Summary
This class studies large-scale systems and applications on the World Wide Web.
We will be studying topics around the architecture and functionality of modern search engines (e.g., Google). For example, collection and refreshing of Web pages (Web crawling), indexing, querying and retrieval, Hidden Web, result ranking, spam detection on the Web. We will also study topics around the organization and functionality of Social Networks (e.g., Facebook, Twitter, LinkedIn). For example: analysis and prediction of social graphs, community detetion, information diffusion in social networks, recommendations.
The course involves presentations and study of different research topics as well as hands-on projects on these topics.
Course Information
Spring semester 2024
Class: Wed 10:00-13:00
Instructor: Alexandros Ntoulas
antoulas -*at*- di -*dot*- uoa +*dot*+ gr
Announcements
- 04/03: Class begins on Wednesday March 06. Please join eclass to see announcements.
Projects
Unless there's a name next to a topic it's not taken. Please email the instructor if you are interested in picking one of the topics for your term project.
- Tik Tok crawler: Write a crawler for downloading info from Tik Tok.
- Focused crawler: Implement a focused crawler that collects web pages of a given topic (e.g. cars).
- Link Ranking: Implement a set of link-based ranking algorithms (e.g. PageRank, HITS, SALSA) and compare them over a set of Web pages.
- Product specs: Implement a crawler that creates a database of product specs by crawling information from the web for comparison shopping.
- Location-aware Advertising Platform: Create a platform to support location-aware advertising
- Hotspot logger: Implement a mouse movement logger for a web page and overlay the hotspots on top of the content.
- Recipes: Create a crawler for recipes and create a search engine for recipes based on the ingredients in someone's fridge. Improvement: identify the fridge ingredients from a picture of the fridge.
- DEH: Implement a portal where people can compare all prices from different power suppliers and pricing policies (green/blue etc.) Collect historical data and make predictions on future prices.
- Trigger alerts: Implement a system for watching specific web sites and alerting the user on changes.
References
The course material that we will study in this class is mostly coming from literature and research papers from conferences such as the WWW, WSDM, WISE, VLDB, SIGMOD, Big Data, etc.
The papers that we will go through are available in the schedule table below.
For more information and a more elaborate description of each topic you can also study the following books. However, these are not a requirement for the class:
-
Introduction to Information Retrieval,
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
Cambridge University Press. 2008.
-
Search Engines Information Retrieval in Practice,
W. Bruce Croft, Donald Metzler, Trevor Strohman,
Pearson, 2009.
-
Networks, Crowds, and Markets: Reasoning About a Highly Connected World,
David Easley and Jon Kleinberg,
Cambridge University Press, 2010.
Syllabus & Schedule
Date |
Topic |
Handouts/Assignments |
Slides/Notes |
Wed, Mar 06 |
Course desrription & logistics Introduction to Web Search |
Note: You can read these papers if you want to learn more about Web Search engines, but you are not expected to review them. First reviews are due for next week's topic (Web characteristics).
- The anatomy of a large-scale hypertextual Web search engine (Google), S. Brin, L. Page, Computer Networks, 1998 [pdf]
- The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis, A. Ntoulas et. al, WWW 2015 [pdf]
|
lecture 01
|
Wed, Mar 13 |
Web Characteristics |
- Deadline for forming groups
-
Graph structure in the Web, Andrei Broder et al., WWW 2000 [pdf]
-
What’s New on the Web? The Evolution of the Web from a Search Engine Perspective, A. Ntoulas et. al, WWW 2004 [pdf]
|
lecture 02
|
Wed, Mar 21 |
Web Crawling |
- Project proposals due
-
Effective Change Detection Using Sampling, J. Cho, A. Ntoulas, VLDB 2002 [pdf]
|
lecture 03 |
Wed, Mar 28 |
Hidden (Deep) & Dark Web |
-
Downloading Hidden Web Content, A. Ntoulas et. al, JCDL 2005 [pdf]
|
lecture 04 |
Wed, Apr 03 |
Link Analysis, Ranking & Spam Detection |
-
Detecting Spam Web Pages through Content Analysis, A. Ntoulas et. al, WWW 2006 [pdf]
|
lecture 05 |
Wed, Apr 10 |
Indexing |
-
Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee, A. Ntoulas, J. Cho, SIGIR 2007 [pdf]
|
lecture 06 |
Wed, Apr 17 |
Computational Advertising |
- Project progress deadline
- Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords, B. Edelman et. al, The American Economic Review, 2007 [pdf]
|
lecture 07 |
Wed, Apr 24 |
Social Recommendations |
- Exploiting Social Context for Review Quality Prediction, Y. Lu et. al, WWW 2010 [pdf]
|
lecture 08 |
Wed, May 01 |
No class - Easter break |
|
|
Wed, May 08 |
No class - Easter break |
|
|
Wed, May 15 |
Link Predictions |
- Predicting Positive and Negative Links in Online Social Neworks, J. Leskovec et. al, WWW 2010 [pdf]
|
lecture 09 |
Wed, May 22 |
Small-world phenomena |
- The Anatomy of the Facebook Social Graph, J. Ugander et. al, Arxiv 2012 [pdf]
|
lecture 10 |
Wed, May 29 |
Lecture |
|
|
Wed, Jun 05 |
Lecture |
Project report deadline |
|