M151: Web Information Systems and Applications

Summary

This class studies large-scale systems and applications on the World Wide Web. We will be studying topics around the architecture and functionality of modern search engines (e.g., Google). For example, collection and refreshing of Web pages (Web crawling), indexing, querying and retrieval, Hidden Web, result ranking, spam detection on the Web. We will also study topics around the organization and functionality of Social Networks (e.g., Facebook, Twitter, LinkedIn). For example: analysis and prediction of social graphs, community detetion, information diffusion in social networks, recommendations. The course involves presentations and study of different research topics as well as hands-on projects on these topics.

Course Information

Spring semester 2024
  • Class: Wed 10:00-13:00
  • Instructor: Alexandros Ntoulas
    antoulas -*at*- di -*dot*- uoa +*dot*+ gr
  • Announcements

    • 04/03: Class begins on Wednesday March 06. Please join eclass to see announcements.

    Projects

    Unless there's a name next to a topic it's not taken. Please email the instructor if you are interested in picking one of the topics for your term project.
    • Tik Tok crawler: Write a crawler for downloading info from Tik Tok.
    • Focused crawler: Implement a focused crawler that collects web pages of a given topic (e.g. cars).
    • Link Ranking: Implement a set of link-based ranking algorithms (e.g. PageRank, HITS, SALSA) and compare them over a set of Web pages.
    • Product specs: Implement a crawler that creates a database of product specs by crawling information from the web for comparison shopping.
    • Location-aware Advertising Platform: Create a platform to support location-aware advertising
    • Hotspot logger: Implement a mouse movement logger for a web page and overlay the hotspots on top of the content.
    • Recipes: Create a crawler for recipes and create a search engine for recipes based on the ingredients in someone's fridge. Improvement: identify the fridge ingredients from a picture of the fridge.
    • DEH: Implement a portal where people can compare all prices from different power suppliers and pricing policies (green/blue etc.) Collect historical data and make predictions on future prices.
    • Trigger alerts: Implement a system for watching specific web sites and alerting the user on changes.

    References

    The course material that we will study in this class is mostly coming from literature and research papers from conferences such as the WWW, WSDM, WISE, VLDB, SIGMOD, Big Data, etc. The papers that we will go through are available in the schedule table below.

    For more information and a more elaborate description of each topic you can also study the following books. However, these are not a requirement for the class:



    Syllabus & Schedule

    Date Topic Handouts/Assignments Slides/Notes
    Wed, Mar 06 Course desrription & logistics
    Introduction to Web Search
    Note: You can read these papers if you want to learn more about Web Search engines, but you are not expected to review them. First reviews are due for next week's topic (Web characteristics).
    • The anatomy of a large-scale hypertextual Web search engine (Google), S. Brin, L. Page, Computer Networks, 1998 [pdf]
    • The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis, A. Ntoulas et. al, WWW 2015 [pdf]
    lecture 01
    Wed, Mar 13 Web Characteristics
    • Deadline for forming groups
    • Graph structure in the Web, Andrei Broder et al., WWW 2000 [pdf]
    • What’s New on the Web? The Evolution of the Web from a Search Engine Perspective, A. Ntoulas et. al, WWW 2004 [pdf]
    lecture 02
    Wed, Mar 21 Web Crawling
    • Project proposals due
    • Effective Change Detection Using Sampling, J. Cho, A. Ntoulas, VLDB 2002 [pdf]
    lecture 03
    Wed, Mar 28 Hidden (Deep) & Dark Web
    • Downloading Hidden Web Content, A. Ntoulas et. al, JCDL 2005 [pdf]
    lecture 04
    Wed, Apr 03 Link Analysis, Ranking &
    Spam Detection
    • Detecting Spam Web Pages through Content Analysis, A. Ntoulas et. al, WWW 2006 [pdf]
    lecture 05
    Wed, Apr 10 Indexing
    • Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee, A. Ntoulas, J. Cho, SIGIR 2007 [pdf]
    lecture 06
    Wed, Apr 17 Computational Advertising
    • Project progress deadline
    • Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords, B. Edelman et. al, The American Economic Review, 2007 [pdf]
    lecture 07
    Wed, Apr 24 Social Recommendations
    • Exploiting Social Context for Review Quality Prediction, Y. Lu et. al, WWW 2010 [pdf]
    lecture 08
    Wed, May 01 No class - Easter break
    Wed, May 08 No class - Easter break
    Wed, May 15 Link Predictions
    • Predicting Positive and Negative Links in Online Social Neworks, J. Leskovec et. al, WWW 2010 [pdf]
    lecture 09
    Wed, May 22 Small-world phenomena
    • The Anatomy of the Facebook Social Graph, J. Ugander et. al, Arxiv 2012 [pdf]
    lecture 10
    Wed, May 29 Lecture
    Wed, Jun 05 Lecture Project report deadline