Introduction to Massive Data Analysis

Course Information


Office: EECS546 Gates

Email: jswang @ cs.nthu.edu.tw

Office Hours: Monday 10:00AM-Noon, Friday 12:30AM-Noon

Meeting Times and Locations

Tuesday & Friday 9AM - 10:00AM in EECS 546

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

  1. Data Mining
  2. Map-Reduce and the New Software Stack
  3. Finding Similar Items
  4. Mining Data Streams
  5. Link Analysis
  6. Frequent Itemsets
  7. Clustering
  8. Advertising on the Web
  9. Recommendation Systems
  10. Mining Social-Network Graphs
  11. Dimensionality Reduction
  12. Large-Scale Machine Learning

Important Dates: Assignments

Out on
Due on (11:59pm Pacific Time)
Assignment #1
Term project
Final exam
Wed, March 25, 10:10AM-12:00AM


Knowledge of Java

Course materials

Books: Leskovec-Rajaraman-Ullman: Mining of Massive Datasets can be downloaded for free. It can be purchased from Cambridge University Press, but you are not required to do so.

MOOC: There is a Coursera MOOC that is similar to this course. You may find it useful to view some of the videos there.

Course work and grading

The coursework for the course will consist of: