1 CSCI5570 Large Scale Data Processing SystemsCourse Overview Instructor: Prof. James Cheng
2 Course Webpage Check course webpage regularly Remark: I prefer to put the course webpage under my own directory to make it easier for off-campus access.
3 Topics Overview Topic Tentative ScheduleIntroduction and Course Project Week 1 Prerequisite: Relational Database Systems & Distributed Database Systems Weeks 2-4 Self Reading Distributed Data Analytics Systems Weeks 2-5 NoSQL Weeks 5-8 NewSQL Weeks 9-10 Distributed Graph Processing Systems Weeks 11-12 Distributed Stream Processing Systems Weeks 12-13 Other Large Scale Data Processing Systems ??? Distributed Machine Learning Systems Course Project
4 Prerequisites Fundamental concepts of distributed database systems, prerequisite to NoSQL and NewSQL, as well as other distributed data processing systems Parallel query processing Distributed query processing
5 Distributed Data Analytics SystemsFocus on state-of-the-art big data platforms, widely adopted by industry (e.g., Hadoop, Spark) or best in research (e.g., Naiad, Husky) Fundamental concepts of big data analytics systems Applications (too ad hoc to teach them all, but you can try them out with the systems taught in class): Data collecting, data extraction, data cleaning … Machine learning (e.g., classification, clustering, recommendation, feature selection, dimensionality reduction …) OLAP, data cube Data mining Graph analytics (including social network analysis) Similarity search (e.g., scalable locality sensitive hashing)
6 NoSQL/NewSQL Relational databases are the foundation of western civilization, but now is the era of NoSQL databases NoSQL databases, such as MongoDB, Cassandra, CouchDB, etc., are rapidly taking large shares of the market from traditional vendors such as Oracle Must learn for big data analytics NewSQL databases try to combine the pros of both traditional DBMS and NoSQL
7 Distributed Graph Processing SystemsGraph data: web graphs, online social networks, mobile communication networks, financial networks, biological networks, neutral networks … Distributed systems that make the analysis of these large scale graphs/networks possible Key techniques and algorithms for large scale graph data processing
8 Distributed Stream Processing SystemsStreaming data become common today, e.g., tweets, news feeds, … How to analyze such massive high-speed data in real time? Key techniques and applications
9 Distributed Data Storage SystemsHow to store massive volumes of different types of data, retrieve them, and update them efficiently? How to handle consistency issues? How to handle availability issues?
10 Reading List A list of papers for each topic (except for the older topics such as Relational Database Systems and Distributed Database Systems) will be released weekly
11 Reference Database Systems – The Complete BookSecond edition (Prentice Hall) Hector Garcia-Molina, Jeffrey Ullman Jenifer Widom
12 Reference Database Management Systems Third editionRaghu Ramakrishnan, Johannes Gehrke
13 Assessment Criteria Survey paper: 30 marksSelect one of the following topics: (1) Distributed Data Analytics Systems, (2) NoSQL, (3) NewSQL, (4) Distributed Graph Processing Systems, (5) Distributed Stream Processing Systems, or (6) any other related topic (please seek the approval of the course instructor first) Write a survey paper for this topic The survey paper much contain most of the seminal works and the state-of-the-art works related to this topic, including a clear introduction to each of these works, a description of the problems they solved and their main ideas, a comparative analysis highlighting the strengths and limitations of these works, and your own conclusions and comments on this topic and its future development, etc. Deadline: Nov 30, 2017 HK time (submit a pdf file with filename “Lastname Firstname” to with title “5570 survey Lastname Firstname”)
14 Assessment Criteria Course Project: 70 marksSee details in the project specification
15 Assessment Criteria You will receive an F grade for the course ifyour score for the survey paper is less than 10 marks, OR your score for the course project is less than 30 marks You will receive at least a B- if your score for the survey paper is at least 20 marks, AND your score for the course project is at least 40 marks
16 Academic Honesty Plagiarism, cheating, misconduct in test/exam should be reported to the Faculty Disciplinary Committee for handling. University Guidelines to Academic Honesty:
17 Student/Faculty ExpectationsLet’s join hands to create a positive, respectful, and engaged academic environment inside and outside classroom. Full version of Student/Faculty Expectations on Teaching and Learning: xpectations.pdf