Ecole
University of Piraeus Summer 2019
Professor: Vassilis Christophides
E-mails: vassilis.christophides@inria.fr
Course Hours: Tuesday/Wednesday/Thursday 18:00- 21:00
Room: TBA
Office Hours: Available upon request
[Description] [Lectures] [Bibliography] [Assignments]
Big Data requires the storage,
organization, and processing of data at a scale and efficiency -typically of
heterogeneous nature and in streaming flow- that go well beyond the
capabilities of conventional information technologies. Such requirements have
been first introduced for processing the web, and they are today a common place
in many industries. In this respect many traditional assumptions break, new
query and programming interfaces are required (Map/Reduce), and new computing
models will emerge (Cloud Computing).
This course aims to introduce parallel/distributed data processing using
the MapReduce (M/R) paradigm and provide insights for
developing applications on top of the Hadoop platform.
Big data raises also new challenges in data
mining. Given the scale and speed of data that needs to be processed as well
the variety of parameters to be taken into account, state of the art machine
learning algorithms working offline and expecting homogeneous and clean data
are also challenged. There is on ongoing effort to design Big Data Mining
algorithms accommodating a parallel/distributed or even a streaming evaluation.
Of course such kind of incremental, partial evaluation impacts the quality of
obtained statistical models and thus algorithms compromise between quality of
the learning and computation time. The course will adopt an algorithmic
viewpoint: data mining is about applying algorithms to data, rather than using
data to “train” a machine-learning engine of some sort.
The course will consist of lectures
based both on textbook material (freely-available for download on the Web) and
scientific papers. It will also include programming assignments that will
provide students with hands-on experience on building data-intensive
applications using existing Big Data tools and platforms. The intended audience
of this course is MSc and PhD students but also practitioners who plan to
design or develop state-of-the-art algorithms available today for Big Data
analysis.
Lecture 1 (28/05): Course Overview
Intro on Big Data Processing &
Analytics [pdf]
Lecture
2 (29/05): Finding Similar Items
Minhashing and Locality-sensitive Hashing [pdf]
Homework 1[pdf]
Lecture
3 (30/05): Mining Association Rules
Frequent Patterns Mining [pdf]
Lecture
4-5 (04-05/06): Analysing Data Streams
Sampling, Windows, Synopses and Sketches [pdf]
Lecture
6 (06/06): IoT Data Analytics
Data Anomaly Detection [pdf]