LIACS Data Mining course

1 LIACS Data Mining coursean introduction ...
Author: Guido de Coninck
0 downloads 4 Views

1 LIACS Data Mining coursean introduction

2 Course Information Course website: Practical exercises (will be updated periodically) Practical exercises two practice sessions (during lecture hours) one challenge

3 Course Schedule Practical exercises Start Sept 4Last lecture Dec 4 (exam preparation) Lecture room varies weekly Sept 4: Sylvius 1531 Sept 11: Huygens, Sitterzaal Sept 18: van Steenis, E004 etc. Nov 6, no lecture! Practical exercises Sept 25, second hour Oct 10, second hour Oct 16, instructions challenge Nov 13, Q&A challenge Dec 3, deadline submission challenge Exam: Jan 8, 14:00 – 17:00

4 Data Mining Course TextbookPractical Machine Learning Tools and Techniques third edition, Morgan Kaufmann, ISBN by Ian Witten and Eibe Frank (EUR 45,95 at Amazon.de)

5 Course participants Bachelor Informatica x Minor Data Science x… & Economie x … & Biologie x Minor Data Science x Science Faculty x other programmes x PhD students x others?

6 Introduction Data Miningan overview and some examples

7 Data Mining definitionsthe concept of extracting previously unknown and potentially useful, interesting knowledge from large sets of data secondary statistics: analyzing data that wasn’t originally collected for analysis

8 Data Mining, the big ideaOrganizations collect large amounts of data Often for administrative purposes Large body of experience Learning from experience Goals Prediction/forecasting Diagnostics Optimization Predicition: predicting the outcome of an event, say a mailing for a mortgage offer Forecasting: applies to time series (temperature, stock market), predicting the value on a future date All goals turn out to be forms of optimization

9 2 Streams Mining for insight ‘Black-box’ Mining Understanding a domainFinding regularities between variables Interpretable models Examples: medicine, production, maintenance ‘Black-box’ Mining Don’t care how you do it, just do it well Optimization Examples: marketing, forecasting (financial, weather) Mining for insight: steel mill example, rolls of metal sheet go through various processing steps. 10% fails, why? Which rolls?

10 example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: more response fewer letters Customer information response 3% test mailing Data Mining customer model Customer information response 30% final mailing This is an example of Black Box data mining remainder

11 example: BioinformaticsFind genes involved in disease (Parkinson’s, Celiac, Neuroblastoma) Measurements from patients (1) and controls (0) Gene expression: measurements of 20k genes dataset 20,001 x 100 Challenges many variables few examples (patients), testing is expensive interactions between genes This is an example of Mining for Insight

12 Data Mining paradigms Classification Clustering Regression Association(binary) class variable predict class of future cases most popular paradigm Clustering divide dataset into groups of similar cases Regression numeric target variable Association find dependencies between variables basket analysis, …

13 Classification (decision tree)Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). 0.64 0.51 0.25 0.01 Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No 0.4 0.07 0.1 0.2 No matter for the moment how the tree was produced (it was induced (derived) from the data). More on this in two weeks. This is already Mining for Insight: we now understand better which type of people may be interested. We refer to the variables in the context of DM as attributes.

14 Applying a classifier (decision tree)New customer: (House = Rent, Age = 32, …) prediction = Yes Rent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No This is Black Box DM, since we don’t care about the nature of the model, just about its capability to predict.

15 Classification Tree makes attribute dependencies explicitRent Buy Other Age < 35 Age ≥ 35 Price < 200K Price ≥ 200K Yes No Tree makes attribute dependencies explicit Class depends on House The influence of other attributes is less Dependencies are (often) fuzzy multiple attributes are needed Perfect predictions are rare

16 Graphical interpretationdataset with two attributes + 1 class (+/-) graphical interpretation of decision tree + - y x x < t x  t y < t’ y  t’

17 Graphical interpretationdataset with two attributes + 1 class (+/-) other classifiers Support Vector Machine + - y x Neural Network A classifier defines a decision boundary: where is the boundary between positive and negative cases? Note that graphical interpretation with 2 dimensions is intuitive (humans could draw the decision boundary also) but becomes less intuitive with D>2

18 Applications of DM Marketing Bioinformatics & Medicine Fraud detectionoutgoing incoming Bioinformatics & Medicine Fraud detection Risk management Insurance Enterprise resource planning

19 Break

20 Data Mining Applications

21 Training Data Speed SkatingSpeed skating team LottoNL-Jumbo Detailed historic data training details duration intensity competition results Finding patterns of effective training Visualise data

22 Kjeld Nuis 178 races On average 2.89% above track recordSpecialises on 1000 m (2.1%) Dutch champion 1000 m, 1500 m WC Distances: bronze 1000 m, silver 1500 m WC Sprint: ‘silver’ ISU World Cup: gold 1000 m, silver 1500 m

23 Total sum of load over last 5 days, morning sessionsundesired result due to over-training advised upper limit Relatively simple model of threshold on single variable, but this variable is constructed (not in the original data) and selected from a large set of variables. In other words, a non-trivial discovery.

24 InfraWatch: monitoring of infrastructureContinuous monitoring of a large bridge ‘Hollandse Brug’ 145 sensors time-dependent, at frequencies up to 100 Hz multi-modal (sensor, video, different freq.) managing large data quantities, >5 Gb per day

25 InfraWatch: monitoring of infrastructure34 geo-phones (vibration sensors) 44 embedded strain-gauges, 47 gauges outside 20 thermometers video camera weather station

26 sensor mining

27 Maintenance planning at KLMRoutine checks of aircraft Maintenance requires up to 10k different parts Ordering parts incurs delay (costs)… … but so does keeping stock In theory 10k individual predictions Input maintenance history flight history, Sahara/North Pole Only few parts predictable

28 Cashflow Online Online personal finance overviewAll bank transactions are loaded into the application transactions are classified into different categories Data Mining predicts category

29 67 Categories Gas Water Licht Onderhoud huis en tuinTelefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi

30 Fragmented results: Boodschappen (groceries) Contributie

31 Decision Tree over all categoriesfalse true

32 Data Mining at LIACS Applications Complex data bioinformatics (LUMC)Sports Analytics (LottoNL-Jumbo, PSV) Hollandse Brug (Strukton, TU Delft, RWS) fraud detection at Achmea health insurance and NZa ChartEx, medieval documents (English, Latin) Complex data graphical data (molecules) relational data (criminal careers) stream data (sensor data, click streams)