CS 235 Data Mining: Winter 2017 Instructor: Dr Eamonn Keogh Computer Science & Engineering Department 318 Winston Chung Hall University of California.

1 CS 235 Data Mining: Winter 2017 Instructor: Dr Eamonn K...
Author: Vanessa Williamson
0 downloads 4 Views

1 CS 235 Data Mining: Winter 2017 Instructor: Dr Eamonn Keogh Computer Science & Engineering Department 318 Winston Chung Hall University of California - Riverside Riverside, CA 92521 08:10 AM - 09:30 AM Riverside Campus | Bourns Hall | Room A265 Class web page Some slides adapted from Tan, Steinbach and Kumar, and from Chris Clifton

2 Textbook I will use: Data Mining: The Textbook by Charu C. Aggarwal.Introduction to Data Mining Pang-Ning Tan, Michael Steinbach,  Vipin Kumar Pretty Good I will use: Data Mining: The Textbook by Charu C. Aggarwal. (Not perfect, but the best book out there by a large margin) However, purchasing it is not compulsory. Data Mining: Introductory and Advanced Topics Margaret H. Dunham Good (a little “basic”) Data Mining Concepts and Techniques. Jiawei Han, Micheline Kember Not so good

3 Very useful and interesting bookThe Signal and the Noise: The Art and Science of Prediction. Nate Silver However, purchasing it is not compulsory. On election day 2012, he predicted Obama had a 90.9% chance of winning a majority in the electoral votes and by crunching polling data he successfully predicted the correct result in 50 out of 50 states. Before election day 2016, he was the only person giving Trump a significant chance..

4 Slides I make very nice slides, I suggest you print them out 6 per page, before coming to class. Or, you may follow along on your tablet or laptop. However, if you are using your tablet or laptop, you may not surf the web or work on s etc.. 90% of the material is on the slides, but the other 10% I will say only in class.

5 Cheating Policy Students must read and understand UCR policy on academic honesty. Note, I am very good at detecting cheating (I have taught classes on the subject). Anyone caught cheating will given a final grade of F and may have a letter placed in his or her permanent record. Students are expected to take care that others cannot “cheat off them”. For example, if leave your homework on a shared hard drive or an abandoned floppy and someone else hands it in, you are liable and will have your grade adjusted downward.

6 Classroom Behavior I do not want to hear your cell phones during class. First offence will result in the lowering of your final grade by one letter. Second offence will result in a failing grade and removal from class. Sending or receiving text messages/ , or using the web while in class, will result a failing grade. Chronic lateness (or leaving class early) is unacceptable (it is disrespectful and disruptive to the instructor and other students). If you are late once, forget about it. The second time you are late you should approach me after class to explain why (failing to do so may result in a 1-percentage point reduction in your grade).

7 Office Hours Open door Policy

8 I will give you a formal syllabus at the next meetingGrading : Homework Assignments: ~ 40% (expect about 3) Programming Assignments: ~ 50% (expect about 3) Participation / pop quizzes: ~ 10% (expect the unexpected) I will ask you to read about one paper a week. To make sure you read it. I will either give random pop quizzes, or pick a random person to explain the paper.

9 Two goals in this class Primary: Teach you data mining techniquesSecondary: Teach you how to publish papers

10 Why teach you how to publish papers? IThe skills you need to publish papers, critical thinking, experimental design, literature review etc, are all skills we need to understand and apply data mining techniques.

11 Why teach you how to publish papers? IIAs a practical matter, PhD students need to write two publishable quality papers. The best way to prove a paper is publishable quality, is to publish it.

12 Why teach you how to publish papers?A strong publication record is your ticket to an interview/job/internship at a top company.

13 Why me to teach you how to publish papers?SIGKDD Top Authors Jiawei Han(68)  Philip S. Yu(57)  Christos Faloutsos(55)  Jieping Ye(44)  Padhraic Smyth(27) Jian Pei(26)  Wei Fan(26)  Bing Liu 0001(25)  Eamonn J. Keogh(25)  SIAM SDM Top Authors Philip S. Yu(62) Charu C. Aggarwal(32) Jiawei Han(27) Vipin Kumar(25) Christos Faloutsos(22) Wei Fan(22) Eamonn J. Keogh(21) ICDM Top Authors Philip S. Yu(58) Jiawei Han(42) Xindong Wu(27) Eamonn J. Keogh(26) Zheng Chen(23) DMKD Top Authors Eamonn J. Keogh(15) Jiawei Han(12) Nikolaj Tatti(10) Aristides Gionis(7) Charu C. Aggarwal(7) KAIS Top Authors Philip S. Yu(23) Jian Pei(12) Xindong Wu(11) Eamonn J. Keogh(10) Hui Xiong(6)

14 Today's lecture overlaps with section 2.1, 2.2, 2.3 of our book

15 How much Data is there in the World?About 4000 exabytes 1 EB is = 1 million terabytes  If printed as texbooks, a stack of books with 4000 Exabytes would reach to Pluto and back 80 times. At the end of 1999, that number was only about 12 exabytes of data. (the majority of this data has never been examined by an algorithm, much less a person)

16 What is Data Mining? Many DefinitionsNon-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

17 What is Data Mining? ExampleData mining is finding useful information in large datasets. Key observation: We often data mine datasets that were collected for other purposes.

18 What is (not) Data Mining?Look up phone number by name in a directory (database query) Query a Web search engine for information about “Amazon” (Information Retrieval) What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com)

19 Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems The term "Data Mining" appeared around 1990 in the database community.

20 Why Mine Data? Commercial ViewpointLots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social media Computers have become cheaper and more powerful. Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

21 Why Mine Data? Scientific ViewpointData collected and stored at enormous speeds (TB/minute) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation etc

22 Example Mice Sing! Mining Massive Archive of Mice Sounds with Symbolized Representations, Jesin Zakaria, Sarah Rotschafer, Abdullah Mueen, Khaleel Razak, Eamonn Keogh, in the Proceedings of Siam International Conference on Data Mining (SDM) 2012.

23 We can deactivate (knock-out, KO) genes in mice, and see what happens to their songs1- Syllable Extraction 2- Syllable Classification …1311 … 4521… 13327 …12521 … 12521… 12521 Normal mouse P53-KO

24 Mining Large Data Sets - MotivationThere is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information, if ever. Consider this dataset: It is easy to learn the rule “If you take the drug Ephedra and the drug Xanthine, you will die”. This is an example of a rule that a human could data mine from this dataset…. Name Ephedra Xanthine Outcome Joe yes Died Sue no Lived Cat Bob Tim Jin

25 Mining Large Data Sets - MotivationSuppose that the dimensionality of the dataset is larger, there are thousands of possible drugs, then the problem becomes much harder. Name Ephedra Xanthine Aspirin ….. Vitamin A Outcome Joe yes no Died Sue Lived Cat Bob Tim Jin

26 Mining Large Data Sets - MotivationSuppose that instead of binary data, some fields contain the dosage given: Now the rules format becomes more complex: “If you take the drug Ephedra AND more than 30mg of the drug Xanthine, you will die”. How do we know what the dosage threshold should be? How do we know which operators to use (AND, OR, NOT, XOR) Name Ephedra Xanthine Aspirin ….. Vitamin A Outcome Joe yes 100mg 10mg no Died Sue 150mg 0mg Lived Cat Bob Tim 25mg 5mg Jin 20mg

27 Mining Large Data Sets - MotivationSuppose that we have some missing values…. Name Ephedra Xanthine Aspirin ….. Vitamin A Outcome Joe yes 100mg 10mg no Died Sue --- 0mg Lived Cat ? Bob -- Tim 25mg Jin 20mg

28 Mining Large Data Sets - MotivationSuppose that we have ten million records, but only there are only 10 examples of the dangerous drug interaction. This alone explains why a human will never spot this pattern. A single doctor is unlikely to see more than one of patient die from this interaction in her career. We really do need the power of data mining here. Name Ephedra Xanthine Aspirin ….. Vitamin A Outcome Joe yes 100mg 10mg no Died Sue --- 0mg Lived Cat ? (Ten million records omitted for brevity) Zhu -- Xin 25mg Tom 20mg

29 Mining Large Data Sets - MotivationThe problem is compounded by the fact that there are many rules in the data that are true, but not novel or interesting --People that have babies tend to be females --Most people have a vowel in their name --People that are older than 15 years old, are older than 10 years old Name Sex Number of live births Blood pressure ….. Vitamin A …. Sue female 2 no Joe male --- Ann 3 yes Bob ……

30 Challenges of Data MiningScalability Dimensionality Cardinality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

31 If we are going to study data mining, we need to spend a little time taking about data.This is a little boring, but necessary.

32 What is Data? A collection of objects and their attributesAn attribute is a property or characteristic of an object Examples: eye color of a person, their GPA, their temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, exemplar or instance Objects could be a customer, a patient, a car, a country, a novel, a drug, a movie etc Attributes Objects

33 Data Dimensionality and NumerosityThe number of attributes is the dimensionality of a dataset. The number of objects is the numerosity (or just size) of a dataset. Some of the algorithms we want to use, may scale badly in the dimensionality, or scale badly in the numerosity (or both). As we will see, reducing the dimensionality and/or numerosity of data is a common task in data mining. Attributes Objects

34 Types of Attributes There are different types of attributesNominal (includes Boolean) Examples: ID numbers, eye color, zip codes, sex Ordinal Examples: rankings (e.g., taste of salsa on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts

35 Properties of Attribute ValuesThe type of an attribute depends on which of the following properties it possesses: Distinctness: =  Order: < > Addition: Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties

36 Properties of Attribute ValuesNominal attribute: distinctness We can ask Jewish = Jewish that is, Is Sue.religion = 2? Catholic  Muslim We cannot ask Jewish < Buddist Even though (2<3) (Jewish + Muslim)/2 Sqrt(Atheist) Even though Sqrt(1) is allowed Key: Atheist: 1 Jewish: 2 Buddist: 3 Name Religion Ad Joe 1 12 Sue 2 61 Cat 34 Bob 3 65 Tim 54 Jin 44

37 Properties of Attribute ValuesOrdinal attribute: distinctness & order We can say {newborn, infant, toddler, child, teen, adult} infant = infant newborn < toddler We cannot say newborn + child infant / newborn cosine(child) Key: newborn: 1 infant: 2 toddler:3 etc Name lifestage Ad Joe 1 12 Sue 2 61 Cat 4 34

38 Properties of Attribute ValuesThere are a handful of tricky cases…. Ordinal attribute: distinctness & order If we have {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday} Then we can clearly say Sunday = Sunday Sunday != Tuesday But can we say Sunday < Tuesday? A similar problem occurs with degree of an angle. Consider 365 degrees… Is 5 degrees = 365 degrees? If this problem shows up, you have to think about the what to do on a case by case basis

39 Properties of Attribute ValuesInterval attribute: distinctness, order & addition Suppose it is 10 degrees Celsius (for the moment, assume integers) We can say it is not 11 degrees Celsius 10  11 We can say it is colder than 15 degrees Celsius 10 < 15 We can say closing a window will make it two degrees hotter NewTemp = We cannot say that it is twice as hot as 5 degrees Celsius 10 / 2 = No!

40 Properties of Attribute ValuesThe type of an attribute depends on which of the following properties it possesses: Ratio attribute: all 4 properties We can do anything! So 10kelvin really is twice as hot as 5kelvin Of course, distinctness is tricky to define with real numbers. is = ?

41 Description Examples Operations Attribute Type NominalThe values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation

42 Transformation Comments Attribute Level NominalAny permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}, or by {A, B, C} Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet.

43 Discrete and Continuous AttributesDiscrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. As a practical matter, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.

44 Discrete and Continuous AttributesWe can convert between Continuous and Discrete variables. For example, below we have converted real-valued heights to ordinal {short, medium, tall} Conversions of Discrete to Continuous are less common, but sometimes possible. Why convert? Sometimes the algorithms we what to use are only defined for a certain type of data. For example, hashing or Bloom filters are best defined for Discrete data. Conversion may involve making choices, for example, how many “heights”, where do we place the cutoffs (equal width, equal bin counts etc.) These choices may effect the performance of the algorithms. {short, medium, tall} 1, , 6’3’’ 5’1’’ 5’7’’ 5’3’’ 3 1 2

45 Discrete and Continuous Attributesbeginning darkness  created heaven without moved waters. earth. earth form, God void; upon deep. Spirit Amen. the and And was face The Said With you In of all. :: Discrete and Continuous Attributes We can convert between Discrete and Continuous variables. For example, below we have converted discrete words to a real-valued time series, that nicely shows when a word ‘bursts’ in a text. There are 783,137 words in the King James Bible There are 12,143 unique words in the King James Bible The location of the first word, of each chapter of the King James bible. Genesis starts at 1964 because there is a short preamble Genesis: 1964 Numbers: Leviticus: Exodus: Deuteronomy: Ruth: Judges: Joshua: Samuel 1: Kings 2: Kings 1: Samuel 2: Chronicles 2: Chronicles 1: Esther: Nehemiah: Ezra: Job: Ecclesiastes: Proverbs: Psalms: Solomon: Lamentations: Jeremiah: Isaiah: Ezekiel: Joel: Hosea: Daniel: Obadiah: Amos: Nahum: Micah: Jonah: Habakkuk: Zechariah: Haggai: Zephaniah: Malachi: Luke: Mark: Matthew: John: Corinthians 1: Romans: Acts: Galatians: Corinthians 2: Colossians: Philippians: Ephesians: Thessalonians 1: Timothy 2: Timothy 1: Thessalonians 2: Titus: James: Hebrews: Philemon: Peter 1: John 2: John 1: Peter 2: Jude: John 3: Revelation: Local frequency of “God” in King James Bible 5 x 10 1 2 3 4 5 6 7 8 Genesis Numbers Jeremiah Chronicles 1 Deuteronomy Revelation Ezekiel

46 Even if the data is given to you as continuous, it might really be intrinsically discretepartID size Ad 12323 7.801 12 5324 7.802 61 75654 32.09 34 34523 32.13 65 424 47.94 54 25324 62.07 44

47 Even if the data is given to you as continuous, it might really be intrinsically discreteBing Hu, Thanawin Rakthanmanon, Yuan Hao, Scott Evans, Stefano Lonardi, and Eamonn Keogh (2011). Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL. ICDM 2011

48 Data can be Missing Data can be missing for many reasons.Someone may decline-to-state The attribute may be the result of an medical expensive test Sensor failure etc Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) etc

49 Data can be “Missing”: Special CaseIn some case we expect most of the data to be missing. Consider a dataset containing people’s rankings of movies (or books, or music etc) The dimensionality is very high, there are lots of movies However, most people have only seen a tiny fraction of these So the movie ranking database will be very sparse. Some platforms/languages explicitly support sparse matrices (including Matlab) Joe’s ranking of MASH could be missing for two reasons. He saw it, but did not rank it He did not see it. Even in this case, the data is considered missing, relatively what his ranking would have been, if he had seen it. Here, inferring a missing value is equivalent to asking a question like “How much would Joe like the movie MASH?” See “Collaborative filtering” / “ Recommender Systems” Jaws E.T. MASH Ted Argo Brave OZ Bait Joe 4 1 Van 3 2 Sue May 5 June

50 Document Data is also SparseEach document is a `term' vector (vector-space representation) each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. Doc2 Doc4 document-term matrix the harry  rebel god cat dog help near Doc1 42 1 Doc2 22 13 Doc3 32 Doc4 29 56 5 3 Doc5 9

51 Graph Data is also Typically SparseThe elements of the matrix indicate whether pairs of vertices are connected or not in the graph.

52 We may have to deal with such data as special cases… Not all datasets naturally fit neatly into a rectangular matrix (table) We may have to deal with such data as special cases… However, we have so many algorithms and tools defined for tables, that we often try to massage all datasets into a table if we can. DNA Data First 100 base pairs of the chimp’s mitochondrial DNA: gtttatgtagcttaccccctcaaagcaatacactgaaaatgtttcgacgggtttacatcaccccataaacaaacaggtttggtcctagcctttctattag First 100 base pairs of the human’s mitochondrial DNA: gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctg Transaction Data Spatio-Temporal Data

53 Data Quality What kinds of data quality problems?How can we detect problems with the data? What can we do about these problems? Note, this is just a first look at these issues, we are not going to solve them all today! Examples of data quality problems: Redundancy (dependence or association or (informally) correlation) Noise and outliers Missing values Duplicate data

54 Redundancy Various subsets of the features are often related or correlated in some way. They are partly redundant with each other. For problems in signal processing, this redundancy is typically helpful. But for data mining, redundancy almost always hurts. Two adjacent pixels are typically redundant Two adjacent values in a time series are typically redundant (autocorrelated) Your height in feet/inches and your height in meters are highly redundant. Your height and your inseam are highly redundant. 0.4 0.5 0.6 0.7 Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150

55 Why Redundancy Hurts Some data mining algorithms scale poorly in dimensionality, say O(2D). For the problem below, this means we take O(23) time, when we really only needed O(22) time. We can see some data mining algorithms as counting evidence across a row (Nearest Neighbor Classifier, Bayes Classifier etc). If we have redundant features, we will “overcount” evidence. It is probable that the redundant features will add errors. For example, suppose that person 1 really is exactly 4’10’’. Then they are exactly m, but the system recorded them as 1.47m . So we have introduced m of error. This is a tiny amount, but it we had 100s of such attributes, we would be introducing a lot of error. The curse of dimensionality (discussed later in the quarter) As we will see in the course, we can try to fix this issue with data aggregation, dimensionality reduction techniques, feature selection, feature generation etc Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150

56 Detecting Redundancy By creating a scatterplot of “Height F/I” vs. “Height Meters” we can see the redundancy, and measure it with correlation. However, if we have 100 features, we clearly cannot visual check 1002 scatterplots. Note that two features can have zero correlation, but still be related/redundant. There are more sophisticated tests of “relatedness”. Again, linear correlation is a special type of redundancy, not the only type. Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150 Height Meters Height F/I

57 Visually Detecting (Linear) Correlation IHeight F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150

58 Visually Detecting (Linear) Correlation IIAs the plot below shows, two features can be highly dependent, without having any correlation.

59 Noise Noise refers to modification of original valuesExamples: distortion of a person’s voice when talking on a poor quality phone. Height Meters Weight 1 1.47 166 2 1.90 210 3 180 215 4 1.62 150 This man is not really 180 meters tall! The two images are one man’s ECGs, taken about an hour apart. The different are mostly due to sensor noise  (MIT-BIH Atrial Fibrillation Database record afdb/08405)

60 Association pattern mining Outlier detectionNow that we understand the basic data format, we can preview a high-level view of the four most fundamental problems in data mining. These problems correspond to: Clustering Classification Association pattern mining Outlier detection

61 This is the classification problem. Consider this dataset. Note that the attribute for “Insect Class” for record ten is missing. What should it be? This is the classification problem. Algorithms used to solve it include: Nearest Neighbor Decision Trees Neural Nets Bayesian Classifier Linear Classifiers etc Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1

62 This is the clustering problem (one solution on next page) Given a data matrix D, partition its rows (records) into sets C Ck, such that the rows (records) in each cluster are “similar” to one another. This is the clustering problem (one solution on next page) Algorithms used to solve it include: K-means Hierarchical Clustering Dbscan Density peaks Etc Insect ID Abdomen Length Antennae 1 2.7 5.5 2 8.0 9.1 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 11 6.7 12 2.5 1.6 13 8.4 5.1

63 This is the clustering problem.Given a data matrix D, partition its rows (records) into sets C Ck, such that the rows (records) in each cluster are “similar” to one another. This is the clustering problem. What does “similar” mean? Here I chose k = 2 sets, but in general, what is the best k? Insect ID Abdomen Length Antennae 1 2.7 5.5 2 8.0 9.1 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 2.9 1.9 7 6.1 6.6 8 0.5 1.0 9 8.3 10 8.1 11 6.7 12 2.5 1.6 13 8.4 5.1 Antenna Length 10 1 2 3 4 5 6 7 8 9 Abdomen Length

64 This is Frequent Pattern Mining (association rule mining)Given a binary n × d data matrix D, determine all subsets of columns such that all the values in these columns take on the value of 1 for at least a fraction s of the rows in the matrix. The relative frequency of a pattern is referred to as its support. The fraction s is referred to as the minimum support. This is Frequent Pattern Mining (association rule mining) ID Bread Milk Batteries Juice Butter 1 2 3 4 5 6 7 8 9 10

65 This is Frequent Pattern Mining.Given a binary n × d data matrix D, determine all subsets of columns such that all the values in these columns take on the value of 1 for at least a fraction s of the rows in the matrix. The relative frequency of a pattern is referred to as its support. The fraction s is referred to as the minimum support. This is Frequent Pattern Mining. It seems that bread and butter are often purchased together ID Bread Milk Batteries Juice Butter 1 2 3 4 5 6 7 8 9 10

66 Are there any unusual records in this data?This is the outlier detection problem. Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 0.5 9.9 7 6.1 6.6 8 1.0 9 8.3 10 8.1 8.2

67 Are there any unusual records in this data?This is the outlier detection problem. It seems that record 6 is an outlier. Not that the individual values for the two attribute are not unusual, but their combination is. Antenna Length 10 1 2 3 4 5 6 7 8 9 Abdomen Length Insect ID Abdomen Length Antennae Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 4 1.1 3.1 5 5.4 8.5 6 0.5 9.9 7 6.1 6.6 8 1.0 9 8.3 10 8.1 8.2

68 Homework Look at a few of the 355 datasets in the UCI repositoryhttps://archive.ics.uci.edu/ml/datasets.html Be prepared for a (un) surprise quiz on Thursday, in which you.. Write 2 to 5 sentences explain what the dataset is. The Synthetic Control Chart Time Series Data Set is a time series dataset. As the name implies, it is not real data but synthetic. However it is suppose to be a good proxy for real data that comes from CNC machines. There are no missing values. Tell me the dimensionality of the data. The dimensionality (the length of the time series) is 60. Tell me the numerosity of the data. There are 600 examples, divided into 6 types. Tell me the data type(s) in the dataset. (could all be one type, could be mixed) . All the data objects are real-valued (ratio) Tell me at least one data mining task this dataset was used for. This dataset was used by Alcock and Manolopoulos in a paper published in 1999 (HCI) to demonstrate a classification algorithm.

69 Genesis Numbers Chronicles 1 Jeremiah Ezekiel Deuteronomy RevelationBlue: “God” -English Bible Red: “Dios” -Spanish Bible 5 x 10 1 2 3 4 5 6 7 8 The location of the first word, of each chapter of the King James bible. Genesis starts at 1964 because there is a short preamble Genesis: 1964 Numbers: Leviticus: Exodus: Deuteronomy: Ruth: Judges: Joshua: Samuel 1: Kings 2: Kings 1: Samuel 2: Chronicles 2: Chronicles 1: Esther: Nehemiah: Ezra: Job: Ecclesiastes: Proverbs: Psalms: Solomon: Lamentations: Jeremiah: Isaiah: Ezekiel: Joel: Hosea: Daniel: Obadiah: Amos: Nahum: Micah: Jonah: Habakkuk: Zechariah: Haggai: Zephaniah: Malachi: Luke: Mark: Matthew: John: Corinthians 1: Romans: Acts: Galatians: Corinthians 2: Colossians: Philippians: Ephesians: Thessalonians 1: Timothy 2: Timothy 1: Thessalonians 2: Titus: James: Hebrews: Philemon: Peter 1: John 2: John 1: Peter 2: Jude: John 3: Revelation: Genesis Numbers Chronicles 1 Jeremiah Ezekiel Revelation Deuteronomy Gray: “El Senor” -Spanish Bible

70 Image data, may best be thought of as time series…Ratanamahatana. C. A & Keogh, E. (2004). Using Relevance Feedback in Multimedia Databases. In the proceedings of the 7th International Conference on Visual Information Systems Pierre Maurel and Guillermo Sapiro (2003). Dynamic shapes average. Institute for Mathematics and its Application. Thomas G. Dietterich and Ashit Gandhi, . Content-Based Image Retrieval: Plant Species Identification using Dynamic Programming.

71 SAX baabccbc Continuous Attributes, with dimensionality of 12820 40 60 80 100 120 baabccbc Discrete Attributes, with dimensionality of 8, and a cardinality of 3

72 SAX C C 20 40 60 80 100 120 First convert the time series to PAA representation, then convert the PAA to symbols It takes linear time - 20 40 60 80 100 120 b a c baabccbc

73 A change of representation is a fundamental data mining trickbcabaabca ccabaacca bcabaabca