1 Anomaly Detection in Data ScienceOne-class Classification with Privileged Information for Malware Detection Pavel Erofeev, IITP RAS, Airbus Group Russia
2 Find the Panda
3 Anomaly Detection: Hadlum vs HadlumThe birth of a child to Mrs. Hadlum happened 349 days after Mr. Haldum left for military service Average human pregnancy period is 280 days (40 weeks) Statistically, 39 days is an outlier
4 An outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by different mechanism Howkins, 1980
5 Defining Anomaly DetectionDigital representation vectors describing observations Mixture of “nominal” and “abnormal” points Anomaly points are generated by different generative process than the nominal points
6 Possible Settings in CSSupervised (Know attacks) Training data labeled with “nominal” or “anomaly” Clean (Zero-day attacks) Training data are all “nominal”, test data may be contaminated with “anomaly” Unsupervised (Unknown attacks) Training data consists of mixture of “nominal” and “anomaly” points
7 Real World Data ProblemsData is multivariate There is usually more than one generating mechanism underlying the “normal” data Anomalies may represent a different class of objects, so there sre many of them Domain specific definition of what to count as anomaly Normality evaolves in time
8 Anomaly Taxonomy Point Anomaly
9 Anomaly Taxonomy Contextual Anomaly
10 Anomaly Taxonomy Causal Anomaly
11 Taxonomy
12 Imbalanced classificationNormal data - a lot of samples Abnormal - very few Standard methods do not work as expected Standard metrics do not apply
13 Imbalanced classificationWeights for classes Proved not to be helpful in most cases Resampling methods Oversampling (Bootstrap, SMOTE, etc.) Undersampling How to choose which method to use? How to choose resampling parameter? We compared several methods We proposed a meta-model that on average gives best results [Papanov, Erofeev, Burnaev, 2015]
14 Statistics-based modelsAssumption on normal data generation procedure (e.g. Gaussian distribution, etc.) PCA is a method commonly used to extract most variant combinations in data PCA based anomaly detection is good for highly correlated environments
15 Density-based models SVM-based and nearest neighbours basedHow to choose best kernel parameter?
16 One-class SVM with Privileged InformationEvgeny Burnaev Dmitry Smolyakov Skoltech, IITP RAS
17 One-Class SVM
18 One-Class SVM
19 One-Class SVM
20 One-Class SVM Kernel Trick
21 Kernel Trick
22 Hyper-parameter Influence
23 Decision Functions
24 Learning with Privileged InfoExample: Image classification with textual description
25 Learning with Privileged Info
26 Learning with Privileged Info
27 Learning with Privileged Info
28 Microsoft Malware Classification ChallengeKaggle.com competition data (2015)
29 Problem Description 9 malware families Raw dataRumnit, Lollipop, Kelihos ver3, Vundo, Simda, Tracur, Kelihos ver1, Obfuscator.ACY, Gatak Raw data Hexadecimal representation of the raw binary content Meta-data extracted from the binaries, including function calls, strings, etc.
30 Features Original features Privileged featuresInformation from binary files such as Frequencies of bytes Number of different N-grams, etc. Privileged features Information from code disassemble such as Frequencies of commands Number of calls to external dlls Bytecode as an image Features based on image texture which is commonly used for image classification
31 Features
32 Experimental Setup
33 Results
34 Thanks! Any questions?