1 Predictive Analytics – Basics of Machine Learning
2 What Is Machine Learning?Automating automation Getting computers to program themselves Writing software is the bottleneck Let the data do the work instead!
3 Traditional ProgrammingMachine Learning Computer Data Output Program Computer Data Program Output
4 Sample Applications Web search Computational biology FinanceE-commerce Space exploration Robotics Information extraction Social networks Debugging [Your favorite area]
5 ML in a Nutshell Tens of thousands of machine learning algorithmsHundreds new every year Every machine learning algorithm has three components: Representation Evaluation Optimization
6 Representation Decision trees Sets of rules / Logic programs InstancesGraphical models (Bayes/Markov nets) Neural networks Support vector machines Model ensembles Etc.
7 Evaluation Accuracy Precision and recall Squared error LikelihoodPosterior probability Cost / Utility Margin Entropy K-L divergence Etc.
8 Optimization Combinatorial optimization Convex optimizationE.g.: Greedy search Convex optimization E.g.: Gradient descent Constrained optimization E.g.: Linear programming
9 Types of Learning Supervised (inductive) learningTraining data includes desired outputs Unsupervised learning Training data does not include desired outputs Semi-supervised learning Training data includes a few desired outputs Reinforcement learning Rewards from sequence of actions
10 ML in Practice Understanding domain, prior knowledge, and goalsData integration, selection, cleaning, pre-processing, etc. Learning models Interpreting results Consolidating and deploying discovered knowledge Loop
11 Taxonomy of Machine LearningRote Learning (背誦式學習) Learning from Instruction (教導式學習) Learning by Analogy (類推式學習) Learning from Examples (例舉式學習) Supervised Learning (the most common type) Classification Regression Reinforcement Learning Unsupervised Learning Evolutionary Learning (learning with probability) Markov Chain (learning with probability) Hidden Markov model (learning with probability and time)
12 Rote Learning (背誦式學習) 最簡單的一種形式 不對輸入資料做任何處理 問題的解答被直接儲存起來,等到同樣的問題出現時再被取出如果儲存太多例子,在其中尋找對應於新問題的解,將使程式的效率整個地降低下來 可使用索引(Indexing)技巧來簡化搜尋資料的動作 也可以用啟示的方法來限制儲存資料的數目 1950年代末期Samuel使用此型式來儲存西洋棋盤面狀況的CHECKER程式中已相當成熟
13 Learning from Instruction (教導式學習)以外界教導者的意見來改進效能。 類似於大部份的正規教育方法。 最大難題 將人類的高階語言,翻譯成程式可用的內在知識結構 將此知識與現有的知識基底整合在一起,以做有效的應用 學習者需執行一些演繹的工作,不過大部份的責任在教導者這一方。 教導者需提供組織好的知識,使學習者的現有知識漸次增加。 需建立一可接受指示或意見,且能儲存和有效應用所得知識的系統。
14 Learning by Analogy (類推式學習)基於兩者的類似之處,將現有的知識應用到新的問題上。 更動現有的知識以適應新的例子。 例子 我們可能告訴學生,網球的開球動作就像將釘子釘入牆上高處的某一點一樣。同樣的,一個類推式學習的系統,很可能被用來將一個既成的電腦程式,轉換成一個執行相關功能的程式,而後者並非原來就具備這些功能。 比起背誦式學習或教導式學習,類推式學習要求學習者具有更多的推論能力。
15 Learning from ExamplesGiven an example set of limited size, find a concise data description Supervised Learning (the most common type) Classification Regression Reinforcement Learning Unsupervised Learning Evolutionary Learning
16 Supervised Learning Classification RegressionTo assign new inputs to one of a number of discrete classes or categories Ex: Assign a character bitmap the correct letter of the alphabet Other pattern recognition problems Regression The output space is formed by the values of continuous variables Predict the value of shares in the stock exchange market
17 Reinforcement LearningHow to map situations to actions, in order to maximize a given reward. The learning algorithm is not told which actions to take in a given situation The learner is assumed to gain information about actions taken by some reward not necessarily arriving immediately after the action is taken Ex: play chess Reward Winning losing
18 Unsupervised LearningIf the data is only a sample of objects without associated target values Clustering algorithm Given a training sample of objects with the aim of extracting some structure from them Ex: Identify indoor or outdoor images Dimensionality reduction method Represent high-dimensionality data in low-dimension space, trying to preserve the original information of data
19 Evolutionary LearningBiological organisms adapt to improve their survival rates and chance of having offspring in their environment
20 Least Squares Learning Rule
21 A Beginning - Artificial Neuron樹狀突起由其它的神經單元接收訊號。 當其所接受的脈動(Impulse) 超過某一特定的定限(Threshold),這個神經單元就會被點燃(Fire),並產生一個脈動傳遞到軸突。 軸突末端的分支稱為胞突纏絡(突觸) (Synapse),它是神經與神經的連絡點﹔它可以是抑制的或者是刺激的。抑制的胞狀纏絡會降低所傳送的脈動﹔刺激的細胞纏絡則會加強之。 Cel body:細胞主體 Dendrite:樹狀突起 Axon:軸突
22 Artificial neuron (perceptron)Artificial neuron (two inputs case) w0 x0=1 Output function Activation function
23 Artificial Neuron Output function (Activation function)Binary threshold function
24 Artificial Neuron Output function (Activation function)Linear threshold function
25 Artificial Neuron Output function (Activation function)Sigmoidal function
26 Artificial Neuron Output function (Activation function)Gaussian function
27 Least squares learning rule - regressionGiven Minimize the difference (mean square error) between the desired output and the actual output for each input vector
28 Least squares learning rule - regression
29 Least squares learning rule - regressionLet then
30 Least squares learning rule - regression
31 Least squares learning rule - regressionLet then 4-31
32 Least squares learning rule - regressionFor real time learning
33 Least squares learning rule - regressionThe modification depends on output error and inputs
34 Back-propagation Learning Rule
35 Back-propagation learning ruleImprove the learning efficiency w0 wn 1 … xn Bias term … xn
36 Back-propagation learning rule
37 Back-propagation learning rulelinear output function: Output layer Hidden layer
38 Back-propagation learning ruleSigmoid output function: Output layer Hidden layer
39 Back-propagation learning ruleOutput layer (Linear vs Sigmoid) Hidden layer (Linear vs Sigmoid)
40 Back-propagation learning ruleApply the input vector to the input layer Calculate net-input of the input layer: Calculate the output of the output layer: Calculate the net-input of the output layer: Update the weights of the output layer: For linear output unit For sigmoid function : the learning-rate parameter (0<<1)
41 Back-propagation learning ruleUpdate the weights of the hidden layer: For linear output unit For sigmoid function If the error term is acceptably small, stop the training.
42 Back-propagation learning ruleWeights Be initialized to small Random values between 0.5. bias terms May improve the learning learning rate parameter A small number On the order of 0.05 to 0.25 To ensure settling to a solution.
43 Back-propagation learning rule – An ExerciseExclusive-OR operation
44 Back-propagation learning rule – An ExerciseInitial weights and threshold levels are set randomly: w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = -1.2, w45 = 1.1, q3 = 0.8, q4 = -0.1 and q5 = 0.3. y3 y4
45 Back-propagation learning rule – An Exercise
46 Back-propagation learning rule – An Exercise
47 Back-propagation learning rule – An ExerciseThe training process is repeated until the sum of squared errors is less than
48 Back-propagation learning rule – An ExerciseSum-Squared Error
49 Back-propagation learning rule – An ExerciseFinal results
50 Competitive Learning
51 Competitive Learning Principle of Competitive Learning
52 Competitive Learning Discovery of significant patterns or invariants of the environment without the intervention(介入) of a teaching input self-reinforcing based on competition involve cooperation
53 Competitive Learning Example of CLThree clusters of vectors (denoted by solid dots) distributed on the unit sphere. Initially randomized codebook vectors (crosses) move under influence of a competitive learning rule to approximate the centroids of the clusters. Competitive learning schemes use codebook vectors to approximate centroids of data clusters.
54 Competitive Learning Hebbian learning rulesignal Hebbian learning rule Competitive learning rule Linear competitive learning rule
55 Competitive Learning - Kohonen learning ruleWinner-take-all learning rule
56 Competitive Learning - Kohonen learning ruleAbout winner The more the weights of a node parallel to the input vector, the higher probability he wins
57 Competitive Learning - Kohonen learning ruleExample: a two-node Kohonen network training patterns training patterns in polar coordinate form normalized initial weights
58 Competitive Learning - Kohonen learning rule
59 Self-Organizing Feature Map
60 Self-Organizing Feature Map: Underlying IdeasUnsupervised learning process. Clusters of neurons win the competition. Weights of winning neurons are adjusted to bring about a better response to the current input. Final weights specify clusters of network nodes that are topologically close. Correspondence between signal features and response locations on the map. Preserves the topology of the input.
61 Self-Organizing Feature Map: Underlying IdeasDistance relations in high dimensional spaces should be approximated by the network as the distances in the two dimensional neuronal field: Input neurons should be exposed to a sufficient number of different inputs. Only the winning neuron and its neighbours adapt their connections. A similar weight update procedure is employed on neurons which comprise topologically related subsets. The resulting adjustment enhances the responses to the same or to a similar input that occurs subsequently
62 Self-Organizing Feature MapNeighbourhood Computation Neighbourhood is a function of time: as epochs of training elapse, the neighbourhood shrinks. Neighbourhood Shapes
63 Self-Organizing Feature MapSome Observations Ordering phase (initial period of adaptation): learning rate should be close to unity. Learning rate should be decreased linearly, exponentially or inversely with iteration over the first 1000 epochs while maintaining its value above 0.1. Convergence phase: learning rate should be maintained at around 0.01 for a large number of epochs. may typically run into many tens of thousands of epochs. During the ordering phase NkIJ shrinks linearly with k to finally include only a few neurons. During the convergence phase NkIJ may comprise only one or no neighbours.
64 Self-Organizing Feature Map - Simulation ExampleThe data employed in the experiment comprised 500 points distributed uniformly over the bipolar square [−1, 1] × [−1, 1]. The points thus describe a geometrically square topology. 88 planar array of neurons was applied.
65 Self-Organizing Feature Map - Simulation Example
66 Self-Organizing Feature Map - Simulation ExampleSimulation Notes: Initial value of the neighbourhood radius r = 6. Neighbourhood is initially a square of width 12 centered around the winning neuron IJ Neighbourhood width contracts by 1 every 200 epochs. After 1000 epochs, neighbourhood radius maintained at 1. Neighbourhood radius can also be zero only the winning neuron updates its weights.
67 Evolutionary Learning
68 Calculus-based Search - Indirect methodSet the gradient of the objective function equal to 0 fig. 1.1
69 Calculus-based Search - Direct methodMove in the direction related to the local gradient hill-climbing
70 Calculus-based SearchBoth methods are local in scope best in a neighborhood of the current point fig. 1.2
71 Calculus-based SearchBoth dependent upon the existence of derivative
72 Calculus-based searchReal world search Calculus-based methods are insufficient
73 State Space Search - Breadth first searchFirst in First out
74 State Space Search - Breadth first search
75 State Space Search - Depth first searchLast in First out
76 State Space Search - Depth first search
77 State Space Search - Best first search(First in Best out)
78 State Space Search - Best first search
79 The Goals of OptimizationProcess improvement shortest path Optimum (destination) itself goal convergence
80 Random search Search and save the best Not randomized techniqueGenetic Algorithms (GAs) use random choice as a tool in a directed search process
81 Characteristics of GAs - search informationWhich is the best? Smallest, longest, fast, fittest, ... The best in the array (met elements) is always not the solution. plays a role to guide the search direction
82 Characteristics of GAs - search informationThe search direction breadth first search the first element in the array depth first search the last element in the array best first search the best element in the array winner takes all neural network search the gradient calculation, gradient of the error function By GAs search, how to find the search direction?
83 Characteristics of GAs - search informationwinner takes all winner group takes all Randomly exchange the bits of the strings Winner group The next search state
84 Characteristics of GAsWork with a coding of the parameter set, not the parameters themselves Search from a population of points, not a single point Use objective function information, not derivatives or other auxiliary knowledge Use probabilistic transition rules, not deterministic rules
85 Characteristics of GAs - codingExample: find the maximum of f(x) fig. 1.5 Code the parameter x as a finite length string How to code the parameter?
86 A Simple Genetic Algorithm - InitializationExample 01101 11000 01000 10011 . String number dependents on the complexity of the problem
87 A Simple Genetic Algorithm - ReproductionCoping strings according to their fitness value. String with higher value have a higher probability of offspring in the next generation. Example Table 1.1
88 A Simple Genetic Algorithm - Reproductionfig. 1.7 Each string has a reproduction probability proportional to their corresponding area (fitness) Highly fit strings have a higher number of offspring in the succeeding generation
89 A Simple Genetic Algorithm - CrossoverNewly reproduced strings in the mating pool are mated at random. Randomly select , l is the string length, then swap all characters between positions k+1 and l.
90 A Simple Genetic Algorithm - CrossoverExample string 1: 01101 string 2: 11000 select k from [1, 5-1] = [1, 4] if k = 4, then 01101 11000 01100 11001 (New strings)
91 A Simple Genetic Algorithm - Crossover
92 A Simple Genetic Algorithm - Crossoverinitialization reproduction crossover Random number generation String copy String exchange Mix of direction and chance builds new solutions from the best partial solutions
93 A Simple Genetic Algorithm - MutationReproduction and crossover may occasionally become overzealous may loss some potentially useful genetic material Mutation occasional random alternation of the value of a string position for binary coding: 0 1 a random walk through the string space an insurance policy against premature loss with small probability
94 A Simulation by Hand Maximize Objective function
95 A Simulation by Hand Coding [0,31] 00000~11111 Initialization
96 A Simulation by Hand Reproduction the weighted roulette wheel
97 A Simulation by Hand Crossover
98 A Simulation by Hand Mutation performed on a bit-by-bit basisprobability: per bit