Ensemble method, decision tree, random forest and boosting

1 Ensemble method, decision tree, random forest and boost...
Author: Clinton Lindsey
0 downloads 2 Views

1 Ensemble method, decision tree, random forest and boostingZhiqi Peng

2 Key concepts of supervised learningObjective function: 𝐽 πœƒ =𝐿 πœƒ +Ξ© πœƒ 𝐿 πœƒ is training loss, measure how well model fit on training data Ξ© πœƒ is regularization, measures complexity of model

3 Key concepts of supervised learningLower training loss result in more predictive model Lower regularization result in simpler model.

4 Bias and Variance tradeoffπ‘Œ=𝑓 π‘₯ +πœ€, πœ€~Ν(0, 𝜎 πœ€ ) πΈπ‘Ÿπ‘Ÿ π‘₯ =𝐸[(π‘Œβˆ’ 𝑓 (π‘₯) ) 2 ]

5 Bias and Variance tradeoffπΈπ‘Ÿπ‘Ÿ π‘₯ =𝐸[(π‘Œβˆ’ 𝑓 (π‘₯) ) 2 ] =𝐸[(𝑓 π‘₯ +πœ€βˆ’ 𝑓 (π‘₯) ) 2 ]=𝐸[(𝑓 π‘₯ βˆ’ 𝑓 (π‘₯)+πœ€ ) 2 ] =𝐸[(𝑓 π‘₯ βˆ’ 𝑓 π‘₯ ) 2 +2πœ€ 𝑓 π‘₯ βˆ’ 𝑓 π‘₯ + πœ€ 2 ] =𝐸[ 𝑓 (π‘₯ ) 2 βˆ’2𝑓 π‘₯ 𝑓 π‘₯ +𝑓(π‘₯ ) 2 ]+ 𝜎 πœ€ 2 =𝐸[ 𝑓 (π‘₯ ) 2 ]βˆ’2𝑓 π‘₯ 𝐸[ 𝑓 π‘₯ ]+𝑓(π‘₯ ) 2 + 𝜎 πœ€ 2 =𝐸[ 𝑓 π‘₯ ) 2 βˆ’ 𝐸 2 𝑓 π‘₯ + 𝐸 2 𝑓 π‘₯ βˆ’2𝑓 π‘₯ 𝐸[ 𝑓 π‘₯ ]+𝑓(π‘₯ ) 2 + 𝜎 πœ€ 2 =π‘‰π‘Žπ‘Ÿ 𝑓 π‘₯ +(𝐸 𝑓 π‘₯ βˆ’π‘“ π‘₯ ) 2 + 𝜎 πœ€ 2 =π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’+ π΅π‘–π‘Žπ‘  2 +πΌπ‘Ÿπ‘Ÿπ‘’π‘‘π‘’π‘π‘’π‘π‘™π‘’ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ

6 Bias and Variance tradeoffCopied from internet

7 Classification Error 𝑦=𝑠𝑖𝑔𝑛 π‘“βˆ’ 1 2 Pr 𝑦 ≠𝑦|𝑦 = Pr 𝑦 ≠𝑦|𝑦=0 + Pr 𝑦 ≠𝑦|𝑦=1 =1 𝑓< ∞ 𝑝 𝑓 𝑑 𝑓 +1(𝑓β‰₯1/2) βˆ’βˆž 1/2 𝑝 𝑓 𝑑 𝑓 This is based on assumption of averaging x, there should have a f(x) and f’(x), both of then are random variable for certain x. This achieve expected classification error given averaging y and y’ under uniform distribution of x.

8 Classification Error If we assume 𝑓 ~𝑁(𝐸 𝑓 ,( π‘‰π‘Žπ‘Ÿ( 𝑓 )) 1 2 ) Pr 𝑦 ≠𝑦|𝑦 = Ξ¦ [𝑠𝑖𝑔𝑛(π‘“βˆ’1/2) 𝐸 𝑓 βˆ’1/2 π‘‰π‘Žπ‘Ÿ( 𝑓 ) ], where Ξ¦ 𝑧 = 𝑧 ∞ 𝑒 βˆ’ 1 2 𝑒 2 𝑑𝑒 Define boundary bias: 𝑏 𝑓,𝐸 𝑓 = 𝑠𝑖𝑔𝑛(βˆ’π‘“+1/2)(𝐸 𝑓 βˆ’1/2)

9 Classification Error For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 <0:Pr decreases as 𝐸 𝑓 βˆ’1/2 increases Pr decreases as π‘‰π‘Žπ‘Ÿ 𝑓 descreases That’s why some high bias method perform well for classification while inappropriate for function estimation.

10 Decision Tree AΒ decision treeΒ is aΒ decisionΒ support tool that uses a tree-like graph or model ofΒ decisionsΒ and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Copied from internet

11 Ensemble method Ensemble methodsΒ are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.Β 

12 Ensemble Error Rate 𝑒 π‘’π‘›π‘ π‘’π‘šπ‘π‘™π‘’ = 𝑖>𝑛/2 𝑛 𝑛 𝑖 𝑒 π‘–π‘›π‘‘π‘–π‘£π‘–π‘‘π‘’π‘Žπ‘™ 𝑖 (1βˆ’ 𝑒 π‘–π‘›π‘‘π‘–π‘£π‘–π‘‘π‘’π‘Žπ‘™ ) π‘›βˆ’π‘– , 𝑒 π‘–π‘›π‘‘π‘–π‘£π‘–π‘‘π‘’π‘Žπ‘™ <1/2 Example: 𝑒 π‘–π‘›π‘‘π‘–π‘£π‘–π‘‘π‘’π‘Žπ‘™ =0.35, n=25 𝑒 π‘’π‘›π‘ π‘’π‘šπ‘π‘™π‘’ =0.06

13 Ensemble Method Tree Bagging: Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1, ..., B: Sample, with replacement, B training examples from X, Y; call these Xb, Yb. Train a decision or regression tree fb on Xb, Yb. After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': 𝑓 = 1 𝐡 𝑏=1 𝐡 𝑓 𝑏 π‘₯ β€² or by taking the majority vote in the case of decision trees.

14 Ensemble Method This procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees ; bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

15 Variance reduction Let’s assume we have T functions, each function is 𝑓 𝑖 π‘₯ , 𝐸 𝑓 𝑖 = πœ‡, π‘‰π‘Žπ‘Ÿ( 𝑓 𝑖 )= 𝜎 2 , and 𝑓 = 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 π‘₯ β€² π‘‰π‘Žπ‘Ÿ 𝑓 =π‘‰π‘Žπ‘Ÿ 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 =𝐸 ( 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 ) 2 βˆ’ 𝐸 2 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 = 1 𝑇 2 𝐸 𝑖=1 𝑇 𝑓 𝑖 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] βˆ’ πœ‡ 2 = 1 𝑇 2 (𝑇 𝜎 2 + πœ‡ 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] )βˆ’ πœ‡ 2

16 Variance reduction π‘‰π‘Žπ‘Ÿ 𝑓 = 1 𝑇 2 (𝑇 𝜎 2 + πœ‡ 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] )βˆ’ πœ‡ 2 If 𝑓 𝑖 π‘Žπ‘›π‘‘ 𝑓 𝑗 are independent: π‘‰π‘Žπ‘Ÿ 𝑓 = 1 𝑇 2 (T 𝜎 2 + πœ‡ 2 + 𝑇 2 βˆ’π‘‡ πœ‡ 2 βˆ’ 𝑇 2 πœ‡ 2 ) π‘‰π‘Žπ‘Ÿ 𝑓 = 𝜎 2 𝑇 Ensemble Variance is smaller than individual function and Bias stays the same

17 Review Classification ErrorEnsemble method can reduce variance: For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 <0, reduce variance can decrease classification error. For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 >0, reduce variance can increase classification error.

18 Random Forest Random forests use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

19 Random Forest Error BoundDefine margin on Random Forest where Y!=j: π‘šπ‘” 𝑋,π‘Œ =𝑃 π‘Ÿ Θ β„Ž 𝑖 𝑋,Θ =π‘Œ βˆ’π‘ƒ π‘Ÿ Θ ( β„Ž 𝑖 𝑋,Θ =𝑗) π‘šπ‘” 𝑋,π‘Œ = 𝐸 Θ [𝐼 β„Ž 𝑖 𝑋,Θ =π‘Œ βˆ’πΌ( β„Ž 𝑖 𝑋,Θ =𝑗)] Define generalization error: 𝑃𝐸=Pr⁑(π‘šπ‘” 𝑋,π‘Œ <0) Define strength of set classifiers: 𝑠= 𝐸 𝑋,π‘Œ [π‘šπ‘”(𝑋,π‘Œ)] Assuming s>=0, using Chebychev’s inequality, we have: 𝑃𝐸≀ π‘‰π‘Žπ‘Ÿ(π‘šπ‘”) 𝑠 2 PE is any give point(X,Y), s is overall expectation,

20 Random Forest Error BoundDefine raw margin function: π‘Ÿπ‘šπ‘” Θ,𝑋,π‘Œ =𝐼 β„Ž 𝑖 𝑋,Θ =π‘Œ βˆ’πΌ β„Ž 𝑖 𝑋,Θ =𝑗 For π‘‰π‘Žπ‘Ÿ π‘šπ‘” , assume rmgs are independent and have same distribution: π‘‰π‘Žπ‘Ÿ π‘šπ‘” = 𝐸 Θ,Ξ˜β€² [πΆπ‘œπ‘£ π‘Ÿπ‘šπ‘”,π‘Ÿπ‘šπ‘”β€² ] π‘‰π‘Žπ‘Ÿ π‘šπ‘” = 𝐸 Θ,Ξ˜β€² [𝜌(Θ,Ξ˜β€²) π‘‰π‘Žπ‘Ÿ(Θ) π‘‰π‘Žπ‘Ÿ(Ξ˜β€²) ] π‘‰π‘Žπ‘Ÿ π‘šπ‘” = 𝜌 ( 𝐸 Θ π‘‰π‘Žπ‘Ÿ Θ ) 2 π‘‰π‘Žπ‘Ÿ π‘šπ‘” ≀ 𝜌 𝐸 Θ [π‘‰π‘Žπ‘Ÿ(Θ)] Var(mg) is overall var

21 Random Forest Error Bound𝐸 Θ [π‘‰π‘Žπ‘Ÿ(Θ)]≀ 𝐸 Θ [ 𝐸 𝑋,π‘Œ [ π‘Ÿπ‘šπ‘” 2 𝛩,𝑋,π‘Œ ]]βˆ’ 𝑠 2 ≀1βˆ’ 𝑠 2 We got generalization error upper bound by: 𝑃𝐸≀ 𝜌 (1βˆ’ 𝑠 2 )/ 𝑠 2 While 𝜌 is the mean value of correlation and s is the strength of classifiers 1 is for each classifier, the maximum expectation of every (X,Y) is 1.

22 Boosting While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. After a weak learner is added, the data are reweighted: examples that are misclassified gain weight and examples that are classified correctly lose. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

23 Ada-boost

24 Ada-boost derivation Boosted classifier: 𝐹 𝑑 π‘₯ 𝑖 = 𝐹 π‘‘βˆ’1 π‘₯ 𝑖 + 𝛼 𝑑 β„Ž 𝑑 π‘₯ 𝑖 = 𝛼 1 β„Ž 1 π‘₯ 𝑖 + 𝛼 2 β„Ž 2 π‘₯ 𝑖 +…+ 𝛼 𝑑 β„Ž 𝑑 π‘₯ 𝑖 Total error: 𝐸= 𝑖 𝑛 𝑒 βˆ’π‘¦ 𝑖 𝐹 𝑑 π‘₯ 𝑖 Let πœ€ 𝑖 π‘‘βˆ’1 = 𝑒 βˆ’π‘¦ 𝑖 𝐹 π‘‘βˆ’1 π‘₯ 𝑖

25 Ada-boost derivation We have: 𝐸= 𝑖 𝑛 𝑒 βˆ’π‘¦ 𝑖 𝐹 𝑑 π‘₯ 𝑖 𝐸= 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 βˆ’π‘¦ 𝑖 𝛼 𝑑 β„Ž 𝑑 π‘₯ 𝑖 𝐸= 𝑦 𝑖 = β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 βˆ’ 𝛼 𝑑 + 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 𝛼 𝑑 𝐸= 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 βˆ’ 𝛼 𝑑 + 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 (𝑒 𝛼 𝑑 βˆ’ 𝑒 βˆ’π›Ό 𝑑 )

26 Ada-boost derivation 𝐸= 𝑦 𝑖 = β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 βˆ’ 𝛼 𝑑 + 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 𝛼 𝑑 𝑑𝐸 𝑑 π‘Ž 𝑑 =βˆ’ 𝑦 𝑖 = β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 βˆ’ 𝛼 𝑑 + 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑒 𝛼 𝑑 =0 𝛼 𝑑 = 1 2 𝑙𝑛 𝑦 𝑖 = β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 = 1 2 𝑙𝑛 𝑦 𝑖 = β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 / 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 𝑦 𝑖 β‰  β„Ž 𝑑 π‘₯ 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 / 𝑖 𝑛 πœ€ 𝑖 π‘‘βˆ’1 = 1 2 ln⁑( 1βˆ’ πœ– 𝑑 πœ– 𝑑 )

27 Regression Tree Similar to decision tree, but contains one score in each leaf value. Copied from internet

28 Objective for Tree EnsembleAssuming we have K trees 𝑦 𝑖 = π‘˜=1 π‘˜ 𝑓 π‘˜ π‘₯ 𝑖 , 𝑓 π‘˜ πœ– β„± Objective function 𝐽 πœƒ = 𝑖=1 𝑛 𝐿 𝑦 β…ˆ , 𝑦 β…ˆ + π‘˜=1 π‘˜ 𝛺 𝑓 π‘˜

29 Gradient Boosting 𝑦 𝑖 𝑑 = 𝐹 𝑑 π‘₯ 𝑖 = 𝐹 π‘‘βˆ’1 π‘₯ 𝑖 + 𝑓 𝑑 π‘₯ 𝑖 = 𝑓 1 π‘₯ 𝑖 + 𝑓 2 π‘₯ 𝑖 +…+ 𝑓 𝑑 π‘₯ 𝑖 𝐽 𝑑 𝑓 𝑑 = 𝑖 𝑛 𝐿( 𝑦 𝑖 , 𝑦 𝑖 π‘‘βˆ’1 + 𝑓 𝑑 π‘₯ 𝑖 ) + 𝑖 𝑑 Ξ©( 𝑓 𝑖 ) Use Taylor series: 𝐽 𝑑 β„Ž 𝑑 ≃ 𝑖 𝑛 [𝐿 𝑦 𝑖 , 𝑦 𝑖 π‘‘βˆ’1 + 𝑔 𝑖 𝑓 𝑑 π‘₯ 𝑖 β„Ž 𝑖 𝑓 𝑑 2 π‘₯ 𝑖 ] +Ξ© 𝑓 𝑑 +π‘π‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘ = 𝑖 𝑛 [ 𝑔 𝑖 𝑓 𝑑 π‘₯ 𝑖 β„Ž 𝑖 𝑓 𝑑 2 π‘₯ 𝑖 ] +Ξ© 𝑓 𝑑 +π‘π‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘ Where 𝑔 𝑖 = πœ•πΏ 𝑦 𝑖 , 𝑦 𝑖 π‘‘βˆ’1 πœ• 𝑦 𝑖 π‘‘βˆ’1 and β„Ž 𝑖 = πœ• 2 𝐿 𝑦 𝑖 , 𝑦 𝑖 π‘‘βˆ’1 πœ• 𝑦 𝑖 π‘‘βˆ’1

30 Gradient Boosting TreeDefine mapping between leaf and input: 𝑓 𝑑 π‘₯ = 𝑀 π‘ž(π‘₯) Define complexity: Ξ© 𝑓 𝑑 =γ𝑇+ 1 2 πœ† 𝑗=1 𝑇 𝑀 𝑗 2 , 𝑇 𝑖𝑠 π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘™π‘’π‘Žπ‘“π‘  Define 𝐼 𝑗 ={𝑖|π‘ž π‘₯ 𝑖 =𝑗} Refine objective: 𝐽 𝑑 𝑓 𝑑 = 𝑖 𝑛 [ 𝑔 𝑖 𝑓 𝑑 π‘₯ 𝑖 β„Ž 𝑖 𝑓 𝑑 2 π‘₯ 𝑖 ] +Ξ© 𝑓 𝑑 = 𝑖 𝑛 [ 𝑔 𝑖 𝑀 π‘ž( π‘₯ 𝑖 ) β„Ž 𝑖 𝑀 π‘ž( π‘₯ 𝑖 ) 2 ] +γ𝑇+ 1 2 πœ† 𝑗=1 𝑇 𝑀 𝑗 2

31 Gradient Boosting Tree𝐽 𝑑 𝑓 𝑑 = 𝑖 𝑛 [ 𝑔 𝑖 𝑀 π‘ž( π‘₯ 𝑖 ) π‘₯ 𝑖 β„Ž 𝑖 𝑀 π‘ž( π‘₯ 𝑖 ) 2 π‘₯ 𝑖 ] +γ𝑇+ 1 2 πœ† 𝑗=1 𝑇 𝑀 𝑗 2 = 𝑗=1 𝑇 [( π‘–βˆˆ 𝐼 𝑗 𝑔 𝑖 ) 𝑀 𝑗 ( π‘–βˆˆ 𝐼 𝑗 β„Ž 𝑖 +πœ† ) 𝑀 𝑗 2 ] +γ𝑇 = 𝑗=1 𝑇 [ 𝐺 𝑗 𝑀 𝑗 ( 𝐻 𝑗 +πœ†) 𝑀 𝑗 2 ] +γ𝑇 We can set 𝑀 𝑗 βˆ— =βˆ’ 𝐺 𝑗 𝐻 𝑗 +πœ† to achieve minimum 𝐽 𝑑 𝑓 𝑑 when H>0 Min 𝐽 𝑑 𝑓 𝑑 =βˆ’ 1 2 𝑗=1 𝑇 𝐺 𝑗 2 𝐻 𝑗 +πœ† +γ𝑇

32 Find split of a tree πΊπ‘Žπ‘–π‘›= 𝐺 𝐿 2 𝐻 𝐿 +πœ† + 𝐺 𝑅 2 𝐻 𝑅 +πœ† βˆ’ ( 𝐺 𝐿 + 𝐺 𝑅 ) 2 𝐻 𝐿 + 𝐻 𝑅 +πœ† βˆ’π›Ύ

33 An Algorithm for Split FindingFor each node, enumerate over all features For each feature, sorted the instances by feature value Use a linear scan to decide the best split along that feature Take the best split solution along all the features Time complexity growing a tee of depth K O(n*d*k*logn), O(nlogn) to sort each level, and there are d features.

34 An Algorithm for Split FindingCopied from Introduction to Boosted Trees Tianqi Chen

35 Thank you