Ensemble method, decision tree, random forest and boosting

1 Ensemble method, decision tree, random forest and boost...

Author: Clinton Lindsey

0 downloads 2 Views

1 Ensemble method, decision tree, random forest and boostingZhiqi Peng

2 Key concepts of supervised learningObjective function: 𝐽 𝜃 =𝐿 𝜃 +Ω 𝜃 𝐿 𝜃 is training loss, measure how well model fit on training data Ω 𝜃 is regularization, measures complexity of model

3 Key concepts of supervised learningLower training loss result in more predictive model Lower regularization result in simpler model.

4 Bias and Variance tradeoff𝑌=𝑓 𝑥 +𝜀, 𝜀~Ν(0, 𝜎 𝜀 ) 𝐸𝑟𝑟 𝑥 =𝐸[(𝑌− 𝑓 (𝑥) ) 2 ]

5 Bias and Variance tradeoff𝐸𝑟𝑟 𝑥 =𝐸[(𝑌− 𝑓 (𝑥) ) 2 ] =𝐸[(𝑓 𝑥 +𝜀− 𝑓 (𝑥) ) 2 ]=𝐸[(𝑓 𝑥 − 𝑓 (𝑥)+𝜀 ) 2 ] =𝐸[(𝑓 𝑥 − 𝑓 𝑥 ) 2 +2𝜀 𝑓 𝑥 − 𝑓 𝑥 + 𝜀 2 ] =𝐸[ 𝑓 (𝑥 ) 2 −2𝑓 𝑥 𝑓 𝑥 +𝑓(𝑥 ) 2 ]+ 𝜎 𝜀 2 =𝐸[ 𝑓 (𝑥 ) 2 ]−2𝑓 𝑥 𝐸[ 𝑓 𝑥 ]+𝑓(𝑥 ) 2 + 𝜎 𝜀 2 =𝐸[ 𝑓 𝑥 ) 2 − 𝐸 2 𝑓 𝑥 + 𝐸 2 𝑓 𝑥 −2𝑓 𝑥 𝐸[ 𝑓 𝑥 ]+𝑓(𝑥 ) 2 + 𝜎 𝜀 2 =𝑉𝑎𝑟 𝑓 𝑥 +(𝐸 𝑓 𝑥 −𝑓 𝑥 ) 2 + 𝜎 𝜀 2 =𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒+ 𝐵𝑖𝑎𝑠 2 +𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑒𝑏𝑙𝑒 𝑒𝑟𝑟𝑜𝑟

6 Bias and Variance tradeoffCopied from internet

7 Classification Error 𝑦=𝑠𝑖𝑔𝑛 𝑓− 1 2 Pr 𝑦 ≠𝑦|𝑦 = Pr 𝑦 ≠𝑦|𝑦=0 + Pr 𝑦 ≠𝑦|𝑦=1 =1 𝑓< ∞ 𝑝 𝑓 𝑑 𝑓 +1(𝑓≥1/2) −∞ 1/2 𝑝 𝑓 𝑑 𝑓 This is based on assumption of averaging x, there should have a f(x) and f’(x), both of then are random variable for certain x. This achieve expected classification error given averaging y and y’ under uniform distribution of x.

8 Classification Error If we assume 𝑓 ~𝑁(𝐸 𝑓 ,( 𝑉𝑎𝑟( 𝑓 )) 1 2 ) Pr 𝑦 ≠𝑦|𝑦 = Φ [𝑠𝑖𝑔𝑛(𝑓−1/2) 𝐸 𝑓 −1/2 𝑉𝑎𝑟( 𝑓 ) ], where Φ 𝑧 = 𝑧 ∞ 𝑒 − 1 2 𝑢 2 𝑑𝑢 Define boundary bias: 𝑏 𝑓,𝐸 𝑓 = 𝑠𝑖𝑔𝑛(−𝑓+1/2)(𝐸 𝑓 −1/2)

9 Classification Error For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 <0:Pr decreases as 𝐸 𝑓 −1/2 increases Pr decreases as 𝑉𝑎𝑟 𝑓 descreases That’s why some high bias method perform well for classification while inappropriate for function estimation.

10 Decision Tree A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Copied from internet

11 Ensemble method Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.

12 Ensemble Error Rate 𝑒 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 = 𝑖>𝑛/2 𝑛 𝑛 𝑖 𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑖 (1− 𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 ) 𝑛−𝑖 , 𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 <1/2 Example: 𝑒 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 =0.35, n=25 𝑒 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 =0.06

13 Ensemble Method Tree Bagging: Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1, ..., B: Sample, with replacement, B training examples from X, Y; call these Xb, Yb. Train a decision or regression tree fb on Xb, Yb. After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': 𝑓 = 1 𝐵 𝑏=1 𝐵 𝑓 𝑏 𝑥 ′ or by taking the majority vote in the case of decision trees.

14 Ensemble Method This procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees ; bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

15 Variance reduction Let’s assume we have T functions, each function is 𝑓 𝑖 𝑥 , 𝐸 𝑓 𝑖 = 𝜇, 𝑉𝑎𝑟( 𝑓 𝑖 )= 𝜎 2 , and 𝑓 = 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 𝑥 ′ 𝑉𝑎𝑟 𝑓 =𝑉𝑎𝑟 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 =𝐸 ( 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 ) 2 − 𝐸 2 1 𝑇 𝑖=1 𝑇 𝑓 𝑖 = 1 𝑇 2 𝐸 𝑖=1 𝑇 𝑓 𝑖 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] − 𝜇 2 = 1 𝑇 2 (𝑇 𝜎 2 + 𝜇 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] )− 𝜇 2

16 Variance reduction 𝑉𝑎𝑟 𝑓 = 1 𝑇 2 (𝑇 𝜎 2 + 𝜇 2 +𝐸[ 𝑖≠𝑗 𝑓 𝑖 𝑓 𝑗 ] )− 𝜇 2 If 𝑓 𝑖 𝑎𝑛𝑑 𝑓 𝑗 are independent: 𝑉𝑎𝑟 𝑓 = 1 𝑇 2 (T 𝜎 2 + 𝜇 2 + 𝑇 2 −𝑇 𝜇 2 − 𝑇 2 𝜇 2 ) 𝑉𝑎𝑟 𝑓 = 𝜎 2 𝑇 Ensemble Variance is smaller than individual function and Bias stays the same

17 Review Classification ErrorEnsemble method can reduce variance: For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 <0, reduce variance can decrease classification error. For given 𝑏 𝑓,𝐸 𝑓 , If 𝑏 𝑓,𝐸 𝑓 >0, reduce variance can increase classification error.

18 Random Forest Random forests use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

19 Random Forest Error BoundDefine margin on Random Forest where Y!=j: 𝑚𝑔 𝑋,𝑌 =𝑃 𝑟 Θ ℎ 𝑖 𝑋,Θ =𝑌 −𝑃 𝑟 Θ ( ℎ 𝑖 𝑋,Θ =𝑗) 𝑚𝑔 𝑋,𝑌 = 𝐸 Θ [𝐼 ℎ 𝑖 𝑋,Θ =𝑌 −𝐼( ℎ 𝑖 𝑋,Θ =𝑗)] Define generalization error: 𝑃𝐸=Pr⁡(𝑚𝑔 𝑋,𝑌 <0) Define strength of set classifiers: 𝑠= 𝐸 𝑋,𝑌 [𝑚𝑔(𝑋,𝑌)] Assuming s>=0, using Chebychev’s inequality, we have: 𝑃𝐸≤ 𝑉𝑎𝑟(𝑚𝑔) 𝑠 2 PE is any give point(X,Y), s is overall expectation,

20 Random Forest Error BoundDefine raw margin function: 𝑟𝑚𝑔 Θ,𝑋,𝑌 =𝐼 ℎ 𝑖 𝑋,Θ =𝑌 −𝐼 ℎ 𝑖 𝑋,Θ =𝑗 For 𝑉𝑎𝑟 𝑚𝑔 , assume rmgs are independent and have same distribution: 𝑉𝑎𝑟 𝑚𝑔 = 𝐸 Θ,Θ′ [𝐶𝑜𝑣 𝑟𝑚𝑔,𝑟𝑚𝑔′ ] 𝑉𝑎𝑟 𝑚𝑔 = 𝐸 Θ,Θ′ [𝜌(Θ,Θ′) 𝑉𝑎𝑟(Θ) 𝑉𝑎𝑟(Θ′) ] 𝑉𝑎𝑟 𝑚𝑔 = 𝜌 ( 𝐸 Θ 𝑉𝑎𝑟 Θ ) 2 𝑉𝑎𝑟 𝑚𝑔 ≤ 𝜌 𝐸 Θ [𝑉𝑎𝑟(Θ)] Var(mg) is overall var

21 Random Forest Error Bound𝐸 Θ [𝑉𝑎𝑟(Θ)]≤ 𝐸 Θ [ 𝐸 𝑋,𝑌 [ 𝑟𝑚𝑔 2 𝛩,𝑋,𝑌 ]]− 𝑠 2 ≤1− 𝑠 2 We got generalization error upper bound by: 𝑃𝐸≤ 𝜌 (1− 𝑠 2 )/ 𝑠 2 While 𝜌 is the mean value of correlation and s is the strength of classifiers 1 is for each classifier, the maximum expectation of every (X,Y) is 1.

22 Boosting While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. After a weak learner is added, the data are reweighted: examples that are misclassified gain weight and examples that are classified correctly lose. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

23 Ada-boost

24 Ada-boost derivation Boosted classifier: 𝐹 𝑡 𝑥 𝑖 = 𝐹 𝑡−1 𝑥 𝑖 + 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 = 𝛼 1 ℎ 1 𝑥 𝑖 + 𝛼 2 ℎ 2 𝑥 𝑖 +…+ 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 Total error: 𝐸= 𝑖 𝑛 𝑒 −𝑦 𝑖 𝐹 𝑡 𝑥 𝑖 Let 𝜀 𝑖 𝑡−1 = 𝑒 −𝑦 𝑖 𝐹 𝑡−1 𝑥 𝑖

25 Ada-boost derivation We have: 𝐸= 𝑖 𝑛 𝑒 −𝑦 𝑖 𝐹 𝑡 𝑥 𝑖 𝐸= 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 −𝑦 𝑖 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 𝐸= 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 − 𝛼 𝑡 + 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 𝛼 𝑡 𝐸= 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 − 𝛼 𝑡 + 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 (𝑒 𝛼 𝑡 − 𝑒 −𝛼 𝑡 )

26 Ada-boost derivation 𝐸= 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 − 𝛼 𝑡 + 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 𝛼 𝑡 𝑑𝐸 𝑑 𝑎 𝑡 =− 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 − 𝛼 𝑡 + 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑒 𝛼 𝑡 =0 𝛼 𝑡 = 1 2 𝑙𝑛 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 = 1 2 𝑙𝑛 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 / 𝑖 𝑛 𝜀 𝑖 𝑡−1 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝑛 𝜀 𝑖 𝑡−1 / 𝑖 𝑛 𝜀 𝑖 𝑡−1 = 1 2 ln⁡( 1− 𝜖 𝑡 𝜖 𝑡 )

27 Regression Tree Similar to decision tree, but contains one score in each leaf value. Copied from internet

28 Objective for Tree EnsembleAssuming we have K trees 𝑦 𝑖 = 𝑘=1 𝑘 𝑓 𝑘 𝑥 𝑖 , 𝑓 𝑘 𝜖 ℱ Objective function 𝐽 𝜃 = 𝑖=1 𝑛 𝐿 𝑦 ⅈ , 𝑦 ⅈ + 𝑘=1 𝑘 𝛺 𝑓 𝑘

29 Gradient Boosting 𝑦 𝑖 𝑡 = 𝐹 𝑡 𝑥 𝑖 = 𝐹 𝑡−1 𝑥 𝑖 + 𝑓 𝑡 𝑥 𝑖 = 𝑓 1 𝑥 𝑖 + 𝑓 2 𝑥 𝑖 +…+ 𝑓 𝑡 𝑥 𝑖 𝐽 𝑡 𝑓 𝑡 = 𝑖 𝑛 𝐿( 𝑦 𝑖 , 𝑦 𝑖 𝑡−1 + 𝑓 𝑡 𝑥 𝑖 ) + 𝑖 𝑡 Ω( 𝑓 𝑖 ) Use Taylor series: 𝐽 𝑡 ℎ 𝑡 ≃ 𝑖 𝑛 [𝐿 𝑦 𝑖 , 𝑦 𝑖 𝑡−1 + 𝑔 𝑖 𝑓 𝑡 𝑥 𝑖 ℎ 𝑖 𝑓 𝑡 2 𝑥 𝑖 ] +Ω 𝑓 𝑡 +𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 = 𝑖 𝑛 [ 𝑔 𝑖 𝑓 𝑡 𝑥 𝑖 ℎ 𝑖 𝑓 𝑡 2 𝑥 𝑖 ] +Ω 𝑓 𝑡 +𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 Where 𝑔 𝑖 = 𝜕𝐿 𝑦 𝑖 , 𝑦 𝑖 𝑡−1 𝜕 𝑦 𝑖 𝑡−1 and ℎ 𝑖 = 𝜕 2 𝐿 𝑦 𝑖 , 𝑦 𝑖 𝑡−1 𝜕 𝑦 𝑖 𝑡−1

30 Gradient Boosting TreeDefine mapping between leaf and input: 𝑓 𝑡 𝑥 = 𝑤 𝑞(𝑥) Define complexity: Ω 𝑓 𝑡 =γ𝑇+ 1 2 𝜆 𝑗=1 𝑇 𝑤 𝑗 2 , 𝑇 𝑖𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑒𝑎𝑓𝑠 Define 𝐼 𝑗 ={𝑖|𝑞 𝑥 𝑖 =𝑗} Refine objective: 𝐽 𝑡 𝑓 𝑡 = 𝑖 𝑛 [ 𝑔 𝑖 𝑓 𝑡 𝑥 𝑖 ℎ 𝑖 𝑓 𝑡 2 𝑥 𝑖 ] +Ω 𝑓 𝑡 = 𝑖 𝑛 [ 𝑔 𝑖 𝑤 𝑞( 𝑥 𝑖 ) ℎ 𝑖 𝑤 𝑞( 𝑥 𝑖 ) 2 ] +γ𝑇+ 1 2 𝜆 𝑗=1 𝑇 𝑤 𝑗 2

31 Gradient Boosting Tree𝐽 𝑡 𝑓 𝑡 = 𝑖 𝑛 [ 𝑔 𝑖 𝑤 𝑞( 𝑥 𝑖 ) 𝑥 𝑖 ℎ 𝑖 𝑤 𝑞( 𝑥 𝑖 ) 2 𝑥 𝑖 ] +γ𝑇+ 1 2 𝜆 𝑗=1 𝑇 𝑤 𝑗 2 = 𝑗=1 𝑇 [( 𝑖∈ 𝐼 𝑗 𝑔 𝑖 ) 𝑤 𝑗 ( 𝑖∈ 𝐼 𝑗 ℎ 𝑖 +𝜆 ) 𝑤 𝑗 2 ] +γ𝑇 = 𝑗=1 𝑇 [ 𝐺 𝑗 𝑤 𝑗 ( 𝐻 𝑗 +𝜆) 𝑤 𝑗 2 ] +γ𝑇 We can set 𝑤 𝑗 ∗ =− 𝐺 𝑗 𝐻 𝑗 +𝜆 to achieve minimum 𝐽 𝑡 𝑓 𝑡 when H>0 Min 𝐽 𝑡 𝑓 𝑡 =− 1 2 𝑗=1 𝑇 𝐺 𝑗 2 𝐻 𝑗 +𝜆 +γ𝑇

32 Find split of a tree 𝐺𝑎𝑖𝑛= 𝐺 𝐿 2 𝐻 𝐿 +𝜆 + 𝐺 𝑅 2 𝐻 𝑅 +𝜆 − ( 𝐺 𝐿 + 𝐺 𝑅 ) 2 𝐻 𝐿 + 𝐻 𝑅 +𝜆 −𝛾

33 An Algorithm for Split FindingFor each node, enumerate over all features For each feature, sorted the instances by feature value Use a linear scan to decide the best split along that feature Take the best split solution along all the features Time complexity growing a tee of depth K O(n*d*k*logn), O(nlogn) to sort each level, and there are d features.

34 An Algorithm for Split FindingCopied from Introduction to Boosted Trees Tianqi Chen

35 Thank you

Ensemble method, decision tree, random forest and boosting

Recommend Documents