1 Ensemble method, decision tree, random forest and boostingZhiqi Peng
2 Key concepts of supervised learningObjective function: π½ π =πΏ π +Ξ© π πΏ π is training loss, measure how well model fit on training data Ξ© π is regularization, measures complexity of model
3 Key concepts of supervised learningLower training loss result in more predictive model Lower regularization result in simpler model.
4 Bias and Variance tradeoffπ=π π₯ +π, π~Ξ(0, π π ) πΈππ π₯ =πΈ[(πβ π (π₯) ) 2 ]
5 Bias and Variance tradeoffπΈππ π₯ =πΈ[(πβ π (π₯) ) 2 ] =πΈ[(π π₯ +πβ π (π₯) ) 2 ]=πΈ[(π π₯ β π (π₯)+π ) 2 ] =πΈ[(π π₯ β π π₯ ) 2 +2π π π₯ β π π₯ + π 2 ] =πΈ[ π (π₯ ) 2 β2π π₯ π π₯ +π(π₯ ) 2 ]+ π π 2 =πΈ[ π (π₯ ) 2 ]β2π π₯ πΈ[ π π₯ ]+π(π₯ ) 2 + π π 2 =πΈ[ π π₯ ) 2 β πΈ 2 π π₯ + πΈ 2 π π₯ β2π π₯ πΈ[ π π₯ ]+π(π₯ ) 2 + π π 2 =πππ π π₯ +(πΈ π π₯ βπ π₯ ) 2 + π π 2 =ππππππππ+ π΅πππ 2 +πΌπππππ’πππππ πππππ
6 Bias and Variance tradeoffCopied from internet
7 Classification Error π¦=π πππ πβ 1 2 Pr π¦ β π¦|π¦ = Pr π¦ β π¦|π¦=0 + Pr π¦ β π¦|π¦=1 =1 π< β π π π π +1(πβ₯1/2) ββ 1/2 π π π π This is based on assumption of averaging x, there should have a f(x) and fβ(x), both of then are random variable for certain x. This achieve expected classification error given averaging y and yβ under uniform distribution of x.
8 Classification Error If we assume π ~π(πΈ π ,( πππ( π )) 1 2 ) Pr π¦ β π¦|π¦ = Ξ¦ [π πππ(πβ1/2) πΈ π β1/2 πππ( π ) ], where Ξ¦ π§ = π§ β π β 1 2 π’ 2 ππ’ Define boundary bias: π π,πΈ π = π πππ(βπ+1/2)(πΈ π β1/2)
9 Classification Error For given π π,πΈ π , If π π,πΈ π <0:Pr decreases as πΈ π β1/2 increases Pr decreases as πππ π descreases Thatβs why some high bias method perform well for classification while inappropriate for function estimation.
10 Decision Tree AΒ decision treeΒ is aΒ decisionΒ support tool that uses a tree-like graph or model ofΒ decisionsΒ and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Copied from internet
11 Ensemble method Ensemble methodsΒ are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.Β
12 Ensemble Error Rate π πππ πππππ = π>π/2 π π π π πππππ£πππ’ππ π (1β π πππππ£πππ’ππ ) πβπ , π πππππ£πππ’ππ <1/2 Example: π πππππ£πππ’ππ =0.35, n=25 π πππ πππππ =0.06
13 Ensemble Method Tree Bagging: Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1, ..., B: Sample, with replacement, B training examples from X, Y; call these Xb, Yb. Train a decision or regression tree fb on Xb, Yb. After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': π = 1 π΅ π=1 π΅ π π π₯ β² or by taking the majority vote in the case of decision trees.
14 Ensemble Method This procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees ; bootstrap sampling is a way of de-correlating the trees by showing them different training sets.
15 Variance reduction Letβs assume we have T functions, each function is π π π₯ , πΈ π π = π, πππ( π π )= π 2 , and π = 1 π π=1 π π π π₯ β² πππ π =πππ 1 π π=1 π π π =πΈ ( 1 π π=1 π π π ) 2 β πΈ 2 1 π π=1 π π π = 1 π 2 πΈ π=1 π π π 2 +πΈ[ πβ π π π π π ] β π 2 = 1 π 2 (π π 2 + π 2 +πΈ[ πβ π π π π π ] )β π 2
16 Variance reduction πππ π = 1 π 2 (π π 2 + π 2 +πΈ[ πβ π π π π π ] )β π 2 If π π πππ π π are independent: πππ π = 1 π 2 (T π 2 + π 2 + π 2 βπ π 2 β π 2 π 2 ) πππ π = π 2 π Ensemble Variance is smaller than individual function and Bias stays the same
17 Review Classification ErrorEnsemble method can reduce variance: For given π π,πΈ π , If π π,πΈ π <0, reduce variance can decrease classification error. For given π π,πΈ π , If π π,πΈ π >0, reduce variance can increase classification error.
18 Random Forest Random forests use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.
19 Random Forest Error BoundDefine margin on Random Forest where Y!=j: ππ π,π =π π Ξ β π π,Ξ =π βπ π Ξ ( β π π,Ξ =π) ππ π,π = πΈ Ξ [πΌ β π π,Ξ =π βπΌ( β π π,Ξ =π)] Define generalization error: ππΈ=Prβ‘(ππ π,π <0) Define strength of set classifiers: π = πΈ π,π [ππ(π,π)] Assuming s>=0, using Chebychevβs inequality, we have: ππΈβ€ πππ(ππ) π 2 PE is any give point(X,Y), s is overall expectation,
20 Random Forest Error BoundDefine raw margin function: πππ Ξ,π,π =πΌ β π π,Ξ =π βπΌ β π π,Ξ =π For πππ ππ , assume rmgs are independent and have same distribution: πππ ππ = πΈ Ξ,Ξβ² [πΆππ£ πππ,πππβ² ] πππ ππ = πΈ Ξ,Ξβ² [π(Ξ,Ξβ²) πππ(Ξ) πππ(Ξβ²) ] πππ ππ = π ( πΈ Ξ πππ Ξ ) 2 πππ ππ β€ π πΈ Ξ [πππ(Ξ)] Var(mg) is overall var
21 Random Forest Error BoundπΈ Ξ [πππ(Ξ)]β€ πΈ Ξ [ πΈ π,π [ πππ 2 π©,π,π ]]β π 2 β€1β π 2 We got generalization error upper bound by: ππΈβ€ π (1β π 2 )/ π 2 While π is the mean value of correlation and s is the strength of classifiers 1 is for each classifier, the maximum expectation of every (X,Y) is 1.
22 Boosting While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. After a weak learner is added, the data are reweighted: examples that are misclassified gain weight and examples that are classified correctly lose. Thus, future weak learners focus more on the examples that previous weak learners misclassified.
23 Ada-boost
24 Ada-boost derivation Boosted classifier: πΉ π‘ π₯ π = πΉ π‘β1 π₯ π + πΌ π‘ β π‘ π₯ π = πΌ 1 β 1 π₯ π + πΌ 2 β 2 π₯ π +β¦+ πΌ π‘ β π‘ π₯ π Total error: πΈ= π π π βπ¦ π πΉ π‘ π₯ π Let π π π‘β1 = π βπ¦ π πΉ π‘β1 π₯ π
25 Ada-boost derivation We have: πΈ= π π π βπ¦ π πΉ π‘ π₯ π πΈ= π π π π π‘β1 π βπ¦ π πΌ π‘ β π‘ π₯ π πΈ= π¦ π = β π‘ π₯ π π π π π‘β1 π β πΌ π‘ + π¦ π β β π‘ π₯ π π π π π‘β1 π πΌ π‘ πΈ= π π π π π‘β1 π β πΌ π‘ + π¦ π β β π‘ π₯ π π π π π‘β1 (π πΌ π‘ β π βπΌ π‘ )
26 Ada-boost derivation πΈ= π¦ π = β π‘ π₯ π π π π π‘β1 π β πΌ π‘ + π¦ π β β π‘ π₯ π π π π π‘β1 π πΌ π‘ ππΈ π π π‘ =β π¦ π = β π‘ π₯ π π π π π‘β1 π β πΌ π‘ + π¦ π β β π‘ π₯ π π π π π‘β1 π πΌ π‘ =0 πΌ π‘ = 1 2 ππ π¦ π = β π‘ π₯ π π π π π‘β1 π¦ π β β π‘ π₯ π π π π π‘β1 = 1 2 ππ π¦ π = β π‘ π₯ π π π π π‘β1 / π π π π π‘β1 π¦ π β β π‘ π₯ π π π π π‘β1 / π π π π π‘β1 = 1 2 lnβ‘( 1β π π‘ π π‘ )
27 Regression Tree Similar to decision tree, but contains one score in each leaf value. Copied from internet
28 Objective for Tree EnsembleAssuming we have K trees π¦ π = π=1 π π π π₯ π , π π π β± Objective function π½ π = π=1 π πΏ π¦ β , π¦ β + π=1 π πΊ π π
29 Gradient Boosting π¦ π π‘ = πΉ π‘ π₯ π = πΉ π‘β1 π₯ π + π π‘ π₯ π = π 1 π₯ π + π 2 π₯ π +β¦+ π π‘ π₯ π π½ π‘ π π‘ = π π πΏ( π¦ π , π¦ π π‘β1 + π π‘ π₯ π ) + π π‘ Ξ©( π π ) Use Taylor series: π½ π‘ β π‘ β π π [πΏ π¦ π , π¦ π π‘β1 + π π π π‘ π₯ π β π π π‘ 2 π₯ π ] +Ξ© π π‘ +ππππ π‘πππ‘ = π π [ π π π π‘ π₯ π β π π π‘ 2 π₯ π ] +Ξ© π π‘ +ππππ π‘πππ‘ Where π π = ππΏ π¦ π , π¦ π π‘β1 π π¦ π π‘β1 and β π = π 2 πΏ π¦ π , π¦ π π‘β1 π π¦ π π‘β1
30 Gradient Boosting TreeDefine mapping between leaf and input: π π‘ π₯ = π€ π(π₯) Define complexity: Ξ© π π‘ =Ξ³π+ 1 2 π π=1 π π€ π 2 , π ππ ππ’ππππ ππ πππππ Define πΌ π ={π|π π₯ π =π} Refine objective: π½ π‘ π π‘ = π π [ π π π π‘ π₯ π β π π π‘ 2 π₯ π ] +Ξ© π π‘ = π π [ π π π€ π( π₯ π ) β π π€ π( π₯ π ) 2 ] +Ξ³π+ 1 2 π π=1 π π€ π 2
31 Gradient Boosting Treeπ½ π‘ π π‘ = π π [ π π π€ π( π₯ π ) π₯ π β π π€ π( π₯ π ) 2 π₯ π ] +Ξ³π+ 1 2 π π=1 π π€ π 2 = π=1 π [( πβ πΌ π π π ) π€ π ( πβ πΌ π β π +π ) π€ π 2 ] +Ξ³π = π=1 π [ πΊ π π€ π ( π» π +π) π€ π 2 ] +Ξ³π We can set π€ π β =β πΊ π π» π +π to achieve minimum π½ π‘ π π‘ when H>0 Min π½ π‘ π π‘ =β 1 2 π=1 π πΊ π 2 π» π +π +Ξ³π
32 Find split of a tree πΊπππ= πΊ πΏ 2 π» πΏ +π + πΊ π 2 π» π +π β ( πΊ πΏ + πΊ π ) 2 π» πΏ + π» π +π βπΎ
33 An Algorithm for Split FindingFor each node, enumerate over all features For each feature, sorted the instances by feature value Use a linear scan to decide the best split along that feature Take the best split solution along all the features Time complexity growing a tee of depth K O(n*d*k*logn), O(nlogn) to sort each level, and there are d features.
34 An Algorithm for Split FindingCopied from Introduction to Boosted Trees Tianqi Chen
35 Thank you