Friday, November 28, 2008

The Basic Knowledge on Random Forest

In machine learning, a random forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations.

Learning Algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M. (假设有N个训练样本,M个变量)
2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. (给定m个输入变量,用来确定树上一个节点的决策,m应小于M)
3. Choose a training set for this tree by choosing N times with replacement from all N available training cases(i.e. take a bootstrap sample)。 Use the rest of the cases to estimate the error of the tree, by predicting their classes.(从N个训练样本中随机重复取样N次得到一组训练集,即bootstrap取样)。预测剩余样本的类别,并用以估计决策树的误差。
4. For each node of the tree, randomly choose m variable on which to base the decision at that node. Calculate the best split based on these m variable in the training set.(对每个节点都随机选取m个基于此节点决策的变量。根据着m个变量计算其最佳分割方式)
5. Each tree is fully grown and not pruned(as may be done in constructing a normal tree classifier). (每棵树都会完整生长,不会像其他许多正常树分类器构建完成后经常做的那样被剪枝)

Advantages
The advantages of random forest are:
  • For many data sets, it produces a highly accurate classifier (分类准确度高)
  • It handles a very large number of input variables (处理大量输入变量)
  • It estimates the importance of variables in determine classification(在决策类别时,评估变量的重要性)
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses(在构建森林过程中产生对泛化误差的内部无偏差估计)
  • It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing (具有一个比较好的方法可以估计缺失值,并且如果有一大部分数据缺失,它仍可以维持准确度)
  • It provides an experimental way to detect variable interactions(提供一种试验方法侦测变量之间的相互作用)
  • It can balance error in class population unbalanced data sets(对于非平衡数据集中的分类数据,可以平衡误差)
  • It computes proximities between cases, useful for clustering, detecting outliers, and(by scaling) visualizing the data(计算各种用例之间的相似性,对于聚类、侦测离群值和数据可视化(扩大或缩小)都非常游泳)
  • Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views (它也可被拓展到无标记数据的应用,形成非监督聚类、侦测离群值和数据可视化的方法)
  • Learning is fast.(学习过程很快)
Reference:
Wiki for Random Forest
中文
Random Forest from Berkeley
RandomForest on the main page of Breiman

No comments: