Learning Algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M. (假设有N个训练样本,M个变量)
2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. (给定m个输入变量,用来确定树上一个节点的决策,m应小于M)
3. Choose a training set for this tree by choosing N times with replacement from all N available training cases(i.e. take a bootstrap sample)。 Use the rest of the cases to estimate the error of the tree, by predicting their classes.(从N个训练样本中随机重复取样N次得到一组训练集,即bootstrap取样)。预测剩余样本的类别,并用以估计决策树的误差。
4. For each node of the tree, randomly choose m variable on which to base the decision at that node. Calculate the best split based on these m variable in the training set.(对每个节点都随机选取m个基于此节点决策的变量。根据着m个变量计算其最佳分割方式)
5. Each tree is fully grown and not pruned(as may be done in constructing a normal tree classifier). (每棵树都会完整生长,不会像其他许多正常树分类器构建完成后经常做的那样被剪枝)
Advantages
The advantages of random forest are:
- For many data sets, it produces a highly accurate classifier (分类准确度高)
- It handles a very large number of input variables (处理大量输入变量)
- It estimates the importance of variables in determine classification(在决策类别时,评估变量的重要性)
- It generates an internal unbiased estimate of the generalization error as the forest building progresses(在构建森林过程中产生对泛化误差的内部无偏差估计)
- It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing (具有一个比较好的方法可以估计缺失值,并且如果有一大部分数据缺失,它仍可以维持准确度)
- It provides an experimental way to detect variable interactions(提供一种试验方法侦测变量之间的相互作用)
- It can balance error in class population unbalanced data sets(对于非平衡数据集中的分类数据,可以平衡误差)
- It computes proximities between cases, useful for clustering, detecting outliers, and(by scaling) visualizing the data(计算各种用例之间的相似性,对于聚类、侦测离群值和数据可视化(扩大或缩小)都非常游泳)
- Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views (它也可被拓展到无标记数据的应用,形成非监督聚类、侦测离群值和数据可视化的方法)
- Learning is fast.(学习过程很快)
Wiki for Random Forest
中文
Random Forest from Berkeley
RandomForest on the main page of Breiman
No comments:
Post a Comment