?? 機器學習中文參考手冊 - opencv china.htm
字號:
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title=機器學習中文參考手冊
href="http://www.opencv.org.cn/index.php?title=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B8%AD%E6%96%87%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C&action=edit&section=28">編輯</A>]</DIV><A
name=.E5.B8.B8.E7.94.A8libSVM.E8.B5.84.E6.96.99.E9.93.BE.E6.8E.A5></A>
<H2>常用libSVM資料鏈接 </H2>
<P><A class="external text" title=http://www.csie.ntu.edu.tw/~cjlin/libsvm/
href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/"
rel=nofollow>官方站點,有一些tutorial和測試數據</A> </P>
<P><A class="external text" title=http://bbs.ir-lab.org/cgi-bin/leoboard.cgi
href="http://bbs.ir-lab.org/cgi-bin/leoboard.cgi"
rel=nofollow>哈工大的機器學習論壇,非常好</A> </P>
<P>上交的一個研究生還寫過libsvm2.6版的代碼中文注釋,源鏈接找不著了,大家自己搜搜吧,寫得很好,上海交通大學模式分析與機器智能實驗室。 </P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title=機器學習中文參考手冊
href="http://www.opencv.org.cn/index.php?title=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B8%AD%E6%96%87%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C&action=edit&section=29">編輯</A>]</DIV><A
name=Decision_Trees></A>
<H1>Decision Trees</H1>
<P>The ML classes discussed in this section implement Classification And
Regression Tree algorithms, which is described in [Brieman84]. </P>
<P>The class CvDTree represents a single decision tree that may be used alone,
or as a base class in tree ensembles (see Boosting and Random Trees). </P>
<P>Decision tree is a binary tree (i.e. tree where each non-leaf node has
exactly 2 child nodes). It can be used either for classification, when each tree
leaf is marked with some class label (multiple leafs may have the same label),
or for regression, when each tree leaf is also assigned a constant (so the
approximation function is piecewise constant). </P>
<P><B>Predicting with Decision Trees</B> </P>
<P>To reach a leaf node, and thus to obtain a response for the input feature
vector, the prediction procedure starts with the root node. From each non-leaf
node the procedure goes to the left (i.e. selects the left child node as the
next observed node), or to the right based on the value of a certain variable,
which index is stored in the observed node. The variable can be either ordered
or categorical. In the first case, the variable value is compared with the
certain threshold (which is also stored in the node); if the value is less than
the threshold, the procedure goes to the left, otherwise, to the right (for
example, if the weight is less than 1 kilo, the procedure goes to the left, else
to the right). And in the second case the discrete variable value is tested,
whether it belongs to a certain subset of values (also stored in the node) from
a limited set of values the variable could take; if yes, the procedure goes to
the left, else - to the right (for example, if the color is green or red, go to
the left, else to the right). That is, in each node, a pair of entities
(<variable_index>, <decision_rule (threshold/subset)>) is used. This
pair is called split (split on the variable #<variable_index>). Once a
leaf node is reached, the value assigned to this node is used as the output of
prediction procedure. </P>
<P>Sometimes, certain features of the input vector are missed (for example, in
the darkness it is difficult to determine the object color), and the prediction
procedure may get stuck in the certain node (in the mentioned example if the
node is split by color). To avoid such situations, decision trees use so-called
surrogate splits. That is, in addition to the best "primary" split, every tree
node may also be split on one or more other variables with nearly the same
results. </P>
<P><B>Training Decision Trees</B> </P>
<P>The tree is built recursively, starting from the root node. The whole
training data (feature vectors and the responses) are used to split the root
node. In each node the optimum decision rule (i.e. the best "primary" split) is
found based on some criteria (in ML gini "purity" criteria is used for
classification, and sum of squared errors is used for regression). Then, if
necessary, the surrogate splits are found that resemble at the most the results
of the primary split on the training data; all data are divided using the
primary and the surrogate splits (just like it is done in the prediction
procedure) between the left and the right child node. Then the procedure
recursively splits both left and right nodes etc. At each node the recursive
procedure may stop (i.e. stop splitting the node further) in one of the
following cases: </P>
<UL>
<LI>depth of the tree branch being constructed has reached the specified
maximum value.
<LI>number of training samples in the node is less than the specified
threshold, i.e. it is not statistically representative set to split the node
further.
<LI>all the samples in the node belong to the same class (or, in case of
regression, the variation is too small).
<LI>the best split found does not give any noticeable improvement comparing to
just a random choice. </LI></UL>
<P>When the tree is built, it may be pruned using cross-validation procedure, if
need. That is, some branches of the tree that may lead to the model overfitting
are cut off. Normally, this procedure is only applied to standalone decision
trees, while tree ensembles usually build small enough trees and use their own
protection schemes against overfitting. </P>
<P><B>Variable importance</B> </P>
<P>Besides the obvious use of decision trees - prediction, the tree can be also
used for various data analysis. One of the key properties of the constructed
decision tree algorithms is that it is possible to compute importance (relative
decisive power) of each variable. For example, in a spam filter that uses a set
of words occurred in the message as a feature vector, the variable importance
rating can be used to determine the most "spam-indicating" words and thus help
to keep the dictionary size reasonable. </P>
<P>Importance of each variable is computed over all the splits on this variable
in the tree, primary and surrogate ones. Thus, to compute variable importance
correctly, the surrogate splits must be enabled in the training parameters, even
if there is no missing data. </P>
<P>[Brieman84] Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984),
"Classification and Regression Trees", Wadsworth. </P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title=機器學習中文參考手冊
href="http://www.opencv.org.cn/index.php?title=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B8%AD%E6%96%87%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C&action=edit&section=30">編輯</A>]</DIV><A
name=CvDTreeSplit></A>
<H2>CvDTreeSplit</H2>
<P>Decision tree node split </P><PRE>struct CvDTreeSplit
{
int var_idx;
int inversed;
float quality;
CvDTreeSplit* next;
union
{
int subset[2];
struct
{
float c;
int split_point;
}
ord;
};
};
</PRE>
<DL>
<DT>var_idx
<DD>Index of the variable used in the split
<DT>inversed
<DD>When it equals to 1, the inverse split rule is used (i.e. left and right
branches are exchanged in the expressions below)
<DT>quality
<DD>The split quality, a positive number. It is used to choose the best
primary split, then to choose and sort the surrogate splits. After the tree is
constructed, it is also used to compute variable importance.
<DT>next
<DD>Pointer to the next split in the node split list.
<DT>subset
<DD>Bit array indicating the value subset in case of split on a categorical
variable. The rule is: if var_value in subset then next_node<-left else
next_node<-right
<DT>c
<DD>The threshold value in case of split on an ordered variable. The rule is:
if var_value < c then next_node<-left else next_node<-right
<DT>split_point
<DD>Used internally by the training algorithm. </DD></DL>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title=機器學習中文參考手冊
href="http://www.opencv.org.cn/index.php?title=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B8%AD%E6%96%87%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C&action=edit&section=31">編輯</A>]</DIV><A
name=CvDTreeNode></A>
<H2>CvDTreeNode</H2>
<P>Decision tree node </P><PRE>struct CvDTreeNode
{
int class_idx;
int Tn;
double value;
CvDTreeNode* parent;
CvDTreeNode* left;
CvDTreeNode* right;
CvDTreeSplit* split;
int sample_count;
int depth;
...
};
</PRE>
<DL>
<DT>value
<DD>The value assigned to the tree node. It is either a class label, or the
estimated function value.
<DT>class_idx
<DD>The assigned to the node normalized class index (to 0..class_count-1
range), it is used internally in classification trees and tree ensembles.
<DT>Tn
<DD>The tree index in a ordered sequence of trees. The indices are used during
and after the pruning procedure. The root node has the maximum value Tn of the
whole tree, child nodes have Tn less than or equal to the parent's Tn, and the
nodes with Tn≤CvDTree::pruned_tree_idx are not taken into consideration at the
prediction stage (the corresponding branches are considered as cut-off), even
if they have not been physically deleted from the tree at the pruning stage.
<DT>parent, left, right
<DD>Pointers to the parent node, left and right child nodes.
<DT>split
<DD>Pointer to the first (primary) split.
<DT>sample_count
<DD>The number of samples that fall into the node at the training stage. It is
used to resolve the difficult cases - when the variable for the primary split
is missing, and all the variables for other surrogate splits are missing too,
the sample is directed to the left if
left->sample_count>right->sample_count and to the right otherwise.
<DT>depth
<DD>The node depth, the root node depth is 0, the child nodes depth is the
parent's depth + 1. </DD></DL>
<P>Other numerous fields of CvDTreeNode are used internally at the training
stage. </P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title=機器學習中文參考手冊
href="http://www.opencv.org.cn/index.php?title=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%B8%AD%E6%96%87%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C&action=edit&section=32">編輯</A>]</DIV><A
name=CvDTreeParams></A>
<H2>CvDTreeParams</H2>
<P>Decision tree training parameters </P><PRE>struct CvDTreeParams
{
int max_categories;
int max_depth;
int min_sample_count;
int cv
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -