?? the k-means procedure.htm
字號:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0055)http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html -->
<HTML><HEAD><TITLE>The k-Means Procedure</TITLE>
<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>
<META content="MSHTML 5.00.3806.1700" name=GENERATOR>
<META content="C:\Program Files\Microsoft Office\Office\html.dot" name=Template>
<META content="Mozilla/4.76 [en] (Windows NT 5.0; U) [Netscape]"
name=GENERATOR></HEAD>
<BODY link=#0000ff vLink=#800080><B><FONT face=Arial><FONT size=+4>K-means
algorithm</FONT></FONT></B> <BR><A
href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html">Description</A> |
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansDemo.html">Demo</A> |
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansCode.html">Code</A> |
<P>This is a simple non-hierarchical clustering algorithm which requires the
number of clusters as an input. It works by initially creating a random centroid
for each of the clusters. Then the data is classified depending upon the minimum
distance to the centroids. The centroid's position is recalculated everytime a
component is added to the cluster and this continues until all the components
are grouped into the final required number of clusters and the centroids do not
change in successive calculations.
<P><B><I><FONT face=Arial>The k-Means Procedure</FONT></I></B>
<P>Suppose that we have n example feature vectors x<SUB>1</SUB>, x<SUB>2</SUB>,
..., x<SUB>n</SUB> all from the same class, and we know that they fall into k
compact clusters, k < n. Let m<SUB>i</SUB> be the mean of the vectors in
Cluster i. If the clusters are well separated, we can use a minimum-distance
classifier to separate them. That is, we can say that x is in Cluster i if || x
- mi || is the minimum of all the k distances. This suggests the following
procedure for finding the k means:
<P>Make initial guesses for the means m<SUB>1</SUB>, m<SUB>2</SUB>, ...,
m<SUB>k</SUB>
<P>Until there are no changes in any mean
<DIR>
<DIR>Use the estimated means to classify the examples into clusters using a
distance measure
<P>For i from 1 to k
<DIR>
<DIR>Replace m<SUB>i</SUB> with the mean of all of the examples for Cluster
i</DIR></DIR>end_for</DIR></DIR>end_until
<P>This is a simple version of the k-means procedure. It can be viewed as a
greedy algorithm for partitioning the n examples into k clusters so as to
minimize the sum of the squared distances to the cluster centers. It does have
some weaknesses.
<UL>
<LI>The way to initialize the means was not specified. One popular way to
start is to randomly choose k of the examples.
<LI>The results produced depend on the initial values for the means, and it
frequently happens that sub-optimal partitions are found. The standard
solution is to try a number of different starting points.
<LI>It can happen that the set of examples closest to m<SUB>i</SUB> is empty,
so that m<SUB>i</SUB> cannot be updated. This is an annoyance that must be
handled in an implementation, but that we shall ignore.
<LI>The results depend on the metric used to measure || x - mi ||. A popular
solution is to normalize each variable by its standard deviation, though this
is not always desirable.
<LI>The results depend on the value of k. </LI></UL><B><I><FONT
face=Arial>Implementation issues</FONT></I></B>
<P>An implementation of this algorithm needs to pay attention to the following
issues.
<P><B>Initialization of k-means</B>: Since the basic k-means algorithm does not
guarantee finding a global optimum and only finds a local minima, the solution
obtained often depends very much on how it is initialized. There are a number of
different heuristics used to initialize the algorithm. Two popular methods are -
<UL>
<LI>assign i<SUP>th</SUP> sample to the i modulo k<SUP>th</SUP> class.
<LI>randomly assign data to k different classes. </LI></UL><B>Distance
measure</B>: The algorithm spends most of its time computing the distance. So,
the measure chosen will have significant impact on the total time taken to
compute.
<P><B>Number of iterations</B>: Theoretically, k-means should terminate when no
more samples are changing classes. There are proofs of termination for k-means,
which rely on the fact that both steps of k-means (assign sample to nearest
centers, move centers to cluster centroids) reduce variance. Typically, two
criteria are used -
<UL>
<LI>terminate after a predefined fixed number of iterations, or
<LI>terminate after fewer than n sample change classes. </LI></UL><B>Dead
clusters</B>: Seldom a centroid may not have any members in the next iteration.
Such a cluster needs to be taken care of with some heuristics.
<P><B>Number of k-means</B>: Usually, the number of clusters required is an
input variable. However, certain heuristic cost function based on inter and
intra cluster variance may be used to choose k in a model selection scenario.
<P>
<HR width="100%">
<BR><A
href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html">Description</A> |
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansDemo.html">Demo</A> |
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansCode.html">Code</A> |
<P>Comments to <A
href="mailto:akgu380@cse.unsw.edu.au">akgu380@cse.unsw.edu.au</A> <BR>
<BR> <BR> <BR> <BR> <BR> <BR> <BR>
<BR> <BR> <BR> <BR> <BR> <BR> <BR>
<BR> </P></BODY></HTML>
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -