?? kmeans.m
字號:
function [idx, C, sumD, D] = kmeans(X, k, varargin)
%KMEANS K-means clustering.
% IDX = KMEANS(X, K) partitions the points in the N-by-P data matrix
% X into K clusters. This partition minimizes the sum, over all
% clusters, of the within-cluster sums of point-to-cluster-centroid
% distances. Rows of X correspond to points, columns correspond to
% variables. KMEANS returns an N-by-1 vector IDX containing the
% cluster indices of each point. By default, KMEANS uses squared
% Euclidean distances.
%
% KMEANS treats NaNs as missing data, and ignores any rows of X that
% contain NaNs.
%
% [IDX, C] = KMEANS(X, K) returns the K cluster centroid locations in
% the K-by-P matrix C.
%
% [IDX, C, SUMD] = KMEANS(X, K) returns the within-cluster sums of
% point-to-centroid distances in the 1-by-K vector sumD.
%
% [IDX, C, SUMD, D] = KMEANS(X, K) returns distances from each point
% to every centroid in the N-by-K matrix D.
%
% [ ... ] = KMEANS(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies
% optional parameter name/value pairs to control the iterative algorithm
% used by KMEANS. Parameters are:
%
% 'Distance' - Distance measure, in P-dimensional space, that KMEANS
% should minimize with respect to. Choices are:
% 'sqEuclidean' - Squared Euclidean distance (the default)
% 'cityblock' - Sum of absolute differences, a.k.a. L1 distance
% 'cosine' - One minus the cosine of the included angle
% between points (treated as vectors)
% 'correlation' - One minus the sample correlation between points
% (treated as sequences of values)
% 'Hamming' - Percentage of bits that differ (only suitable
% for binary data)
%
% 'Start' - Method used to choose initial cluster centroid positions,
% sometimes known as "seeds". Choices are:
% 'sample' - Select K observations from X at random (the default)
% 'uniform' - Select K points uniformly at random from the range
% of X. Not valid for Hamming distance.
% 'cluster' - Perform preliminary clustering phase on random 10%
% subsample of X. This preliminary phase is itself
% initialized using 'sample'.
% matrix - A K-by-P matrix of starting locations. In this case,
% you can pass in [] for K, and KMEANS infers K from
% the first dimension of the matrix. You can also
% supply a 3D array, implying a value for 'Replicates'
% from the array's third dimension.
%
% 'Replicates' - Number of times to repeat the clustering, each with a
% new set of initial centroids. A positive integer, default is 1.
%
% 'EmptyAction' - Action to take if a cluster loses all of its member
% observations. Choices are:
% 'error' - Treat an empty cluster as an error (the default)
% 'drop' - Remove any clusters that become empty, and set
% the corresponding values in C and D to NaN.
% 'singleton' - Create a new cluster consisting of the one
% observation furthest from its centroid.
%
% 'Options' - Options for the iterative algorithm used to minimize the
% fitting criterion, as created by STATSET. Choices of STATSET
% parameters are:
%
% 'Display' - Level of display output. Choices are 'off', (the
% default), 'iter', and 'final'.
% 'MaxIter' - Maximum number of iterations allowed. Default is 100.
%
% 'OnlinePhase' - Flag indicating whether KMEANS should perform an "on-line
% update" phase in addition to a "batch update" phase. The on-line phase
% can be time consuming for large data sets, but guarantees a solution
% that is local minimum of the distance criterion, i.e., a partition of
% the data where moving any single point to a different cluster increases
% the total sum of distances. 'on' (the default) or 'off'.
%
% Example:
%
% X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];
% opts = statset('Display','final');
% [cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...
% 'Replicates',5, 'Options',opts);
% plot(X(cidx==1,1),X(cidx==1,2),'r.', ...
% X(cidx==2,1),X(cidx==2,2),'b.', ctrs(:,1),ctrs(:,2),'kx');
%
% See also LINKAGE, CLUSTERDATA, SILHOUETTE.
% KMEANS uses a two-phase iterative algorithm to minimize the sum of
% point-to-centroid distances, summed over all K clusters. The first phase
% uses what the literature often describes as "batch" updates, where each
% iteration consists of reassigning points to their nearest cluster
% centroid, all at once, followed by recalculation of cluster centroids.
% This phase occasionally (especially for small datasets) does not converge
% to solution that is a local minimum, i.e., a partition of the data where
% moving any single point to a different cluster increases the total sum of
% distances. Thus, the batch phase be thought of as providing a fast but
% potentially only approximate solution as a starting point for the second
% phase. The second phase uses what the literature often describes as
% "on-line" updates, where points are individually reassigned if doing so
% will reduce the sum of distances, and cluster centroids are recomputed
% after each reassignment. Each iteration during this second phase consists
% of one pass though all the points. The on-line phase will converge to a
% local minimum, although there may be other local minima with lower total
% sum of distances. The problem of finding the global minimum can only be
% solved in general by an exhaustive (or clever, or lucky) choice of
% starting points, but using several replicates with random starting points
% typically results in a solution that is a global minumum.
%
% References:
%
% [1] Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
% [2] Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN
% Programs, Examples, translated by J. Goldschmidt, Halsted Press,
% New York, 226 pp.
% Copyright 1993-2007 The MathWorks, Inc.
% $Revision: 1.4.4.8 $ $Date: 2007/06/14 05:25:34 $
if nargin < 2
error('stats:kmeans:TooFewInputs','At least two input arguments required.');
end
[ignore,wasnan,X] = statremovenan(X);
hadNaNs = any(wasnan);
if hadNaNs
warning('stats:kmeans:MissingDataRemoved','Ignoring rows of X with missing data.');
end
% n points in p dimensional space
[n, p] = size(X);
pnames = { 'distance' 'start' 'replicates' 'emptyaction' 'onlinephase' 'options' 'maxiter' 'display'};
dflts = {'sqeuclidean' 'sample' [] 'error' 'on' [] [] []};
[eid,errmsg,distance,start,reps,emptyact,online,options,maxit,display] ...
= statgetargs(pnames, dflts, varargin{:});
if ~isempty(eid)
error(sprintf('stats:kmeans:%s',eid),errmsg);
end
if ischar(distance)
distNames = {'sqeuclidean','cityblock','cosine','correlation','hamming'};
j = strmatch(lower(distance), distNames);
if length(j) > 1
error('stats:kmeans:AmbiguousDistance', ...
'Ambiguous ''Distance'' parameter value: %s.', distance);
elseif isempty(j)
error('stats:kmeans:UnknownDistance', ...
'Unknown ''Distance'' parameter value: %s.', distance);
end
distance = distNames{j};
switch distance
case 'cosine'
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps(max(Xnorm)))
error('stats:kmeans:ZeroDataForCos', ...
['Some points have small relative magnitudes, making them ', ...
'effectively zero.\nEither remove those points, or choose a ', ...
'distance other than ''cosine''.']);
end
X = X ./ Xnorm(:,ones(1,p));
case 'correlation'
X = X - repmat(mean(X,2),1,p);
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps(max(Xnorm)))
error('stats:kmeans:ConstantDataForCorr', ...
['Some points have small relative standard deviations, making them ', ...
'effectively constant.\nEither remove those points, or choose a ', ...
'distance other than ''correlation''.']);
end
X = X ./ Xnorm(:,ones(1,p));
case 'hamming'
if ~all(ismember(X(:),[0 1]))
error('stats:kmeans:NonbinaryDataForHamm', ...
'Non-binary data cannot be clustered using Hamming distance.');
end
end
else
error('stats:kmeans:InvalidDistance', ...
'The ''Distance'' parameter value must be a string.');
end
if ischar(start)
startNames = {'uniform','sample','cluster'};
j = strmatch(lower(start), startNames);
if length(j) > 1
error('stats:kmeans:AmbiguousStart', ...
'Ambiguous ''Start'' parameter value: %s.', start);
elseif isempty(j)
error('stats:kmeans:UnknownStart', ...
'Unknown ''Start'' parameter value: %s.', start);
elseif isempty(k)
error('stats:kmeans:MissingK', ...
'You must specify the number of clusters, K.');
end
start = startNames{j};
if strcmp(start, 'uniform')
if strcmp(distance, 'hamming')
error('stats:kmeans:UniformStartForHamm', ...
'Hamming distance cannot be initialized with uniform random values.');
end
Xmins = min(X,[],1);
Xmaxs = max(X,[],1);
end
elseif isnumeric(start)
CC = start;
start = 'numeric';
if isempty(k)
k = size(CC,1);
elseif k ~= size(CC,1);
error('stats:kmeans:MisshapedStart', ...
'The ''Start'' matrix must have K rows.');
elseif size(CC,2) ~= p
error('stats:kmeans:MisshapedStart', ...
'The ''Start'' matrix must have the same number of columns as X.');
end
if isempty(reps)
reps = size(CC,3);
elseif reps ~= size(CC,3);
error('stats:kmeans:MisshapedStart', ...
'The third dimension of the ''Start'' array must match the ''replicates'' parameter value.');
end
% Need to center explicit starting points for 'correlation'. (Re)normalization
% for 'cosine'/'correlation' is done at each iteration.
if isequal(distance, 'correlation')
CC = CC - repmat(mean(CC,2),[1,p,1]);
end
else
error('stats:kmeans:InvalidStart', ...
'The ''Start'' parameter value must be a string or a numeric matrix or array.');
end
if ischar(emptyact)
emptyactNames = {'error','drop','singleton'};
j = strmatch(lower(emptyact), emptyactNames);
if length(j) > 1
error('stats:kmeans:AmbiguousEmptyAction', ...
'Ambiguous ''EmptyAction'' parameter value: %s.', emptyact);
elseif isempty(j)
error('stats:kmeans:UnknownEmptyAction', ...
'Unknown ''EmptyAction'' parameter value: %s.', emptyact);
end
emptyact = emptyactNames{j};
else
error('stats:kmeans:InvalidEmptyAction', ...
'The ''EmptyAction'' parameter value must be a string.');
end
if ischar(online)
j = strmatch(lower(online), {'on','off'});
if length(j) > 1
error('stats:kmeans:AmbiguousOnlinePhase', ...
'Ambiguous ''OnlinePhase'' parameter value: %s.', online);
elseif isempty(j)
error('stats:kmeans:UnknownOnlinePhase', ...
'Unknown ''OnlinePhase'' parameter value: %s.', online);
end
online = (j==1);
else
error('stats:kmeans:InvalidOnlinePhase', ...
'The ''OnlinePhase'' parameter value must be ''on'' or ''off''.');
end
% 'maxiter' and 'display' are grandfathered as separate param name/value pairs
if ~isempty(display)
options = statset(options,'Display',display);
end
if ~isempty(maxit)
options = statset(options,'MaxIter',maxit);
end
options = statset(statset('kmeans'), options);
display = strmatch(lower(options.Display), {'off','notify','final','iter'}) - 1;
maxit = options.MaxIter;
if ~(isscalar(k) && isnumeric(k) && isreal(k) && k > 0 && (round(k)==k))
error('stats:kmeans:InvalidK', ...
'X must be a positive integer value.');
% elseif k == 1
% this special case works automatically
elseif n < k
error('stats:kmeans:TooManyClusters', ...
'X must have more rows than the number of clusters.');
end
% Assume one replicate
if isempty(reps)
reps = 1;
end
%
% Done with input argument processing, begin clustering
%
dispfmt = '%6d\t%6d\t%8d\t%12g';
if online, Del = NaN(n,k); end % reassignment criterion
totsumDBest = Inf;
emptyErrCnt = 0;
for rep = 1:reps
switch start
case 'uniform'
C = unifrnd(Xmins(ones(k,1),:), Xmaxs(ones(k,1),:));
% For 'cosine' and 'correlation', these are uniform inside a subset
% of the unit hypersphere. Still need to center them for
% 'correlation'. (Re)normalization for 'cosine'/'correlation' is
% done at each iteration.
if isequal(distance, 'correlation')
C = C - repmat(mean(C,2),1,p);
end
if isa(X,'single')
C = single(C);
end
case 'sample'
C = X(randsample(n,k),:);
if ~isfloat(C) % X may be logical
C = double(C);
end
case 'cluster'
Xsubset = X(randsample(n,floor(.1*n)),:);
[dum, C] = kmeans(Xsubset, k, varargin{:}, 'start','sample', 'replicates',1);
case 'numeric'
C = CC(:,:,rep);
end
% Compute the distance from every point to each cluster centroid and the
% initial assignment of points to clusters
D = distfun(X, C, distance, 0);
[d, idx] = min(D, [], 2);
m = accumarray(idx,1,[k,1]);
try % catch empty cluster errors and move on to next rep
% Begin phase one: batch reassignments
converged = batchUpdate();
% Begin phase two: single reassignments
if online
converged = onlineUpdate();
end
if ~converged
warning('stats:kmeans:FailedToConverge', ...
'Failed to converge in %d iterations%s.',maxit,repsMsg(rep,reps));
end
% Calculate cluster-wise sums of distances
nonempties = find(m>0);
D(:,nonempties) = distfun(X, C(nonempties,:), distance, iter);
d = D((idx-1)*n + (1:n)');
sumD = accumarray(idx,d,[k,1]);
totsumD = sum(sumD);
if display > 1 % 'final' or 'iter'
disp(sprintf('%d iterations, total sum of distances = %g',iter,totsumD));
end
% Save the best solution so far
if totsumD < totsumDBest
totsumDBest = totsumD;
idxBest = idx;
Cbest = C;
sumDBest = sumD;
if nargout > 3
Dbest = D;
end
end
% If an empty cluster error occurred in one of multiple replicates, catch
% it, warn, and move on to next replicate. Error only when all replicates
% fail. Rethrow an other kind of error.
catch
err = lasterror;
if reps == 1 || ~isequal(err.identifier,'stats:kmeans:EmptyCluster')
rethrow(err);
else
emptyErrCnt = emptyErrCnt + 1;
warning('stats:kmeans:EmptyCluster', ...
'Replicate %d terminated: empty cluster created at iteration %d.',rep,iter);
if emptyErrCnt == reps
error('stats:kmeans:EmptyClusterAllReps', ...
'An empty cluster error occurred in every replicate.');
end
end
end % catch
end % replicates
% Return the best solution
idx = idxBest;
C = Cbest;
sumD = sumDBest;
if nargout > 3
D = Dbest;
end
if hadNaNs
idx = statinsertnan(wasnan, idx);
end
%------------------------------------------------------------------
?? 快捷鍵說明
復(fù)制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -