?? optimtips.m
字號:
%}%% 5. Ridge regression%{Ridge regression produces a biased estimator of the parameterscompared to a simple linear regression, but biased estimatorsare not always a bad thing.%}%%% Build some random datan = 10;x = randn(n,2);y = sum(x,2) + randn(n,1);% the coefficients of the regression model should be [1 1]',% since we formed y by summing the columns of x, then adding% noise to the result.coef = x\y%%% if we generated many sets of data, we would see that with% few data points in each set and a large noise component,% our estimates of the parameters are fairly volatile. Don't% forget, the nominal coefficients were [1 1]'.coef = zeros(10,2);for i = 1:10 y = sum(x,2) + randn(n,1); coef(i,:) = (x\y)';endcoef%%% Traditional ridge regression can be viewed & accomplished% in several ways. The pure linear regression is designed to% minimize the quadratic form (A*coef - y)'*(A*coef - y).% That is, it minimizes the sum of squares of the residuals% (A*coef - y). A ridge regression simply adds a term to that% objective: lambda*sum(coef.^2). Clearly when lambda is large,% the coefficient vector (coef) will be biased towards zero.% For small lambda, the ridge estimates will be largely% unchanged from the simple linear regression.%% There are two simple ways to accomplish this ridge regression.% The first is to use the normal equations. Recall that the% normal equations (to solve the system A*coef = y) are simply:% % coef = inv(A'*A)*A'*y%% The associated ridge estimates for a given value of lambda% are (for p parameters to estimate, in our example, p = 2.)% % coef = inv(A'*A + lambda^2*eye(p,p))*A'*y%% We can even think of the simple linear regression as a ridge% estimate with lambda = 0.% % As before, we never want to actually use the normal equations.% We can use \ or any other solver to accomplish this by% augmenting the matrix A.%% coef = [A;lambda*eye(p,p)] \ [y;zeros(p,1)]%% One way to interpret this expression is as if we had added% extra data points, each of which implies that one of the% parameters in the regression be zero. We can then think of% lambda as a weight for each of those new "data" points.p = 2;lambda = 1;coef = zeros(10,4);for i = 1:10 y = sum(x,2) + randn(n,1); coef(i,1:2) = (x\y)'; coef(i,3:4) = ([x;lambda*eye(p,p)]\[y;zeros(p,1)])';end% The first two columns of coef are the standard regression% estimates. Columns 3 and 4 are ridge estimates. Note that% the ridge estimates are always biased towards zero. Had% we used a larger value of lambda, the bias would be more% pronounced.coef%%% Simple ridge estimators rarely seem useful to me. The type% of bias (always towards zero) that they produce is often% not that helpful.%% There are other variations on the ridge estimator that can% be quite helpful however. To show this, we will build a% simple spline model - a piecewise constant model. Start% out, as always, with some data.n = 50;x = sort(rand(n,1));y = sin(pi*x) + randn(size(x))/20;closeplot(x,y,'o')title 'Data from a trig function'%%% Choose some knotsknots = 0:.01:1;% p is the number of parameters we will estimatep = length(knots);% Build the regression problem - the least squares spline.% Define the spline in terms of its value in a given knot% interval. There are 50 data points and 101 knots. We need% to know which knot interval every point falls in. Histc% tells us that. (Old versions of matlab which do not have% histc can find bindex from the file exchange.)[junk,bind] = histc(x,knots);% Build the matrix as a sparse one! It is sparse, so use% that fact to our advantage.A = sparse((1:n)',bind,1,n,p);% solve for the least squares spline. Remember that we% had 101 knots, so 101 parameters to estimate.spl_coef = A\yplot(x,y,'go',[knots(1:(end-1));knots(2:end)], ... repmat(spl_coef(1:(end-1))',2,1),'r-')axis([0 1 -.2 1.2])title 'Unregularized zero order "spline" model'xlabel 'x'ylabel 'y'%%% When we looked at the coefficients for this spline, we should% have seen many coefficients that were zero. More than 1/2 of% them were zero in fact. Where there was no data, there the% spline was estimated as zero. There was also a warning of% "rank deficiency". Since there were only 50 data points, but% 101 knots (therefore 101 parameters to estimate) this was% expected.%% We can improve this dramatically with the use of a ridge% regression. Here I'll set up the bias so that each parameter% is biased to be close to its neighbors in the spline.B = spdiags(ones(p-1,1)*[-1 1],[0 1],p-1,p);% Solve. Try it with a reasonably small lambda.lambda = 1e-3;spl_coef_r = ([A;lambda*B]\[y;zeros(p-1,1)])% This time, no zero coefficients, and no rank problems.plot(x,y,'go',[knots(1:(end-1));knots(2:end)], ... repmat(spl_coef_r(1:(end-1))',2,1),'r-')title 'Zero order spline model with minimal regularization'xlabel 'x'ylabel 'y'% This least squares spline, with only a tiny amount of a bias,% looks quit reasonable. The trick is that the regularization% term is only significant for those knots where there is no data% at all to estimate the spline. The information for those spline% coefficients is provided entirely by the regularizer.% We can make lambda larger. Try varying lambda in this block% of code on your own. You will see that lambda = 1 yields just% a bit more smoothing. Lambda = 10 or 100 begins to seriously% bias the result not towards zero, but to the overall mean of% the data!% There are other ways to do this regularization. We might also% choose to bias the integral of the second derivative of the% curve (for higher order splines.)%%% There is one more question to think about when doing any% regularized regression: What is the correct value for lambda?%% Too small or too large, either is not good. In the middle is% just right. But where is Goldilocks when you need her?%% An answer can sometimes be found in cross validation.% This entails repeated fits, dropping out each data point in% turn from the fit, then predicting the dropped point from the% model. The ridge parameter (lambda) which results in the lowest% overall prediction error sum of squares is the choice to use.% A nice discussion of cross validation in its many forms can% be found here:% http://www.quantlet.com/mdstat/scripts/csa/html/node123.html% I'll show an example of ordinary cross validation (OCV) in% action for the same spline fit. First, we'll plot the prediction% error sums of squares (PRESS) as a function of lambda.nl = 21;lambda = logspace(-1,2,nl);press = zeros(1,nl);% loop over lambda values for the plotfor i = 1:nl k = 1:n; % loop over data points, dropping each out in turn for j = 1:n % k_j is the list of data points, less the j'th point k_j = setdiff(k,j); % fit the reduced problem spl_coef = ([A(k_j,:);lambda(i)*B]\[y(k_j);zeros(p-1,1)]); % prediction at the point dropped out pred_j = A(j,:)*spl_coef; % accumulate press for this lambda press(i) = press(i) + (pred_j - y(j)).^2; endend% plot, using a log axis for xsemilogx(lambda,press,'-o')title 'The "optimal" lambda minimizes PRESS'xlabel 'Lambda'ylabel 'PRESS'% Note: there is a minimum in this function near lambda == 1,% although it is only a slight dip. We could now use fminbnd% to minimize PRESS(lambda).%%% I'll be lazy here, and guess from the plot that PRESS was% minimized roughly around lambda == 2.lambda = 2;spl_coef_r = ([A;lambda*B]\[y;zeros(p-1,1)]);% This time, no zero coefficients, and no rank problems.plot(x,y,'go',[knots(1:(end-1));knots(2:end)], ... repmat(spl_coef_r(1:(end-1))',2,1),'r-')title 'Zero order spline model with lambda == 2'xlabel 'x'ylabel 'y'%% 6. Transforming a nonlinear problem to linearity%{There are some simple nonlinear models which can be transformedinto linearity. I'll talk about why you might do this, and whatthe problems are. Supposing that your model was this y = a * exp(b*t)Yes, I've left off the error "term", but I did so for a reason.We'd like to estimate a and b from some data, but the model isnonlinear in the parameters a and b. A trick that is often usedis to take the log of our model. I.e., log(y) = log(a) + b*tWe can now estimate the parameters log(a) & b using traditionallinear regression techniques. Lets try it out on some numbers:%}%%t = linspace(-1,1,21)';y = pi*exp(1.5*t) + randn(size(t));y(y<=0) = 1; % I won't want any negative numbers here!closeplot(t,y,'o')title 'A nonlinear relationship to model'xlabel 't'ylabel 'y'%%% Solve for the coefficients in our transformed problemcoef1 = [ones(size(t)),t] \ log(y)% how well did we do? We can undo the transformation with a simple% exponentiation. The true value was pi.coef1(1) = exp(coef1(1))%%% Is this the right thing to do? Well its not always bad. In fact,% its a trick I've often used to get good starting values, but it% plays a little trickily with the error structure in our problem.%% When we logged the basic model, we forgot that we actually had% added in error to y. But fitting the logged model does not treat% this error properly. In fact, fitting the logged model implicitly% assumes that we have a lognormal error structure, i.e, proportional% error. If we had proportional, lognormal error, then logging the% model turns the noise into normally distributed noise.%% Estimating the simple exponential parameters directly is not really% that hard anyway. Lets do it with lsqcurvefit:fun = @(coef,xdata) coef(1)*exp(coef(2)*xdata);c0 = [1 2];coef2 = lsqcurvefit(fun,c0,t,y)% These parameters may be more accurate since we implicitly assumed% the appropriate error model.yhat1 = coef1(1)*exp(coef1(2)*t);yhat2 = coef2(1)*exp(coef2(2)*t);plot(t,y,'ro',t,yhat1,'b-',t,yhat2,'g-')title 'Transformation versus a pure nonlinear regression'xlabel 't'ylabel 'y'%% 7. Sums of exponentials%{Our last examples were exponential models, so lets talk about sumsof exponentials. It can't be that bad, can it? Suppose your modelis of the form: y = a1*exp(a2*t) + a3*exp(a4*t) + a5*exp(a6*t) + ... + noiseThe Jacobian matrix is what matters, and a measure of whether youwill have problems is the condition number of that matrix. Largecondition numbers are bad. The Jacobian matrix is the matrix of partial derivatives of theparameters. Thus for each data point, differentiate the modelwith respect to each parameter. For a linear model, this is atrivial matter. Since it is always useful to have a baseline,lets look at what happens first for a simple polynomial model.%}%%% Choose 21 points, running from -1 to 1. The order of the model% will be p.n = 21;x = linspace(-1,1,n)';M = ones(n,1);format short gfor p = 1:15 M = [M,x.^p]; k = cond(M); disp([p,k])end% Even for a 15th order polynomial model, the condition number% was only 565000.%%% A sum of exponentials. For simplicity, I'll set all the % coefficients a1 = a3 = a5 = ... = 1, leaving the model% as simply%% y = exp(a2*t) + exp(a4*t) + exp(a6*t) + ... + noise%% The Jacobian matrix is very simple to generate.n = 21;x = linspace(-1,1,n)';c0 = linspace(1,2,8);M = [];for p = 1:8 M = [M,x.*exp(c0(p)*x)]; k = cond(M); disp([p,k])end% Note how the condition number increases as we add terms.%% 8. Poor starting values%{Robustness of an optimizer to poor starting values is a nice goal,but simply choosing the best possible starting values is alwaysvaluable. I'll note that variable partitioning is often a big helpfor least squares problems. By reducing the dimensionality of thesearch space, robustness almost always improves.Very often the sign of a parameter is important. For example, inan exponential model, if the rate parameter is given a startingvalue with the wrong sign I have very often seen optimizationsconverge to a local optimum with clearly the incorrect curve shape.Likewise, constraints will often help prevent an optimizer fromdiverging into known bad places. For example, suppose we chose tofit a Gaussian-like mode to a singly peaked set of spectral data,ranging in wavelength from 400 to 700 nm. We might logicallyrequire that the peak of the Gaussian lies within the range ofour data, and that the spread parameter must be not onlynon-negative, but greater than some small positive value. Ifone is using an optimizer to optimize the knot placement forsplines, forcing the knots to be distinct by some tolerance isalso a good idea.There are tricks one can use to choose good starting values.Section 5 in this text discusses the idea of a linearizationby logging a model. This will probably give reasonable startingestimates. Other linearizations might not be so helpful however.For example, a simple first order Taylor series approximationto an objective will not show any real gain when used to choosestarting values for a Newton derived method. This is becausemethods like Gauss-Newton simply use that same truncated Taylorseries approximation in their iterations. So that Taylor seriesapproximation was nothing more than an initial iteration of theNewton's method. Starting values derived in this fashion willprobably gain no more than a single iteration of the optimizer.
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -