?? psy6003 logistic regression and discriminant analysis.mht
字號:
suggest that you browse through the menu version of SPSS to learn the =
details. A=20
simple example will illustrate the parallels. Imagine that we had =
carried out a=20
study of voting and wished to know how to best predict whether people =
had voted=20
Conservative or Labour. The commands would be:</P><PRE>LOGISTIC =
REGRESSION /VARIABLES voting WITH age sex class
att1 att2 att3 att4 extro psycho neuro
/METHOD FSTEP(LR)
/CLASSPLOT.</PRE>
<P>The dependent variable is separated from the independent variables by =
the=20
term WITH. The METHOD subcommand uses the keyword FSTEP to specify a=20
<B>forward</B> <B>stepwise</B> procedure; we could also use BSTEP which =
does a=20
<B>backward stepwise</B>, i.e. it starts by entering all the variables =
and then=20
takes them out one at a time; or ENTER is we were engaged in hypothesis =
testing=20
rather than exploratory analysis. If no METHOD subcommand is given, =
ENTER will=20
be assumed. The (LR) term after FSTEP specifies that likelihood ratio=20
considerations will be used in selecting variables to add to or delete =
from the=20
model; this is preferable but can slow computation, so it may be =
necessary to=20
omit it. The /CLASSPLOT line is not strictly necessary but aids =
interpretation -=20
<A =
href=3D"http://www.ex.ac.uk/~SEGLea/multvar2/disclogi.html#classplot">see=
=20
below</A>. </P>
<P>A useful property of the LOGISTIC REGRESSION command is that it can =
cope=20
automatically with categorical independent variables; we don't have to =
write a=20
loop as we do for linear regression. All we have to do is declare any=20
categorical variables on a /CATEGORICAL subcommand <I>as well as</I> on =
the=20
/VARIABLES subcommand. The /CONTRAST subcommand should be used to =
control which=20
category is dropped out when the dummy variables are formed; if the =
control or=20
modal category of, say, a variable DIAGNOST was its third value, we =
would use=20
the subcommand /CONTRAST(DIAGNOST)=3DINDICATOR(3) to tell the LOGISTIC =
REGRESSION=20
to drop level 3 of the variable in forming dummy variables. Although =
this is an=20
improvement over what we have to do when using SPSS to carry out linear=20
regression, there is a snag. /CONTRAST likes its category levels =
specified in=20
rather an odd way; in the example, 3 might not be the value used to code =
the=20
modal category in DIAGNOST: for example, if psychotic, neurotic and =
normal=20
people were coded 0, 1 and 2, the correct entry in /CONTRAST would be 3, =
not 2.=20
Look, I didn't write this idiot system, I'm just trying to tell you =
about it.=20
</P>
<P>As in linear regression, there is no need to declare dichotomous =
independent=20
variables as categorical. </P>
<P>We can also use SPSS to carry out discriminant analysis. For the =
example just=20
considered, the commands would be: </P><PRE>DISCRIMINANT =
GROUPS=3Dvoting(0,1)
/VARIABLES =3D age sex class att1 to att4 extro psycho neuro
/METHOD=3DminRESID
/STATISTICS=3DTABLE.</PRE>
<P>Note that we have to specify the two possible levels of the dependent =
variable (voting). We can use the /METHOD subcommand to request a =
variety of=20
stepwise methods (RAO is another you might like to try), or to ENTER all =
or a=20
subset of variables. The subcommand /STATISTICS=3DTABLE is needed to get =
the=20
classification table which is needed for assessing goodness of fit (see =
below).=20
</P>
<P><I>back to <A=20
href=3D"http://www.ex.ac.uk/~SEGLea/multvar2/disclogi.html#top">top</A></=
I></P>
<H3><A name=3Dreport></A>Interpreting and reporting logistic regression=20
results</H3>
<UL>
<LI><B>Log likelihoods</B>=20
<P>A key concept for understanding the tests used in logistic =
regression (and=20
many other procedures using maximum likelihood methods) is that of =
<B>log=20
likelihood</B>. Likelihood just means probability, though it tends to =
be used=20
by statisticians of a <B>Bayesian</B> orientation. It always means =
probability=20
<I>under a specified hypothesis</I>. In thinking about logistic =
regression,=20
two hypotheses are likely to be of interest: the null hypothesis, =
which is=20
that all the coefficients in the regression equation take the value =
zero, and=20
the hypothesis that the model currently under consideration is =
accurate. We=20
then work out the likelihood of observing the exact data we actually =
did=20
observe under each of these hypotheses. The result is nearly always a=20
frighteningly small number, and to make it easier to handle, we take =
its=20
natural logarithm (i.e. its log base <I>e</I>) , giving us a log =
likelihood.=20
Probabilities are always less than one, so log likelihoods are always=20
negative; often, we work with <B>negative log likelihoods</B> for =
convenience.=20
</P>
<LI><B>Goodness of fit</B>=20
<P>Logistic regression does not give rise to an=20
<I>R</I><SUP>2</SUP><SUB>adj</SUB> statistic. Darlington (1990, page =
449)=20
recommends the following statistic as a measure of goodness of fit: =
</P>
<CENTER><PRE> exp[(LL<SUB>model</SUB>-LL<SUB>0</SUB>)/N] - 1
LRFC<SUB>1</SUB> =3D ------------------------
exp(-LL<SUB>0</SUB>/N) - 1
</PRE></CENTER>
<P>where exp refers to the exponential function (the inverse of the =
log=20
function), <I>N</I> as usual is sample size, and =
<I>LL</I><SUB>model</SUB> and=20
<I>LL</I><SUB>0</SUB> are the log likelihoods of the data under the =
model and=20
the null hypothesis respectively. (Note that I have changed =
Darlington's=20
notation a little to make it fit in with that used in the rest of =
these=20
notes.) Darlington's statistic is useful because it takes values =
between 0 and=20
1 (or 0% and 100%) which have much the same interpretation as values =
of=20
<I>R</I><SUP>2</SUP><SUB>adj</SUB> or =
<I>R</I><SUP>2</SUP><SUB>adj</SUB> in an=20
linear regression, although unfortunately it looks from the formula =
that, of=20
the two, it is more closely analogous to <I>R</I><SUP>2</SUP> . =
Unfortunately=20
SPSS does not report this statistic. However, it does report =
<I>negative</I>=20
log likelihoods, multiplied by 2, so with a little adjustment these =
can be=20
inserted in the equation for <I>LRFC</I><SUB>1</SUB>. </P>
<P>Rather than using a goodness of fit statistic, though, we often =
want to=20
look at the proportion of cases we have managed to classify correctly. =
For=20
this we need to look at the <B>classification table</B> printed out by =
SPSS,=20
which tells us how many of the cases where the observed value of the =
dependent=20
variable was 1 have been predicted with a value 1, and so on. An =
advantage of=20
the classification table is that we can get one out of either logistic =
regression or discriminant analysis, so we can use it to compare the =
two=20
approaches. Statisticians claim that logistic regression tends to =
classify a=20
higher proportion of cases correctly. </P>
<P><A name=3Dclassplot></A>Another very useful piece of information =
for=20
assessing goodness of fit can be gained by using the /CLASSPLOT =
subcommand.=20
This causes SPSS to print distributions of predicted logit values,=20
distinguishing the observed category values. The resulting plot is =
very useful=20
for spotting possible outliers. It will also tell you whether it might =
be=20
better to separate the two predicted categories by some rule other =
than the=20
simple one SPSS uses, which is to predict value 1 if logit(<I>p</I>) =
is=20
greater than 0 (i.e. if <I>p</I> is greater than 0.5). A better =
separation of=20
categories might result from using a different criterion. We might =
also want=20
to use a different criterion if the <I>a priori</I> probabilities of =
the two=20
categories were very different (one might be a rare disease, for =
example), or=20
if the costs of mistakenly predicting someone into the two categories =
differ=20
(suppose the categories were "found guilty of murder" and "not =
guilty", for=20
example). The following is an example of such a CLASSPLOT:</P><PRE> =
32 + f+
| f|
| f|
F | f|
R 24 + f+
E | f|
Q | f|
U | f|
E 16 + f+
N | f|
C | f|
Y | f|
8 + f+
| f|
| f f f f ffffff|
| n fnn nnnnnf nnfnn nnn n fn nnffnff f ff nfnffff|
Predicted --------------+--------------+--------------+---------------
Prob: 0 .25 .5 .75 1
Group: nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnffffffffffffffffffffffffffffff
Predicted Probability is of Membership for found guilty
Symbols: n - not guilty
f - found guilty
Each Symbol Represents 2 Cases.</PRE>
<P>If we were called as expert witnesses to advise the court about the =
probability that the person accused had committed murder, using the =
variables=20
in this particular logistic regression model, we might want to set a =
predicted=20
probability criterion of .9 rather than .5</P>
<LI><B>Overall significance</B>=20
<P>SPSS will offer you a variety of statistical tests. Usually, =
though,=20
overall significance is tested using what SPSS calls the <I>Model=20
Chi</I>-<I>square</I>, which is derived from the likelihood of =
observing the=20
actual data under the assumption that the model that has been fitted =
is=20
accurate. It is convenient to use -2 times the log (base <I>e</I>) of =
this=20
likelihood; we call this -2<I>LL</I>. The difference between =
-2<I>LL</I> for=20
the best-fitting model and -2<I>LL</I> for the null hypothesis model =
(in which=20
all the <I>b</I> values are set to zero) is distributed like =
chi-squared, with=20
degrees of freedom equal to the number of predictors; this difference =
is the=20
<I>Model chi</I>-<I>square</I> that SPSS refers to. Very conveniently, =
the=20
difference between -2<I>LL</I> values for models with successive terms =
added=20
also has a chi-squared distribution, so when we use a stepwise =
procedure, we=20
can use chi-squared tests to find out if adding one or more extra =
predictors=20
singificantly improves the fit of our model. <A name=3Dcoeffs></A></P>
<LI><B>The interpretation of coefficients</B>=20
<P>How can we <I>describe</I> the effect of a single regressor in =
logistic=20
regression? The fundamental equation for logistic regression tells us =
that=20
with all other variables held constant, there is a constant increase =
of=20
<I>b</I><SUB>1</SUB> in logit(<I>p</I>) for every 1-unit increase in=20
<I>x</I><SUB>1</SUB>, and so on. But what does a constant increase in=20
logit(<I>p</I>) mean? Because the logit transformation is non-linear, =
it does=20
not mean a constant increase in <I>p</I>; so the increase in <I>p</I>=20
associated with a 1-unit increase in <I>x</I><SUB>1</SUB> changes with =
the=20
value of <I>x</I><SUB>1</SUB> you begin with. </P>
<P>It turns out that a constant increase in logit(<I>p</I>) does have =
a=20
reasonably straightforward interpretation. It corresponds to a =
constant=20
<I>multiplication</I> (by exp(<I>b</I>)) of the <B>odds</B> that the =
dependent=20
variable takes the value 1 rather than 0. So, suppose =
<I>b</I><SUB>1</SUB>=20
takes the value 2.30 - we choose this value as an example because =
exp(2.30)=20
equals 10, so the arithmetic will be easy. Then if =
<I>x</I><SUB>1</SUB>=20
changes increases by 1, the odds that the dependent variable takes the =
value 1=20
increase tenfold. So, with this value of <I>b</I><SUB>1</SUB>, let us =
suppose=20
that with all other variables at their mean values, and =
<I>x</I><SUB>1</SUB>=20
taking the value 0, we predict a logit(<I>p</I>) of 0; this means that =
there=20
is an even chance of the dependent variable taking the value 1. Now =
suppose=20
<I>x</I><SUB>1</SUB> increases to 1. The odds that the dependent =
variable=20
?? 快捷鍵說明
復制代碼
Ctrl + C
搜索代碼
Ctrl + F
全屏模式
F11
切換主題
Ctrl + Shift + D
顯示快捷鍵
?
增大字號
Ctrl + =
減小字號
Ctrl + -