Gazdaságtudományi Kar. Gazdaságelméleti és Módszertani Intézet. Logistic regression. Quantitative Statistical Methods. Dr

• Gazdaságtudományi Kar •

Gazdaságelméleti és Módszertani Intézet

Logistic regression

Quantitative Statistical Methods Dr. Szilágyi Roland



Connection Analysis Independent variable(x)

Quantit Qualitative ative

Dependent (y)

Qualitative

crosstabs

ANOVA

Quantitative Discriminant-analysis,

Logistic regression Correlation-, regression analysis



Logistic regression • The Logistic regression is a multivariate method that helps to predict the classification of cases into groups on the basis of independent variables. So those independent variables (x) are identified in the analysis, which cause significant difference in the dependent variables categories . – binary (the dependent variable has two categories) – Multinomial



Logistic regression in practice • • • • •

Market research Modelling (by or no) Segmentation reliability Enterprise analysis (default, non default) etc.



Stages of Analysis 1

• General Purpose

2

• Assumptions

3

• Estimaton of Function Coefficients

4

• Interpretation of Results

5

• Validity Tests



General purposes • To create logistic regresion function, which is the best split of the categories of dependent variables as linear combination of independent variables. • To determine whether there is a significant difference among groups according to independent variables. • To determine which independent variables explain the most the differences among groups. • Based on the experience obtained by a known classification, we can predict the group membership of new cases analyzing their independent variables. • To measure the accuracy of classification



The Assumptions for Logistic regression 1. Measure of variables • The dependent variable should be categorized by m (at least 2) text values (e.g.: 1-good student, 2-bad student; or 1prominent student, 2-average, 3-bad student). • Independent variables could be measured on whatever scale.



The Assumptions for Logistic regression 2. Independence Not only the explanatory variables, but also all cases must be independent. Therefore, panel, longitudinal research, or pre-test data cannot be used for logistic regression analysis.



The Assumptions for Logistic regression 3. Sample size It is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have at least 60 observations.



The Assumptions for Logistic regression 4. Multivariate normal distribution In case of normal distribution, the estimation of parameters are easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing.



The Assumptions for Logistic regression 5. Multicollinearity Independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis.



Binary Logistic Regression • The logistic function is useful because it can take any input linear combination of independent variables (Xi), whereas the output always takes values between zero and one and hence is interpretable as a probability. The logistic function is defined as follows:

F( x ) 

e

 0  1 x1 ...  p x p

1 e

 0  1 x1 ...  p x p

• Note that F(x) is interpreted as the probability of the dependent variable equaling a "success" or "case" rather than a failure or non-case. P (Yi  1 X )



Binary Logistic Regression • We can now define the inverse of the logistic function, the „logit” (log odds):

 F( x )     0  1 x1  ...   p x p   Y  ln 1 F  ( x)   after exponentiating

odds x  e

 0  1 x1 ...  p x p



Binary Logistic Regression • „The odds of the dependent variable equaling a case (given some linear combination xi of the predictors) is equivalent to the exponential function of the linear regression expression

Px odds x  1  Px



Binary Logistic Regression Px odds x  1  Px

Px 

e

 0  1 x1 ...  p x p

1 e

 0  1 x1 ...  p x p



Maximum Likelihood Method • The maximum likelihood method finds a set of coefficients (β), called the maximum likelihood estimates, at which the log-likelihood function attains its local maximum: n

L i 1

e

 0  1 x1 ...  p x p

1 e

 0  1 x1 ...  p x p

 max

Forrás: Hajdu Ottó: Többváltozós statisztikai számítások; KSH, Budapest, 2003.



Tests of Model Fit The Binary Logistic Regression procedure reports the Hosmer-Lemeshow goodness-of-fit statistic. It helps you to determine whether the model adequately describes the data Ho: model fits H1: model don’t fit The Hosmer–Lemeshow test specifically identifies subgroups deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar (khi square) are called fitted (well calibrated).



Testinf of parameters (β) H 0 : i  0 H1 :  i  0

 bi   Waldi =   s(b i ) 

2



Choosing the Right Model • Based on residual sum of squares (linear regression) • Based on Likelihood ratio (compare the Likelihood of the model with the Likelihood of a baseline (minimal) model) • Proportion of good predictions.



Pseudo R2 • Cox and Snell's R2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model. • Nagelkerke's R2 is an adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1.



Example • If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit Variables risks. Age in years Level of education Years with current employer Years at current address Household income in thousands Debt to income ratio (x100) Credit card debt in thousands Other debt in thousands Previously defaulted



Outputs Classification Tablea,b Predicted

Step 0

Observed Previously No defaulted Yes

Selected Casesc Previously defaulted Percenta ge No Yes Correct 375 0 100,0 124

Overall Percentage

a. Constant is included in the model. b. The cut value is ,500 Source: Help- IBM SPSS Statistics

0

,0 75,2

Unselected Casesd,e Previously defaulted Percenta ge No Yes Correct 142 0 100,0 59

0

,0 70,6



Hosmer and Lemeshow Test Step 1

Chi-square

df

Sig.

3,292

8

,915

2

11,866

8

,157

3

9,447

8

,306

4

4,027

8

,855

Source: Help- IBM SPSS Statistics



Model Summary Step 1

-2 Log Cox & Snell R Nagelkerke R likelihood Square Square a 498,012 ,116 ,172

2

447,301b

,201

,299

3

411,553b

,257

,381

4

394,721c

,281

,417

Source : Help- IBM SPSS Statistics



Classification Tablea Predicted

Step 1

Observed Previously defaulted

Selected Casesb Unselected Casesc,d Previously defaulted Previously defaulted Percentage Percentage Correct Correct No Yes No Yes No 361 14 96,3 137 5 96,5 Yes 100 24 19,4 45 14 23,7

Overall Percentage Step 2

Previously defaulted

No Yes

77,2

351 80

24 44

Overall Percentage Step 3

Previously defaulted

No Yes

Step 4

No Yes

Overall Percentage

136 36

6 23

79,2 348 72

27 52

Overall Percentage Previously defaulted

93,6 35,5

75,1

92,8 41,9

79,1 135 28

7 31

80,2 352 67

23 57

93,9 46,0 82,0

95,8 39,0 95,1 52,5 82,6

130 27

12 32

91,5 54,2 80,6



Classification table (Confusion matrix) (predicted) no (0) no (0) true negative (TN) (observed) yes (1)

False negative (FN) Type II ↓ negative predictive value TN/(TN+FN)

yes (1) False positive (FP) Type I True positive (TP) ↓ positive predictive value (precision) TP/(FP+TP)

→specificity TN/(TN+FP) → sensitivity TP/(FN+TP) accuracy (TP+TN)/ (TN+FP+FN+TP)



Variables in the Equation Step

1a

Debt to income ratio (x100) Constant

B S.E. Wald df Sig. ,121 ,017 52,676 1 ,000

95% C.I.for EXP(B) Exp(B) Lower Upper 1,129 1,092 1,166

1

,000

,084

Step 2b Years with current employer

-2,476 ,230 116,31 5 -,140 ,023 38,158

1

,000

,869

,831

,909

Debt to income ratio (x100)

,134 ,018 54,659

1

,000

1,143

1,103

1,185

Constant Step 3c Years with current employer

-1,621 ,259 39,038 -,244 ,033 54,676

1 1

,000 ,000

,198 ,783

,734

,836


,069 ,022

9,809

1

,002

1,072

1,026

1,119

Credit card debt in thousands

,506 ,101 25,127

1

,000

1,658

1,361

2,021

-1,058 ,280 14,249 -,247 ,034 51,826

1 1

,000 ,000

,347 ,781

,731

,836

-,089 ,023 15,109

1

,000

,915

,875

,957


,072 ,023 10,040

1

,002

1,074

1,028

1,123

Credit card debt in thousands

,602 ,111 29,606

1

,000

1,826

1,470

2,269

1

,045

,546

Constant Step 4d Years with current employer Years at current address

Constant

-,605 ,301

4,034



Meaning of coefficients • The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Exp(B) is easier to interpret. Exp(B) represents the ratiochange in the odds of the event of interest for a one-unit change in the predictor (Xi ) Ceteris Paribus (all other things being equal). Source : Help- IBM SPSS Statistics



Source : Help- IBM SPSS Statistics



Thank you for your attention! email: [email protected]

Gazdaságtudományi Kar. Gazdaságelméleti és Módszertani Intézet. Logistic regression. Quantitative Statistical Methods. Dr

Recommend Documents