• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Logistic regression
Quantitative Statistical Methods Dr. Szilágyi Roland
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Connection Analysis Independent variable(x)
Quantit Qualitative ative
Dependent (y)
Qualitative
crosstabs
ANOVA
Quantitative Discriminant-analysis,
Logistic regression Correlation-, regression analysis
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Logistic regression • The Logistic regression is a multivariate method that helps to predict the classification of cases into groups on the basis of independent variables. So those independent variables (x) are identified in the analysis, which cause significant difference in the dependent variables categories . – binary (the dependent variable has two categories) – Multinomial
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Logistic regression in practice • • • • •
Market research Modelling (by or no) Segmentation reliability Enterprise analysis (default, non default) etc.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Stages of Analysis 1
• General Purpose
2
• Assumptions
3
• Estimaton of Function Coefficients
4
• Interpretation of Results
5
• Validity Tests
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
General purposes • To create logistic regresion function, which is the best split of the categories of dependent variables as linear combination of independent variables. • To determine whether there is a significant difference among groups according to independent variables. • To determine which independent variables explain the most the differences among groups. • Based on the experience obtained by a known classification, we can predict the group membership of new cases analyzing their independent variables. • To measure the accuracy of classification
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
The Assumptions for Logistic regression 1. Measure of variables • The dependent variable should be categorized by m (at least 2) text values (e.g.: 1-good student, 2-bad student; or 1prominent student, 2-average, 3-bad student). • Independent variables could be measured on whatever scale.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
The Assumptions for Logistic regression 2. Independence Not only the explanatory variables, but also all cases must be independent. Therefore, panel, longitudinal research, or pre-test data cannot be used for logistic regression analysis.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
The Assumptions for Logistic regression 3. Sample size It is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have at least 60 observations.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
The Assumptions for Logistic regression 4. Multivariate normal distribution In case of normal distribution, the estimation of parameters are easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
The Assumptions for Logistic regression 5. Multicollinearity Independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Binary Logistic Regression • The logistic function is useful because it can take any input linear combination of independent variables (Xi), whereas the output always takes values between zero and one and hence is interpretable as a probability. The logistic function is defined as follows:
F( x )
e
0 1 x1 ... p x p
1 e
0 1 x1 ... p x p
• Note that F(x) is interpreted as the probability of the dependent variable equaling a "success" or "case" rather than a failure or non-case. P (Yi 1 X )
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Binary Logistic Regression • We can now define the inverse of the logistic function, the „logit” (log odds):
F( x ) 0 1 x1 ... p x p Y ln 1 F ( x) after exponentiating
odds x e
0 1 x1 ... p x p
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Binary Logistic Regression • „The odds of the dependent variable equaling a case (given some linear combination xi of the predictors) is equivalent to the exponential function of the linear regression expression
Px odds x 1 Px
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Binary Logistic Regression Px odds x 1 Px
Px
e
0 1 x1 ... p x p
1 e
0 1 x1 ... p x p
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Maximum Likelihood Method • The maximum likelihood method finds a set of coefficients (β), called the maximum likelihood estimates, at which the log-likelihood function attains its local maximum: n
L i 1
e
0 1 x1 ... p x p
1 e
0 1 x1 ... p x p
max
Forrás: Hajdu Ottó: Többváltozós statisztikai számítások; KSH, Budapest, 2003.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Tests of Model Fit The Binary Logistic Regression procedure reports the Hosmer-Lemeshow goodness-of-fit statistic. It helps you to determine whether the model adequately describes the data Ho: model fits H1: model don’t fit The Hosmer–Lemeshow test specifically identifies subgroups deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar (khi square) are called fitted (well calibrated).
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Testinf of parameters (β) H 0 : i 0 H1 : i 0
bi Waldi = s(b i )
2
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Choosing the Right Model • Based on residual sum of squares (linear regression) • Based on Likelihood ratio (compare the Likelihood of the model with the Likelihood of a baseline (minimal) model) • Proportion of good predictions.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Pseudo R2 • Cox and Snell's R2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model. • Nagelkerke's R2 is an adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1.
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Example • If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit Variables risks. Age in years Level of education Years with current employer Years at current address Household income in thousands Debt to income ratio (x100) Credit card debt in thousands Other debt in thousands Previously defaulted
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Outputs Classification Tablea,b Predicted
Step 0
Observed Previously No defaulted Yes
Selected Casesc Previously defaulted Percenta ge No Yes Correct 375 0 100,0 124
Overall Percentage
a. Constant is included in the model. b. The cut value is ,500 Source: Help- IBM SPSS Statistics
0
,0 75,2
Unselected Casesd,e Previously defaulted Percenta ge No Yes Correct 142 0 100,0 59
0
,0 70,6
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Hosmer and Lemeshow Test Step 1
Chi-square
df
Sig.
3,292
8
,915
2
11,866
8
,157
3
9,447
8
,306
4
4,027
8
,855
Source: Help- IBM SPSS Statistics
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Model Summary Step 1
-2 Log Cox & Snell R Nagelkerke R likelihood Square Square a 498,012 ,116 ,172
2
447,301b
,201
,299
3
411,553b
,257
,381
4
394,721c
,281
,417
Source : Help- IBM SPSS Statistics
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Classification Tablea Predicted
Step 1
Observed Previously defaulted
Selected Casesb Unselected Casesc,d Previously defaulted Previously defaulted Percentage Percentage Correct Correct No Yes No Yes No 361 14 96,3 137 5 96,5 Yes 100 24 19,4 45 14 23,7
Overall Percentage Step 2
Previously defaulted
No Yes
77,2
351 80
24 44
Overall Percentage Step 3
Previously defaulted
No Yes
Step 4
No Yes
Overall Percentage
136 36
6 23
79,2 348 72
27 52
Overall Percentage Previously defaulted
93,6 35,5
75,1
92,8 41,9
79,1 135 28
7 31
80,2 352 67
23 57
93,9 46,0 82,0
95,8 39,0 95,1 52,5 82,6
130 27
12 32
91,5 54,2 80,6
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Classification table (Confusion matrix) (predicted) no (0) no (0) true negative (TN) (observed) yes (1)
False negative (FN) Type II ↓ negative predictive value TN/(TN+FN)
yes (1) False positive (FP) Type I True positive (TP) ↓ positive predictive value (precision) TP/(FP+TP)
→specificity TN/(TN+FP) → sensitivity TP/(FN+TP) accuracy (TP+TN)/ (TN+FP+FN+TP)
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Variables in the Equation Step
1a
Debt to income ratio (x100) Constant
B S.E. Wald df Sig. ,121 ,017 52,676 1 ,000
95% C.I.for EXP(B) Exp(B) Lower Upper 1,129 1,092 1,166
1
,000
,084
Step 2b Years with current employer
-2,476 ,230 116,31 5 -,140 ,023 38,158
1
,000
,869
,831
,909
Debt to income ratio (x100)
,134 ,018 54,659
1
,000
1,143
1,103
1,185
Constant Step 3c Years with current employer
-1,621 ,259 39,038 -,244 ,033 54,676
1 1
,000 ,000
,198 ,783
,734
,836
Debt to income ratio (x100)
,069 ,022
9,809
1
,002
1,072
1,026
1,119
Credit card debt in thousands
,506 ,101 25,127
1
,000
1,658
1,361
2,021
-1,058 ,280 14,249 -,247 ,034 51,826
1 1
,000 ,000
,347 ,781
,731
,836
-,089 ,023 15,109
1
,000
,915
,875
,957
Debt to income ratio (x100)
,072 ,023 10,040
1
,002
1,074
1,028
1,123
Credit card debt in thousands
,602 ,111 29,606
1
,000
1,826
1,470
2,269
1
,045
,546
Constant Step 4d Years with current employer Years at current address
Constant
-,605 ,301
4,034
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Meaning of coefficients • The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Exp(B) is easier to interpret. Exp(B) represents the ratiochange in the odds of the event of interest for a one-unit change in the predictor (Xi ) Ceteris Paribus (all other things being equal). Source : Help- IBM SPSS Statistics
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Source : Help- IBM SPSS Statistics
• Gazdaságtudományi Kar •
Gazdaságelméleti és Módszertani Intézet
Thank you for your attention! email:
[email protected]