Logistic Regression #
Logistic regression is used to predict a binary outcome variable (e.g., Will this customer churn? Will a consumer purchase a product?). It is developed from linear regression and is specifically used when the dependent variable is binary. Logistic regression is widely applied in:
- Impact Factor Analysis
- Outcome Prediction (similar to regression analysis)
The core idea of logistic regression is to estimate the probability of a certain outcome (1 or 0) given predictor variables.
The logistic regression equation can be written as:
[ \text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n ]
where:
- ( P ) is the probability of the outcome occurring (Y = 1),
- ( X_1, X_2, \dots, X_n ) are the independent variables,
- ( \beta_0, \beta_1, \dots, \beta_n ) are the coefficients to be estimated.
Assumptions of Logistic Regression #
Before performing logistic regression, the following assumptions should be met:
- Binary Dependent Variable: The outcome variable must be binary (e.g., yes/no, 1/0).
- At Least One Independent Variable: The predictor variables can be continuous or categorical.
- Independence of Observations: Each observation must be independent, and the classifications should be exhaustive and mutually exclusive.
- Sample Size: The minimum sample size should be 15 times the number of predictor variables. Some researchers suggest a minimum of 50 times the number of predictors.
- Linear Relationship between Predictors and Logit: There should be a linear relationship between the continuous predictors and the logit transformation of the dependent variable.
- No Multicollinearity: The independent variables should not be highly correlated with each other.
- No Outliers or Influential Points: The data should not contain extreme outliers, leverage points, or influential points.
Key Tables in Logistic Regression Output #
When performing logistic regression analysis, the output contains several key tables. We will focus on three main tables:
-
Omnibus Tests of Model Coefficients
- This table tests whether all the coefficients in the model are equal to zero. If the p-value is less than 0.05, it indicates that at least one of the predictors has a statistically significant relationship with the dependent variable, meaning the model is significant.
Example Output:
Model Chi-Square df Sig. 1 15.43 3 0.002 Here, a p-value of 0.002 indicates that the model is significant.
-
Hosmer and Lemeshow Test
- This test checks the goodness of fit for the model. A p-value greater than 0.05 indicates that the model fits the data well (i.e., the data have been adequately captured by the model).
Example Output:
Step Chi-Square df Sig. 1 8.44 8 0.39 Here, a p-value of 0.39 indicates a good fit for the model.
-
Variables in the Equation
- This table displays the final selected predictors and their corresponding coefficients. The Sig. column indicates the p-value for each variable’s contribution to the model.
Example Output:
Predictor B Std. Error Wald df Sig. (Constant) -1.23 0.50 6.91 1 0.009 Predictor_1 0.45 0.20 5.09 1 0.024 Predictor_2 0.30 0.12 7.25 1 0.007 - B: The coefficient for each predictor variable (how much the log-odds of the outcome changes per unit increase in the predictor).
- Sig.: The p-value that indicates whether the variable is statistically significant.
Practical Operation: Steps in SPSS #
Assume we have a dataset with the independent variable Age and the dependent variable Churn (whether a customer churns or not). We wish to build a logistic regression model to predict customer churn.
Steps:
Step 1: Data Preparation
-
Open SPSS and go to Data View.
-
Enter the data, with variables as follows:
- Churn (dependent variable): 0 = No, 1 = Yes
- Age (independent variable): Age of the customer
Example data:
Age Churn 25 0 30 1 35 0 40 1 45 1 50 0
Step 2: Conduct Logistic Regression
- Click on the Analyze menu, select Regression, and then click Binary Logistic.
- In the pop-up dialog:
- Move Churn (dependent variable) to the Dependent box.
- Move Age (independent variable) to the Covariates box.
- Click OK to run the logistic regression.
Step 3: View and Interpret the Output Results
Review the output from the following tables:
- Omnibus Tests of Model Coefficients: Check if the model is significant.
- Hosmer and Lemeshow Test: Check if the model fits the data well.
- Variables in the Equation: Interpret the coefficients and p-values.
Example interpretation of output:
- Sig. = 0.024 for Age means age significantly predicts customer churn.
- The B coefficient for Age is 0.45, meaning that for each additional year of age, the odds of customer churn increase by a factor of ( \exp(0.45) ).
Predicting Outcomes #
To predict the probability of a customer churning given their age, use the logistic regression equation:
[ \text{logit}(P) = -1.23 + 0.45 \times (\text{Age}) ]
For a customer who is 40 years old:
[ \text{logit}(P) = -1.23 + 0.45 \times 40 = 14.27 ]
Then, transform the logit value into the probability using the logistic function:
[ P = \frac{1}{1 + e^{-\text{logit}(P)}} = \frac{1}{1 + e^{-14.27}} \approx 0.999 ]
The predicted probability of this customer churning is approximately 99.9%.
Summary #
- Logistic Regression is used to predict binary outcomes based on predictor variables.
- Key assumptions include a binary dependent variable, independence of observations, and a linear relationship between predictors and logit.
- Focus on Omnibus Tests, Hosmer and Lemeshow Test, and Variables in the Equation tables for interpretation.
- Use the regression equation to predict probabilities of outcomes.
Through these steps, we can build a logistic regression model, interpret the results, and use it to make predictions.
#SPSSLast modified on 2024-01-07