SPSS 19.0|Logistic Regression

Zhenting HE / 2024-01-07


Logistic Regression #

Logistic regression is used to predict a binary outcome variable (e.g., Will this customer churn? Will a consumer purchase a product?). It is developed from linear regression and is specifically used when the dependent variable is binary. Logistic regression is widely applied in:

  1. Impact Factor Analysis
  2. Outcome Prediction (similar to regression analysis)

The core idea of logistic regression is to estimate the probability of a certain outcome (1 or 0) given predictor variables.

The logistic regression equation can be written as:

[ \text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n ]

where:

  • ( P ) is the probability of the outcome occurring (Y = 1),
  • ( X_1, X_2, \dots, X_n ) are the independent variables,
  • ( \beta_0, \beta_1, \dots, \beta_n ) are the coefficients to be estimated.

Assumptions of Logistic Regression #

Before performing logistic regression, the following assumptions should be met:

  1. Binary Dependent Variable: The outcome variable must be binary (e.g., yes/no, 1/0).
  2. At Least One Independent Variable: The predictor variables can be continuous or categorical.
  3. Independence of Observations: Each observation must be independent, and the classifications should be exhaustive and mutually exclusive.
  4. Sample Size: The minimum sample size should be 15 times the number of predictor variables. Some researchers suggest a minimum of 50 times the number of predictors.
  5. Linear Relationship between Predictors and Logit: There should be a linear relationship between the continuous predictors and the logit transformation of the dependent variable.
  6. No Multicollinearity: The independent variables should not be highly correlated with each other.
  7. No Outliers or Influential Points: The data should not contain extreme outliers, leverage points, or influential points.

Key Tables in Logistic Regression Output #

When performing logistic regression analysis, the output contains several key tables. We will focus on three main tables:

  1. Omnibus Tests of Model Coefficients

    • This table tests whether all the coefficients in the model are equal to zero. If the p-value is less than 0.05, it indicates that at least one of the predictors has a statistically significant relationship with the dependent variable, meaning the model is significant.

    Example Output:

    Model Chi-Square df Sig.
    1 15.43 3 0.002

    Here, a p-value of 0.002 indicates that the model is significant.

  2. Hosmer and Lemeshow Test

    • This test checks the goodness of fit for the model. A p-value greater than 0.05 indicates that the model fits the data well (i.e., the data have been adequately captured by the model).

    Example Output:

    Step Chi-Square df Sig.
    1 8.44 8 0.39

    Here, a p-value of 0.39 indicates a good fit for the model.

  3. Variables in the Equation

    • This table displays the final selected predictors and their corresponding coefficients. The Sig. column indicates the p-value for each variable’s contribution to the model.

    Example Output:

    Predictor B Std. Error Wald df Sig.
    (Constant) -1.23 0.50 6.91 1 0.009
    Predictor_1 0.45 0.20 5.09 1 0.024
    Predictor_2 0.30 0.12 7.25 1 0.007
    • B: The coefficient for each predictor variable (how much the log-odds of the outcome changes per unit increase in the predictor).
    • Sig.: The p-value that indicates whether the variable is statistically significant.

Practical Operation: Steps in SPSS #

Assume we have a dataset with the independent variable Age and the dependent variable Churn (whether a customer churns or not). We wish to build a logistic regression model to predict customer churn.

Steps:

Step 1: Data Preparation

  1. Open SPSS and go to Data View.

  2. Enter the data, with variables as follows:

    • Churn (dependent variable): 0 = No, 1 = Yes
    • Age (independent variable): Age of the customer

    Example data:

    Age Churn
    25 0
    30 1
    35 0
    40 1
    45 1
    50 0

Step 2: Conduct Logistic Regression

  1. Click on the Analyze menu, select Regression, and then click Binary Logistic.
  2. In the pop-up dialog:
    • Move Churn (dependent variable) to the Dependent box.
    • Move Age (independent variable) to the Covariates box.
  3. Click OK to run the logistic regression.

Step 3: View and Interpret the Output Results

Review the output from the following tables:

  • Omnibus Tests of Model Coefficients: Check if the model is significant.
  • Hosmer and Lemeshow Test: Check if the model fits the data well.
  • Variables in the Equation: Interpret the coefficients and p-values.

Example interpretation of output:

  • Sig. = 0.024 for Age means age significantly predicts customer churn.
  • The B coefficient for Age is 0.45, meaning that for each additional year of age, the odds of customer churn increase by a factor of ( \exp(0.45) ).

Predicting Outcomes #

To predict the probability of a customer churning given their age, use the logistic regression equation:

[ \text{logit}(P) = -1.23 + 0.45 \times (\text{Age}) ]

For a customer who is 40 years old:

[ \text{logit}(P) = -1.23 + 0.45 \times 40 = 14.27 ]

Then, transform the logit value into the probability using the logistic function:

[ P = \frac{1}{1 + e^{-\text{logit}(P)}} = \frac{1}{1 + e^{-14.27}} \approx 0.999 ]

The predicted probability of this customer churning is approximately 99.9%.


Summary #

  1. Logistic Regression is used to predict binary outcomes based on predictor variables.
  2. Key assumptions include a binary dependent variable, independence of observations, and a linear relationship between predictors and logit.
  3. Focus on Omnibus Tests, Hosmer and Lemeshow Test, and Variables in the Equation tables for interpretation.
  4. Use the regression equation to predict probabilities of outcomes.

Through these steps, we can build a logistic regression model, interpret the results, and use it to make predictions.

#SPSS

Last modified on 2024-01-07