Using Logistic Regression in Research

Binary Logistic Regression is a statistical analysis that determines how much variance, if at all, is explained on a dichotomous dependent variable by a set of independent variables.
Questions Answered:

How does the probability of getting lung cancer change for every additional pound of overweight and for every X cigarettes smoked per day?

Do body weight calorie intake, fat intake, and age have an influence on heart attacks (yes vs. no)?

The major assumptions are:

That the outcome must be discrete as, the dependent variable should be dichotomous in nature (e.g., presence vs. absent).
You should assess the data for outliers by converting the continuous predictors to standardized, or z scores, and remove values below -3.29 or greater than 3.29.

Assumptions and Estimation in Binary Logistic Regression

There should be no high intercorrelations (multicollinearity) among the predictors.  This can be assessed by a correlation matrix among the predictors. Tabachnick and Fidell (2012) suggest that as long correlation coefficients among independent variables are less than 0.90 the assumption is met.

You can handle categorical outcome variables with more than two categories using special forms of logistic regression. Outcome variables with three or more categories which are not ordered. It can be examined using multinomial logistic regression, while ordered outcome variables can be examined using various forms of ordinal logistic regression. These techniques require a number of additional assumptions and tests, so we will focus now strictly on binary logistic regression.

Estimation of Binary Logistic Regression Using Maximum Likelihood Estimation (MLE)

Estimation of binary logistic regression using Maximum Likelihood Estimation (MLE), unlike linear regression, which uses the Ordinary Least Squares (OLS) approach. MLE is an iterative procedure. Which means its starts with a guess as to the best weight for each predictor variable (that is, each coefficient in the model) and then adjusts these coefficients repeatedly until there is no additional improvement in the ability to predict the value of the outcome variable (either 0 or 1) for each case. While OLS regression can be visualized as the process of finding the line which best fits the data, logistic regression is more similar to crosstabulation given that the outcome is categorical and the test statistic utilized is the Chi Square

How is logistic regression run in SPSS and how is the output interpreted?
In SPSS, binary logistic regression is located on the Analyze drop list, under theRegression menu. The outcome variable, which must be coded as 0 and 1, is placed in the first box labeled Dependent. All predictors are entered into the Covariates box, and categorical variables should be appropriately dummy coded. SPSS predicts the value labeled 1 by default, so careful attention should be paid to the coding of the outcome (usually it makes more sense to examine the presence of a characteristic or “success”).

SPSS produces lots of output for logistic regression, but below we focus on the most important panel of coefficients to determine the direction, magnitude, and significance of each predictor.

For example, let’s examine a study that investigates whether individuals have been tested for HIV. Looking first at Age as a predictor, we see that the value in the column labeled B (also known as the logit, the logit coefficient, the logistic regression coefficient, or the parameter estimate) is -.035. This indicates that the association between age and testing is negative; that is, as age increases, testing for HIV decreases. Much like in OLS regression, a logit of 0 indicates no relationship. A positive logit increases the logged odds of success, while a negative logit decreases the logged odds of success.

But what is the magnitude of this effect? Technically we could say that for every additional year of age, the odds of having been tested for HIV decrease by a factor of .035. This is not very intuitive, however, as we don’t generally have a strong sense of what odds are. In fact, it is much more common to look to the last column of this table, labeled Exp(B). This is the Odds Ratio, and you can interpret it as the change in the odds of success.. It is important to note that Odds Ratios (ORs) are relative to 1, meaning that an OR of 1 indicates no relationship while an OR greater than 1 indicates a positive relationship and an OR less than 1 a negative relationship. For most people ORs are most intuitively interpreted by converting to percent changes in the odds of success. This is simply done:

(Odds Ratio – 1) * 100 = percent change

So here we could say that each additional year of age reduces the odds of having been tested for HIV by 3.5%.

The interpretation of dummy-coded predictors is even easier in logistic regression. Here we compare the odds of those coded 1 (females in this example) to those coded 0 (males). Using the same simple equation as above, we find that women have 31.5% greater odds of having been tested for HIV compared to men. For both age and sex we see that the p-value is extremely small (< .01). Therefore, we conclude that each predictor is significantly associated with the outcome of interest.

What are special concerns with regard to logistic regression?
One key way in which logistic regression differs from OLS regression is with regard to explained variance or R2. Because logistic regression estimates the coefficients using MLE rather than OLS (see above). There is no direct corollary to explained variance in logistic regression. Many people want to describe how good a particular model is in an equivalent way, so researchers have developed numerous pseudo-R2 values. These should be interpreted with extreme caution as they have many computational issues that cause them to be artificially high or low. A better approach is to present any of the goodness of fit tests available. Hosmer-Lemeshow is a commonly used measure of goodness of fit. Which is based on the Chi-square test (which makes sense given that logistic regression is related to crosstabulation).