US 877.437.8622    UK 0.808.101.0930    info@statisticssolutions.com

Our Mission

"To serve graduate students and researchers by producing and delivering expert data analysis and clear sample size justification, comprehensible results, and ongoing support with unsurpassed response time and the most aggressive pricing in the statistical consulting field."

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse ultricies scelerisque bibendum. Maecenas sodales fermentum nisl id dapibus. Praesent malesuada, lacus non accumsan imperdiet, quam ante euismod dui, quis fermentum felis metus non nisi"

What is logistic regression?

Logistic regression is the linear regression analysis to conduct when the dependent variable is dichotomous (binary). Like all linear regressions the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more metric (interval or ratio scale) independent variables.

Standard linear regression requires the dependent variable to be of metric (interval or ratio) scale. How can we apply the same principle to a dichotomous (0/1) variable? Logistic regression assumes that the dependent variable is a stochastic event. That is that for instance if we analyze a pesticides kill rate the outcome event is either killed or alive. Since even the most resistant bug can only be either of these two states, logistic regression thinks in likelihoods of the bug getting killed. If the likelihood of killing the bug is > 0.5 it is assumed dead, if it is < 0.5 it is assumed alive.

It is quite common to run a regular linear regression analysis with dummy variables. A dummy variable is a binary variable that is treated as if it would be continuous. However, such an approach has major shortcomings. Firstly, it can lead to probabilities outside of the (0,1) interval; secondly residuals will all have the same variance (think of parallel lines in the zpred*zresid plot).

The basic idea is to use a logarithmic function to restrict the probability values to (0,1). Technically this is the log odds (the logarithmic of the odds of y = 1). Sometimes instead of a logit model for logistic regression a probit model is used. The following graph shows the difference for a logit and a probit model for different values (-4,4). Both models are commonly used in logistic regression, in most cases a model is fitted with both functions and the function with the better fit is chosen. However, probit assumes normal distribution of the probability of the event, when logit assumes the log distribution. Thus the difference between logit and probit is typically seen in small samples.

At the center of the logistic regression analysis is the task estimating the log odds of an event. Mathematically logistic regression estimates a multiple linear regression function defined as

logit(p)

for i = 1…n .

Logistic regression is similar to the Discriminant Analysis. Discriminant analysis uses the regression line to split a sample in two groups along the levels of the dependent variable. Whereas the logistic regression analysis uses the concept of probabilities and log odds with cut-off probability 0.5, the discriminant analysis cuts the geometrical plane that is represented by the scatter cloud. The practical difference is in the assumptions of both tests. If the data is multivariate normal, homoscedasticity is present in variance and covariance and the independent variables are linearly related, then use discriminant analysis because it is more statistically powerful and efficient. Discriminant analysis is typically more accurate in predictive classification of the dependent variable than logistic regression.

When selecting the model for the logistic regression analysis another important consideration is the model fit. Adding independent variables to a logistic regression model will always increase its statistical validity, because it will always explain a bit more variance of the log odds (typically expressed as R²). However, adding more and more variables to the model makes it inefficient and over fitting occurs. Occam’s razor applies perfectly to the problem of over fitting – a logistic regression model should be as simple as possible but not simpler. Statistically speaking – if the logistic regression includes a large number of variables the probability increases that the significance test finds the variables to be significant just by pure chance.

Contact Request Form

Fill-out the form below to learn how we can assist you with logistic regression

We respect your privacy and guarantee that information will never be shared with third parties

  • Ph.D. Research Methodologists
  • Ph.D. Statisticians
  • Timely ongoing support
  • Accurate Statistics Guaranteed
  • Will Accommodate Your Schedule
  • Statistics Coaching
  • Quantitative & Qualitative Expertise
  • Customized Video Tutorials
Email Newsletter icon, E-mail Newsletter icon, Email List icon, E-mail List icon Sign Up For Our Weekly Email Newsletter
For Email Newsletters you can trust
WebsiteFeedback
Feedback Analytics