Logistic regression is the linear regression analysis to conduct when the dependent variable is dichotomous (binary). Like all linear regressions the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more metric (interval or ratio scale) independent variables.
Standard linear regression requires the dependent variable to be of metric (interval or ratio) scale. How can we apply the same principle to a dichotomous (0/1) variable? Logistic regression assumes that the dependent variable is a stochastic event. That is that for instance if we analyze a pesticides kill rate the outcome event is either killed or alive. Since even the most resistant bug can only be either of these two states, logistic regression thinks in likelihoods of the bug getting killed. If the likelihood of killing the bug is > 0.5 it is assumed dead, if it is < 0.5 it is assumed alive.
It is quite common to run a regular linear regression analysis with dummy variables. A dummy variable is a binary variable that is treated as if it would be continuous. However, such an approach has major shortcomings. Firstly, it can lead to probabilities outside of the (0,1) interval; secondly residuals will all have the same variance (think of parallel lines in the zpred*zresid plot).
The basic idea is to use a logarithmic function to restrict the probability values to (0,1). Technically this is the log odds (the logarithmic of the odds of y = 1). Sometimes instead of a logit model for logistic regression a probit model is used. The following graph shows the difference for a logit and a probit model for different values (-4,4). Both models are commonly used in logistic regression, in most cases a model is fitted with both functions and the function with the better fit is chosen. However, probit assumes normal distribution of the probability of the event, when logit assumes the log distribution. Thus the difference between logit and probit is typically seen in small samples.

At the center of the logistic regression analysis is the task estimating the log odds of an event. Mathematically logistic regression estimates a multiple linear regression function defined as
logit(p) ![]()
for i = 1…n .
Logistic regression is similar to the Discriminant Analysis. Discriminant analysis uses the regression line to split a sample in two groups along the levels of the dependent variable. Whereas the logistic regression analysis uses the concept of probabilities and log odds with cut-off probability 0.5, the discriminant analysis cuts the geometrical plane that is represented by the scatter cloud. The practical difference is in the assumptions of both tests. If the data is multivariate normal, homoscedasticity is present in variance and covariance and the independent variables are linearly related, then use discriminant analysis because it is more statistically powerful and efficient. Discriminant analysis is typically more accurate in predictive classification of the dependent variable than logistic regression.
When selecting the model for the logistic regression analysis another important consideration is the model fit. Adding independent variables to a logistic regression model will always increase its statistical validity, because it will always explain a bit more variance of the log odds (typically expressed as R²). However, adding more and more variables to the model makes it inefficient and over fitting occurs. Occam’s razor applies perfectly to the problem of over fitting – a logistic regression model should be as simple as possible but not simpler. Statistically speaking – if the logistic regression includes a large number of variables the probability increases that the significance test finds the variables to be significant just by pure chance.


