The Multiple Linear Regression Analysis in SPSS
This example is based on the FBI’s 2006 crime statistics. Particularly we are interested in the relationship between size of the state, various property crime rates and the number of murders in the city. It is our hypothesis that less violent crimes open the door to violent crimes and also that even we account for some effect of the city size by comparing crime rates per 100,000 inhabitants that there still is an effect left.
First we need to check whether there is a linear relationship between the independent variables and the dependent variable in our multiple linear regression model. We check the scatter plots. The scatter plots indicate a good linear relationship between murder rate and burglary and motor vehicle theft rates, and only weak relationships between population and larceny.
Secondly we need to check for multivariate normality. In our example we find that multivariate normality might not be present in the population data (which is not surprising since we truncated variability by selecting the 70 biggest cities to begin with).
We will ignore this violation of the assumption, and conduct the multiple linear regression analysis. Multiple linear regression is found in SPSS in Analyze/Regression/Linear…
In our example we need to enter the variable murder rate as the dependent variable to our multiple linear regression model and the population, burglary, larceny, vehicle theft as independent variables. We also select stepwise as the method. The default method for the multiple linear regression analysis is ‘Enter’, that means that all variables are forced to be in the model. But since over fitting is a concern of ours, we want only the variables in the model that explain additional variance.
In the field options we can set the stepwise criteria. We want to include variables in our multiple linear regression model that increase F by at least 0.05 and we want to exclude them if the increase F by less than 0.1.
The field statistics allows us to include additional statistics that we need to assess the validity of our linear regression analysis.
It is advisable to additionally include the collinearity diagnostics and the Durbin-Watson test for auto-correlation. To test the assumption of homoscedasticity and normality of residuals we also include a special plot in the Plots menu.
The first table tells us the variables in our analysis. Turns out that only motor vehicle theft is useful to predict the murder rate.
The next table shows the multiple linear regression model summary and overall fit statistics. We find that the adjusted R² of our model is 0.398 with the R² = .407 that means that the linear regression explains 40.7% of the variance in the data. The Durbin-Watson d = 2.074, which is between the two critical values of 1.5 < d < 2.5 and therefore we can assume that there is no first order linear auto-correlation in our multiple linear regression data.
If we would have forced all variables (Method: Enter) into the linear regression model we would have seen a little higher R² and adjusted R² (viz. 0.424 and .458).
The next table is the F-test, the linear regression’s F-test has the null hypothesis that there is no linear relationship between the variables (in other words R²=0). The F-test is highly significant, thus we can assume that there is a linear relationship between the variables in our model.
The next table shows the multiple linear regression estimates including the intercept and the significance levels.
In our stepwise multiple linear regression analysis we find a non-significant intercept but highly significant vehicle theft coefficient, which we can interpret as for every 100 vehicle thefts per 100,000 inhabitants we will see 2 additional murders per 100,000.
If we force all variables into the multiple linear regression the Beta weights and collinearity are interesting. Beta expresses the relative importance of each independent variables in standardized terms. Firstly we find that only burglary and motor vehicle theft are significant predictors, secondly we find that motor vehicle theft has a higher impact than burglary (beta = .507 and beta = .333).
This table also checks for multicollinearity in our multiple linear regression model. Tolerance should be > 0.1 (or VIF < 10) for all variables, which they are.
As the Goldfeld-Quandt test is not supported in SPSS we check is the homoscedasticity and normality of residuals with the Q-Q-Plot of z*pred and z*presid. The plot indicates that in our multiple linear regression analysis there is no tendency in the error terms. If that happens you see a graph that looks like a staircase.