# Conduct and Interpret a Multiple Linear Regression

**What is Multiple Linear Regression?**

Multiple linear regression is the most common form of the regression analysis. As a predictive analysis, multiple linear regression is used to describe data and to explain the relationship between one dependent variable and two or more independent variables.

At the center of the multiple linear regression analysis lies the task of fitting a single line through a scatter plot. More specifically, the multiple linear regression fits a line through a multi-dimensional cloud of data points. The simplest form has one dependent and two independent variables. The general form of the multiple linear regression is defined as *for i = 1…n.*

Sometimes the dependent variable is also called endogenous variable, criterion variable, prognostic variable or regressand. The independent variables are also called exogenous variables, predictor variables or regressors.

Multiple Linear Regression Analysis consists of more than just fitting a linear line through a cloud of data points. It consists of three stages: 1) analyzing the correlation and directionality of the data, 2) estimating the model, i.e., fitting the line, and 3) evaluating the validity and usefulness of the model.

There are three major uses for Multiple Linear Regression Analysis: 1) causal analysis, 2) forecasting an effect, and 3) trend forecasting. Other than correlation analysis, which focuses on the strength of the relationship between two or more variables, regression analysis assumes a dependence or causal relationship between one or more independent and one dependent variable.

Firstly, it might be used to identify the strength of the effect that the independent variables have on a dependent variable. Typical questions would seek to determine the strength of relationship between dose and effect, sales and marketing spend, age and income.

Secondly, it can be used to forecast effects or impacts of changes. That is to say, multiple linear regression analysis helps us to understand how much the dependent variable will change when we change the independent variables. A typical question would be “How much additional Y do I get for one additional unit X?”

Thirdly, multiple linear regression analysis predicts trends and future values. The multiple linear regression analysis can be used to get point estimates. Typical questions might include, “*What will the price for gold be six months from now*? *What is the total effort for a task X*?”

*The Multiple Linear Regression in SPSS*

Our research question for the multiple linear regression is as follows:

*Can we explain the reading score that a student achieved on the standardized test with the five aptitude tests?*

First, we need to check whether there is a linear relationship between the independent variables and the dependent variable in our multiple linear regression model. To do so, we check the scatter plots. We could create five individual scatter plots using the *Graphs/Chart Builder…* Alternatively we can use the Matrix Scatter Plot in the menu *Graphs/Legacy Dialogs/Scatter/Dot…*

The scatter plots indicate a good linear relationship between writing score and the aptitude tests 1 to 5, where there seems to be a positive relationship for aptitude test 1 and a negative linear relationship for aptitude tests 2 to 5.

Secondly, we need to check for multivariate normality. This can either be done with an ‘eyeball’ test on the Q-Q-Plots or by using the 1-Sample K-S test to test the null hypothesis that the variable approximates a normal distribution. The K-S test is not significant for all variables, thus we can assume normality.

Multiple linear regression is found in SPSS in *Analyze/Regression/Linear…*

To answer our research question we need to enter the variable reading scores as the dependent variable in our multiple linear regression model and the aptitude test scores (1 to 5) as independent variables. We also select *stepwise *as the method. The default method for the multiple linear regression analysis is ‘*Enter*‘, which means that all variables are forced to be in the model. But since over-fitting is a concern of ours, we want only the variables in the model that explain additional variance. Stepwise means that the variables are entered into the regression model in the order of their explanatory power.

In the field *Options…* we can define the criteria for stepwise inclusion in the model. We want to include variables in our multiple linear regression model that increase F by at least 0.05 and we want to exclude them again if the increase F by less than 0.1. This dialog box also allows us to manage missing values (e.g., replace them with the mean).

The dialog *Statistics…* allows us to include additional statistics that we need to assess the validity of our linear regression analysis. Even though it is not a time series, we include Durbin-Watson to check for autocorrelation and we include the collinearity that will check for autocorrelation.

In the dialog *Plots…*, we add the standardized residual plot (ZPRED on x-axis and ZRESID on y-axis), which allows us to eyeball homoscedasticity and normality of residuals.