Assumptions of Multiple Linear Regression
Multiple linear regression analysis makes several key assumptions:
- A Linear Relationship between the outcome variable and the independent variables. A plot of the standardized residuals verses the predicted Y' values show whether there is a linear or curvilinear relationship.
- Multivariate Normality--Multiple regression assumes that the variables are normally distributed.
- No Multicollinearity--This assumption assumes that the independent variables are not highly correlated with each other. This assumption is tested by the Variance Inflation Factor (VIF) statistic.
- Homoscedasticity--This assumption requires that the variance of error terms are similar across the independent variables. As with the linear relationship assumption, Intellectus Statistics plot the standardized residuals verses the predicted Y' values can show whether points are equally distributed across all values of the independent variables or not.
The Intellectus Statistics tool automatically includes the assumption tests and plots when conducting a regression.
Get a Jump Start on Your Quantitative Results Chapter
Multiple linear regression needs at least three independent variables (nominal, ordinal, or interval/ratio) scale. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis, in the simplest case of having just two independent variables, that requires n > 40. On the IntellectusStatistics.com login page, exact sample size is calculated for different numbers of independent variables.
Multiple Linear Regression Assumptions
First, multiple linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since multiple linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatterplots. The following two examples depict a curvilinear relationship and a hetroscedastic relationship.
Second, the multiple linear regression analysis requires that the error between observed and predicted values (i.e., the residuals of the regression) should be normally distributed. This assumption can best be checked by plotting residual values on a histogram with a fitted normal curve or by reviewing a Q-Q-Plot. Normality can also be checked with a goodness of fit test (e.g., the Kolmogorov-Smirnov test), though this test must be conducted on the residuals themselves. When the data is not normally distributed, a non-linear transformation (e.g., log-transformation) might correct this issue if one or more of the individual predictor variables are to blame, though this does not directly respond to the normality of the residuals.
Third, multiple linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent from each other.
Multicollinearity is checked against 4 key criteria:
1) Correlation matrix – When computing the matrix of Pearson's Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than .08.
2) Tolerance – The tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.2 there might be multicollinearity in the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – The variance inflation factor of the linear regression is defined as VIF = 1/T. Similarly with VIF > 10 there is an indication for multicollinearity to be present.
4) Condition Index – The condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the regression variables, whereas values > 30 indicate strong multicollinearity.
If multicollinearity is found in the data, one remedy might be centering the data. To center the data, simply deduct the mean score from each observation.
Other alternatives to tackle the problem of multicollinearity in multiple linear regression is to conduct a factor analysis before the regression analysis and to rotate the factors to insure independence of the factors in the linear regression analysis.
Fourth, multiple linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x). This for instance typically occurs in stock prices, where today's price is not independent from yesterday's price.
While a scatterplot let's you check for autocorrelations, you can test the multiple linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson's d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the multiple linear regression data. However the Durbin-Watson test is limited to linear autocorrelation and direct neighbors (so called first order effects).
The last assumption the multiple linear regression analysis makes is homoscedasticity. The scatter plot is good way to check whether homoscedasticity (that is the error terms along the regression line are equal) is given. If the data is heteroscedastic the scatter plots look like the following examples:
The Goldfeld-Quandt Test can test for heteroscedasticity. The test splits the multiple linear regression data in high and low value to see if the samples are significantly different . If homoscedasticity is present in our multiple linear regression model, a non-linear correction might fix the problem, but might sneak multicollinearity into the model.
Statistics Solutions can assist with your quantitative analysis by assisting you to editing your methodology and results chapters. We can work with your data analysis plan and your results chapter.
Call 877-437-8622 to request a quote or email Info@StatisticsSolutions.com