In this post we will discuss univariate and multivariate outliers. A univariate outlier is a data point that consists of an extreme value on one variable. A multivariate outlier is a combination of unusual scores on at least two variables. Both types of outliers can influence the outcome of statistical analyses. Outliers exist for four reasons. Incorrect data entry can cause data to contain extreme cases. A second reason for outliers can be failure to indicate codes for missing values in a dataset. Another possibility is that the case did not come from the intended sample. And finally, the distribution of the sample for specific variables may have a more extreme distribution than normal.
In many parametric statistics, researchers must remove univariate and multivariate outliers from the dataset. When looking for univariate outliers in continuous variables, researchers can use standardized values (z scores). If the statistical analysis does not contain a grouping variable, such as linear regression, canonical correlation, or SEM, researchers should assess the dataset for outliers as a whole. If the analysis contains a grouping variable, such as MANOVA, ANOVA, ANCOVA, or logistic regression, researchers should assess the data for outliers separately within each group. For continuous variables, researchers can consider univariate outliers as standardized cases that fall outside the absolute value of 3.29. However, researchers should exercise caution with extremely large sample sizes, as outliers are common in these datasets. Once researchers have removed univariate outliers from a dataset, they can assess and remove multivariate outliers.
Need help conducting your analysis? Leverage our 30+ years of experience and low-cost same-day service to complete your results today!
Schedule now using the calendar below.
Researchers can identify multivariate outliers using Mahalanobis distance, which measures the distance of a data point from the calculated centroid of the other cases. They can calculate the centroid as the intersection of the mean of the variables they assess. They also recognize each point as an X, Y combination, and they identify multivariate outliers as points that lie a given distance from the other cases The distances are interpreted using a p < .001 and the corresponding χ2 value with the degrees of freedom equal to the number of variables. The researchers also recognize multivariate outliers using leverage, discrepancy, and influence.
Leverage relates to Mahalanobis distance but researchers measure it on a different scale, so they do not apply the χ2 distribution. Large scores indicate the case if further out however may still lie on the same line. Discrepancy assesses the extent that the case is in line with the other cases. Leverage and discrepancy determine influence and assess how coefficients change when researchers remove cases. Cases > 1.00 are likely to be considered outliers.