# Data Analysis Plan

The data analysis plan refers to articulating how your data will be cleaned, transformed, and analyzed. All scientific research is replicable, and to be replicable you need to give the reader the roadmap of how you managed your data and conducted the analyses. Each of the following areas could be added into a data analysis plan.

**Cleaning the Data**

The cleaning of data is the removing of univariate and multivariate outliers, dealing with missing data, and assessing for normality.

*Univariate outliers*

State that the variables will be assessed for univariate outliers. Univariate outlier refers to an observation with a standard deviation of greater than ±3.29 from the variable’s mean. This is easily accomplished by standardizing the scores of a variable (i.e., the variable’s scores have a mean of zero and a standard deviation of 1), and looking for an observation greater than ±3.29.

*Multivariate outliers*

When conducting multivariate analyses, state that the sets of variables will be assessed for multivariate outliers. Multivariate outliers refer to outliers on a combination of two or more variables. To assess for multivariate outliers, you can conduct a regression with the observation ID number as the dependent variable, the variables being assessed as the predictors, and assess for Mahalanobis’ distance. Then examine an observation’s Mahalanobis’ distance score relative to the degrees of freedom (i.e., the number of variables will equal the degrees of freedom) for a chi-square value at the p=.001 level.

*Missing data*

Articulate how missing data will be handled. Missing data is the absence of an observation on a variable. There are a few remedies: drop the observation with the missing data, mean substitution, and multiple imputation (using Intellectus Statistics).

*Normality*

Most parametric statistics (e.g., ANOVAs) have the assumption of Normality. Normality refers to the shape of the distribution of scores (e.g., shape of a normal bell curve). To assess for normality, a researcher can examine skewness and kurtosis of a variable, or conduct a 1-sample KS test. The KS test will report whether the distribution of data is significantly different than a normal curve.

When the data is not normally distributed, a transformation of the data can be appropriate. Some common transformations are the square root, logarithmic, and inverse.

**Describing the Specific Statistical Tests to Examine Each of the Research Questions**

The selection of the statistical analyses are based on two things: the way the hypothesis is stated in statistical language and the level of measurement of the variables.

**Hypothesis**

The way the researcher states the hypothesis makes a difference in the selection of data analysis test. Here are three null hypothesis examples:- (Example 1) Variable A does not relate to Variable B,
- Example 1 tends to be stated in correlation or chi-square language,

- (Example 2) Variable A does not predict to Variable B,
- Example 2 is stated in regression language,

- (Example 3) There are no differences on Variable A by Variable B.
- Example 3 is stated in ANOVA or Mann-Whitney language.

- (Example 1) Variable A does not relate to Variable B,

How is one to choose the precise data analysis test? In addition to the phraseology of differences, prediction, or relationship, the other consideration in the test selection is the level of measurement of each of the variables.

**Level of Measurement of the Variables to Select the Correct Data Analysis**

In the hypotheses above, the level of measurement of the variables is a key factor in selecting the correct data analysis.- In example 1, if the variables are both categorical, the correct analysis would be a chi-square test, while if both variables are interval-level, a Pearson correlation would most likely be the correct analysis to examine relationships.
- In example 2, regression is the appropriate test (i.e., examining the influence of a variable on another variable). Linear regression is the correct analysis if the dependent variable is interval-level, logistic regression if the dependent variable is dichotomous, and multinominal logistic regression if the dependent variable has three or more categories.
- In example 3, if the dependent variable is interval, an ANOVA is appropriate whereas an ordinal dependent variable would lead one to select the Mann-Whitney as the appropriate test.

**Putting the Data Analysis All Together**

In the data analysis plan, data cleaning and transformation procedures should be addressed, then discuss the specific data analysis tests to be conducted. Be sure to state the hypotheses the way you want—to examine relationships, to predict, or to examine differences on a variable by another variable.

**Please call 727-442-4290 to request a quote based on the specifics of your research, or email [email protected].*