Pearson’s Correlation Coefficient, The Most Commonly Used Bivariate Correlation


Posted August 27, 2013

It is difficult to understate the value of the correlation coefficient to descriptive statistics.  Use of the term “correlation coefficient” is almost always a short-hand phrase for the Pearson product-moment correlation coefficient.  There are several other well known correlation coefficients such as Spearman’s rho rank correlation coefficient but it usually a safe assumption the correlation being referenced is a Pearson correlation coefficient since it is the most commonly used measure of bivariate association.  It is appropriate to use the Pearson correlation coefficient when the two variables of interest are scored using interval or ratio measures while the associations of ordinal or nominal variables should be compared using alternative methods.  However, all statistical packages will calculate the Pearson correlation coefficient for ordinal or nominal variables (assuming the categories are given numerical values) but the justification for using this statistic relies on assumptions which are either untested or questionable; the resulting value of the statistic may not make real world sense.

Before calculating a Pearson correlation coefficient it is essential and good practice to first visually inspect the relationship between two variables by means of a scatterplot graph.  A bivariate scatterplot allows the researcher to gain better sense of the overall variability of the data but also visualize any systematic relationships the correlation coefficient would be describing such as a general positive or negative linear trend.  A non-linear association e.g., a curvilinear relationship may be present but this would not be described well by the Pearson correlation coefficient.  Also, oftentimes extreme values are readily apparent and may represent outliers or data errors to which Pearson’s correlation coefficient is sensitive.  If the distribution of either of the variables is skewed or contains outliers then transforming the data or using an alternative measure of association may be warranted.  Ignorance regarding the potential presence of these factors may obscure the sample’s true correlation coefficient and ultimately mislead the researcher when making inferences about the target population’s true Pearson correlation coefficient.

The use of Greek letters in statistics is common and the Greek letter rho, symbolized by ρ, represents the Pearson correlation coefficient for a population.  A researcher wants to know rho but in practice it is typically impossible to collect a census (sample the entire target population) so instead a smaller sample is taken from the target population and the sample’s Pearson correlation coefficient is calculated and labeled with a lower case r.  The value of the correlation coefficient always ranges from negative one to positive one.  These values describe perfect negative and positive linear relationships, respectively, while a coefficient value equal to zero indicates the two variables have no linear relationship. Regardless of the calculated value of the Pearson correlation coefficient the researcher is not able to infer causation.  The confusion may be a result of the language and notation used when discussing the correlation of two variables.  For example, it may be said knowing the values of one variable allows the researcher to “predict” the values of the other variable as though one caused the other but this type of prediction is not analogous to causation.  Also, the labeling two variables as X and Y may be mistakenly interpreted as a dependent, independent variable relationship that is customary with regression but for the purposes of calculating correlation coefficients it does not matter which variable is labeled X or Y.







Pin It on Pinterest

Shares
Share This