This Month
Published: 29 September 2015

Points of Significance

Association, correlation and causation

Naomi Altman¹ &
Martin Krzywinski²

Nature Methods volume 12, pages 899–900 (2015)Cite this article

171k Accesses
186 Citations
182 Altmetric
Metrics details

Subjects

Abstract

Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.

You have full access to this article via your institution.

Download PDF

Main

Most studies include multiple response variables, and the dependencies among them are often of great interest. For example, we may wish to know whether the levels of mRNA and the matching protein vary together in a tissue, or whether increasing levels of one metabolite are associated with changed levels of another. This month we begin a series of columns about relationships between variables (or features of a system), beginning with how pairwise dependencies can be characterized using correlation.

Two variables are independent when the value of one gives no information about the value of the other. For variables X and Y, we can express independence by saying that the chance of measuring any one of the possible values of X is unaffected by the value of Y, and vice versa, or by using conditional probability, P(X|Y) = P(X). For example, successive tosses of a coin are independent—for a fair coin, P(H) = 0.5 regardless of the outcome of the previous toss, because a toss does not alter the properties of the coin. In contrast, if a system is changed by observation, measurements may become associated or, equivalently, dependent. Cards drawn without replacement are not independent; when a red card is drawn, the probability of drawing a black card increases, because now there are fewer red cards.

Association should not be confused with causality; if X causes Y, then the two are associated (dependent). However, associations can arise between variables in the presence (i.e., X causes Y) and absence (i.e., they have a common cause) of a causal relationship, as we've seen in the context of Bayesian networks¹. As an example, suppose we observe that people who daily drink more than 4 cups of coffee have a decreased chance of developing skin cancer. This does not necessarily mean that coffee confers resistance to cancer; one alternative explanation would be that people who drink a lot of coffee work indoors for long hours and thus have little exposure to the sun, a known risk. If this is the case, then the number of hours spent outdoors is a confounding variable—a cause common to both observations. In such a situation, a direct causal link cannot be inferred; the association merely suggests a hypothesis, such as a common cause, but does not offer proof. In addition, when many variables in complex systems are studied, spurious associations can arise. Thus, association does not imply causation.

In everyday language, dependence, association and correlation are used interchangeably. Technically, however, association is synonymous with dependence and is different from correlation (Fig. 1a). Association is a very general relationship: one variable provides information about another. Correlation is more specific: two variables are correlated when they display an increasing or decreasing trend. For example, in an increasing trend, observing that X > μ_X implies that it is more likely that Y > μ_Y. Because not all associations are correlations, and because causality, as discussed above, can be connected only to association, we cannot equate correlation with causality in either direction.

**Figure 1: Correlation is a type of association and measures increasing or decreasing trends quantified using correlation coefficients.**

For quantitative and ordinal data, there are two primary measures of correlation: Pearson's correlation (r), which measures linear trends, and Spearman's (rank) correlation (s), which measures increasing and decreasing trends that are not necessarily linear (Fig. 1b). Like other statistics, these have population values, usually referred to as ρ. There are other measures of association that are also referred to as correlation coefficients, but which might not measure trends.

When “correlated” is used unmodified, it generally refers to Pearson's correlation, given by ρ(X, Y) = cov(X, Y)/σ_Xσ_Y, where cov(X, Y) = E((X – μ_X)(Y – μ_Y)). The correlation computed from the sample is denoted by r. Both variables must be on an interval or ratio scale; r cannot be interpreted if either variable is ordinal. For a linear trend, |r| = 1 in the absence of noise and decreases with noise, but it is also possible that |r| < 1 for perfectly associated nonlinear trends (Fig. 1b). In addition, data sets with very different associations may have the same correlation (Fig. 1c). Thus, a scatter plot should be used to interpret r. If either variable is shifted or scaled, r does not change and r(X, Y) = r(aX + b, Y). However, r is sensitive to nonlinear monotone (increasing or decreasing) transformation. For example, when applying log transformation, r(X, Y) ≠ r(X, log(Y)). It is also sensitive to the range of X or Y values and can decrease as values are sampled from a smaller range.

If an increasing or decreasing but nonlinear relationship is suspected, Spearman's correlation is more appropriate. It is a nonparametric method that converts the data to ranks and then applies the formula for the Pearson correlation. It can be used when X is ordinal and is more robust to outliers. It is also not sensitive to monotone increasing transformations because they preserve ranks—for example, s(X, Y) = s(X, log(Y)). For both coefficients, a smaller magnitude corresponds to increasing scatter or a non-monotonic relationship.

It is possible to see large correlation coefficients even for random data (Fig. 2a). Thus, r should be reported together with a P value, which measures the degree to which the data are consistent with the null hypothesis that there is no trend in the population. For Pearson's r, to calculate the P value we use the test statistic √[d.f. × r²/(1 − r²)], which is t-distributed with d.f. = n – 2 when (X, Y) has a bivariate normal distribution (P for s does not require normality) and the population correlation is 0. Even more informative is a 95% confidence interval, often calculated using the bootstrap method². In Figure 2a we see that values up to |r| < 0.63 are not statistically significant—their confidence intervals span zero. More important, there are very large correlations that are statistically significant (Fig. 2a) even though they are drawn from a population in which the true correlation is ρ = 0. These spurious cases (Fig. 2b) should be expected any time a large number of correlations is calculated—for example, a study with only 140 genes yields 9,730 correlations. Conversely, modest correlations between a few variables, known to be noisy, could be biologically interesting.

**Figure 2: Correlation coefficients fluctuate in random data, and spurious correlations can arise.**

Because P depends on both r and the sample size, it should never be used as a measure of the strength of the association. It is possible for a smaller r, whose magnitude can be interpreted as the estimated effect size, to be associated with a smaller P merely because of a large sample size³. Statistical significance of a correlation coefficient does not imply substantive and biologically relevant significance.

The value of both coefficients will fluctuate with different samples, as seen in Figure 2, as well as with the amount of noise and/or the sample size. With enough noise, the correlation coefficient can cease to be informative about any underlying trend. Figure 3a shows a perfectly correlated relationship (X, X) where X is a set of n = 20 points uniformly distributed in the range [0, 1] in the presence of different amounts of normally distributed noise with a standard deviation σ. As σ increases from 0.1 to 0.3 to 0.6, r(X, X + σ) decreases from 0.95 to 0.69 to 0.42. At σ = 0.6 the noise is high enough that r = 0.42 (P = 0.063) is not statistically significant—its confidence interval includes ρ = 0.

**Figure 3: Effect of noise and sample size on Pearson's correlation coefficient r.**

When the linear trend is masked by noise, larger samples are needed to confidently measure the correlation. Figure 3b shows how the correlation coefficient varies for subsamples of size m drawn from samples at different noise levels: m = 4–20 (σ = 0.1), m = 4–100 (σ = 0.3) and m = 4–200 (σ = 0.6). When σ = 0.1, the correlation coefficient converges to 0.96 once m > 12. However, when noise is high, not only is the value of r lower for the full sample (e.g., r = 0.59 for σ = 0.3), but larger subsamples are needed to robustly estimate ρ.

The Pearson correlation coefficient can also be used to quantify how much fluctuation in one variable can be explained by its correlation with another variable. A previous discussion about analysis of variance⁴ showed that the effect of a factor on the response variable can be described as explaining the variation in the response; the response varied, and once the factor was accounted for, the variation decreased. The squared Pearson correlation coefficient r² has a similar role: it is the proportion of variation in Y explained by X (and vice versa). For example, r = 0.05 means that only 0.25% of the variance of Y is explained by X (and vice versa), and r = 0.9 means that 81% of the variance of Y is explained by X. This interpretation is helpful in assessments of the biological importance of the magnitude of r when it is statistically significant.

Besides the correlation among features, we may also talk about the correlation among the items we are measuring. This is also expressed as the proportion of variance explained. In particular, if the units are clustered, then the intraclass correlation (which should be thought of as a squared correlation) is the percent variance explained by the clusters and given by σ_b²/(σ_b² + σ_w²), where σ_b² is the between-cluster variation and σ_b² + σ_w² is the total between- and within-cluster variation. This formula was discussed previously in an examination of the percentage of total variance explained by biological variation⁵ where the clusters are the technical replicates for the same biological replicate. As with the correlation between features, the higher the intraclass correlation, the less scatter in the data—this time measured not from the trend curve but from the cluster centers.

Association is the same as dependence and may be due to direct or indirect causation. Correlation implies specific types of association such as monotone trends or clustering, but not causation. For example, when the number of features is large compared with the sample size, large but spurious correlations frequently occur. Conversely, when there are a large number of observations, small and substantively unimportant correlations may be statistically significant.

References

Puga, J.L., Krzywinski, M. & Altman, N. Nat. Methods 12, 799–800 (2015).
Article CAS Google Scholar
Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477–478 (2015).
Article CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 10, 1139–1140 (2013).
Article CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 11, 699–700 (2014).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 12, 5–6 (2015).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,
Naomi Altman
Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,
Martin Krzywinski

Authors

Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altman, N., Krzywinski, M. Association, correlation and causation. Nat Methods 12, 899–900 (2015). https://doi.org/10.1038/nmeth.3587

Download citation

Published: 29 September 2015
Issue Date: October 2015
DOI: https://doi.org/10.1038/nmeth.3587

This article is cited by

Non-alcoholic fatty liver disease and gestational diabetes mellitus: a bidirectional two-sample mendelian randomization study
- Ben-Gang Zhou
- Jian-Lei Xia
- Qiang She
BMC Endocrine Disorders (2024)
Data-Driven Dam Outflow Prediction Using Deep Learning with Simultaneous Selection of Input Predictors and Hyperparameters Using the Bayesian Optimization Algorithm
- Vinh Ngoc Tran
- Duc Dang Dinh
- Giang Tien Nguyen
Water Resources Management (2024)
What can scatterplots teach us about doing data science better?
- Wilson Wen Bin Goh
- Reuben Jyong Kiat Foo
- Limsoon Wong
International Journal of Data Science and Analytics (2024)
Athletic Injury Research: Frameworks, Models and the Need for Causal Knowledge
- Judd T. Kalkhoven
Sports Medicine (2024)
Immune response of the Almaco jack (Seriola rivoliana) against the infestation with Neobenedenia sp. in three cultivated temperatures
- Isabel Valles-Vega
- Juan Carlos Pérez-Urbiola
- Felipe Ascencio
Parasitology Research (2024)

Association, correlation and causation

Subjects

Abstract

Main

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Non-alcoholic fatty liver disease and gestational diabetes mellitus: a bidirectional two-sample mendelian randomization study

Data-Driven Dam Outflow Prediction Using Deep Learning with Simultaneous Selection of Input Predictors and Hyperparameters Using the Bayesian Optimization Algorithm

What can scatterplots teach us about doing data science better?

Athletic Injury Research: Frameworks, Models and the Need for Causal Knowledge

Immune response of the Almaco jack (Seriola rivoliana) against the infestation with Neobenedenia sp. in three cultivated temperatures

Search

Quick links

Subjects

Abstract

Main

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Non-alcoholic fatty liver disease and gestational diabetes mellitus: a bidirectional two-sample mendelian randomization study

Data-Driven Dam Outflow Prediction Using Deep Learning with Simultaneous Selection of Input Predictors and Hyperparameters Using the Bayesian Optimization Algorithm

What can scatterplots teach us about doing data science better?

Athletic Injury Research: Frameworks, Models and the Need for Causal Knowledge

Immune response of the Almaco jack (Seriola rivoliana) against the infestation with Neobenedenia sp. in three cultivated temperatures

Search

Quick links