Points of Significance: Association, correlation and causation

Journal name:
Nature Methods
Volume:
12,
Pages:
899–900
Year published:
DOI:
doi:10.1038/nmeth.3587
Published online

Abstract

Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.

At a glance

Figures

  1. Correlation is a type of association and measures increasing or decreasing trends quantified using correlation coefficients.
    Figure 1: Correlation is a type of association and measures increasing or decreasing trends quantified using correlation coefficients.

    (a) Scatter plots of associated (but not correlated), non-associated and correlated variables. In the lower association example, variance in y is increasing with x. (b) The Pearson correlation coefficient (r, black) measures linear trends, and the Spearman correlation coefficient (s, red) measures increasing or decreasing trends. (c) Very different data sets may have similar r values. Descriptors such as curvature or the presence of outliers can be more specific.

  2. Correlation coefficients fluctuate in random data, and spurious correlations can arise.
    Figure 2: Correlation coefficients fluctuate in random data, and spurious correlations can arise.

    (a) Distribution (left) and 95% confidence intervals (right) of correlation coefficients of 10,000 n = 10 samples of two independent normally distributed variables. Statistically significant coefficients (α = 0.05) and corresponding intervals that do not include r = 0 are highlighted in blue. (b) Samples with the three largest and smallest correlation coefficients (statistically significant) from a.

  3. Effect of noise and sample size on Pearson's correlation coefficient r.
    Figure 3: Effect of noise and sample size on Pearson's correlation coefficient r.

    (a) r of an n = 20 sample of (X, X + ε), where ε is the normally distributed noise scaled to standard deviation σ. The amount of scatter and value of r at three values of σ are shown. The shaded area is the 95% confidence interval. Intervals that do not include r = 0 are highlighted in blue (σ < 0.58), and those that do are highlighted in gray and correspond to nonsignificant r values (ns; e.g., r = 0.42 with P = 0.063). (b) As sample size increases, r becomes less variable, and the estimate of the population correlation improves. Shown are samples with increasing size and noise: n = 20 (σ = 0.1), n = 100 (σ = 0.3) and n = 200 (σ = 0.6). Traces at the bottom show r calculated from a subsample, created from the first m values of each sample.

References

  1. Puga, J.L., Krzywinski, M. & Altman, N. Nat. Methods 12, 799800 (2015).
  2. Kulesa, A., Krzywinski, M., Blainey, P. & Altman, N. Nat. Methods 12, 477478 (2015).
  3. Krzywinski, M. & Altman, N. Nat. Methods 10, 11391140 (2013).
  4. Krzywinski, M. & Altman, N. Nat. Methods 11, 699700 (2014).
  5. Altman, N. & Krzywinski, M. Nat. Methods 12, 56 (2015).

Download references

Author information

Affiliations

  1. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data