Points of Significance: Simple linear regression

Journal name:
Nature Methods
Volume:
12,
Pages:
999–1000
Year published:
DOI:
doi:10.1038/nmeth.3627
Published online

Abstract

“The statistician knows...that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.”1

At a glance

Figures

  1. A variable Y has a regression on variable X if the mean of Y (black line) E(Y[verbar]X) varies with X.
    Figure 1: A variable Y has a regression on variable X if the mean of Y (black line) E(Y|X) varies with X.

    (a) If the properties of Y do not change with X, there is no association. (b) Association is possible without regression. Here E(Y|X) is constant, but the variance of Y increases with X. (c) Linear regression E(Y|X) = β0+ β1X. (d) Nonlinear regression E(Y|X) = exp(β0+ β1X).

  2. In a linear regression relationship, the response variable has a distribution for each value of the independent variable.
    Figure 2: In a linear regression relationship, the response variable has a distribution for each value of the independent variable.

    (a) At each height, weight is distributed normally with s.d. σ = 3. (b) Linear regression of n = 3 weight measurements for each height. The mean weight varies as μ(Height) = 2 × Height/3 – 45 (black line) and is estimated by a regression line (blue line) with 95% confidence interval (blue band). The 95% prediction interval (gray band) is the region in which 95% of the population is predicted to lie for each fixed height.

  3. Regression models associate error to response which tends to pull predictions closer to the mean of the data (regression to the mean).
    Figure 3: Regression models associate error to response which tends to pull predictions closer to the mean of the data (regression to the mean).

    (a) Uncertainty in a linear regression relationship can be expressed by a 95% confidence interval (blue band) and 95% prediction interval (gray band). Shown are regressions for the relationship in Figure 2a using different amounts of scatter (normally distributed with s.d. σ). (b) Predictions using successive regressions X right arrow Y right arrow X′ to the mean. When predicting using height H = 175 cm (larger than average), we predict weight W = 71.6 kg (dashed line). If we then regress H on W at W = 71.6 kg, we predict H′ = 172.7 cm, which is closer than H to the mean height (64.6 cm). Means of height and weight are shown as dotted lines.

References

  1. Box, G. J. Am. Stat. Assoc. 71, 791799 (1976).
  2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899900 (2015).
  3. Krzywinski, M. & Altman, N. Nat. Methods 11, 699700 (2014).
  4. Krzywinski, M. & Altman, N. Nat. Methods 10, 809810 (2013).
  5. Krzywinski, M. & Altman, N. Nat. Methods 10, 10411042 (2013).

Download references

Author information

Affiliations

  1. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data