Points of Significance: Analyzing outliers: influential or nuisance?

Journal name:
Nature Methods
Volume:
13,
Pages:
281–282
Year published:
DOI:
doi:10.1038/nmeth.3812
Published online
Corrected online

Some outliers influence the regression fit more than others.

At a glance

Figures

  1. Observations near the mean have less influence on the regression estimates and fitted values.
    Figure 1: Observations near the mean have less influence on the regression estimates and fitted values.

    (a) A simple linear regression line always goes through the means of the predictor and the response. Shown are values for a sample with n = 11 (black dots) simulated with W(H) = −45 + 2/3H + ε, with the noise distributed normally and with zero mean and variance of 1. The regression (black line) passes through (mean height mH = 165, mean weight mW = 65.2) and has a slope of 0.70. Also shown are the 95% confidence interval (dark gray band) and 95% prediction interval (light gray band). (b) The fit (blue line) for a new sample (blue dots) derived from observations shown in a by modifying the sixth weight at H = 165 from W(165) = 65 to W(165) = 62. The black line is the fit from a. (c) Same as b, except here we obtained the new sample (orange dots) by changing the 11th weight in a from W(170) = 68.3 to W(170) = 65.3. The sum of squared residuals (SSE) is shown for each fit.

  2. The leverage, residual and Cook's distance of an observation are used to assess the robustness of the fit.
    Figure 2: The leverage, residual and Cook's distance of an observation are used to assess the robustness of the fit.

    (a) The leverage of an observation tells us about its potential to influence the fit and increases as the square of the distance from the predictor value to its mean. Shown are leverage values for the data set in Figure 1a. Leverage values larger than (2p + 2)/n = 0.36 (dotted line; p = 1, n = 11) are considered large for p predictors and sample size n. (b) The residual is the distance between the observation and its fitted value, yiŷi, shown here for the three fits in Figure 1 as a function of the predictor value (left) and fitted weight (right). Colors of points correspond to the colors of the fitted lines in Figure 1, and there is a horizontal offset of half the width of a data point where points would otherwise occlude each other. (c) Cook's distance is a measure of the influence of each data value on the fit and values greater than 4/n = 0.36 (dotted line; n = 11) are considered high influence. Shown are Cook's distances for each fit in Figure 1.

  3. A plot of residuals as a function of leverage identifies influential observations that are not modeled well by the regression.
    Figure 3: A plot of residuals as a function of leverage identifies influential observations that are not modeled well by the regression.

    These quantities are shown here for each of the fits in Figure 1. The contour of Cook's distance of 4/n = 0.36 (n = 11) is shown by a black line. The sixth observation that was adjusted (Fig. 1b) stands out as a low-leverage outlier (middle panel). In contrast, the 11th observation (Fig. 1c) has high leverage, a large residual and a large Cook's distance (right panel).

Change history

Corrected online 14 April 2016
In the version of this piece initially published, there were two errors. The equation describing mean squared error (MSE) was incorrect in the PDF file. In the legend for Figure 1a, the stated values for mean height and mean weight were switched. The errors have been corrected in the HTML and the PDF versions of the piece.

References

  1. Altman, N. & Krzywinski, M. Nat. Methods 12, 9991000 (2015).
  2. Krzywinski, M. & Altman, N. Nat. Methods 12, 11031104 (2015).

Download references

Author information

Affiliations

  1. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data