Residual plots can be used to validate assumptions about the regression model.
So far in our discussion of linear regression, we have seen that the estimated regression coefficients and predicted values can be difficult to interpret^{1}. When the predictors are correlated^{2}, the magnitude and even the sign of the estimated regression coefficients can be highly variable, although the predicted values may be stable. When outliers are present^{3}, both the estimated regression coefficients and the predicted values can be influenced. This month, we discuss diagnostics for the robustness of the estimates and of the statistical inference—that is, the ttests, confidence intervals and prediction intervals that are computed on the basis of assumptions that the errors are additive, normal and independent and have zero mean and constant variance.
Recall that the linear regression model is Y = β_{0} + Σβ_{j}X_{j} + ε, where Y is the response variable, X = (X_{1}, ..., X_{p}) are the p predictor variables, (β_{0}, ..., β_{p}) are the unknown population regression coefficients and ε is the error, which is normally distributed with zero mean and constant variance σ^{2} (often written as N(0, σ^{2})). The response and predicted value for the ith observation are y_{i} and ŷ_{i}, respectively, and the difference between them is the residual r_{i} = y_{i} − ŷ_{i}. The variance is estimated by the mean squared error MSE = Σr_{i}^{2}/(n − p − 1) for sample size n and p predictors.
One of the most versatile regression diagnostic methods is to plot the residuals r_{i} against the predictors (x_{i}, r_{i}) and the predicted values (ŷ_{i}, r_{i}) (Fig. 1). When noise assumptions are met, these plots should have zero mean with no local nonrandom trends and constant spread (Fig. 1b). Trends indicate that the regression may be nonlinear and that terms such as polynomials (e.g., Σγ_{j}X^{2}_{j}) may be required in the model for more accurate prediction; residuals will still have an overall zero average (Fig. 1b). The absolute values of the residuals can be plotted in the same way to assess the constant variance assumption.
Unless there are substantive reasons to expect a linear relationship between the response and predictor variables, linear regression provides a convenient approximation of the true relationship. However, if the residuals are large and show a systematic trend, there may be a lack of fit between the model and the true relationship (Fig. 1b). Whereas linear trends in the residual plots indicate influential data points that have pulled the fit away from the bulk of the data^{3}, curvature indicates nonlinear trends that have not been captured. Adding powers of the predictors as additional predictors in the model allows us to fit a polynomial, which can capture this curvature. If the polynomial fit exhibits a significantly lower MSE, we might conclude that there are terms present that were not captured in the original model. For example, including an H^{2} term in the fit in Figure 1a decreases the MSE substantively from 1.32 to 0.86.
A formal test of lack of fit can be done when there are replicates at some combinations of the predictor values. The variability of the replicates can be used to estimate the error variance—an MSE much larger than the withinreplicate variability is evidence that the residuals have an additional component due to lack of fit.
One can assess the assumption of constant noise variance (homoscedasticity) by plotting absolute values of residuals together with a smooth, nonparametric regression line. If the noise is heteroscedastic (nonconstant variance), the plot will have a nonzero mean and the regression line will not be horizontal (Fig. 1b).
Although the estimated regression slopes and predicted values are robust to heteroscedasticity, their sampling distributions are not^{4}. Most important, the s.d. of the sampling distributions depend on the variances weighted by values of the predictors, so that use of the test statistics and confidence intervals computed under the assumption of constant variance is no longer valid. Prediction intervals will be particularly inaccurate, as they require appropriate coverage of the error distribution as well as of the predicted value. For example, if the data are normally distributed but the variance is not constant (Fig. 1a), the MSE will be an estimate of the average variance and the prediction intervals will be too large for values with small variance and too small for values with large variance.
Another factor that can influence the variance estimate is dependence among the errors, which can occur for a number of reasons—for example, multiple observations on the same individual, time or spatial correlation, or a latent random factor that causes familial or other cluster correlations. Error correlation biases the MSE as an estimator of the variance, and that bias can be high and in either direction depending on the type of correlation. As with heteroscedasticity, the estimated regression slopes and predicted values are robust, but their sampling distributions do not have the values computed under the independence assumption^{4}. Even if the error variance is constant, in the presence of correlation use of the test statistics, confidence intervals and prediction intervals computed under the independence assumption can lead to very misleading results.
One can detect the correlation of residuals over time by plotting them versus the time at which the observations were made. Ripples and other nonrandom behavior in the plot indicate time correlation. Spatial correlation can similarly be detected in a plot of the residuals versus the spatial coordinates at which the observations were made. Other types of correlation may be more difficult to detect, particularly if they are due to unknown latent variables, such as cluster effects.
Statistical inference for linear regression relies heavily on the variance estimate, MSE, and is therefore influenced by any factor that affects that estimate. Outliers, for example, can increase the MSE in two different ways. Outliers with a large residual, such as lowleverage points, can directly increase the MSE because the MSE is proportional to the sum of squared residuals. Outliers with high leverage and a high Cook's distance^{3} may have a small residual but increase MSE indirectly by increasing the residuals of other data points by pulling the linear away from the majority of responses.
Statistical inference is typically done under the assumption that the errors are normally distributed with constant variance. A version of the central limit theorem^{5} tells us that for large samples, tests and confidence intervals for the estimated regression coefficients and fitted values continue to be accurate with nonnormal data as long as the errors are independent and identically distributed with constant variance. Here, the definition of 'large' depends on the nature of the nonnormality. For example, errors that are closer to uniform on a fixed interval are 'close' to normal, but distributions that produce many outliers require large sample sizes. The prediction intervals rely critically on the normality assumption to determine the width and symmetry of the interval. Nonnormality of the error distribution may completely disrupt the coverage of prediction intervals.
After it has been established from the residual plots that the residuals have no nonlinear trends and constant variance, informal evaluation of normality is often done using a histogram of the residuals or a Q–Q plot, also known as a normal probability plot (Fig. 2). These plots show the sorted values of the sample versus sorted values that would be expected if they were drawn from a normal distribution. Although formal tests of normality can be done, they are not considered to be effective because they may be sensitive to departures from normality that have little effect on the statistical inference.
Correlation among the predictors, also known as multicollinearity, does not affect the stability of the predicted values, but it can greatly affect the stability of the estimated regression coefficients, as we saw in the context of predicting weight from height and maximum jump height^{2}. A commonly used measure of multicollinearity is VIF(X_{i}) = 1/(1 − R_{i}^{2}), where VIF is the variance inflation factor and R_{i}^{2} is the percent variance of X_{i} explained by the regression of X_{i} on the other predictors. It is calculated for each predictor X_{i} via a regression in which values of the predictor are fitted against the other predictors. If VIF(X_{i}) is large, then there may be high variation in the regression coefficient estimate between different samples—for example, when VIF > 10, the regression coefficients should not be interpreted. The reciprocal of VIF, the tolerance, is sometimes used equivalently.
Multiple regression is one of the most powerful tools in the basic statistical toolkit. Although simple to apply, it is prone to over and misinterpretation without attention to diagnostics.
References
Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).
Krzywinski, M. & Altman, N. Nat. Methods 12, 1103–1104 (2015).
Altman, N. & Krzywinski, M. Nat. Methods 13, 281–282 (2016).
Eicker, F. in Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability Vol. 1 (eds. Le Cam, L.M., Neyman, J. & Scott, E.M.) 59–82 (Univ. of California Press, 1967)
Lumley, T. et al. Annu. Rev. Public Health 23, 151–169 (2002).
Author information
Authors and Affiliations
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Rights and permissions
About this article
Cite this article
Altman, N., Krzywinski, M. Regression diagnostics. Nat Methods 13, 385–386 (2016). https://doi.org/10.1038/nmeth.3854
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3854
This article is cited by

Elemental Concentrations in the Shells of the Mussel Perna perna: Discrimination of Origin
Biological Trace Element Research (2024)

Shortterm changes in the spatial distribution of the ghost shrimp Callichirus corruptus (Decapoda: Axiidea, Callichiridae) from southwestern Atlantic
Marine Biology (2024)

Prediction of Pharyngeal 3D Volume Using 2D Lateral Area Measurements During Swallowing
Dysphagia (2024)

Frequency distributions, dimorphisms, and allometric variation in size of the weapon on male harvestmen (Arthropoda, Arachnida, Opiliones)
Evolutionary Ecology (2023)

Mercury levels in an environmentally protected estuarine area in Northeast Brazil: partitioning in the water column and transport to the ocean
Environmental Science and Pollution Research (2022)