This Month
Published: 29 September 2016

Points of Significance

Regularization

Jake Lever¹,
Martin Krzywinski² &
Naomi Altman³

Nature Methods volume 13, pages 803–804 (2016)Cite this article

12k Accesses
28 Citations
14 Altmetric
Metrics details

Subjects

Constraining the magnitude of parameters of a model can control its complexity

You have full access to this article via your institution.

Download PDF

Last month we examined the challenge of selecting a predictive model that generalizes well, and we discussed how a model's ability to generalize is related to its number of parameters and its complexity¹. An appropriate level of complexity is needed to avoid both underfitting and overfitting. An underfitted model is usually a poor fit to the training data, and an overfitted model is a good fit to the training data but not to new data. This month we explore the topic of regularization, a method that controls a model's complexity by penalizing the magnitude of its parameters.

Regularization can be used with any type of predictive model. We will illustrate this method using multiple linear regression applied to the analysis of simulated biomarker data to predict disease severity on a continuous scale. For the ith patient, let y_i be the known disease severity and x_ij be the value of the jth biomarker. Multiple linear regression finds the parameter estimate values that minimize the sum squared error SSE = Σ_i(y_i − ŷ_i)², where is the ith patient's predicted disease severity. For simplicity, we exclude the intercept, , which is a constant offset. Recall that the hat on indicates that the value is an estimate of the corresponding parameter β_j in the underlying model.

The complexity of a model is related to the number and magnitude of its parameters¹. With too few parameters the model underfits, missing predictable systematic variation. With too many parameters it overfits, fitting training data noise with lower SSE but increasing variability in prediction of new data. Generally, as the number of parameters increases, so does the total magnitude of their estimated values (Fig. 1a). Thus, rather than limiting the number of parameters, we can control model complexity by constraining the magnitude of the parameters—a kind of limited budget for the model to spend on parameters. In fact, this can also reduce the number of variables in the model.

**Figure 1: Regularization controls model complexity by imposing a limit on the magnitude of its parameters.**

To see how this might work, let's consider a single patient and apply a three-biomarker model given by . Suppose that, in reality, the underlying model—which we don't know—is y = 5x₃. The ideal estimate is . But what happens if, by chance, the values for the first two biomarkers in our training data are perfectly correlated; for example, x₁ = 2x₂? Now also gives the same fit as the ideal estimate because 50x₁ − 100x₂ = 0. To put it more generally, as long as , the magnitude of can be arbitrarily large. If x₁ and x₂ do not have perfect correlation in new data, models with nonzero values for and will perform worse than the ideal solution and, the larger their magnitudes, the worse the fit. By penalizing the magnitude of the parameters, we avoid models with large values of and and may even force them to their correct values of 0.

The classic approach to constraining model parameter magnitudes is ridge regression (RR), also known as Tikhonov regularization. It imposes a limit on the squared L²-norm, which is the sum of squares of parameters, by limiting . This is equivalent to minimizing . The second term is the regularizer function in which λ acts as a weight that balances minimizing SSE and limits the model complexity; the value of λ can be chosen. Achieving this balance is crucial; if we only care about model selection based on minimizing SSE on the training data, we typically wind up with an overfitted model. In general, the larger the value of λ, the lower the magnitude of parameters and thus model complexity (Fig. 1b,c). Note that even with large values of λ, parameter magnitudes are reduced but not set to zero. Thus, a complex model such as the fifth-order polynomial (Fig. 1b) is not reduced to a lower order polynomial. The relationship between T and λ is complex—depending on the data set and model, either T or λ may be more convenient to use. As we'll see below, a value of T directly corresponds to a boundary in the model parameter space that can be handily visualized.

One benefit of RR is that it can select a unique model in cases where multiple models yield an equally good fit, as measured by SSE (Fig. 2a, black line) by adding a regularizer function (Fig. 2b) to the SSE minimization to find a unique solution (Fig. 2c, black point). Figure 2 is a simplification; in a realistic scenario the model would have more parameters and correlation in the data would exist but would not be perfect. We show the regularization process for a fixed λ = 9 (Fig. 2b,c); the best value for λ would normally be chosen using a process like cross-validation that evaluates the model parameter solution using a validation data set¹.

**Figure 2: Ridge regression (RR) can resolve cases where multiple models yield the same quality fit to the data.**

An alternative regularization method is the least absolute shrinkage and selection operator (LASSO), which imposes a limit on the L¹-norm, . This is equivalent to minimizing . Unlike RR, as λ increases (or T decreases), LASSO removes variables from the model by reducing the corresponding parameters to zero (Fig. 3a). In other words, LASSO can be used as a variable selection method, which removes biomarkers from our diagnostic test model.

**Figure 3: LASSO and elastic net (EN) can remove variables from a model by forcing their parameters to zero.**

To geometrically illustrate both RR and LASSO, we repeated the simulation from Figure 2a but without correlation in the variables (Fig. 3b). In the absence of correlation, the lowest SSE is at the parameter estimate which, in this example, happened to be the parameters used in the underlying model to generate the data. However, for our choice of T, this parameter estimate coordinate fell outside of the constraints of both regularization methods, and we instead had to choose a coordinate within the constraint space that had the lowest SSE. This coordinate corresponded to a model that yields a higher SSE than the minimum possible SSE but strikes a balance between model complexity and SSE. Recall that our goal here wasn't to minimize the SSE, which can easily be done by an overfitted model, but rather to find a model that would perform well with new data. In this example, we used T because it has a more direct geometrical interpretation; for example, it corresponds to the square of the radius of the RR boundary circle.

An interesting observation is that for some values of T, the LASSO solution may fall on one of the corners of the square boundary (Fig. 3b, T = 3). Since these corners sit on an axis where one of the parameters equals zero, they represent a solution in which the corresponding variable has been removed from the model. This is in contrast to RR, where because of the circular boundary, variables won't be removed except in the unlikely event that the minimum SSE already falls on an axis.

LASSO has several important weaknesses. First, it does not guarantee a unique solution (Fig. 3c). It is also not robust to colinearity in the data. For example, in a diagnostic test there may be several correlated biomarkers and, although we would typically want each of them to contribute to the model, LASSO will often select only one of them—this selection can vary even with minor changes in the data. Another potential weakness of LASSO manifests when the number of variables, P, is larger than the number of samples, N, a common occurrence in biology, especially with 'omic' data sets. LASSO cannot select more than N variables, which may be an advantage or a disadvantage depending on the data.

Given these weaknesses, an approach that blends RR and LASSO is often used for model selection. The elastic net method uses both regularization methods, simultaneously restricting both the L¹- and L²-norms of the parameters. For linear regression, this is equivalent to minimizing . This increases the number of models to be evaluated, as different combinations of λ₁ and λ₂ should be tried out during the cross-validation. If λ₁ = 0, we have LASSO; and if λ₂ = 0, we have RR. Elastic net is known to select more variables than LASSO (Fig. 3a), and it shrinks the nonzero model parameters like ridge regression².

Creating a robust predictor model requires careful control of the model complexity so that the model will generalize well to new data. This complexity can be controlled by reducing the number of parameters, but it is challenging to do so, as many combinations need to be evaluated. Regularization addresses this by creating a parameter budget used to prioritize variables in the model. Ridge regression provides unique solutions even when variables are multiply correlated, but it does not reduce the number of variables; while LASSO performs variable selection but may not provide a unique solution in every case. Elastic net offers the best of both worlds and can be used to create a simpler model that will likely perform better on new data.

References

Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 703–704 (2016).
Article CAS Google Scholar
Zou, H. & Hastie, T. J. R. Stat. Soc. B 67, 301–320 (2005).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.,
Jake Lever
Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,
Martin Krzywinski
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,
Naomi Altman

Authors

Jake Lever
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lever, J., Krzywinski, M. & Altman, N. Regularization. Nat Methods 13, 803–804 (2016). https://doi.org/10.1038/nmeth.4014

Download citation

Published: 29 September 2016
Issue Date: October 2016
DOI: https://doi.org/10.1038/nmeth.4014

This article is cited by

Neural networks primer
- Alexander Derry
- Martin Krzywinski
- Naomi Altman
Nature Methods (2023)
Optimizing machine learning models for granular NdFeB magnets by very fast simulated annealing
- Hyeon-Kyu Park
- Jae-Hyeok Lee
- Sang-Koog Kim
Scientific Reports (2021)
Machine learning: supervised methods
- Danilo Bzdok
- Martin Krzywinski
- Naomi Altman
Nature Methods (2018)
Lipoprotein markers associated with disability from multiple sclerosis
- A. R. Gafson
- T. Thorne
- P. M. Matthews
Scientific Reports (2018)
Machine learning identifies a core gene set predictive of acquired resistance to EGFR tyrosine kinase inhibitor
- Young Rae Kim
- Sung Young Kim
Journal of Cancer Research and Clinical Oncology (2018)

Regularization

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Neural networks primer

Optimizing machine learning models for granular NdFeB magnets by very fast simulated annealing

Machine learning: supervised methods

Lipoprotein markers associated with disability from multiple sclerosis

Machine learning identifies a core gene set predictive of acquired resistance to EGFR tyrosine kinase inhibitor

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Neural networks primer

Optimizing machine learning models for granular NdFeB magnets by very fast simulated annealing

Machine learning: supervised methods

Lipoprotein markers associated with disability from multiple sclerosis

Machine learning identifies a core gene set predictive of acquired resistance to EGFR tyrosine kinase inhibitor

Search

Quick links