This Month
Published: 03 January 2018

Points of significance

Machine learning: supervised methods

Danilo Bzdok^1,2,
Martin Krzywinski³ &
Naomi Altman⁴

Nature Methods volume 15, pages 5–6 (2018)Cite this article

22k Accesses
179 Citations
24 Altmetric
Metrics details

Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.

You have full access to this article via your institution.

Download PDF

In supervised learning, a set of input variables, such as blood metabolite or gene expression levels, are used to predict a quantitative response variable like hormone level or a qualitative one such as healthy versus diseased individuals. We have previously discussed several supervised learning algorithms, including logistic regression and random forests, and their typical behaviors with different sample sizes and numbers of predictor variables. This month, we look at two very common supervised methods in the context of machine learning: linear support vector machines (SVMs) and k-nearest neighbors (kNNs). Both have been successfully applied to challenging pattern-recognition problems in biology and medicine¹.

SVM and kNN exemplify several important trade-offs in machine learning (ML). SVM is often less computationally demanding than kNN and is easier to interpret, but it can identify only a limited set of patterns. On the other hand, kNN can find very complex patterns, but its output is more challenging to interpret. To illustrate both algorithms, we will apply them to classification, because they tend to perform better at predicting categorical outputs (e.g., health versus disease) than at approximating target functions with numeric outputs (e.g., hormone level). Both learning techniques can be used to distinguish many classes at once, use multiple predictors and obtain probabilities for each class membership.

We'll illustrate SVM using a two-class problem and begin with a case in which the classes are linearly separable, meaning that a straight line can be drawn that perfectly separates the classes, with the margin being the perpendicular distance between the closest points to the line from each class (Fig. 1a). Many such separating lines are possible, and SVM can be used to find one with the widest margin (Fig. 1b). When three or more predictors are used, the separating line becomes a (hyper-)plane, but the algorithm remains the same. The closest points to the line are called support vectors¹ and are the only points that ultimately influence the position of the separating line—any points that are further from the line can be moved, removed or added with no impact on the line. When the classes are linearly separable, the wider the margin, the higher our confidence in the classification, because it indicates that the classes are less similar.

**Figure 1: A support vector machine (SVM) classifies points by maximizing the width of a margin that separates the classes.**

Practically, most data sets are not linearly separable, and any separating line will result in misclassification, no matter how narrow the margin is. We say that the margin is violated by a sample if it is on the wrong side of the separating line (Fig. 1c, red arrows) or is on the correct side but within the margin (Fig. 1c, orange arrow).

Even when the data are linearly separable, allowing a few points to be misclassified might improve the classifier by allowing a wider margin for the bulk of the data (Fig. 2a). To handle violations, we impose a penalty proportional to the distance between each violating point¹ and the separating line, with nonviolating points having zero penalty. In SVM, the separating line is chosen by minimizing 1/m+C∑p_i, where m is the margin width, p_i is the penalty for each point, and C is a hyperparameter (a parameter used to tune the overall fitting behavior of an algorithm) balancing the trade-off between margin width and misclassification. A point that has a nonzero penalty is considered a support vector, because it impacts the position of the separating line and its margin.

**Figure 2: The balance between the width of the margin and penalties for margin violations is controlled by a regularization parameter, C.**

When C is large, the margin width has a low impact on the minimization, and the line is placed to minimize the sum of the violation penalties (Fig. 2, C = 1,000). When C is decreased, the misclassified points have lower impact, and the line is placed with more emphasis on maximizing the margin (Fig. 2, C = 50 and C = 5). When C is very small, classification penalties become insignificant, and the margin can be encouraged to actually grow to encompass all points. Typically, C is chosen using cross-validation².

Recall we showed previously³ how regularization can be used to guard against overfitting that occurs when the prediction equation is too closely tailored to random variation in the training set. In that sense, the role of C is similar, except here it tunes the fit by adjusting the balance of terms being minimized rather than the complexity of the shape of the boundary. Large values of C force the separating line to adjust to data far from the center of each class and thus encourage overfitting. Small values tolerate many margin violations and encourage underfitting.

We can avoid the explicit assumption of a linear class boundary by using the k-nearest neighbors (kNN) algorithm. This algorithm determines the class of an unclassified point by counting the majority class vote from its k-nearest neighbor training points (Fig. 3a). For example, a patient whose symptoms closely match those of patients with a specific diagnosis would be classified with the same disease status. Because kNN does not assume a particular boundary between the classes, its boundary can be closer to the 'true' relationship. However, for a given training set, predictions may be less stable than for SVMs, especially when k is small, and the algorithm will often overfit the training data.

**Figure 3: Illustration of the k-nearest neighbors (kNN) classifier.**

The value of the hyperparameter k acts to regularize kNN, analogous to C in SVM, and is generally selected by cross-validation. To avoid ties in the vote, k can be chosen to be odd. Small k gives a finely textured boundary, which is sensitive to outliers and yields a high model variance (k = 3, Fig. 3b). Larger k gives more rigid boundaries, and high model bias (k = 7, Fig. 3b), and this pools the effect of more distant neighbors. The largest possible value of k is the number of training points—at this extreme, any new observation is classified based on the majority in the entire training sample incurring maximum model bias.

Neither SVM nor kNN make explicit model specifications about the data-generating process such as normality of the data. However, linear SVM is considered a parametric method because it can only produce linear boundaries. If the true class boundary is nonlinear, SVM will struggle to find a satisfying fit even with increased size of the training set. To help the algorithm capture nonlinear boundaries, functions of the input variables, such as polynomials, could be added to the set of predictor variables¹. This extension of the algorithm is called kernel SVM.

In contrast, kNN is a nonparametric algorithm because it avoids a priori assumptions about the shape of the class boundary and can thus adapt more closely to nonlinear boundaries as the amount of training data increases. kNN has higher variance than linear SVM, but it has the advantage of producing classification fits that adapt to any boundary. Even though the true class boundary is unknown in most real-world applications, kNN has been shown to approach the theoretically optimal classification boundary as the training set increases to massive data¹. However, because kNN does not impose any structure on the boundary, it can create class boundaries that may be less interpretable than those of linear SVM. The simplicity of the linear SVM boundary also lends itself more directly to formal tests of statistical significance that give P values for the relevance of individual variables.

There are also trade-offs in the number of samples and the number of variables that can be handled by these approaches. SVM can achieve good prediction accuracy for new observations despite large numbers of input variables. SVM therefore serves as an off-the-shelf technique that is frequently used in genome-wide analysis and brain imaging, two application domains that often have small sample sizes (e.g., hundreds of participants) but very high numbers of inputs (e.g., hundreds of thousands of genes or brain locations).

By contrast, the classification performance of kNN rapidly deteriorates when searching for patterns using high numbers of input variables¹ when many of the variables may be unrelated to the classification or contribute only small amounts of information. Because equal attention is given to all variables, the nearest neighbors may be defined by irrelevant variables. This so-called curse of dimensionality occurs for many algorithms that become more flexible as the number of predictors increases¹.

Finally, computation and memory resources are important practical considerations⁴when analyzing extensive data. SVM only needs a small subset of training points (the support vectors) to define the classification rule, making it often more memory efficient and less computationally demanding when inferring the class of a new observation. In contrast, kNN typically requires higher computation and memory resources because it needs to use all input variables and training samples for each new observation to be classified.

References

Hastie, T., Tibshirani, R. & Friedman, J. Springer Series in Statistics, Heidelberg (2001).
Google Scholar
Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 703–704 (2016).
Article CAS Google Scholar
Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 803–804 (2016).
Article CAS Google Scholar
Bzdok, D. & Yeo, B.T.T. Neuroimage 155, 549–564 (2017).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychiatry at the RWTH Aachen University, Germany
Danilo Bzdok
INRIA/Neurospin Saclay, France
Danilo Bzdok
a staff scientist at Canada's Michael Smith Genome Sciences Centre,
Martin Krzywinski
a Professor of Statistics at The Pennsylvania State University,
Naomi Altman

Authors

Danilo Bzdok
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat Methods 15, 5–6 (2018). https://doi.org/10.1038/nmeth.4551

Download citation

Published: 03 January 2018
Issue Date: 01 January 2018
DOI: https://doi.org/10.1038/nmeth.4551

This article is cited by

Comparing classifier performance with baselines
- Fadel M. Megahed
- Ying-Ju Chen
- Naomi Altman
Nature Methods (2024)
SD-IIDS: intelligent intrusion detection system for software-defined networks
- Neena Susan Shaji
- Raja Muthalagu
- Pranav Mothabhau Pawar
Multimedia Tools and Applications (2024)
Machine Learning-Based Model for Predicting Prolonged Mechanical Ventilation in Patients with Congestive Heart Failure
- Le Li
- Bin Tu
- Yan Yao
Cardiovascular Drugs and Therapy (2024)
Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis
- Bin-Bin Jia
- Jun-Ying Liu
- Min-Ling Zhang
International Journal of Machine Learning and Cybernetics (2024)
Ensemble learning-assisted prediction of prolonged hospital length of stay after spine correction surgery: a multi-center cohort study
- Wenle Li
- Yusi Zhang
- Zhaohui Hu
Journal of Orthopaedic Surgery and Research (2024)

Machine learning: supervised methods

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Comparing classifier performance with baselines

SD-IIDS: intelligent intrusion detection system for software-defined networks

Machine Learning-Based Model for Predicting Prolonged Mechanical Ventilation in Patients with Congestive Heart Failure

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

Ensemble learning-assisted prediction of prolonged hospital length of stay after spine correction surgery: a multi-center cohort study

Search

Quick links

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comparing classifier performance with baselines

SD-IIDS: intelligent intrusion detection system for software-defined networks

Machine Learning-Based Model for Predicting Prolonged Mechanical Ventilation in Patients with Congestive Heart Failure

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

Ensemble learning-assisted prediction of prolonged hospital length of stay after spine correction surgery: a multi-center cohort study

Search

Quick links