Points of Significance: Classification evaluation

Journal name:
Nature Methods
Volume:
13,
Pages:
603–604
Year published:
DOI:
doi:10.1038/nmeth.3945
Published online
Corrected online

It is important to understand both what a classification metric expresses and what it hides.

At a glance

Figures

  1. The confusion matrix shows the counts of true and false predictions obtained with known data.
    Figure 1: The confusion matrix shows the counts of true and false predictions obtained with known data.

    Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.

  2. The same value of a metric can correspond to very different classifier performance.
    Figure 2: The same value of a metric can correspond to very different classifier performance.

    (a–d) Each panel shows three different classification scenarios with a table of corresponding values of accuracy (ac), sensitivity (sn), precision (pr), F1 score (F1) and Matthews correlation coefficient (MCC). Scenarios in a group have the same value (0.8) for the metric in bold in each table: (a) accuracy, (b) sensitivity (recall), (c) precision and (d) F1 score. In each panel, those observations that do not contribute to the corresponding metric are struck through with a red line. The color-coding is the same as in Figure 1; for example, blue circles (cases known to be positive) on a gray background (predicted to be negative) are FNs.

  3. Graphical evaluation of classifiers.
    Figure 3: Graphical evaluation of classifiers.

    (a,b) Findings obtained with the (a) ROC, which plots the true positive rate (TPR) versus the false positive rate (FPR), and (b) PR curves. In both panels, curves depict classifiers that are (A) good, (B) similar to random classification and (C) worse than random. The expected performance of a random classifier is shown by the dotted line in a. The equivalent for the PR curve depends on the class balance and is not shown.

  4. Graphical representation of classifier performance avoids setting an exact threshold on results but may be insensitive to important aspects of the data.
    Figure 4: Graphical representation of classifier performance avoids setting an exact threshold on results but may be insensitive to important aspects of the data.

    (a,b) ROC and PR curves for two data sets with very different class balances: (a) 5% positive and (b) 50% positive observations. For each panel, observations are shown as vertical lines (top), of which 5% or 50% are positive (blue).

Change history

Corrected online 16 September 2016
In the version of this article initially published, the expression defining the Fβ score was incorrect. The correct expression is Fβ = (1 + β2)(Precision × Recall)/(β2 × Precision + Recall).The error has been corrected in the HTML and PDF versions of the article.

References

  1. Lever, J., Krzywinski, M. & Altman, N. Nat. Methods 13, 541542 (2016).

Download references

Author information

Affiliations

  1. Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

  3. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data