Reply: metrics to assess machine learning models

Rajkomar, Alvin; Dai, Andrew M.; Sun, Mimi; Hardt, Michaela; Chen, Kai; Rough, Kathryn; Dean, Jeffrey

doi:10.1038/s41746-018-0063-z

Download PDF

Correspondence
Open access
Published: 10 October 2018

Reply: metrics to assess machine learning models

Alvin Rajkomar ORCID: orcid.org/0000-0001-5750-5016^1,2,
Andrew M. Dai¹,
Mimi Sun¹,
Michaela Hardt¹,
Kai Chen¹,
Kathryn Rough¹ &
…
Jeffrey Dean¹

npj Digital Medicine volume 1, Article number: 57 (2018) Cite this article

1803 Accesses
2 Citations
Metrics details

Subjects

We thank Prof. Pinker for bringing up important points on how to assess the performance of machine learning models. The central finding of our work is that a machine learning pipeline operating on an open-source data-format for electronic health records can render accurate predictions across multiple tasks in a way that works for multiple health systems. To demonstrate this, we selected three commonly used binary prediction tasks, inpatient mortality, 30-day unplanned readmission, and length of stay, as well as the task of predicting every discharge diagnosis. The main metric we used for the binary predictions was the area-under-the-receiver-operator curve (AUROC).

We would first like to clarify a few issues. We would highlight in our results section that we did report the number-needed-to-evaluate or work-up to detection ratio for the inpatient mortality model and baseline model, which is (1/PPV) and commonly accepted as a clinically relevant metric.¹

Also, as described in the “Study Cohort” section, we only included hospitalizations of 24 h or longer, and Table 1 reports the inpatient mortality rates of the hospitals to be approximately 2% in that cohort. This should not be confused with 2.3% of patients dying within 24 h.

Table 1 Area under the precision-recall curves for various predictions

Full size table

Prof. Pinker states that the public could be mislead by the way the mainstream media had reported the results of our paper. We observed that many reports incorrectly conflated accuracy with AUROC. We take our responsibility seriously to clearly explain our results to a more general audience and had simultaneously released a public blog post.² In that post, we talked explicitly about the AUROC: “The most common way to assess accuracy is by a measure called the area-under-the-receiver-operator curve, which measures how well a model distinguishes between a patient who will have a particular future outcome compared to one who will not. In this metric, 1.00 is perfect, and 0.50 is no better than random chance, so higher numbers mean the model is more accurate.”

We agree that the AUROC has its limitations, although we would note that no single metric conveys a complete picture of the performance of a model. The AUROC has an advantage of being a commonly reported metric in both clinical and recent machine-learning papers.³ We did caution in our manuscript that direct comparison of AUROCs from studies using different cohorts is problematic.⁴

However, we do agree that the area under the precision-recall curve (AUPRC) is relevant for prediction tasks and can be particularly helpful with clinical tasks with high class imbalance.

Therefore, we report the AUPRC for each of the binary prediction tasks for the primary models reported in the manuscript, the clinical baselines, and the enhanced-baselines that we described in the supplemental materials (Table 1). The confidence intervals are calculated by stratified bootstrapping of the positive and negative classes, as is common for this metric.⁵ It is worth noting that the models evaluated here were tuned to optimize the AUROC, and it is well-known that a model tuned for optimizing AUROC does not necessarily optimize AUPRC (and vice-versa). The size of the test set (9624 for Hospital A and 12,127 for Hospital B) limits the power to make comparisons between models, although the point-estimates are higher for the deep learning models for each case.

References

Romero-Brufau, S., Huddleston, J. M., Escobar, G. J. & Liebow, M. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit. Care. 19, 285 (2015).
Article Google Scholar
Rajkomar, A. & Oren, E. Deep Learning for Electronic Health Records. Google AI Blog http://ai.googleblog.com/2018/05/deep-learning-for-electronic-health.html.
Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).
Article Google Scholar
Walsh, C. & Hripcsak, G. The effects of data sources, cohort selection, and outcome definition on a predictive model of risk of thirty-day hospital readmissions. J. Biomed. Inform. 52, 418–426 (2014).
Article Google Scholar
Boyd, K., Eng, K. H. & Page, C. D. In: H., Blockeel, K., Kersting, S., Nijssen, & F. Železný (Eds). Machine Learning and Knowledge Discovery in Databases 451–466 (Springer Berlin Heidelberg, 2013).

Download references

Author information

Authors and Affiliations

Google LLC, Mountain View, CA, USA
Alvin Rajkomar, Andrew M. Dai, Mimi Sun, Michaela Hardt, Kai Chen, Kathryn Rough & Jeffrey Dean
University of California, San Francisco, San Francisco, CA, USA
Alvin Rajkomar

Authors

Alvin Rajkomar
View author publications
You can also search for this author in PubMed Google Scholar
Andrew M. Dai
View author publications
You can also search for this author in PubMed Google Scholar
Mimi Sun
View author publications
You can also search for this author in PubMed Google Scholar
Michaela Hardt
View author publications
You can also search for this author in PubMed Google Scholar
Kai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn Rough
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Dean
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.R. A.D, M.H., M.S., K.C.,K.R, and J.D. contributed with the statistical analysis, interpretation of results, and drafted and revised the paper.

Corresponding author

Correspondence to Alvin Rajkomar.

Ethics declarations

Competing interests

The authors declare no competing interests, but please note that the authors work for Google, as indicated by their affiliation.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rajkomar, A., Dai, A.M., Sun, M. et al. Reply: metrics to assess machine learning models. npj Digital Med 1, 57 (2018). https://doi.org/10.1038/s41746-018-0063-z

Download citation

Received: 03 August 2018
Revised: 07 September 2018
Accepted: 10 September 2018
Published: 10 October 2018
DOI: https://doi.org/10.1038/s41746-018-0063-z

Reply: metrics to assess machine learning models

Subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Reporting accuracy of rare event classifiers

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links