Perspectives on validation of clinical predictive algorithms

de Hond, Anne A. H.; Shah, Vaibhavi B.; Kant, Ilse M. J.; Van Calster, Ben; Steyerberg, Ewout W.; Hernandez-Boussard, Tina

doi:10.1038/s41746-023-00832-9

Download PDF

Comment
Open access
Published: 06 May 2023

Perspectives on validation of clinical predictive algorithms

npj Digital Medicine volume 6, Article number: 86 (2023) Cite this article

5268 Accesses
6 Citations
15 Altmetric
Metrics details

Subjects

The generalizability of predictive algorithms is of key relevance to application in clinical practice. We provide an overview of three types of generalizability, based on existing literature: temporal, geographical, and domain generalizability. These generalizability types are linked to their associated goals, methodology, and stakeholders.

Machine learning has led to a surge in the development of clinical predictive algorithms. The generalizability of these algorithms often goes untested¹, leaving the community in the dark on their accuracy and safety when applied to a specific medical setting. We need clear objectives with respect to generalizability that align with the intended use. Journals, funding organizations, and regulatory bodies provide some guidance on generalizability requirements for clinical predictive algorithms, but a clear definition is often lacking. For example, it is considered best practice to ‘Describe the generalizability of the model including the performance of the model on validation and testing datasets²’. We consider this recommendation too vague. It is not clear what type of generalizability is referred to and whether it is sufficient for the intended use of the algorithm (see Supplementary Table 1 for more examples and suggestions for improvement). This commentary aims to provide clarity on different objectives related to generalizability via an overview of three main types of generalizability summarized from the literature with their associated goals, methodology, and stakeholders.

We performed a scoping review to identify different types of generalizability (see Supplementary Methods and Supplementary Table 2). In the context of clinical prediction models or predictive algorithms, generalizability refers to an algorithm’s ability to perform adequately across different settings³. Setting is defined by the clinical context of included patients, time, and place. Algorithm performance can then be assessed along various axes, including discrimination³, calibration⁴, and measures for clinical usefulness, such as Net Benefit⁵. We extracted three distinct types of generalizability. Examples of published validation use cases for each generalizability type can be found in Supplementary Table 3.

A key distinction should be made between internal and external validation (Fig. 1). Internal validation assesses the reproducibility of algorithm performance in data that is distinct from the development (or: train) data but derived from the exact same underlying population. It provides an optimism-corrected estimate of performance for the setting where the data originated from⁶. Cross-validation and bootstrapping are the recommended methods to assess internal validity^6,7. Cross-validation splits the data in equal parts (usually five or ten) and trains the algorithm on all but one holdout part that is used for testing. This process is repeated until all parts have been used as test data. The whole procedure is preferably repeated multiple times for more stability, e.g., a 10 × 10-fold cross-validation procedure. Bootstrapping repeatedly samples data points from the development data with replacement (usually 500–2000 times). These samples are used to train the algorithm with the original development data as test set^6,8. Internal validation is necessary but not sufficient to ensure safe clinical applicability. The main stakeholder is the developer of the algorithm, who uses internal validation to assess the validity of the development process, and quantifies overoptimism in expected performance^7,9.

External validity assesses the transportability of the clinical predictive algorithm to other settings than those considered during development (Fig. 1). It encompasses three generalizability types: temporal, geographical and domain generalizability. Temporal validity assesses the performance of an algorithm over time at the development setting. This type of generalizability is required to understand data drift (a change in the data over time from the data that was used during development)¹⁰. Temporal validity may be assessed by testing the algorithm on a dataset derived from the same setting as the development cohort but from a later time. Variations in design are possible, such as a ‘waterfall’ design, in which the development time window is repeatedly increased¹¹. The main stakeholders of temporal validity are clinicians, hospital administrators, and other clinical end-users that plan to implement the algorithm into their clinical practice. These stakeholders need proof of temporal validity to ensure the safe use of the algorithm at their local clinical institution or hospital.

Geographical validation assesses the generalizability of an algorithm to a place (institution or location) that is different from where the algorithm was developed. This type of validation assesses the heterogeneity across places. Geographical validity can be assessed by testing the algorithm on data collected from the new place(s). More complex designs are possible, such as a leave-one-site-out (or internal-external) validation in which the algorithm is developed on all but one location and tested on the left-out one¹². This process is repeated until all locations have been used as test location. Geographical validation is required when the algorithm is going to be used outside of the original development place. The main stakeholders are the clinical end-users at a new implementation site who want proof of validity for safe use at their site. Manufacturers, insurers, and governing bodies could be other stakeholders that are interested in evidence for the general or widespread applicability of the prediction tool. When geographical generalizability is low, a global model that is valid for different places may not be tenable¹³. Instead, a local variant of the algorithm could be achieved through updating the global algorithm at each individual place⁴.

Domain validation assesses the generalizability of an algorithm to a different clinical context^14,15. This type of validation considers generalizability across medical background (e.g., 30-day mortality risk for emergency versus surgical patients), but also medical setting (e.g., fall prevention in nursing home versus hospital), and demographics (e.g., emergency admission risk for adult versus pediatric patients). For example, some COVID-19 prediction models were developed for related respiratory diagnoses¹⁶. In a large study on generalizability of prediction models, model performance was found to be better in ‘closely related’ than ‘distantly related’ validation cohorts, which underscores the relevance of domain generalizability¹⁷. Like geographical validation, domain validity is assessed by testing the algorithm on data collected from a new domain. Stakeholders of domain validity include clinical end-users from the new domain, manufacturers, insurers, and governing bodies. If the algorithm does not generalize across domains, the underlying relationships may be truly different, warranting separate algorithms for each domain.

The overview presented in Fig. 1 may be used as a starting point by regulatory bodies, industry, and academia when formulating guidelines and requirements for the generalizability of a clinical predictive algorithm. Building on previous work^18,19,20,21, we argue that validation studies should be suited to the target context and the intended use of the clinical predictive algorithm. Always aiming for a specific type of generalizability may not be defensible for some predictive algorithms and their intended use^18,22.

During algorithm development and validation, researchers and developers should adhere to guidelines, specifically TRIPOD or its forthcoming variant, TRIPOD-AI^20,23. They should report on the algorithm’s capacity to generalize and provide a justification for their chosen validation strategy by relating it to their intended operational period, (clinical) population, and environment. Moreover, they should add a disclaimer about the type of generalizability and intended use of their algorithms. If generalizability is limited this ought to be acknowledged alongside other implementation risks. For example, only internal or temporal validation was performed, or poor generalizability was found across places or clinical contexts. Researchers and developers should also report when the algorithm’s scope limits the necessary validation steps. For example, domain validation may not be attempted when a predictive algorithm cannot be used (or has very limited use) outside of its domain (e.g., a prostate biopsy model).

In conclusion, we propose more precise specification for the desired and required type of generalizability for the implementation of clinical predictive algorithms. The three generalizability types discussed here, comprising temporal, geographical, and domain generalizability, all serve a unique goal and specific application purpose. Hence, researchers, developers, journals, funding organizations, and regulatory bodies should ensure that their chosen generalizability claims on the algorithm’s intended use align with the underlying evidence. Future research may assess the impact of different types of heterogeneity on generalizability and steps to improve generalizability for clinical predictive algorithms.

References

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
Article CAS PubMed Google Scholar
Kakarmath, S. et al. Best practices for authors of healthcare-related artificial intelligence manuscripts. npj Digital Med. 3, 134 (2020).
Article Google Scholar
Steyerberg, E. W. & Vergouwe, Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur. Heart J. 35, 1925–1931 (2014).
Article PubMed PubMed Central Google Scholar
Van Calster, B. et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019).
Article PubMed PubMed Central Google Scholar
Vickers, A. J., Van Calster, B. & Steyerberg, E. W. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 352, i6 (2016).
Article PubMed PubMed Central Google Scholar
Harrell, F. Multivariable modeling strategies. In: Regression Modeling Strategies. Springer Series in Statistics. (Springer, Cham., 2015).
Steyerberg, E. W. Clinical prediction models (Springer Nature, 2009).
Efron, B. & Tibshirani, R. J. An introduction to the bootstrap (CRC press, 1994).
Futoma, J., Simons, M., Panch, T., Doshi-Velez, F. & Celi, L. A. The myth of generalisability in clinical research and machine learning in health care. Lancet Digital Health 2, e489–e492 (2020).
Article PubMed Google Scholar
Wan, B., Caffo, B. & Vedula, S. S. A unified framework on generalizability of clinical prediction models. Front. Artif. Intell. 5, https://doi.org/10.3389/frai.2022.872720 (2022).
de Hond, A. A. H. et al. Predicting readmission or death after discharge from the ICU: external validation and retraining of a machine learning model. Crit. Care Med. 51, 291–300 (2023).
Article PubMed Google Scholar
Austin, P. C. et al. Geographic and temporal validity of prediction models: different approaches were useful to examine model performance. J. Clin. Epidemiol. 79, 76–85 (2016).
Article PubMed PubMed Central Google Scholar
Steyerberg, E. W., Nieboer, D., Debray, T. P. A. & van Houwelingen, H. C. Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: an overview and illustration. Stat. Med 38, 4290–4309 (2019).
Article PubMed PubMed Central Google Scholar
Debray, T. P. et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J. Clin. Epidemiol. 68, 279–289 (2015).
Article PubMed Google Scholar
Cowley, L. E., Farewell, D. M., Maguire, S. & Kemp, A. M. Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature. Diagnostic Progn. Res. 3, 16 (2019).
Article Google Scholar
Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 369, m1328 (2020).
Article PubMed PubMed Central Google Scholar
Gulati, G. et al. Generalizability of cardiovascular disease clinical prediction models: 158 independent external validations of 104 unique models. Circ. Cardiovasc. Qual. Outcomes 15, e008487 (2022).
Article PubMed PubMed Central Google Scholar
Futoma, J., Simons, M., Panch, T., Doshi-Velez, F. & Celi, L. A. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit Health 2, e489–e492 (2020).
Article PubMed PubMed Central Google Scholar
Burns, M. L. & Kheterpal, S. Machine learning comes of age: local impact versus national generalizability. Anesthesiology 132, 939–941 (2020).
Article PubMed Google Scholar
de Hond, A. A. H. et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. npj Digital Med. 5, 2 (2022).
Article Google Scholar
Sperrin, M., Riley, R. D., Collins, G. S. & Martin, G. P. Targeted validation: validating clinical prediction models in their intended population and setting. Diagnostic Progn. Res. 6, 24 (2022).
Article Google Scholar
Van Calster, B., Steyerberg, E. W., Wynants, L. & van Smeden, M. There is no such thing as a validated prediction model. BMC Med. 21, 70 (2023).
Article PubMed PubMed Central Google Scholar
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Eur. Urol. 67, 1142–1151 (2015).
Article PubMed Google Scholar

Download references

Acknowledgements

Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM013362. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research was also supported by Research Foundation – Flanders (FWO) grant G097322N and Internal Funds KU Leuven grant C24M/20/064.

Author information

Authors and Affiliations

Clinical AI Implementation and Research Lab, Leiden University Medical Centre, Leiden, the Netherlands
Anne A. H. de Hond & Ewout W. Steyerberg
Department of Medicine (Biomedical Informatics), Stanford University, Stanford, CA, USA
Anne A. H. de Hond, Vaibhavi B. Shah & Tina Hernandez-Boussard
Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, the Netherlands
Anne A. H. de Hond, Ben Van Calster & Ewout W. Steyerberg
Department of Digital Health, University Medical Center Utrecht, Utrecht, the Netherlands
Ilse M. J. Kant
Department of Development & Regeneration, KU Leuven, Leuven, Belgium
Ben Van Calster
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Tina Hernandez-Boussard
Department of Epidemiology & Population Health (by courtesy), Stanford University, Stanford, CA, USA
Tina Hernandez-Boussard

Authors

Anne A. H. de Hond
View author publications
You can also search for this author in PubMed Google Scholar
Vaibhavi B. Shah
View author publications
You can also search for this author in PubMed Google Scholar
Ilse M. J. Kant
View author publications
You can also search for this author in PubMed Google Scholar
Ben Van Calster
View author publications
You can also search for this author in PubMed Google Scholar
Ewout W. Steyerberg
View author publications
You can also search for this author in PubMed Google Scholar
Tina Hernandez-Boussard
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.H.B. and A.A.H.d.H. conceived the idea. A.A.H.d.H. and V.B.S. performed the literature search and performed the analysis. A.A.H.d.H., V.B.S., I.M.J.K., B.v.C., E.W.S. and T.H.B. wrote the initial draft and approved the final manuscript. A.A.H.d.H. is the guarantor and accepts full responsibility for the work. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Corresponding author

Correspondence to Anne A. H. de Hond.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

de Hond, A.A.H., Shah, V.B., Kant, I.M.J. et al. Perspectives on validation of clinical predictive algorithms. npj Digit. Med. 6, 86 (2023). https://doi.org/10.1038/s41746-023-00832-9

Download citation

Received: 10 January 2023
Accepted: 28 April 2023
Published: 06 May 2023
DOI: https://doi.org/10.1038/s41746-023-00832-9

This article is cited by

Clinical data mining: challenges, opportunities, and recommendations for translational applications
- Huimin Qiao
- Yijing Chen
- You Guo
Journal of Translational Medicine (2024)
Strategies for evaluating predictive models: examples and implications based on a natural language processing model used to assess operative performance feedback
- Andrew E. Krumm
- Erkin Ötleş
- Benjamin Zendejas
Global Surgical Education - Journal of the Association for Surgical Education (2023)

Perspectives on validation of clinical predictive algorithms

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplemental Information

Rights and permissions

About this article

Cite this article

This article is cited by

Clinical data mining: challenges, opportunities, and recommendations for translational applications

Strategies for evaluating predictive models: examples and implications based on a natural language processing model used to assess operative performance feedback

Search

Quick links

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplemental Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Clinical data mining: challenges, opportunities, and recommendations for translational applications

Strategies for evaluating predictive models: examples and implications based on a natural language processing model used to assess operative performance feedback

Search

Quick links