Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Demographic bias in misdiagnosis by computational pathology models

Abstract

Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas. For example, when using common modeling approaches, we observed performance gaps (in area under the receiver operating characteristic curve) between white and Black patients of 3.0% for breast cancer subtyping, 10.9% for lung cancer subtyping and 16.0% for IDH1 mutation prediction in gliomas. We found that richer feature representations obtained from self-supervised vision foundation models reduce performance variations between groups. These representations provide improvements upon weaker models even when those weaker models are combined with state-of-the-art bias mitigation strategies and modeling choices. Nevertheless, self-supervised vision foundation models do not fully eliminate these discrepancies, highlighting the continuing need for bias mitigation efforts in computational pathology. Finally, we demonstrate that our results extend to other demographic factors beyond patient race. Given these findings, we encourage regulatory and policy agencies to integrate demographic-stratified evaluation into their assessment guidelines.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Dataset characteristics, fairness metrics and modeling choices investigated.
Fig. 2: Investigating bias from data characteristics.
Fig. 3: Investigating bias from MIL model architectures and bias mitigation strategies.
Fig. 4: Evaluating race information in embeddings.
Fig. 5: Effect of training set diversity and size on disparities.
Fig. 6: Investigating lung subtyping disparities beyond race.

Similar content being viewed by others

Data availability

Public data from TCGA, including digital histology and the clinical annotations used, are available at https://portal.gdc.cancer.gov/ and https://cbioportal.org. The EBRAINS brain tumor atlas can be accessed at https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994. Restrictions apply to the availability of the in-house data, which were used with institutional permission for the current study and are thus not publicly available. We note that these data were not specifically collected for this study. All requests for data may be addressed to the corresponding author and will be promptly evaluated based on institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations. Internal data can only be shared for noncommercial, academic purposes and will require a data user agreement.

Code availability

All code was implemented in Python using PyTorch as the primary deep learning package. Code and scripts to reproduce the training experiments of this paper are available at https://github.com/mahmoodlab/CPATH_demographics.

References

  1. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).

    Article  Google Scholar 

  2. van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).

    Article  PubMed  Google Scholar 

  3. Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020).

    Article  CAS  PubMed  Google Scholar 

  6. Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).

    Article  CAS  PubMed  Google Scholar 

  7. Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).

    Article  CAS  PubMed  Google Scholar 

  10. Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. in Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).

  11. Shao, Z. et al. TransMIL: transformer based correlated multiple instance learning for whole slide image classification. in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 2136–2147 (Curran Associates, 2021).

  12. Chan, T. H., Cendra, F. J., Ma, L., Yin, G. & Yu, L. Histopathology whole slide image analysis with heterogeneous graph representation learning. in Proc. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15661–15670 (IEEE, 2023).

  13. Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Leo, P. et al. Computer extracted gland features from H&E predicts prostate cancer recurrence comparably to a genomic companion diagnostic test: a large multi-site study. NPJ Precis. Oncol. 5, 35 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Howard, F. M. et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun. 12, 4423 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chatterji, S. et al. Prediction models for hormone receptor status in female breast cancer do not extend to males: further evidence of sex-based disparity in breast cancer. NPJ Breast Cancer 9, 91 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Dehkharghanian, T. et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn. Pathol. 18, 67 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).

    Article  CAS  PubMed  Google Scholar 

  19. Mhasawade, V., Zhao, Y. & Chunara, R. Machine learning and algorithmic fairness in public and population health. Nat. Mach. Intell. 3, 659–666 (2021).

    Article  Google Scholar 

  20. Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4, e406–e414 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med. 27, 136–140 (2021).

    Article  CAS  PubMed  Google Scholar 

  22. Population Estimates, July 1, 2022 (V2022). U.S. Census Bureau QuickFacts https://www.census.gov/quickfacts/fact/table/US/PST045222 (2022).

  23. Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff. (Millwood) 37, 780–785 (2018).

    Article  PubMed  Google Scholar 

  24. Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Spratt, D. E. et al. Racial/ethnic disparities in genomic sequencing. JAMA Oncol. 2, 1070–1074 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Khor, S. et al. Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Netw. Open 6, e2318495 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  27. van der Burgh, A. C., Hoorn, E. J. & Chaker, L. Removing race from kidney function estimates. JAMA 325, 2018 (2021).

    Article  PubMed  Google Scholar 

  28. Diao, J. A. et al. Clinical implications of removing race from estimates of kidney function. JAMA 325, 184–186 (2021).

    Article  PubMed  Google Scholar 

  29. Marmot, M. Social determinants of health inequalities. Lancet 365, 1099–1104 (2005).

    Article  PubMed  Google Scholar 

  30. Dietze, E. C., Sistrunk, C., Miranda-Carboni, G., O’Reagan, R. & Seewaldt, V. L. Triple-negative breast cancer in African-American women: disparities versus biology. Nat. Rev. Cancer 15, 248–254 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Cormier, J. N. et al. Ethnic differences among patients with cutaneous melanoma. Arch. Intern. Med. 166, 1907–1914 (2006).

    Article  PubMed  Google Scholar 

  32. Rubin, J. B. The spectrum of sex differences in cancer. Trends Cancer 8, 303–315 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Lara, O. D. et al. Pan-cancer clinical and molecular analysis of racial disparities. Cancer 126, 800–807 (2020).

    Article  CAS  PubMed  Google Scholar 

  34. Heath, E. I. et al. Racial disparities in the molecular landscape of cancer. Anticancer Res. 38, 2235–2240 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Gucalp, A. et al. Male breast cancer: a disease distinct from female breast cancer. Breast Cancer Res. Treat. 173, 37–48 (2019).

    Article  PubMed  Google Scholar 

  36. Dong, M. et al. Sex differences in cancer incidence and survival: a pan-cancer analysis. Cancer Epidemiol. Biomarkers Prev. 29, 1389–1397 (2020).

    Article  PubMed  Google Scholar 

  37. Butler, E. N., Kelly, S. P., Coupland, V. H., Rosenberg, P. S. & Cook, M. B. Fatal prostate cancer incidence trends in the United States and England by race, stage, and treatment. Br. J. Cancer 123, 487–494 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Zavala, V. A. et al. Cancer health disparities in racial/ethnic minorities in the United States. Br. J. Cancer 124, 315–332 (2021).

    Article  PubMed  Google Scholar 

  39. Ngan, H.-L., Wang, L., Lo, K.-W. & Lui, V. W. Y. Genomic landscapes of EBV-associated nasopharyngeal carcinoma vs. HPV-associated head and neck cancer. Cancers (Basel) 10, 210 (2018).

    Article  PubMed  Google Scholar 

  40. Singh, H., Singh, R., Mhasawade, V. & Chunara, R. Fairness violations and mitigation under covariate shift. in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 3–13 (Association for Computing Machinery, 2021).

  41. Maity, S., Mukherjee, D., Yurochkin, M. & Sun, Y. Does enforcing fairness mitigate biases caused by subpopulation shift? in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 25773–25784 (Curran Associates, 2021).

  42. Giguere, S. et al. Fairness guarantees under demographic shift. in Proc. 10th International Conference on Learning Representations (ICLR, 2022).

  43. Schrouff, J. et al. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. in Advances in Neural Information Processing Systems Vol. 35 (eds. Koyejo, S. et al.) 19304–19318 (Curran Associates, 2022).

  44. Chen, S. et al. Machine learning-based pathomics signature could act as a novel prognostic marker for patients with clear cell renal cell carcinoma. Br. J. Cancer 126, 771–777 (2022).

    Article  CAS  PubMed  Google Scholar 

  45. US Food and Drug Administration. Evaluation of automatic class III designation for Paige Prostate. www.accessdata.fda.gov/cdrh_docs/reviews/DEN200080.pdf (2021).

  46. Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Pac. Symp. Biocomput. 26, 232–243 (2021).

    PubMed  Google Scholar 

  47. Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Risk of bias in chest X-ray foundation models. Preprint at https://arxiv.org/abs/2209.02965v1 (2022).

  49. Beheshtian, E., Putman, K., Santomartino, S. M., Parekh, V. S. & Yi, P. H. Generalizability and bias in a deep learning pediatric bone age prediction model using hand radiographs. Radiology 306, e220505 (2023).

    Article  PubMed  Google Scholar 

  50. Röösli, E., Bozkurt, S. & Hernandez-Boussard, T. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci. Data 9, 24 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Bernhardt, M., Jones, C. & Glocker, B. Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms. Nat. Med. 28, 1157–1158 (2022).

    Article  CAS  PubMed  Google Scholar 

  52. Mukherjee, P. et al. Confounding factors need to be accounted for in assessing bias by machine learning algorithms. Nat. Med. 28, 1159–1160 (2022).

    Article  CAS  PubMed  Google Scholar 

  53. Meng, C., Trinh, L., Xu, N., Enouen, J. & Liu, Y. Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset. Sci. Rep. 12, 7166 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).

    Article  PubMed  Google Scholar 

  55. Madras, D., Creager, E., Pitassi, T. & Zemel, R. Learning adversarially fair and transferable representations. in Proc. 35th International Conference on Machine Learning 3384–3393 (PMLR, 2018).

  56. Wang, R., Chaudhari, P. & Davatzikos, C. Bias in machine learning models can be significantly mitigated by careful training: evidence from neuroimaging studies. Proc. Natl Acad. Sci. USA 120, e2211613120 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Yang, J., Soltan, A. A., Eyre, D. W. & Clifton, D. A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat. Mach. Intell. 5, 884–894 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Burlina, P., Joshi, N., Paul, W., Pacheco, K. D. & Bressler, N. M. Addressing artificial intelligence bias in retinal diagnostics. Transl. Vis. Sci. Technol. 10, 13 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Distinct lung cancer subtypes associate to distinct drivers of tumor progression. Oncotarget 9, 35528–35540 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  61. Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Abandoning the notion of non-small cell lung cancer. Trends Mol. Med. 25, 585–594 (2019).

    Article  PubMed  Google Scholar 

  62. Yan, H. et al. IDH1 and IDH2 mutations in gliomas. N. Engl. J. Med. 360, 765–773 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. in Advances in Neural Information Processing Systems Vol. 29 (eds. Lee, D. D. et al.) 3315–3323 (Curran Associates, 2016).

  64. Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT Press, 2023); fairmlbook.org/pdf/fairmlbook.pdf

  65. Chouldechova, A. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5, 153–163 (2017).

    Article  PubMed  Google Scholar 

  66. Wang, X. et al. Characteristics of The Cancer Genome Atlas cases relative to U.S. general population cancer cases. Br. J. Cancer 119, 885–892 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource. Sci. Data 9, 55 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Maron, O. & Lozano-Pérez, T. A framework for multiple-instance learning. in Advances in Neural Information Processing Systems Vol. 10 (eds. Jordan, M. I. et al.) 570–576 (MIT Press, 1998).

  69. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

  70. Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).

    Article  PubMed  Google Scholar 

  71. Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).

    Article  CAS  PubMed  Google Scholar 

  72. Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. in Proc. 35th International Conference on Machine Learning 2127–2136 (PMLR, 2018).

  73. Jaume, G., Song, A. H. & Mahmood, F. Integrating context for superior cancer prognosis. Nat. Biomed. Eng. 6, 1323–1325 (2022).

    Article  CAS  PubMed  Google Scholar 

  74. Kamiran, F. & Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33, 1–33 (2012).

    Article  Google Scholar 

  75. Krasanakis, E., Spyromitros-Xioufis, E., Papadopoulos, S. & Kompatsiaris, Y. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. in Proc. 2018 World Wide Web Conference 853–862 (International World Wide Web Conferences Steering Committee, 2018).

  76. Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N. & Varshney, K. R. Optimized pre-processing for discrimination prevention. in Advances in Neural Information Processing Systems Vol. 30 (eds. Guyon, I. et al.) 3995–4004 (Curran Associates, 2017).

  77. Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. in Proc. 30th International Conference on Machine Learning 325–333 (PMLR, 2013).

  78. Zafar, M. B., Valera, I., Rodriguez, M. G. & Gummadi, K. P. Fairness beyond disparate treatment and disparate impact: learning classification without disparate mistreatment. in Proc. 26th International Conference on World Wide Web 1171–1180 (International World Wide Web Conferences Steering Committee, 2017).

  79. Celis, L. E. & Keswani, V. Improved adversarial learning for fair classification. Preprint at https://arxiv.org/abs/1901.10443 (2019).

  80. Zhong, Y. et al. MEDFAIR: benchmarking fairness for medical imaging. in Proc. International Conference on Learning Representations (ICLR, 2023).

  81. Yang, Y., Zhang, H., Katabi, D. & Ghassemi, M. Change is hard: a closer look at subpopulation shift. in International Conference on Machine Learning (ICML, 2023).

  82. Breen, J. et al. Efficient subtyping of ovarian cancer histopathology whole slide images using active sampling in multiple instance learning. in Proc. SPIE 12471 (eds. Tomaszewski, J. E. & Ward, A. D.) 1247110 (Society of Photo-Optical Instrumentation Engineers, 2023).

  83. Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N. & Huang, J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med. Image Anal. 65, 101789 (2020).

    Article  PubMed  Google Scholar 

  84. Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).

    Article  Google Scholar 

  85. Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).

    PubMed  Google Scholar 

  86. Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283–286 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  87. Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. in Proc. 6th IEEE International Conference on Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009).

  89. Janowczyk, A., Basavanhally, A. & Madabhushi, A. Stain Normalization using Sparse AutoEncoders (StaNoSA): application to digital pathology. Comput. Med. Imaging Graph. 57, 50–61 (2017).

    Article  PubMed  Google Scholar 

  90. Ciompi, F. et al. The importance of stain normalization in colorectal tissue classification with convolutional networks. in Proc. 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) 160–163 (IEEE, 2017).

  91. Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. Image Anal. 58, 101544 (2019).

    Article  PubMed  Google Scholar 

  92. Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 89, 104467 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Adleberg, J. et al. Predicting patient demographics from chest radiographs with deep learning. J. Am. Coll. Radiol. 19, 1151–1161 (2022).

    Article  PubMed  Google Scholar 

  94. Yi, P. H. et al. Radiology ‘forensics’: determination of age and sex from chest radiographs using deep learning. Emerg. Radiol. 28, 949–954 (2021).

    Article  PubMed  Google Scholar 

  95. Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).

    Article  CAS  PubMed  Google Scholar 

  96. Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains. Nat. Commun. 11, 5727 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Movva, R. et al. Coarse race data conceals disparities in clinical risk score performance. in Machine Learning for Healthcare Conference 443–472 (PMLR, 2023)

  98. Mamary, A. J. et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr. Pulm. Dis. 5, 177–184 (2018).

    PubMed  PubMed Central  Google Scholar 

  99. Sun, T. Y. et al. Exploring gender disparities in time to diagnosis. in Machine Learning for Healthcare Conference (Curran Associates, 2020).

  100. Gianfrancesco, M. A., Tamang, S., Yazdany, J. & Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 178, 1544–1547 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  101. Glocker, B., Jones, C., Roschewitz, M. & Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. Artif. Intell. 5, e230060 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  102. Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).

    Article  PubMed  Google Scholar 

  103. Borrell, L. N. et al. Race and genetic ancestry in medicine—a time for reckoning with racism. N. Engl. J. Med. 384, 474–480 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  104. Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  105. Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  106. Bamshad, M., Wooding, S., Salisbury, B. A. & Stephens, J. C. Deconstructing the relationship between genetics and race. Nat. Rev. Genet. 5, 598–609 (2004).

    Article  CAS  PubMed  Google Scholar 

  107. Bhargava, H. K. et al. Computationally derived image signature of stromal morphology is prognostic of prostate cancer recurrence following prostatectomy in African American patients. Clin. Cancer Res. 26, 1915–1923 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Shi, Y. et al. A prospective, molecular epidemiology study of EGFR mutations in Asian patients with advanced non-small-cell lung cancer of adenocarcinoma histology (PIONEER). J. Thorac. Oncol. 9, 154–162 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Martini, R. et al. African ancestry-associated gene expression profiles in triple-negative breast cancer underlie altered tumor biology and clinical outcome in women of African descent. Cancer Discov. 12, 2530–2551 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Zhang, G. et al. Characterization of frequently mutated cancer genes in Chinese breast tumors: a comparison of Chinese and TCGA cohorts. Ann. Transl. Med. 7, 179 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  111. McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit. Health 2, e221–e223 (2020).

    Article  PubMed  Google Scholar 

  112. Sung, H., DeSantis, C. E., Fedewa, S. A., Kantelhardt, E. J. & Jemal, A. Breast cancer subtypes among Eastern-African-born black women and other black women in the United States. Cancer 125, 3401–3411 (2019).

    Article  CAS  PubMed  Google Scholar 

  113. Li, X., Wu, P. & Su, J. Accurate fairness: improving individual fairness without trading accuracy. in Proc. 37th AAAI Conference on Artificial Intelligence Vol. 37 (eds. Williams, B. et al.) 14312–14320 (Association for the Advancement of Artificial Intelligence, 2023).

  114. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Yang, Y., Zha, K., Chen, Y., Wang, H. & Katabi, D. Delving into deep imbalanced regression. in Proc. 38th International Conference on Machine Learning 11842–11851 (PMLR, 2021).

  116. Morik, M., Singh, A., Hong, J. & Joachims, T. Controlling fairness and bias in dynamic learning-to-rank. in Proc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 429–438 (Association for Computing Machinery, 2020).

  117. Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).

    Article  CAS  PubMed  Google Scholar 

  118. Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. Preprint at https://arxiv.org/abs/2309.07778 (2023).

  119. Dippel, J. et al. RudolfV: a foundation model by pathologists for pathologists. Preprint at https://arxiv.org/abs/2401.04079 (2024).

  120. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).

    Article  Google Scholar 

  121. Pfohl, S. R. et al. Understanding subgroup performance differences of fair predictors using causal models. in NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models (2023).

  122. Cai, T. T., Namkoong, H. & Yadlowsky, S. Diagnosing model performance under distribution shift. Preprint at https://arxiv.org/abs/2303.02011 (2023).

  123. Morning, A. The racial self-identification of South Asians in the United States. J. Ethn. Migr. Stud. 27, 61–79 (2001).

    Article  Google Scholar 

  124. Chadban, S. J. et al. KDIGO clinical practice guideline on the evaluation and management of candidates for kidney transplantation. Transplantation 104, S11–S103 (2020).

    Article  PubMed  Google Scholar 

  125. Eneanya, N. D., Yang, W. & Reese, P. P. Reconsidering the consequences of using race to estimate kidney function. JAMA 322, 113–114 (2019).

    Article  PubMed  Google Scholar 

  126. Zelnick, L. R., Leca, N., Young, B. & Bansal, N. Association of the estimated glomerular filtration rate with vs without a coefficient for race with time to eligibility for kidney transplant. JAMA Netw. Open 4, e2034004 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  127. del Barrio, E., Gordaliza, P. & Loubes, J.-M. Review of mathematical frameworks for fairness in machine learning. Preprint at http://arxiv.org/abs/2005.13755 (2020).

  128. Binns, R. On the apparent conflict between individual and group fairness. in Proc. 2020 Conference on Fairness, Accountability, and Transparency 514–524 (Association for Computing Machinery, 2020).

  129. Braveman, P., Egerter, S. & Williams, D. R. The social determinants of health: coming of age. Annu. Rev. Public Health 32, 381–398 (2011).

    Article  PubMed  Google Scholar 

  130. Walker, R. J., Williams, J. S. & Egede, L. E. Influence of race, ethnicity and social determinants of health on diabetes outcomes. Am. J. Med. Sci. 351, 366–373 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  131. Link, B. G. & Phelan, J. Social conditions as fundamental causes of disease. J. Health Soc. Behav. 35, 80–94 (1995).

    Article  Google Scholar 

  132. Richardson, L. D. & Norris, M. Access to health and health care: how race and ethnicity matter. Mt. Sinai J. Med. 77, 166–177 (2010).

    Article  PubMed  Google Scholar 

  133. Yearby, R. Racial disparities in health status and access to healthcare: the continuation of inequality in the United States due to structural racism. Am. J. Econ. Sociol. 77, 1113–1152 (2018).

    Article  Google Scholar 

  134. van Ryn, M. Research on the provider contribution to race/ethnicity disparities in medical care. Med. Care 40, I140–I151 (2002).

    Article  PubMed  Google Scholar 

  135. George, S., Ragin, C. & Ashing, K. T. Black is diverse: the untapped beauty and benefit of cancer genomics and precision medicine. JCO Oncol. Pract. 17, 279–283 (2021).

    Article  PubMed  Google Scholar 

  136. Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  137. Bonham, V. L., Green, E. D. & Pérez-Stable, E. J. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA 320, 1533–1534 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  138. Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  139. Zou, J., Gichoya, J. W., Ho, D. E. & Obermeyer, Z. Implications of predicting race variables from medical images. Science 381, 149–150 (2023).

    Article  CAS  PubMed  Google Scholar 

  140. Chen, I. Y., Johansson, F. D. & Sontag, D. Why is my classifier discriminatory? in Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).

  141. Puyol-Antón, E. et al. Fairness in cardiac magnetic resonance imaging: assessing sex and racial bias in deep learning-based segmentation. Front. Cardiovasc. Med. 9, 859310 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  142. US Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf (2019).

  143. Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  144. Zhu, S., Gilbert, M., Chetty, I. & Siddiqui, F. The 2021 landscape of FDA-approved artificial intelligence/machine learning-enabled medical devices: an analysis of the characteristics and intended use. Int. J. Med. Inform. 165, 104828 (2022).

    Article  PubMed  Google Scholar 

  145. Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  146. Sounderajah, V. et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 11, e047709 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  147. Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  148. Lipkova, J. et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 28, 575–582 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  149. Smith, B., Hermsen, M., Lesser, E., Ravichandar, D. & Kremers, W. Developing image analysis pipelines of whole-slide images: pre- and post-processing. J. Clin. Transl. Sci. 5, e38 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  150. Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002 (IEEE, 2021).

  151. Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021).

  152. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. in Proc. International Conference on Learning Representations (ICLR, 2021).

  153. Oquab, M. et al. DINOv2: Learning robust visual features without supervision. in Transactions on Machine Learning Research 2835–8856 (TMLR, 2024).

  154. Dolezal, J. M. et al. Slideflow: deep learning for digital histopathology with real-time whole-slide visualization. Preprint at https://arXiv.org/abs/2304.04142 (2023).

  155. Kriegsmann, M. et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel) 12, 1604 (2020).

    Article  CAS  PubMed  Google Scholar 

  156. Janßen, C. et al. Multimodal lung cancer subtyping using deep learning neural networks on whole slide tissue images and MALDI MSI. Cancers (Basel) 14, 6181 (2022).

    Article  PubMed  Google Scholar 

  157. Celik, Y., Talo, M., Yildirim, O., Karabatak, M. & Acharya, U. R. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recognit. Lett. 133, 232–239 (2020).

    Article  Google Scholar 

  158. Han, Z. et al. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7, 4172 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  159. Srikantamurthy, M. M., Rallabandi, V. P. S., Dudekula, D. B., Natarajan, S. & Park, J. Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23, 19 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  160. Xiong, Y. et al. Nyströmformer: a Nyström-based algorithm for approximating self-attention. in Proc. AAAI Conference on Artificial Intelligence Vol. 35 14138–14148 (Association for the Advancement of Artificial Intelligence, 2021).

  161. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019).

  162. Berrada, L., Zisserman, A. & Kumar, M. P. Smooth loss functions for deep top-k classification. in Proc. 6th International Conference on Learning Representations (ICLR, 2018).

  163. Jiang, H. & Nachum, O. Identifying and correcting label bias in machine learning. in Proc. 23rd International Conference on Artificial Intelligence and Statistics Vol. 108 702–712 (PMLR, 2020).

  164. Chai, X. et al. Unsupervised domain adaptation techniques based on auto-encoder for non-stationary EEG-based emotion recognition. Comput. Biol. Med. 79, 205–214 (2016).

    Article  PubMed  Google Scholar 

  165. Fang, T., Lu, N., Niu, G. & Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. in Advances in Neural Information Processing Systems Vol. 33 (eds. Larochelle, H. et al.) 11996–12007 (Curran Associates, 2020).

  166. Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).

    Article  CAS  PubMed  Google Scholar 

  167. Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  168. Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).

    Article  CAS  PubMed  Google Scholar 

  169. American Cancer Society. Key statistics for breast cancer—how common is breast cancer? www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html (2024).

  170. American Cancer Society. Key statistics for lung cancer—how common is lung cancer? www.cancer.org/cancer/types/lung-cancer/about/key-statistics.html (2024).

  171. Kim, M. et al. Glioblastoma as an age-related neurological disorder in adults. Neurooncol. Adv. 3, vdab125 (2021).

    PubMed  PubMed Central  Google Scholar 

  172. Cao, J., Yan, W., Zhan, Z., Hong, X. & Yan, H. Epidemiology and risk stratification of low-grade gliomas in the United States, 2004–2019: a competing-risk regression model for survival analysis. Front. Oncol. 13, 1079597 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  173. scikit-learn developers. 1.1. Linear models. scikit-learn scikit-learn.org/stable/modules/linear_model.html (2022).

  174. Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. https://doi.org/10.2202/1544-6115.1585 (2010).

    Article  PubMed  Google Scholar 

  175. Ernst, M. D. Permutation methods: a basis for exact inference. Stat. Sci. 19, 676–685 (2004).

    Article  Google Scholar 

  176. Fisher, R. The Design of Experiments Vol. 6 (Hafner, 1951).

  177. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).

    Article  Google Scholar 

  178. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  179. Lane, D. M. Confidence Interval on Pearson’s Correlation (Rice Univ., 2018); onlinestatbook.com/2/estimation/correlation_ci.html

  180. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. in Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).

  181. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Brigham and Women’s Hospital (BWH) President’s Fund, BWH and Massachusetts General Hospital Pathology, and National Institute of General Medical Sciences R35GM138216 (to F.M.). R.J.C. was supported by the National Science Foundation Graduate Fellowship. Y.Y. was supported by the Takeda Fellowship. M.Y.L. was supported by the Siebel Scholars program. D.F.K.W. was supported by the National Institutes of Health/National Cancer Institute Ruth L. Kirschstein National Service Award (T32CA251062). The content is solely the responsibility of the authors and does not reflect the official views of the funding sources.

Author information

Authors and Affiliations

Authors

Contributions

A.V., R.J.C., and F.M. conceived the study. All authors designed the experiments. A.V., R.J.C., M.Y.L., D.F.K.W., T.Y.C., J.L. and M.S. performed data collection and cleaning. A.V. and R.J.C conducted the experimental analysis with assistance from all coauthors. D.F.K.W. analyzed the misclassified cases. A.V., D.F.K.W., R.J.C., A.H.S., G.J., T.H., Y.Y., E.C.D. and F.M. prepared the paper with input from all coauthors. F.M. supervised the research.

Corresponding author

Correspondence to Faisal Mahmood.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Effects of data processing strategies on disparities in breast subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for breast subtyping trained on TCGA-BRCA (n = 1,049 slides) and tested on MGB-breast (n = 1,265 slides) with: A-D. ResNet50IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2.

Extended Data Fig. 2 Effects of data processing strategies on disparities in lung subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for lung subtyping trained on TCGA-lung (n = 1,043 slides) and tested on MGB-lung (n = 1,960 slides) with: A-D. ResNet50IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 3.

Extended Data Fig. 3 Effects of data processing strategies on disparities in IDH1 mutation prediction.

Race stratified ROC curves and true positive rate disparity for ABMIL models for IDH1 mutation prediction trained on EBRAINS brain tumor atlas (n = 873 slides) and tested on TCGA-GBMLGG cohort (n = 1,123 slides) with: A-C. ResNet50IN feature encoder D-F. CTransPath patch encoder G-I. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, D, G) (ii) ABMIL trained using stain-normalized features (B, E, H) (iii) With stain normalization, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian, with 500 slides per class for each race) (C, F, I). ROC curves show mean curve (n = 20 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 4.

Extended Data Fig. 4 Effect of stain normalization on disparities.

Race stratified ROC curves and true positive rate disparity for ABMIL model trained in 20-fold study ResNet50IN and UNI patch encoders with Macenko stain normalization for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. ABMIL trained on TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) cohorts for breast and lung subtyping and tested on resampled MGB-breast lung cohorts, respectively. For IDH1 mutation prediction, ABMIL trained on EBrains (n = 873 slides) and tested on resampled TCGA-GBMLGG. All unbiased test cohorts have 1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 24.

Extended Data Fig. 5 Effect of pre-training dataset size on demographic disparities.

Race-stratified and overall ROC AUC for ABMIL models with patch encoders pre-trained on natural images and histology image datasets of varying sizes for: A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. All models were trained on 20-fold Monte Carlo splits on TCGA-BRCA (n = 1,049 slides), TCGA-Lung (n = 1,043 slides), and EBRAINS brain tumor atlas (n = 873 slides) and tested on resampled MGB-breast, MGB-lung, and TCGA-GBMLGG (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race) for breast subtyping, lung subtyping, and IDH1 mutation prediction, respectively. The number of images used for pre-training of each encoder is shown in brackets under the encoder name. Refer to Methods for details of each encoder. Error bars in bar plots indicate 95% CI, with the center being the mean value (n = 20 folds).

Extended Data Fig. 6 Demographic stratified performance of internal validation cohorts.

Race stratified breast and lung subtyping ROC curves and true positive rate disparity for ABMIL models trained and tested on: (A) TCGA-BRCA (B) TCGA-lung (C) MGB-breast (D) MGB-lung. To create training splits, 25 examples from each subtype were sampled 10 times to create 10 folds, and the rest of the data was used for validation. ABMIL with UNI patch encoder used. ROC curves show mean (n = 10 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Tables 2 and 3.

Extended Data Fig. 7 Investigating breast subtyping disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the TCGA-GBMLGG test cohort (n = 1,123 slides) for ABMIL model trained with UNI features on the EBRAINS brain tumor atlas cohort (n = 873 slides) in a 20-fold study for IDH1 mutation prediction. A. TPR disparity for different race groups. B. The TPR disparity is computed for white IDH1 wild-type (WT) and mutant (MT) patients (n = 983 slides), stratified by age. C. TPR disparity for different age groups (years). D. The TPR disparity is computed for IDH1 wild-type and mutant patients aged ≤40 (n = 303 slides), stratified by race. Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained in one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 4.

Extended Data Fig. 8 Investigating IDH1 mutation prediction disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the MGB-breast test cohort (n = 1,265 slides) for ABMIL model trained with UNI patch encoder on the TCGA-BRCA cohort (n = 1,049 slides) in a 20-fold study for breast subtyping. A. TPR disparity for different postal code inferred income groups. (B-D) The TPR disparity is computed for subgroups of IDC and ILC patients from low-income postal codes (n = 407 slides), stratified by other demographic variables. B racial groups, C insurance groups, D age groups (years). E. TPR disparity for different racial groups. (F-H) The TPR disparity is computed for subgroups of the white IDC and ILC patients (n = 904 samples), stratified by other demographic variables. F insurance groups, G income groups inferred from postal code, H age groups. (I-K) The TPR disparity is computed for subgroups of the Black IDC and ILC patients (n = 164 samples), stratified by other demographic variables. I insurance groups, J income groups inferred from postal code, K age groups (years). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 2.

Extended Data Fig. 9 Stain distributions by races for different scanners.

For both the MGB-breast and lung cohorts, we randomly sampled 50 slides per scanner and per race, segmented the tissue from background, and patched the tissue into 256 × 256 tiles at 20x magnification. We sampled 1,000 patches from each slide, converted them from RGB to HSV space, and calculated their average hue and saturations. We compare the distributions of hue and saturation by race for A. Overall MGB-breast cohort B. Slides in MGB-breast cohort scanned on Aperio GT450 scanner C. Slide in MGB-reast cohort scanned on Hamamatsu S210 scanner. We compare the distributions of hue and saturation by race for D. Overall MGB-lung cohort E. Slides in MGB-lung cohort scanned on Aperio GT450 scanner F. Slide in MGB-lung cohort scanned on Hamamatsu S210 scanner. We do not find any statistically significant difference in the hues or saturations of whole slide images by race for both scanners and overall category as compared by two-sided non-parametric paired permutation tests. Boxes indicate quartile values of metric shown by respective axis (n = 50 whole slide images) with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot represents a unique slide.

Extended Data Fig. 10 Effect of resampling sample size on TPR disparities.

We trained ABMIL models with UNI patch encoder on 20-fold Monte Carlo splits on TCGA-BRCA and lung subtyping and EBRAINS IDH1 mutation prediction and tested them on original and resampled MGB-breast and lung and TCGA-GBMLGG cohorts, respectively. We show different resampling variants of the test set (no resampling/ original, 500 and 1,000 slides per class and per race) for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. Resampling is done for each disease class and race (See Methods for more details). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for the task in Supplementary Data Table 24.

Supplementary information

Supplementary Information

Supplementary Data Tables 1–44.

Reporting Summary

Supplementary Table 1

Analysis of FDA-approved algorithms: names of FDA-approved medical imaging algorithms, their approval number and year, the modality they are intended for, their risk group, and (1) whether the company reports the demographics of their test sets or (2) whether demographic-stratified metrics are reported on the test set. ‘1’ indicates that metrics are present or the approval documentation states that no differences were found by demographics. ‘0’ indicates that such metrics are not present.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vaidya, A., Chen, R.J., Williamson, D.F.K. et al. Demographic bias in misdiagnosis by computational pathology models. Nat Med 30, 1174–1190 (2024). https://doi.org/10.1038/s41591-024-02885-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-024-02885-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing