Demographic bias in misdiagnosis by computational pathology models

Vaidya, Anurag; Chen, Richard J.; Williamson, Drew F. K.; Song, Andrew H.; Jaume, Guillaume; Yang, Yuzhe; Hartvigsen, Thomas; Dyer, Emma C.; Lu, Ming Y.; Lipkova, Jana; Shaban, Muhammad; Chen, Tiffany Y.; Mahmood, Faisal

doi:10.1038/s41591-024-02885-z

Article
Published: 19 April 2024

Demographic bias in misdiagnosis by computational pathology models

Anurag Vaidya^1,2,3,4,5^na1,
Richard J. Chen ORCID: orcid.org/0000-0003-0389-1331^1,2,3,4,6^na1,
Drew F. K. Williamson ORCID: orcid.org/0000-0003-1745-8846^1,2,7^na1,
Andrew H. Song^1,2,3,4,
Guillaume Jaume^1,2,3,4,
Yuzhe Yang ORCID: orcid.org/0000-0002-7634-8295⁸,
Thomas Hartvigsen⁹,
Emma C. Dyer¹⁰,
Ming Y. Lu ORCID: orcid.org/0000-0003-0009-9699^1,2,3,4,8,
Jana Lipkova^1,2,3,4,
Muhammad Shaban^1,2,3,4,
Tiffany Y. Chen ORCID: orcid.org/0000-0003-2546-2941^1,2,3,4 &
…
Faisal Mahmood ORCID: orcid.org/0000-0001-7587-1562^1,2,3,4,11

Nature Medicine volume 30, pages 1174–1190 (2024)Cite this article

3823 Accesses
1 Citations
115 Altmetric
Metrics details

Subjects

Abstract

Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas. For example, when using common modeling approaches, we observed performance gaps (in area under the receiver operating characteristic curve) between white and Black patients of 3.0% for breast cancer subtyping, 10.9% for lung cancer subtyping and 16.0% for IDH1 mutation prediction in gliomas. We found that richer feature representations obtained from self-supervised vision foundation models reduce performance variations between groups. These representations provide improvements upon weaker models even when those weaker models are combined with state-of-the-art bias mitigation strategies and modeling choices. Nevertheless, self-supervised vision foundation models do not fully eliminate these discrepancies, highlighting the continuing need for bias mitigation efforts in computational pathology. Finally, we demonstrate that our results extend to other demographic factors beyond patient race. Given these findings, we encourage regulatory and policy agencies to integrate demographic-stratified evaluation into their assessment guidelines.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Dataset characteristics, fairness metrics and modeling choices investigated.**

**Fig. 2: Investigating bias from data characteristics.**

**Fig. 3: Investigating bias from MIL model architectures and bias mitigation strategies.**

**Fig. 4: Evaluating race information in embeddings.**

**Fig. 5: Effect of training set diversity and size on disparities.**

**Fig. 6: Investigating lung subtyping disparities beyond race.**

The impact of site-specific digital histology signatures on deep learning model accuracy and bias

Article Open access 20 July 2021

Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes

Article Open access 12 March 2021

Histopathology images predict multi-omics aberrations and prognoses in colorectal cancer patients

Article Open access 13 April 2023

Data availability

Public data from TCGA, including digital histology and the clinical annotations used, are available at https://portal.gdc.cancer.gov/ and https://cbioportal.org. The EBRAINS brain tumor atlas can be accessed at https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994. Restrictions apply to the availability of the in-house data, which were used with institutional permission for the current study and are thus not publicly available. We note that these data were not specifically collected for this study. All requests for data may be addressed to the corresponding author and will be promptly evaluated based on institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations. Internal data can only be shared for noncommercial, academic purposes and will require a data user agreement.

Code availability

All code was implemented in Python using PyTorch as the primary deep learning package. Code and scripts to reproduce the training experiments of this paper are available at https://github.com/mahmoodlab/CPATH_demographics.

References

Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
Article Google Scholar
van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Article PubMed Google Scholar
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020).
Article CAS PubMed Google Scholar
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. in Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).
Shao, Z. et al. TransMIL: transformer based correlated multiple instance learning for whole slide image classification. in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 2136–2147 (Curran Associates, 2021).
Chan, T. H., Cendra, F. J., Ma, L., Yin, G. & Yu, L. Histopathology whole slide image analysis with heterogeneous graph representation learning. in Proc. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15661–15670 (IEEE, 2023).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Leo, P. et al. Computer extracted gland features from H&E predicts prostate cancer recurrence comparably to a genomic companion diagnostic test: a large multi-site study. NPJ Precis. Oncol. 5, 35 (2021).
Article CAS PubMed PubMed Central Google Scholar
Howard, F. M. et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun. 12, 4423 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chatterji, S. et al. Prediction models for hormone receptor status in female breast cancer do not extend to males: further evidence of sex-based disparity in breast cancer. NPJ Breast Cancer 9, 91 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dehkharghanian, T. et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn. Pathol. 18, 67 (2023).
Article PubMed PubMed Central Google Scholar
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Article CAS PubMed Google Scholar
Mhasawade, V., Zhao, Y. & Chunara, R. Machine learning and algorithmic fairness in public and population health. Nat. Mach. Intell. 3, 659–666 (2021).
Article Google Scholar
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4, e406–e414 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med. 27, 136–140 (2021).
Article CAS PubMed Google Scholar
Population Estimates, July 1, 2022 (V2022). U.S. Census Bureau QuickFacts https://www.census.gov/quickfacts/fact/table/US/PST045222 (2022).
Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff. (Millwood) 37, 780–785 (2018).
Article PubMed Google Scholar
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).
Article CAS PubMed PubMed Central Google Scholar
Spratt, D. E. et al. Racial/ethnic disparities in genomic sequencing. JAMA Oncol. 2, 1070–1074 (2016).
Article PubMed PubMed Central Google Scholar
Khor, S. et al. Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Netw. Open 6, e2318495 (2023).
Article PubMed PubMed Central Google Scholar
van der Burgh, A. C., Hoorn, E. J. & Chaker, L. Removing race from kidney function estimates. JAMA 325, 2018 (2021).
Article PubMed Google Scholar
Diao, J. A. et al. Clinical implications of removing race from estimates of kidney function. JAMA 325, 184–186 (2021).
Article PubMed Google Scholar
Marmot, M. Social determinants of health inequalities. Lancet 365, 1099–1104 (2005).
Article PubMed Google Scholar
Dietze, E. C., Sistrunk, C., Miranda-Carboni, G., O’Reagan, R. & Seewaldt, V. L. Triple-negative breast cancer in African-American women: disparities versus biology. Nat. Rev. Cancer 15, 248–254 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cormier, J. N. et al. Ethnic differences among patients with cutaneous melanoma. Arch. Intern. Med. 166, 1907–1914 (2006).
Article PubMed Google Scholar
Rubin, J. B. The spectrum of sex differences in cancer. Trends Cancer 8, 303–315 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lara, O. D. et al. Pan-cancer clinical and molecular analysis of racial disparities. Cancer 126, 800–807 (2020).
Article CAS PubMed Google Scholar
Heath, E. I. et al. Racial disparities in the molecular landscape of cancer. Anticancer Res. 38, 2235–2240 (2018).
CAS PubMed PubMed Central Google Scholar
Gucalp, A. et al. Male breast cancer: a disease distinct from female breast cancer. Breast Cancer Res. Treat. 173, 37–48 (2019).
Article PubMed Google Scholar
Dong, M. et al. Sex differences in cancer incidence and survival: a pan-cancer analysis. Cancer Epidemiol. Biomarkers Prev. 29, 1389–1397 (2020).
Article PubMed Google Scholar
Butler, E. N., Kelly, S. P., Coupland, V. H., Rosenberg, P. S. & Cook, M. B. Fatal prostate cancer incidence trends in the United States and England by race, stage, and treatment. Br. J. Cancer 123, 487–494 (2020).
Article PubMed PubMed Central Google Scholar
Zavala, V. A. et al. Cancer health disparities in racial/ethnic minorities in the United States. Br. J. Cancer 124, 315–332 (2021).
Article PubMed Google Scholar
Ngan, H.-L., Wang, L., Lo, K.-W. & Lui, V. W. Y. Genomic landscapes of EBV-associated nasopharyngeal carcinoma vs. HPV-associated head and neck cancer. Cancers (Basel) 10, 210 (2018).
Article PubMed Google Scholar
Singh, H., Singh, R., Mhasawade, V. & Chunara, R. Fairness violations and mitigation under covariate shift. in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 3–13 (Association for Computing Machinery, 2021).
Maity, S., Mukherjee, D., Yurochkin, M. & Sun, Y. Does enforcing fairness mitigate biases caused by subpopulation shift? in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 25773–25784 (Curran Associates, 2021).
Giguere, S. et al. Fairness guarantees under demographic shift. in Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Schrouff, J. et al. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. in Advances in Neural Information Processing Systems Vol. 35 (eds. Koyejo, S. et al.) 19304–19318 (Curran Associates, 2022).
Chen, S. et al. Machine learning-based pathomics signature could act as a novel prognostic marker for patients with clear cell renal cell carcinoma. Br. J. Cancer 126, 771–777 (2022).
Article CAS PubMed Google Scholar
US Food and Drug Administration. Evaluation of automatic class III designation for Paige Prostate. www.accessdata.fda.gov/cdrh_docs/reviews/DEN200080.pdf (2021).
Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Pac. Symp. Biocomput. 26, 232–243 (2021).
PubMed Google Scholar
Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
Article CAS PubMed PubMed Central Google Scholar
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Risk of bias in chest X-ray foundation models. Preprint at https://arxiv.org/abs/2209.02965v1 (2022).
Beheshtian, E., Putman, K., Santomartino, S. M., Parekh, V. S. & Yi, P. H. Generalizability and bias in a deep learning pediatric bone age prediction model using hand radiographs. Radiology 306, e220505 (2023).
Article PubMed Google Scholar
Röösli, E., Bozkurt, S. & Hernandez-Boussard, T. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci. Data 9, 24 (2022).
Article PubMed PubMed Central Google Scholar
Bernhardt, M., Jones, C. & Glocker, B. Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms. Nat. Med. 28, 1157–1158 (2022).
Article CAS PubMed Google Scholar
Mukherjee, P. et al. Confounding factors need to be accounted for in assessing bias by machine learning algorithms. Nat. Med. 28, 1159–1160 (2022).
Article CAS PubMed Google Scholar
Meng, C., Trinh, L., Xu, N., Enouen, J. & Liu, Y. Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset. Sci. Rep. 12, 7166 (2022).
Article CAS PubMed PubMed Central Google Scholar
Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
Article PubMed Google Scholar
Madras, D., Creager, E., Pitassi, T. & Zemel, R. Learning adversarially fair and transferable representations. in Proc. 35th International Conference on Machine Learning 3384–3393 (PMLR, 2018).
Wang, R., Chaudhari, P. & Davatzikos, C. Bias in machine learning models can be significantly mitigated by careful training: evidence from neuroimaging studies. Proc. Natl Acad. Sci. USA 120, e2211613120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Soltan, A. A., Eyre, D. W. & Clifton, D. A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat. Mach. Intell. 5, 884–894 (2023).
Article PubMed PubMed Central Google Scholar
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Article CAS PubMed PubMed Central Google Scholar
Burlina, P., Joshi, N., Paul, W., Pacheco, K. D. & Bressler, N. M. Addressing artificial intelligence bias in retinal diagnostics. Transl. Vis. Sci. Technol. 10, 13 (2021).
Article PubMed PubMed Central Google Scholar
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Distinct lung cancer subtypes associate to distinct drivers of tumor progression. Oncotarget 9, 35528–35540 (2018).
Article PubMed PubMed Central Google Scholar
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Abandoning the notion of non-small cell lung cancer. Trends Mol. Med. 25, 585–594 (2019).
Article PubMed Google Scholar
Yan, H. et al. IDH1 and IDH2 mutations in gliomas. N. Engl. J. Med. 360, 765–773 (2009).
Article CAS PubMed PubMed Central Google Scholar
Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. in Advances in Neural Information Processing Systems Vol. 29 (eds. Lee, D. D. et al.) 3315–3323 (Curran Associates, 2016).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT Press, 2023); fairmlbook.org/pdf/fairmlbook.pdf
Chouldechova, A. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5, 153–163 (2017).
Article PubMed Google Scholar
Wang, X. et al. Characteristics of The Cancer Genome Atlas cases relative to U.S. general population cancer cases. Br. J. Cancer 119, 885–892 (2018).
Article PubMed PubMed Central Google Scholar
Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource. Sci. Data 9, 55 (2022).
Article PubMed PubMed Central Google Scholar
Maron, O. & Lozano-Pérez, T. A framework for multiple-instance learning. in Advances in Neural Information Processing Systems Vol. 10 (eds. Jordan, M. I. et al.) 570–576 (MIT Press, 1998).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).
Article PubMed Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed Google Scholar
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. in Proc. 35th International Conference on Machine Learning 2127–2136 (PMLR, 2018).
Jaume, G., Song, A. H. & Mahmood, F. Integrating context for superior cancer prognosis. Nat. Biomed. Eng. 6, 1323–1325 (2022).
Article CAS PubMed Google Scholar
Kamiran, F. & Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33, 1–33 (2012).
Article Google Scholar
Krasanakis, E., Spyromitros-Xioufis, E., Papadopoulos, S. & Kompatsiaris, Y. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. in Proc. 2018 World Wide Web Conference 853–862 (International World Wide Web Conferences Steering Committee, 2018).
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N. & Varshney, K. R. Optimized pre-processing for discrimination prevention. in Advances in Neural Information Processing Systems Vol. 30 (eds. Guyon, I. et al.) 3995–4004 (Curran Associates, 2017).
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. in Proc. 30th International Conference on Machine Learning 325–333 (PMLR, 2013).
Zafar, M. B., Valera, I., Rodriguez, M. G. & Gummadi, K. P. Fairness beyond disparate treatment and disparate impact: learning classification without disparate mistreatment. in Proc. 26th International Conference on World Wide Web 1171–1180 (International World Wide Web Conferences Steering Committee, 2017).
Celis, L. E. & Keswani, V. Improved adversarial learning for fair classification. Preprint at https://arxiv.org/abs/1901.10443 (2019).
Zhong, Y. et al. MEDFAIR: benchmarking fairness for medical imaging. in Proc. International Conference on Learning Representations (ICLR, 2023).
Yang, Y., Zhang, H., Katabi, D. & Ghassemi, M. Change is hard: a closer look at subpopulation shift. in International Conference on Machine Learning (ICML, 2023).
Breen, J. et al. Efficient subtyping of ovarian cancer histopathology whole slide images using active sampling in multiple instance learning. in Proc. SPIE 12471 (eds. Tomaszewski, J. E. & Ward, A. D.) 1247110 (Society of Photo-Optical Instrumentation Engineers, 2023).
Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N. & Huang, J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med. Image Anal. 65, 101789 (2020).
Article PubMed Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article Google Scholar
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
PubMed Google Scholar
Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283–286 (2021).
Article PubMed PubMed Central Google Scholar
Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).
Article CAS PubMed PubMed Central Google Scholar
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. in Proc. 6th IEEE International Conference on Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009).
Janowczyk, A., Basavanhally, A. & Madabhushi, A. Stain Normalization using Sparse AutoEncoders (StaNoSA): application to digital pathology. Comput. Med. Imaging Graph. 57, 50–61 (2017).
Article PubMed Google Scholar
Ciompi, F. et al. The importance of stain normalization in colorectal tissue classification with convolutional networks. in Proc. 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) 160–163 (IEEE, 2017).
Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. Image Anal. 58, 101544 (2019).
Article PubMed Google Scholar
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 89, 104467 (2023).
Article PubMed PubMed Central Google Scholar
Adleberg, J. et al. Predicting patient demographics from chest radiographs with deep learning. J. Am. Coll. Radiol. 19, 1151–1161 (2022).
Article PubMed Google Scholar
Yi, P. H. et al. Radiology ‘forensics’: determination of age and sex from chest radiographs using deep learning. Emerg. Radiol. 28, 949–954 (2021).
Article PubMed Google Scholar
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Article CAS PubMed Google Scholar
Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains. Nat. Commun. 11, 5727 (2020).
Article CAS PubMed PubMed Central Google Scholar
Movva, R. et al. Coarse race data conceals disparities in clinical risk score performance. in Machine Learning for Healthcare Conference 443–472 (PMLR, 2023)
Mamary, A. J. et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr. Pulm. Dis. 5, 177–184 (2018).
PubMed PubMed Central Google Scholar
Sun, T. Y. et al. Exploring gender disparities in time to diagnosis. in Machine Learning for Healthcare Conference (Curran Associates, 2020).
Gianfrancesco, M. A., Tamang, S., Yazdany, J. & Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 178, 1544–1547 (2018).
Article PubMed PubMed Central Google Scholar
Glocker, B., Jones, C., Roschewitz, M. & Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. Artif. Intell. 5, e230060 (2023).
Article PubMed PubMed Central Google Scholar
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
Article PubMed Google Scholar
Borrell, L. N. et al. Race and genetic ancestry in medicine—a time for reckoning with racism. N. Engl. J. Med. 384, 474–480 (2021).
Article PubMed PubMed Central Google Scholar
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Article PubMed PubMed Central Google Scholar
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Article PubMed PubMed Central Google Scholar
Bamshad, M., Wooding, S., Salisbury, B. A. & Stephens, J. C. Deconstructing the relationship between genetics and race. Nat. Rev. Genet. 5, 598–609 (2004).
Article CAS PubMed Google Scholar
Bhargava, H. K. et al. Computationally derived image signature of stromal morphology is prognostic of prostate cancer recurrence following prostatectomy in African American patients. Clin. Cancer Res. 26, 1915–1923 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shi, Y. et al. A prospective, molecular epidemiology study of EGFR mutations in Asian patients with advanced non-small-cell lung cancer of adenocarcinoma histology (PIONEER). J. Thorac. Oncol. 9, 154–162 (2014).
Article CAS PubMed PubMed Central Google Scholar
Martini, R. et al. African ancestry-associated gene expression profiles in triple-negative breast cancer underlie altered tumor biology and clinical outcome in women of African descent. Cancer Discov. 12, 2530–2551 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, G. et al. Characterization of frequently mutated cancer genes in Chinese breast tumors: a comparison of Chinese and TCGA cohorts. Ann. Transl. Med. 7, 179 (2019).
Article PubMed PubMed Central Google Scholar
McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit. Health 2, e221–e223 (2020).
Article PubMed Google Scholar
Sung, H., DeSantis, C. E., Fedewa, S. A., Kantelhardt, E. J. & Jemal, A. Breast cancer subtypes among Eastern-African-born black women and other black women in the United States. Cancer 125, 3401–3411 (2019).
Article CAS PubMed Google Scholar
Li, X., Wu, P. & Su, J. Accurate fairness: improving individual fairness without trading accuracy. in Proc. 37th AAAI Conference on Artificial Intelligence Vol. 37 (eds. Williams, B. et al.) 14312–14320 (Association for the Advancement of Artificial Intelligence, 2023).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yang, Y., Zha, K., Chen, Y., Wang, H. & Katabi, D. Delving into deep imbalanced regression. in Proc. 38th International Conference on Machine Learning 11842–11851 (PMLR, 2021).
Morik, M., Singh, A., Hong, J. & Joachims, T. Controlling fairness and bias in dynamic learning-to-rank. in Proc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 429–438 (Association for Computing Machinery, 2020).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Article CAS PubMed Google Scholar
Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. Preprint at https://arxiv.org/abs/2309.07778 (2023).
Dippel, J. et al. RudolfV: a foundation model by pathologists for pathologists. Preprint at https://arxiv.org/abs/2401.04079 (2024).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Article Google Scholar
Pfohl, S. R. et al. Understanding subgroup performance differences of fair predictors using causal models. in NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models (2023).
Cai, T. T., Namkoong, H. & Yadlowsky, S. Diagnosing model performance under distribution shift. Preprint at https://arxiv.org/abs/2303.02011 (2023).
Morning, A. The racial self-identification of South Asians in the United States. J. Ethn. Migr. Stud. 27, 61–79 (2001).
Article Google Scholar
Chadban, S. J. et al. KDIGO clinical practice guideline on the evaluation and management of candidates for kidney transplantation. Transplantation 104, S11–S103 (2020).
Article PubMed Google Scholar
Eneanya, N. D., Yang, W. & Reese, P. P. Reconsidering the consequences of using race to estimate kidney function. JAMA 322, 113–114 (2019).
Article PubMed Google Scholar
Zelnick, L. R., Leca, N., Young, B. & Bansal, N. Association of the estimated glomerular filtration rate with vs without a coefficient for race with time to eligibility for kidney transplant. JAMA Netw. Open 4, e2034004 (2021).
Article PubMed PubMed Central Google Scholar
del Barrio, E., Gordaliza, P. & Loubes, J.-M. Review of mathematical frameworks for fairness in machine learning. Preprint at http://arxiv.org/abs/2005.13755 (2020).
Binns, R. On the apparent conflict between individual and group fairness. in Proc. 2020 Conference on Fairness, Accountability, and Transparency 514–524 (Association for Computing Machinery, 2020).
Braveman, P., Egerter, S. & Williams, D. R. The social determinants of health: coming of age. Annu. Rev. Public Health 32, 381–398 (2011).
Article PubMed Google Scholar
Walker, R. J., Williams, J. S. & Egede, L. E. Influence of race, ethnicity and social determinants of health on diabetes outcomes. Am. J. Med. Sci. 351, 366–373 (2016).
Article PubMed PubMed Central Google Scholar
Link, B. G. & Phelan, J. Social conditions as fundamental causes of disease. J. Health Soc. Behav. 35, 80–94 (1995).
Article Google Scholar
Richardson, L. D. & Norris, M. Access to health and health care: how race and ethnicity matter. Mt. Sinai J. Med. 77, 166–177 (2010).
Article PubMed Google Scholar
Yearby, R. Racial disparities in health status and access to healthcare: the continuation of inequality in the United States due to structural racism. Am. J. Econ. Sociol. 77, 1113–1152 (2018).
Article Google Scholar
van Ryn, M. Research on the provider contribution to race/ethnicity disparities in medical care. Med. Care 40, I140–I151 (2002).
Article PubMed Google Scholar
George, S., Ragin, C. & Ashing, K. T. Black is diverse: the untapped beauty and benefit of cancer genomics and precision medicine. JCO Oncol. Pract. 17, 279–283 (2021).
Article PubMed Google Scholar
Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Article CAS PubMed PubMed Central Google Scholar
Bonham, V. L., Green, E. D. & Pérez-Stable, E. J. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA 320, 1533–1534 (2018).
Article PubMed PubMed Central Google Scholar
Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Article PubMed PubMed Central Google Scholar
Zou, J., Gichoya, J. W., Ho, D. E. & Obermeyer, Z. Implications of predicting race variables from medical images. Science 381, 149–150 (2023).
Article CAS PubMed Google Scholar
Chen, I. Y., Johansson, F. D. & Sontag, D. Why is my classifier discriminatory? in Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).
Puyol-Antón, E. et al. Fairness in cardiac magnetic resonance imaging: assessing sex and racial bias in deep learning-based segmentation. Front. Cardiovasc. Med. 9, 859310 (2022).
Article PubMed PubMed Central Google Scholar
US Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf (2019).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
Article PubMed PubMed Central Google Scholar
Zhu, S., Gilbert, M., Chetty, I. & Siddiqui, F. The 2021 landscape of FDA-approved artificial intelligence/machine learning-enabled medical devices: an analysis of the characteristics and intended use. Int. J. Med. Inform. 165, 104828 (2022).
Article PubMed Google Scholar
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).
Article PubMed PubMed Central Google Scholar
Sounderajah, V. et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 11, e047709 (2021).
Article PubMed PubMed Central Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lipkova, J. et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 28, 575–582 (2022).
Article CAS PubMed PubMed Central Google Scholar
Smith, B., Hermsen, M., Lesser, E., Ravichandar, D. & Kremers, W. Developing image analysis pipelines of whole-slide images: pre- and post-processing. J. Clin. Transl. Sci. 5, e38 (2020).
Article PubMed PubMed Central Google Scholar
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002 (IEEE, 2021).
Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. in Proc. International Conference on Learning Representations (ICLR, 2021).
Oquab, M. et al. DINOv2: Learning robust visual features without supervision. in Transactions on Machine Learning Research 2835–8856 (TMLR, 2024).
Dolezal, J. M. et al. Slideflow: deep learning for digital histopathology with real-time whole-slide visualization. Preprint at https://arXiv.org/abs/2304.04142 (2023).
Kriegsmann, M. et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel) 12, 1604 (2020).
Article CAS PubMed Google Scholar
Janßen, C. et al. Multimodal lung cancer subtyping using deep learning neural networks on whole slide tissue images and MALDI MSI. Cancers (Basel) 14, 6181 (2022).
Article PubMed Google Scholar
Celik, Y., Talo, M., Yildirim, O., Karabatak, M. & Acharya, U. R. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recognit. Lett. 133, 232–239 (2020).
Article Google Scholar
Han, Z. et al. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7, 4172 (2017).
Article PubMed PubMed Central Google Scholar
Srikantamurthy, M. M., Rallabandi, V. P. S., Dudekula, D. B., Natarajan, S. & Park, J. Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23, 19 (2023).
Article PubMed PubMed Central Google Scholar
Xiong, Y. et al. Nyströmformer: a Nyström-based algorithm for approximating self-attention. in Proc. AAAI Conference on Artificial Intelligence Vol. 35 14138–14148 (Association for the Advancement of Artificial Intelligence, 2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019).
Berrada, L., Zisserman, A. & Kumar, M. P. Smooth loss functions for deep top-k classification. in Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Jiang, H. & Nachum, O. Identifying and correcting label bias in machine learning. in Proc. 23rd International Conference on Artificial Intelligence and Statistics Vol. 108 702–712 (PMLR, 2020).
Chai, X. et al. Unsupervised domain adaptation techniques based on auto-encoder for non-stationary EEG-based emotion recognition. Comput. Biol. Med. 79, 205–214 (2016).
Article PubMed Google Scholar
Fang, T., Lu, N., Niu, G. & Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. in Advances in Neural Information Processing Systems Vol. 33 (eds. Larochelle, H. et al.) 11996–12007 (Curran Associates, 2020).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Article CAS PubMed Google Scholar
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
Article PubMed PubMed Central Google Scholar
Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
Article CAS PubMed Google Scholar
American Cancer Society. Key statistics for breast cancer—how common is breast cancer? www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html (2024).
American Cancer Society. Key statistics for lung cancer—how common is lung cancer? www.cancer.org/cancer/types/lung-cancer/about/key-statistics.html (2024).
Kim, M. et al. Glioblastoma as an age-related neurological disorder in adults. Neurooncol. Adv. 3, vdab125 (2021).
PubMed PubMed Central Google Scholar
Cao, J., Yan, W., Zhan, Z., Hong, X. & Yan, H. Epidemiology and risk stratification of low-grade gliomas in the United States, 2004–2019: a competing-risk regression model for survival analysis. Front. Oncol. 13, 1079597 (2023).
Article PubMed PubMed Central Google Scholar
scikit-learn developers. 1.1. Linear models. scikit-learn scikit-learn.org/stable/modules/linear_model.html (2022).
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. https://doi.org/10.2202/1544-6115.1585 (2010).
Article PubMed Google Scholar
Ernst, M. D. Permutation methods: a basis for exact inference. Stat. Sci. 19, 676–685 (2004).
Article Google Scholar
Fisher, R. The Design of Experiments Vol. 6 (Hafner, 1951).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lane, D. M. Confidence Interval on Pearson’s Correlation (Rice Univ., 2018); onlinestatbook.com/2/estimation/correlation_ci.html
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. in Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar

Download references

Acknowledgements

This work was supported in part by the Brigham and Women’s Hospital (BWH) President’s Fund, BWH and Massachusetts General Hospital Pathology, and National Institute of General Medical Sciences R35GM138216 (to F.M.). R.J.C. was supported by the National Science Foundation Graduate Fellowship. Y.Y. was supported by the Takeda Fellowship. M.Y.L. was supported by the Siebel Scholars program. D.F.K.W. was supported by the National Institutes of Health/National Cancer Institute Ruth L. Kirschstein National Service Award (T32CA251062). The content is solely the responsibility of the authors and does not reflect the official views of the funding sources.

Author information

These authors contributed equally: Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson.

Authors and Affiliations

Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Drew F. K. Williamson, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Anurag Vaidya, Richard J. Chen, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
Anurag Vaidya, Richard J. Chen, Andrew H. Song, Guillaume Jaume, Ming Y. Lu, Jana Lipkova, Muhammad Shaban, Tiffany Y. Chen & Faisal Mahmood
Health Sciences and Technology, Harvard–MIT, Cambridge, MA, USA
Anurag Vaidya
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Richard J. Chen
Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, GA, USA
Drew F. K. Williamson
Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
Yuzhe Yang & Ming Y. Lu
School of Data Science, University of Virginia, Charlottesville, VA, USA
Thomas Hartvigsen
T.H. Chan School of Public Health, Harvard University, Cambridge, MA, USA
Emma C. Dyer
Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA
Faisal Mahmood

Authors

Anurag Vaidya
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Drew F. K. Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Andrew H. Song
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Jaume
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhe Yang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hartvigsen
View author publications
You can also search for this author in PubMed Google Scholar
Emma C. Dyer
View author publications
You can also search for this author in PubMed Google Scholar
Ming Y. Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jana Lipkova
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Shaban
View author publications
You can also search for this author in PubMed Google Scholar
Tiffany Y. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Mahmood
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.V., R.J.C., and F.M. conceived the study. All authors designed the experiments. A.V., R.J.C., M.Y.L., D.F.K.W., T.Y.C., J.L. and M.S. performed data collection and cleaning. A.V. and R.J.C conducted the experimental analysis with assistance from all coauthors. D.F.K.W. analyzed the misclassified cases. A.V., D.F.K.W., R.J.C., A.H.S., G.J., T.H., Y.Y., E.C.D. and F.M. prepared the paper with input from all coauthors. F.M. supervised the research.

Corresponding author

Correspondence to Faisal Mahmood.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Effects of data processing strategies on disparities in breast subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for breast subtyping trained on TCGA-BRCA (n = 1,049 slides) and tested on MGB-breast (n = 1,265 slides) with: A-D. ResNet50_IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2.

Extended Data Fig. 2 Effects of data processing strategies on disparities in lung subtyping.

Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for lung subtyping trained on TCGA-lung (n = 1,043 slides) and tested on MGB-lung (n = 1,960 slides) with: A-D. ResNet50_IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 3.

Extended Data Fig. 3 Effects of data processing strategies on disparities in IDH1 mutation prediction.

Race stratified ROC curves and true positive rate disparity for ABMIL models for IDH1 mutation prediction trained on EBRAINS brain tumor atlas (n = 873 slides) and tested on TCGA-GBMLGG cohort (n = 1,123 slides) with: A-C. ResNet50_IN feature encoder D-F. CTransPath patch encoder G-I. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, D, G) (ii) ABMIL trained using stain-normalized features (B, E, H) (iii) With stain normalization, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian, with 500 slides per class for each race) (C, F, I). ROC curves show mean curve (n = 20 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 4.

Extended Data Fig. 4 Effect of stain normalization on disparities.

Race stratified ROC curves and true positive rate disparity for ABMIL model trained in 20-fold study ResNet50_IN and UNI patch encoders with Macenko stain normalization for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. ABMIL trained on TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) cohorts for breast and lung subtyping and tested on resampled MGB-breast lung cohorts, respectively. For IDH1 mutation prediction, ABMIL trained on EBrains (n = 873 slides) and tested on resampled TCGA-GBMLGG. All unbiased test cohorts have 1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2–4.

Extended Data Fig. 5 Effect of pre-training dataset size on demographic disparities.

Race-stratified and overall ROC AUC for ABMIL models with patch encoders pre-trained on natural images and histology image datasets of varying sizes for: A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. All models were trained on 20-fold Monte Carlo splits on TCGA-BRCA (n = 1,049 slides), TCGA-Lung (n = 1,043 slides), and EBRAINS brain tumor atlas (n = 873 slides) and tested on resampled MGB-breast, MGB-lung, and TCGA-GBMLGG (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race) for breast subtyping, lung subtyping, and IDH1 mutation prediction, respectively. The number of images used for pre-training of each encoder is shown in brackets under the encoder name. Refer to Methods for details of each encoder. Error bars in bar plots indicate 95% CI, with the center being the mean value (n = 20 folds).

Extended Data Fig. 6 Demographic stratified performance of internal validation cohorts.

Race stratified breast and lung subtyping ROC curves and true positive rate disparity for ABMIL models trained and tested on: (A) TCGA-BRCA (B) TCGA-lung (C) MGB-breast (D) MGB-lung. To create training splits, 25 examples from each subtype were sampled 10 times to create 10 folds, and the rest of the data was used for validation. ABMIL with UNI patch encoder used. ROC curves show mean (n = 10 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Tables 2 and 3.

Extended Data Fig. 7 Investigating breast subtyping disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the TCGA-GBMLGG test cohort (n = 1,123 slides) for ABMIL model trained with UNI features on the EBRAINS brain tumor atlas cohort (n = 873 slides) in a 20-fold study for IDH1 mutation prediction. A. TPR disparity for different race groups. B. The TPR disparity is computed for white IDH1 wild-type (WT) and mutant (MT) patients (n = 983 slides), stratified by age. C. TPR disparity for different age groups (years). D. The TPR disparity is computed for IDH1 wild-type and mutant patients aged ≤40 (n = 303 slides), stratified by race. Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained in one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 4.

Extended Data Fig. 8 Investigating IDH1 mutation prediction disparities beyond race.

TPR disparity was assessed in various demographic subgroups of the MGB-breast test cohort (n = 1,265 slides) for ABMIL model trained with UNI patch encoder on the TCGA-BRCA cohort (n = 1,049 slides) in a 20-fold study for breast subtyping. A. TPR disparity for different postal code inferred income groups. (B-D) The TPR disparity is computed for subgroups of IDC and ILC patients from low-income postal codes (n = 407 slides), stratified by other demographic variables. B racial groups, C insurance groups, D age groups (years). E. TPR disparity for different racial groups. (F-H) The TPR disparity is computed for subgroups of the white IDC and ILC patients (n = 904 samples), stratified by other demographic variables. F insurance groups, G income groups inferred from postal code, H age groups. (I-K) The TPR disparity is computed for subgroups of the Black IDC and ILC patients (n = 164 samples), stratified by other demographic variables. I insurance groups, J income groups inferred from postal code, K age groups (years). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 2.

Extended Data Fig. 9 Stain distributions by races for different scanners.

For both the MGB-breast and lung cohorts, we randomly sampled 50 slides per scanner and per race, segmented the tissue from background, and patched the tissue into 256 × 256 tiles at 20x magnification. We sampled 1,000 patches from each slide, converted them from RGB to HSV space, and calculated their average hue and saturations. We compare the distributions of hue and saturation by race for A. Overall MGB-breast cohort B. Slides in MGB-breast cohort scanned on Aperio GT450 scanner C. Slide in MGB-reast cohort scanned on Hamamatsu S210 scanner. We compare the distributions of hue and saturation by race for D. Overall MGB-lung cohort E. Slides in MGB-lung cohort scanned on Aperio GT450 scanner F. Slide in MGB-lung cohort scanned on Hamamatsu S210 scanner. We do not find any statistically significant difference in the hues or saturations of whole slide images by race for both scanners and overall category as compared by two-sided non-parametric paired permutation tests. Boxes indicate quartile values of metric shown by respective axis (n = 50 whole slide images) with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot represents a unique slide.

Extended Data Fig. 10 Effect of resampling sample size on TPR disparities.

We trained ABMIL models with UNI patch encoder on 20-fold Monte Carlo splits on TCGA-BRCA and lung subtyping and EBRAINS IDH1 mutation prediction and tested them on original and resampled MGB-breast and lung and TCGA-GBMLGG cohorts, respectively. We show different resampling variants of the test set (no resampling/ original, 500 and 1,000 slides per class and per race) for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. Resampling is done for each disease class and race (See Methods for more details). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50^th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for the task in Supplementary Data Table 2–4.

Supplementary information

Supplementary Information

Supplementary Data Tables 1–44.

Reporting Summary

Supplementary Table 1

Analysis of FDA-approved algorithms: names of FDA-approved medical imaging algorithms, their approval number and year, the modality they are intended for, their risk group, and (1) whether the company reports the demographics of their test sets or (2) whether demographic-stratified metrics are reported on the test set. ‘1’ indicates that metrics are present or the approval documentation states that no differences were found by demographics. ‘0’ indicates that such metrics are not present.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vaidya, A., Chen, R.J., Williamson, D.F.K. et al. Demographic bias in misdiagnosis by computational pathology models. Nat Med 30, 1174–1190 (2024). https://doi.org/10.1038/s41591-024-02885-z

Download citation

Received: 03 September 2023
Accepted: 23 February 2024
Published: 19 April 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41591-024-02885-z

This article is cited by

Using unlabeled data to enhance fairness of medical AI
- Rajiv Movva
- Pang Wei Koh
- Emma Pierson
Nature Medicine (2024)