Abstract
Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas. For example, when using common modeling approaches, we observed performance gaps (in area under the receiver operating characteristic curve) between white and Black patients of 3.0% for breast cancer subtyping, 10.9% for lung cancer subtyping and 16.0% for IDH1 mutation prediction in gliomas. We found that richer feature representations obtained from self-supervised vision foundation models reduce performance variations between groups. These representations provide improvements upon weaker models even when those weaker models are combined with state-of-the-art bias mitigation strategies and modeling choices. Nevertheless, self-supervised vision foundation models do not fully eliminate these discrepancies, highlighting the continuing need for bias mitigation efforts in computational pathology. Finally, we demonstrate that our results extend to other demographic factors beyond patient race. Given these findings, we encourage regulatory and policy agencies to integrate demographic-stratified evaluation into their assessment guidelines.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Public data from TCGA, including digital histology and the clinical annotations used, are available at https://portal.gdc.cancer.gov/ and https://cbioportal.org. The EBRAINS brain tumor atlas can be accessed at https://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406f-8999-60269dc1f994. Restrictions apply to the availability of the in-house data, which were used with institutional permission for the current study and are thus not publicly available. We note that these data were not specifically collected for this study. All requests for data may be addressed to the corresponding author and will be promptly evaluated based on institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations. Internal data can only be shared for noncommercial, academic purposes and will require a data user agreement.
Code availability
All code was implemented in Python using PyTorch as the primary deep learning package. Code and scripts to reproduce the training experiments of this paper are available at https://github.com/mahmoodlab/CPATH_demographics.
References
Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020).
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. in Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).
Shao, Z. et al. TransMIL: transformer based correlated multiple instance learning for whole slide image classification. in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 2136–2147 (Curran Associates, 2021).
Chan, T. H., Cendra, F. J., Ma, L., Yin, G. & Yu, L. Histopathology whole slide image analysis with heterogeneous graph representation learning. in Proc. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15661–15670 (IEEE, 2023).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Leo, P. et al. Computer extracted gland features from H&E predicts prostate cancer recurrence comparably to a genomic companion diagnostic test: a large multi-site study. NPJ Precis. Oncol. 5, 35 (2021).
Howard, F. M. et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat. Commun. 12, 4423 (2021).
Chatterji, S. et al. Prediction models for hormone receptor status in female breast cancer do not extend to males: further evidence of sex-based disparity in breast cancer. NPJ Breast Cancer 9, 91 (2023).
Dehkharghanian, T. et al. Biased data, biased AI: deep networks predict the acquisition site of TCGA images. Diagn. Pathol. 18, 67 (2023).
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Mhasawade, V., Zhao, Y. & Chunara, R. Machine learning and algorithmic fairness in public and population health. Nat. Mach. Intell. 3, 659–666 (2021).
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4, e406–e414 (2022).
Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. & Obermeyer, Z. An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med. 27, 136–140 (2021).
Population Estimates, July 1, 2022 (V2022). U.S. Census Bureau QuickFacts https://www.census.gov/quickfacts/fact/table/US/PST045222 (2022).
Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff. (Millwood) 37, 780–785 (2018).
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).
Spratt, D. E. et al. Racial/ethnic disparities in genomic sequencing. JAMA Oncol. 2, 1070–1074 (2016).
Khor, S. et al. Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Netw. Open 6, e2318495 (2023).
van der Burgh, A. C., Hoorn, E. J. & Chaker, L. Removing race from kidney function estimates. JAMA 325, 2018 (2021).
Diao, J. A. et al. Clinical implications of removing race from estimates of kidney function. JAMA 325, 184–186 (2021).
Marmot, M. Social determinants of health inequalities. Lancet 365, 1099–1104 (2005).
Dietze, E. C., Sistrunk, C., Miranda-Carboni, G., O’Reagan, R. & Seewaldt, V. L. Triple-negative breast cancer in African-American women: disparities versus biology. Nat. Rev. Cancer 15, 248–254 (2015).
Cormier, J. N. et al. Ethnic differences among patients with cutaneous melanoma. Arch. Intern. Med. 166, 1907–1914 (2006).
Rubin, J. B. The spectrum of sex differences in cancer. Trends Cancer 8, 303–315 (2022).
Lara, O. D. et al. Pan-cancer clinical and molecular analysis of racial disparities. Cancer 126, 800–807 (2020).
Heath, E. I. et al. Racial disparities in the molecular landscape of cancer. Anticancer Res. 38, 2235–2240 (2018).
Gucalp, A. et al. Male breast cancer: a disease distinct from female breast cancer. Breast Cancer Res. Treat. 173, 37–48 (2019).
Dong, M. et al. Sex differences in cancer incidence and survival: a pan-cancer analysis. Cancer Epidemiol. Biomarkers Prev. 29, 1389–1397 (2020).
Butler, E. N., Kelly, S. P., Coupland, V. H., Rosenberg, P. S. & Cook, M. B. Fatal prostate cancer incidence trends in the United States and England by race, stage, and treatment. Br. J. Cancer 123, 487–494 (2020).
Zavala, V. A. et al. Cancer health disparities in racial/ethnic minorities in the United States. Br. J. Cancer 124, 315–332 (2021).
Ngan, H.-L., Wang, L., Lo, K.-W. & Lui, V. W. Y. Genomic landscapes of EBV-associated nasopharyngeal carcinoma vs. HPV-associated head and neck cancer. Cancers (Basel) 10, 210 (2018).
Singh, H., Singh, R., Mhasawade, V. & Chunara, R. Fairness violations and mitigation under covariate shift. in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 3–13 (Association for Computing Machinery, 2021).
Maity, S., Mukherjee, D., Yurochkin, M. & Sun, Y. Does enforcing fairness mitigate biases caused by subpopulation shift? in Advances in Neural Information Processing Systems Vol. 34 (eds. Ranzato, M. et al.) 25773–25784 (Curran Associates, 2021).
Giguere, S. et al. Fairness guarantees under demographic shift. in Proc. 10th International Conference on Learning Representations (ICLR, 2022).
Schrouff, J. et al. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. in Advances in Neural Information Processing Systems Vol. 35 (eds. Koyejo, S. et al.) 19304–19318 (Curran Associates, 2022).
Chen, S. et al. Machine learning-based pathomics signature could act as a novel prognostic marker for patients with clear cell renal cell carcinoma. Br. J. Cancer 126, 771–777 (2022).
US Food and Drug Administration. Evaluation of automatic class III designation for Paige Prostate. www.accessdata.fda.gov/cdrh_docs/reviews/DEN200080.pdf (2021).
Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Pac. Symp. Biocomput. 26, 232–243 (2021).
Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Risk of bias in chest X-ray foundation models. Preprint at https://arxiv.org/abs/2209.02965v1 (2022).
Beheshtian, E., Putman, K., Santomartino, S. M., Parekh, V. S. & Yi, P. H. Generalizability and bias in a deep learning pediatric bone age prediction model using hand radiographs. Radiology 306, e220505 (2023).
Röösli, E., Bozkurt, S. & Hernandez-Boussard, T. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci. Data 9, 24 (2022).
Bernhardt, M., Jones, C. & Glocker, B. Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms. Nat. Med. 28, 1157–1158 (2022).
Mukherjee, P. et al. Confounding factors need to be accounted for in assessing bias by machine learning algorithms. Nat. Med. 28, 1159–1160 (2022).
Meng, C., Trinh, L., Xu, N., Enouen, J. & Liu, Y. Interpretability and fairness evaluation of deep learning models on MIMIC-IV dataset. Sci. Rep. 12, 7166 (2022).
Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
Madras, D., Creager, E., Pitassi, T. & Zemel, R. Learning adversarially fair and transferable representations. in Proc. 35th International Conference on Machine Learning 3384–3393 (PMLR, 2018).
Wang, R., Chaudhari, P. & Davatzikos, C. Bias in machine learning models can be significantly mitigated by careful training: evidence from neuroimaging studies. Proc. Natl Acad. Sci. USA 120, e2211613120 (2023).
Yang, J., Soltan, A. A., Eyre, D. W. & Clifton, D. A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat. Mach. Intell. 5, 884–894 (2023).
Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl Acad. Sci. USA 117, 12592–12594 (2020).
Burlina, P., Joshi, N., Paul, W., Pacheco, K. D. & Bressler, N. M. Addressing artificial intelligence bias in retinal diagnostics. Transl. Vis. Sci. Technol. 10, 13 (2021).
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Distinct lung cancer subtypes associate to distinct drivers of tumor progression. Oncotarget 9, 35528–35540 (2018).
Relli, V., Trerotola, M., Guerra, E. & Alberti, S. Abandoning the notion of non-small cell lung cancer. Trends Mol. Med. 25, 585–594 (2019).
Yan, H. et al. IDH1 and IDH2 mutations in gliomas. N. Engl. J. Med. 360, 765–773 (2009).
Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. in Advances in Neural Information Processing Systems Vol. 29 (eds. Lee, D. D. et al.) 3315–3323 (Curran Associates, 2016).
Barocas, S., Hardt, M. & Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities (MIT Press, 2023); fairmlbook.org/pdf/fairmlbook.pdf
Chouldechova, A. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5, 153–163 (2017).
Wang, X. et al. Characteristics of The Cancer Genome Atlas cases relative to U.S. general population cancer cases. Br. J. Cancer 119, 885–892 (2018).
Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource. Sci. Data 9, 55 (2022).
Maron, O. & Lozano-Pérez, T. A framework for multiple-instance learning. in Advances in Neural Information Processing Systems Vol. 10 (eds. Jordan, M. I. et al.) 570–576 (MIT Press, 1998).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. in Proc. 35th International Conference on Machine Learning 2127–2136 (PMLR, 2018).
Jaume, G., Song, A. H. & Mahmood, F. Integrating context for superior cancer prognosis. Nat. Biomed. Eng. 6, 1323–1325 (2022).
Kamiran, F. & Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33, 1–33 (2012).
Krasanakis, E., Spyromitros-Xioufis, E., Papadopoulos, S. & Kompatsiaris, Y. Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. in Proc. 2018 World Wide Web Conference 853–862 (International World Wide Web Conferences Steering Committee, 2018).
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N. & Varshney, K. R. Optimized pre-processing for discrimination prevention. in Advances in Neural Information Processing Systems Vol. 30 (eds. Guyon, I. et al.) 3995–4004 (Curran Associates, 2017).
Zemel, R., Wu, Y., Swersky, K., Pitassi, T. & Dwork, C. Learning fair representations. in Proc. 30th International Conference on Machine Learning 325–333 (PMLR, 2013).
Zafar, M. B., Valera, I., Rodriguez, M. G. & Gummadi, K. P. Fairness beyond disparate treatment and disparate impact: learning classification without disparate mistreatment. in Proc. 26th International Conference on World Wide Web 1171–1180 (International World Wide Web Conferences Steering Committee, 2017).
Celis, L. E. & Keswani, V. Improved adversarial learning for fair classification. Preprint at https://arxiv.org/abs/1901.10443 (2019).
Zhong, Y. et al. MEDFAIR: benchmarking fairness for medical imaging. in Proc. International Conference on Learning Representations (ICLR, 2023).
Yang, Y., Zhang, H., Katabi, D. & Ghassemi, M. Change is hard: a closer look at subpopulation shift. in International Conference on Machine Learning (ICML, 2023).
Breen, J. et al. Efficient subtyping of ovarian cancer histopathology whole slide images using active sampling in multiple instance learning. in Proc. SPIE 12471 (eds. Tomaszewski, J. E. & Ward, A. D.) 1247110 (Society of Photo-Optical Instrumentation Engineers, 2023).
Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N. & Huang, J. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med. Image Anal. 65, 101789 (2020).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283–286 (2021).
Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. in Proc. 6th IEEE International Conference on Symposium on Biomedical Imaging: From Nano to Macro 1107–1110 (IEEE, 2009).
Janowczyk, A., Basavanhally, A. & Madabhushi, A. Stain Normalization using Sparse AutoEncoders (StaNoSA): application to digital pathology. Comput. Med. Imaging Graph. 57, 50–61 (2017).
Ciompi, F. et al. The importance of stain normalization in colorectal tissue classification with convolutional networks. in Proc. 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) 160–163 (IEEE, 2017).
Tellez, D. et al. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Med. Image Anal. 58, 101544 (2019).
Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 89, 104467 (2023).
Adleberg, J. et al. Predicting patient demographics from chest radiographs with deep learning. J. Am. Coll. Radiol. 19, 1151–1161 (2022).
Yi, P. H. et al. Radiology ‘forensics’: determination of age and sex from chest radiographs using deep learning. Emerg. Radiol. 28, 949–954 (2021).
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains. Nat. Commun. 11, 5727 (2020).
Movva, R. et al. Coarse race data conceals disparities in clinical risk score performance. in Machine Learning for Healthcare Conference 443–472 (PMLR, 2023)
Mamary, A. J. et al. Race and gender disparities are evident in COPD underdiagnoses across all severities of measured airflow obstruction. Chronic Obstr. Pulm. Dis. 5, 177–184 (2018).
Sun, T. Y. et al. Exploring gender disparities in time to diagnosis. in Machine Learning for Healthcare Conference (Curran Associates, 2020).
Gianfrancesco, M. A., Tamang, S., Yazdany, J. & Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 178, 1544–1547 (2018).
Glocker, B., Jones, C., Roschewitz, M. & Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. Artif. Intell. 5, e230060 (2023).
Pfohl, S. R., Foryciarz, A. & Shah, N. H. An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform. 113, 103621 (2021).
Borrell, L. N. et al. Race and genetic ancestry in medicine—a time for reckoning with racism. N. Engl. J. Med. 384, 474–480 (2021).
Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).
Bamshad, M., Wooding, S., Salisbury, B. A. & Stephens, J. C. Deconstructing the relationship between genetics and race. Nat. Rev. Genet. 5, 598–609 (2004).
Bhargava, H. K. et al. Computationally derived image signature of stromal morphology is prognostic of prostate cancer recurrence following prostatectomy in African American patients. Clin. Cancer Res. 26, 1915–1923 (2020).
Shi, Y. et al. A prospective, molecular epidemiology study of EGFR mutations in Asian patients with advanced non-small-cell lung cancer of adenocarcinoma histology (PIONEER). J. Thorac. Oncol. 9, 154–162 (2014).
Martini, R. et al. African ancestry-associated gene expression profiles in triple-negative breast cancer underlie altered tumor biology and clinical outcome in women of African descent. Cancer Discov. 12, 2530–2551 (2022).
Zhang, G. et al. Characterization of frequently mutated cancer genes in Chinese breast tumors: a comparison of Chinese and TCGA cohorts. Ann. Transl. Med. 7, 179 (2019).
McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit. Health 2, e221–e223 (2020).
Sung, H., DeSantis, C. E., Fedewa, S. A., Kantelhardt, E. J. & Jemal, A. Breast cancer subtypes among Eastern-African-born black women and other black women in the United States. Cancer 125, 3401–3411 (2019).
Li, X., Wu, P. & Su, J. Accurate fairness: improving individual fairness without trading accuracy. in Proc. 37th AAAI Conference on Artificial Intelligence Vol. 37 (eds. Williams, B. et al.) 14312–14320 (Association for the Advancement of Artificial Intelligence, 2023).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Yang, Y., Zha, K., Chen, Y., Wang, H. & Katabi, D. Delving into deep imbalanced regression. in Proc. 38th International Conference on Machine Learning 11842–11851 (PMLR, 2021).
Morik, M., Singh, A., Hong, J. & Joachims, T. Controlling fairness and bias in dynamic learning-to-rank. in Proc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 429–438 (Association for Computing Machinery, 2020).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Vorontsov, E. et al. Virchow: a million-slide digital pathology foundation model. Preprint at https://arxiv.org/abs/2309.07778 (2023).
Dippel, J. et al. RudolfV: a foundation model by pathologists for pathologists. Preprint at https://arxiv.org/abs/2401.04079 (2024).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Pfohl, S. R. et al. Understanding subgroup performance differences of fair predictors using causal models. in NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models (2023).
Cai, T. T., Namkoong, H. & Yadlowsky, S. Diagnosing model performance under distribution shift. Preprint at https://arxiv.org/abs/2303.02011 (2023).
Morning, A. The racial self-identification of South Asians in the United States. J. Ethn. Migr. Stud. 27, 61–79 (2001).
Chadban, S. J. et al. KDIGO clinical practice guideline on the evaluation and management of candidates for kidney transplantation. Transplantation 104, S11–S103 (2020).
Eneanya, N. D., Yang, W. & Reese, P. P. Reconsidering the consequences of using race to estimate kidney function. JAMA 322, 113–114 (2019).
Zelnick, L. R., Leca, N., Young, B. & Bansal, N. Association of the estimated glomerular filtration rate with vs without a coefficient for race with time to eligibility for kidney transplant. JAMA Netw. Open 4, e2034004 (2021).
del Barrio, E., Gordaliza, P. & Loubes, J.-M. Review of mathematical frameworks for fairness in machine learning. Preprint at http://arxiv.org/abs/2005.13755 (2020).
Binns, R. On the apparent conflict between individual and group fairness. in Proc. 2020 Conference on Fairness, Accountability, and Transparency 514–524 (Association for Computing Machinery, 2020).
Braveman, P., Egerter, S. & Williams, D. R. The social determinants of health: coming of age. Annu. Rev. Public Health 32, 381–398 (2011).
Walker, R. J., Williams, J. S. & Egede, L. E. Influence of race, ethnicity and social determinants of health on diabetes outcomes. Am. J. Med. Sci. 351, 366–373 (2016).
Link, B. G. & Phelan, J. Social conditions as fundamental causes of disease. J. Health Soc. Behav. 35, 80–94 (1995).
Richardson, L. D. & Norris, M. Access to health and health care: how race and ethnicity matter. Mt. Sinai J. Med. 77, 166–177 (2010).
Yearby, R. Racial disparities in health status and access to healthcare: the continuation of inequality in the United States due to structural racism. Am. J. Econ. Sociol. 77, 1113–1152 (2018).
van Ryn, M. Research on the provider contribution to race/ethnicity disparities in medical care. Med. Care 40, I140–I151 (2002).
George, S., Ragin, C. & Ashing, K. T. Black is diverse: the untapped beauty and benefit of cancer genomics and precision medicine. JCO Oncol. Pract. 17, 279–283 (2021).
Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Bonham, V. L., Green, E. D. & Pérez-Stable, E. J. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA 320, 1533–1534 (2018).
Daneshjou, R. et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Zou, J., Gichoya, J. W., Ho, D. E. & Obermeyer, Z. Implications of predicting race variables from medical images. Science 381, 149–150 (2023).
Chen, I. Y., Johansson, F. D. & Sontag, D. Why is my classifier discriminatory? in Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).
Puyol-Antón, E. et al. Fairness in cardiac magnetic resonance imaging: assessing sex and racial bias in deep learning-based segmentation. Front. Cardiovasc. Med. 9, 859310 (2022).
US Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf (2019).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
Zhu, S., Gilbert, M., Chetty, I. & Siddiqui, F. The 2021 landscape of FDA-approved artificial intelligence/machine learning-enabled medical devices: an analysis of the characteristics and intended use. Int. J. Med. Inform. 165, 104828 (2022).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health 2, e537–e548 (2020).
Sounderajah, V. et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 11, e047709 (2021).
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Lipkova, J. et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat. Med. 28, 575–582 (2022).
Smith, B., Hermsen, M., Lesser, E., Ravichandar, D. & Kremers, W. Developing image analysis pipelines of whole-slide images: pre- and post-processing. J. Clin. Transl. Sci. 5, e38 (2020).
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002 (IEEE, 2021).
Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. in Proc. International Conference on Learning Representations (ICLR, 2021).
Oquab, M. et al. DINOv2: Learning robust visual features without supervision. in Transactions on Machine Learning Research 2835–8856 (TMLR, 2024).
Dolezal, J. M. et al. Slideflow: deep learning for digital histopathology with real-time whole-slide visualization. Preprint at https://arXiv.org/abs/2304.04142 (2023).
Kriegsmann, M. et al. Deep learning for the classification of small-cell and non-small-cell lung cancer. Cancers (Basel) 12, 1604 (2020).
Janßen, C. et al. Multimodal lung cancer subtyping using deep learning neural networks on whole slide tissue images and MALDI MSI. Cancers (Basel) 14, 6181 (2022).
Celik, Y., Talo, M., Yildirim, O., Karabatak, M. & Acharya, U. R. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recognit. Lett. 133, 232–239 (2020).
Han, Z. et al. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7, 4172 (2017).
Srikantamurthy, M. M., Rallabandi, V. P. S., Dudekula, D. B., Natarajan, S. & Park, J. Classification of benign and malignant subtypes of breast cancer histopathology imaging using hybrid CNN-LSTM based transfer learning. BMC Med. Imaging 23, 19 (2023).
Xiong, Y. et al. Nyströmformer: a Nyström-based algorithm for approximating self-attention. in Proc. AAAI Conference on Artificial Intelligence Vol. 35 14138–14148 (Association for the Advancement of Artificial Intelligence, 2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2019).
Berrada, L., Zisserman, A. & Kumar, M. P. Smooth loss functions for deep top-k classification. in Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Jiang, H. & Nachum, O. Identifying and correcting label bias in machine learning. in Proc. 23rd International Conference on Artificial Intelligence and Statistics Vol. 108 702–712 (PMLR, 2020).
Chai, X. et al. Unsupervised domain adaptation techniques based on auto-encoder for non-stationary EEG-based emotion recognition. Comput. Biol. Med. 79, 205–214 (2016).
Fang, T., Lu, N., Niu, G. & Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. in Advances in Neural Information Processing Systems Vol. 33 (eds. Larochelle, H. et al.) 11996–12007 (Curran Associates, 2020).
Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. 50, 419–430 (2008).
Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
American Cancer Society. Key statistics for breast cancer—how common is breast cancer? www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html (2024).
American Cancer Society. Key statistics for lung cancer—how common is lung cancer? www.cancer.org/cancer/types/lung-cancer/about/key-statistics.html (2024).
Kim, M. et al. Glioblastoma as an age-related neurological disorder in adults. Neurooncol. Adv. 3, vdab125 (2021).
Cao, J., Yan, W., Zhan, Z., Hong, X. & Yan, H. Epidemiology and risk stratification of low-grade gliomas in the United States, 2004–2019: a competing-risk regression model for survival analysis. Front. Oncol. 13, 1079597 (2023).
scikit-learn developers. 1.1. Linear models. scikit-learn scikit-learn.org/stable/modules/linear_model.html (2022).
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. https://doi.org/10.2202/1544-6115.1585 (2010).
Ernst, M. D. Permutation methods: a basis for exact inference. Stat. Sci. 19, 676–685 (2004).
Fisher, R. The Design of Experiments Vol. 6 (Hafner, 1951).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Lane, D. M. Confidence Interval on Pearson’s Correlation (Rice Univ., 2018); onlinestatbook.com/2/estimation/correlation_ci.html
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. in Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Acknowledgements
This work was supported in part by the Brigham and Women’s Hospital (BWH) President’s Fund, BWH and Massachusetts General Hospital Pathology, and National Institute of General Medical Sciences R35GM138216 (to F.M.). R.J.C. was supported by the National Science Foundation Graduate Fellowship. Y.Y. was supported by the Takeda Fellowship. M.Y.L. was supported by the Siebel Scholars program. D.F.K.W. was supported by the National Institutes of Health/National Cancer Institute Ruth L. Kirschstein National Service Award (T32CA251062). The content is solely the responsibility of the authors and does not reflect the official views of the funding sources.
Author information
Authors and Affiliations
Contributions
A.V., R.J.C., and F.M. conceived the study. All authors designed the experiments. A.V., R.J.C., M.Y.L., D.F.K.W., T.Y.C., J.L. and M.S. performed data collection and cleaning. A.V. and R.J.C conducted the experimental analysis with assistance from all coauthors. D.F.K.W. analyzed the misclassified cases. A.V., D.F.K.W., R.J.C., A.H.S., G.J., T.H., Y.Y., E.C.D. and F.M. prepared the paper with input from all coauthors. F.M. supervised the research.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Medicine thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Effects of data processing strategies on disparities in breast subtyping.
Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for breast subtyping trained on TCGA-BRCA (n = 1,049 slides) and tested on MGB-breast (n = 1,265 slides) with: A-D. ResNet50IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2.
Extended Data Fig. 2 Effects of data processing strategies on disparities in lung subtyping.
Race stratified subtyping ROC curves and true positive rate disparity for ABMIL models for lung subtyping trained on TCGA-lung (n = 1,043 slides) and tested on MGB-lung (n = 1,960 slides) with: A-D. ResNet50IN patch encoder E-H. CTransPath patch encoder I-L. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, E, I) (ii) 10-fold site-preserving splits (B, F, J) (iii) 10-fold site-preserving splits and stain-normalized features (C, G, K) (iv) With stain normalization and site-preserving folds, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per subtype for each race) (D, H, L). ROC curves show mean curve n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds for site-stratified splits and n = 20 folds for Monte Carlo splits) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 3.
Extended Data Fig. 3 Effects of data processing strategies on disparities in IDH1 mutation prediction.
Race stratified ROC curves and true positive rate disparity for ABMIL models for IDH1 mutation prediction trained on EBRAINS brain tumor atlas (n = 873 slides) and tested on TCGA-GBMLGG cohort (n = 1,123 slides) with: A-C. ResNet50IN feature encoder D-F. CTransPath patch encoder G-I. UNI patch encoder. In each case, the ABMIL model was trained using different strategies: (i) 20-fold Monte Carlo splits (A, D, G) (ii) ABMIL trained using stain-normalized features (B, E, H) (iii) With stain normalization, ABMIL is tested on unbiased test cohorts (1,000 white, 1,000 Black, and 1,000 Asian, with 500 slides per class for each race) (C, F, I). ROC curves show mean curve (n = 20 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 4.
Extended Data Fig. 4 Effect of stain normalization on disparities.
Race stratified ROC curves and true positive rate disparity for ABMIL model trained in 20-fold study ResNet50IN and UNI patch encoders with Macenko stain normalization for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. ABMIL trained on TCGA-BRCA (n = 1,049 slides) and TCGA-lung (n = 1,043 slides) cohorts for breast and lung subtyping and tested on resampled MGB-breast lung cohorts, respectively. For IDH1 mutation prediction, ABMIL trained on EBrains (n = 873 slides) and tested on resampled TCGA-GBMLGG. All unbiased test cohorts have 1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race. Boxes indicate quartile values of TPR disparity (n = 20 folds) with the center being 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Table 2–4.
Extended Data Fig. 5 Effect of pre-training dataset size on demographic disparities.
Race-stratified and overall ROC AUC for ABMIL models with patch encoders pre-trained on natural images and histology image datasets of varying sizes for: A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. All models were trained on 20-fold Monte Carlo splits on TCGA-BRCA (n = 1,049 slides), TCGA-Lung (n = 1,043 slides), and EBRAINS brain tumor atlas (n = 873 slides) and tested on resampled MGB-breast, MGB-lung, and TCGA-GBMLGG (1,000 white, 1,000 Black, and 1,000 Asian slides, with 500 slides per class for each race) for breast subtyping, lung subtyping, and IDH1 mutation prediction, respectively. The number of images used for pre-training of each encoder is shown in brackets under the encoder name. Refer to Methods for details of each encoder. Error bars in bar plots indicate 95% CI, with the center being the mean value (n = 20 folds).
Extended Data Fig. 6 Demographic stratified performance of internal validation cohorts.
Race stratified breast and lung subtyping ROC curves and true positive rate disparity for ABMIL models trained and tested on: (A) TCGA-BRCA (B) TCGA-lung (C) MGB-breast (D) MGB-lung. To create training splits, 25 examples from each subtype were sampled 10 times to create 10 folds, and the rest of the data was used for validation. ABMIL with UNI patch encoder used. ROC curves show mean (n = 10 folds) with 95% CI. Boxes indicate quartile values of TPR disparity (n = 10 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for each task in Supplementary Data Tables 2 and 3.
Extended Data Fig. 7 Investigating breast subtyping disparities beyond race.
TPR disparity was assessed in various demographic subgroups of the TCGA-GBMLGG test cohort (n = 1,123 slides) for ABMIL model trained with UNI features on the EBRAINS brain tumor atlas cohort (n = 873 slides) in a 20-fold study for IDH1 mutation prediction. A. TPR disparity for different race groups. B. The TPR disparity is computed for white IDH1 wild-type (WT) and mutant (MT) patients (n = 983 slides), stratified by age. C. TPR disparity for different age groups (years). D. The TPR disparity is computed for IDH1 wild-type and mutant patients aged ≤40 (n = 303 slides), stratified by race. Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained in one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 4.
Extended Data Fig. 8 Investigating IDH1 mutation prediction disparities beyond race.
TPR disparity was assessed in various demographic subgroups of the MGB-breast test cohort (n = 1,265 slides) for ABMIL model trained with UNI patch encoder on the TCGA-BRCA cohort (n = 1,049 slides) in a 20-fold study for breast subtyping. A. TPR disparity for different postal code inferred income groups. (B-D) The TPR disparity is computed for subgroups of IDC and ILC patients from low-income postal codes (n = 407 slides), stratified by other demographic variables. B racial groups, C insurance groups, D age groups (years). E. TPR disparity for different racial groups. (F-H) The TPR disparity is computed for subgroups of the white IDC and ILC patients (n = 904 samples), stratified by other demographic variables. F insurance groups, G income groups inferred from postal code, H age groups. (I-K) The TPR disparity is computed for subgroups of the Black IDC and ILC patients (n = 164 samples), stratified by other demographic variables. I insurance groups, J income groups inferred from postal code, K age groups (years). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. P value from non-parametric two-sided paired permutation test after multiple hypothesis correction presented. Demographic distributions for the task in Supplementary Data Table 2.
Extended Data Fig. 9 Stain distributions by races for different scanners.
For both the MGB-breast and lung cohorts, we randomly sampled 50 slides per scanner and per race, segmented the tissue from background, and patched the tissue into 256 × 256 tiles at 20x magnification. We sampled 1,000 patches from each slide, converted them from RGB to HSV space, and calculated their average hue and saturations. We compare the distributions of hue and saturation by race for A. Overall MGB-breast cohort B. Slides in MGB-breast cohort scanned on Aperio GT450 scanner C. Slide in MGB-reast cohort scanned on Hamamatsu S210 scanner. We compare the distributions of hue and saturation by race for D. Overall MGB-lung cohort E. Slides in MGB-lung cohort scanned on Aperio GT450 scanner F. Slide in MGB-lung cohort scanned on Hamamatsu S210 scanner. We do not find any statistically significant difference in the hues or saturations of whole slide images by race for both scanners and overall category as compared by two-sided non-parametric paired permutation tests. Boxes indicate quartile values of metric shown by respective axis (n = 50 whole slide images) with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot represents a unique slide.
Extended Data Fig. 10 Effect of resampling sample size on TPR disparities.
We trained ABMIL models with UNI patch encoder on 20-fold Monte Carlo splits on TCGA-BRCA and lung subtyping and EBRAINS IDH1 mutation prediction and tested them on original and resampled MGB-breast and lung and TCGA-GBMLGG cohorts, respectively. We show different resampling variants of the test set (no resampling/ original, 500 and 1,000 slides per class and per race) for A. breast subtyping B. lung subtyping C. IDH1 mutation prediction. Resampling is done for each disease class and race (See Methods for more details). Boxes indicate quartile values of TPR disparity (n = 20 folds), with the center being the 50th percentile. Whiskers extend to data points within 1.5x the interquartile range. Each dot in the box plots represents a unique model trained for one of the folds. Demographic distributions for the task in Supplementary Data Table 2–4.
Supplementary information
Supplementary Information
Supplementary Data Tables 1–44.
Supplementary Table 1
Analysis of FDA-approved algorithms: names of FDA-approved medical imaging algorithms, their approval number and year, the modality they are intended for, their risk group, and (1) whether the company reports the demographics of their test sets or (2) whether demographic-stratified metrics are reported on the test set. ‘1’ indicates that metrics are present or the approval documentation states that no differences were found by demographics. ‘0’ indicates that such metrics are not present.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vaidya, A., Chen, R.J., Williamson, D.F.K. et al. Demographic bias in misdiagnosis by computational pathology models. Nat Med 30, 1174–1190 (2024). https://doi.org/10.1038/s41591-024-02885-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-024-02885-z
This article is cited by
-
Using unlabeled data to enhance fairness of medical AI
Nature Medicine (2024)