Endomyocardial biopsy (EMB) screening represents the standard of care for detecting allograft rejections after heart transplant. Manual interpretation of EMBs is affected by substantial interobserver and intraobserver variability, which often leads to inappropriate treatment with immunosuppressive drugs, unnecessary follow-up biopsies and poor transplant outcomes. Here we present a deep learning-based artificial intelligence (AI) system for automated assessment of gigapixel whole-slide images obtained from EMBs, which simultaneously addresses detection, subtyping and grading of allograft rejection. To assess model performance, we curated a large dataset from the United States, as well as independent test cohorts from Turkey and Switzerland, which includes large-scale variability across populations, sample preparations and slide scanning instrumentation. The model detects allograft rejection with an area under the receiver operating characteristic curve (AUC) of 0.962; assesses the cellular and antibody-mediated rejection type with AUCs of 0.958 and 0.874, respectively; detects Quilty B lesions, benign mimics of rejection, with an AUC of 0.939; and differentiates between low-grade and high-grade rejections with an AUC of 0.833. In a human reader study, the AI system showed non-inferior performance to conventional assessment and reduced interobserver variability and assessment time. This robust evaluation of cardiac allograft rejection paves the way for clinical trials to establish the efficacy of AI-assisted EMB assessment and its potential for improving heart transplant outcomes.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Please email all requests for academic use of raw and processed data to the corresponding author. Restrictions apply to the availability of the in-house and external data, which were used with institutional permission for the current study, and are thus not publicly available. All requests will be promptly evaluated based on institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations. Data can only be shared for non-commercial academic purposes and will require a data user agreement. A subset of whole-slide images used in the study can be accessed through the interactive demonstration available at http://crane.mahmoodlab.org. ImageNet data are available at https://image-net.org/.Source data are provided with this paper.
All code was implemented in Python using PyTorch as the primary deep learning package. All code and scripts to reproduce the experiments of this paper are available at https://github.com/mahmoodlab/CRANE.
Ziaeian, B. & Fonarow, G. C. Epidemiology and aetiology of heart failure. Nat. Rev. Cardiol. 13, 368–378 (2016).
Benjamin, E. J. et al. Heart disease and stroke statistics–2018 update: a report from the American Heart Association. Circulation 137, e67–e492 (2018).
Badoe, N. & Shah, P. in Contemporary Heart Transplantation (eds Bogar, L. & Stempien-Otero, A.) 3–12 (Springer, 2020).
Orrego, C. M., Cordero-Reyes, A. M., Estep, J. D., Loebe, M. & Torre-Amione, G. Usefulness of routine surveillance endomyocardial biopsy 6 months after heart transplantation. J. Heart Lung Transplant. 31, 845–849 (2012).
Lund, L. H. et al. The Registry of the International Society for Heart and Lung Transplantation: thirty-fourth adult heart transplantation report—2017; focus theme: allograft ischemic time. J. Heart Lung Transplant. 36, 1037–1046 (2017).
Colvin-Adams, M. & Agnihotri, A. Cardiac allograft vasculopathy: current knowledge and future direction. Clin. Transplant. 25, 175–184 (2011).
Kfoury, A. G. et al. Cardiovascular mortality among heart transplant recipients with asymptomatic antibody-mediated or stable mixed cellular and antibody-mediated rejection. J .Heart Lung Transplant. 28, 781–784 (2009).
Costanzo, M. R. et al. The International Society of Heart and Lung Transplantation Guidelines for the care of heart transplant recipients. J. Heart Lung Transplant. 29, 914–956 (2010).
Kobashigawa, J. A. The search for a gold standard to detect rejection in heart transplant patients: are we there yet? Circulation 135, 936–938 (2017).
Angelini, A. et al. A web-based pilot study of inter-pathologist reproducibility using the ISHLT 2004 working formulation for biopsy diagnosis of cardiac allograft rejection: the European experience. J. Heart Lung Transplant. 30, 1214–1220 (2011).
Crespo-Leiro, M. G. et al. Concordance among pathologists in the second Cardiac Allograft Rejection Gene Expression Observational Study (CARGO II). Transplantation 94, 1172–1177 (2012).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).
Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580, 252–256 (2020).
Chen, P.-H. C. et al. An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis. Nat. Med. 25, 1453–1457 (2019).
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Bulten, W. et al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 21, 233–241 (2020).
Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging https://doi.org/10.1109/TMI.2020.3021387 (2020).
Mahmood, F. et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans. Med. Imaging 39, 3257–3267 (2020).
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Peyster, E. G. et al. An automated computational image analysis pipeline for histological grading of cardiac allograft rejection. Eur. Heart J. 42, 2356–2369 (2021).
Tong, L., Hoffman, R., Deshpande, S. R. & Wang, M. D. Predicting heart rejection using histopathological whole-slide imaging and deep neural network with dropout. In 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) 1–4 (IEEE, 2017).
Nirschl, J. J. et al. A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue. PLoS ONE 13, e0192726 (2018).
Peyster, E. G., Madabhushi, A. & Margulies, K. B. Advanced morphologic analysis for diagnosing allograft rejection: the case of cardiac transplant rejection. Transplantation 102, 1230–1239 (2018).
Sellaro, T. L. et al. Relationship between magnification and resolution in digital pathology systems. J. Pathol. Inform. 4, 21 (2013).
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 2132–2141 (PMLR, 2018).
Halloran, P. F. et al. Exploring the cardiac response to injury in heart transplant biopsies. JCI Insight 3, e123674 (2018).
Schmauch, B. et al. A deep learning model to predict RNA-seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Karimi, D., Dou, H., Warfield, S. K. & Gholipour, A. Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020).
Mitani, A., Hammel, N. & Liu, Y. Retinal detection of kidney disease and diabetes. Nat. Biomed. Eng. 5, 487–489 (2021).
Biscotti, C. V. et al. Assisted primary screening using the automated ThinPrep Imaging System. Am. J. Clin. Pathol. 123, 281–287 (2005).
Halloran, P. F. et al. Building a tissue-based molecular diagnostic system in heart transplant rejection: the heart Molecular Microscope Diagnostic (MMDx) System. J. Heart Lung Transplant. 36, 1192–1200 (2017).
Duong Van Huyen, J.-P. et al. MicroRNAs as non-invasive biomarkers of heart transplant rejection. Eur. Heart J. 35, 3194–3202 (2014).
Giarraputo, A. et al. A changing paradigm in heart transplantation: an integrative approach for invasive and non-invasive allograft rejection monitoring. Biomolecules 11, 201 (2021).
De Vlaminck, I. et al. Circulating cell-free DNA enables noninvasive diagnosis of heart transplant rejection. Sci. Transl. Med. 6, 241ra77 (2014).
Kennel, P. J. et al. Serum exosomal protein profiling for the non-invasive detection of cardiac allograft rejection. J. Heart Lung Transplant. 37, 409–417 (2018).
Anglicheau, D. & Suthanthiran, M. Noninvasive prediction of organ graft rejection and outcome using gene expression patterns. Transplantation 86, 192–199 (2008).
Dong, Q., Gong, S. & Zhu, X. Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1367–1381 (2019).
Matesanz, R., Mahillo, B., Alvarez, M. & Carmona, M. Global observatory and database on donation and transplantation: world overview on transplantation activities. Transplant. Proc. 41, 2297–2301 (2009).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR) 770-778 (2016).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge.Int. J. Comput. Vis. 115, 211–252 (2015).
Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 6105–6114 (PMLR, 2019).
We thank A. Bruce for scanning internal cohorts of patient histology slides at Brigham and Women’s Hospital (BWH); K. Bronstein, L. Cirelli and E. Askeland for querying the BWH slide database and retrieving archival slides; C. Li for assistance with EMRs and the Research Patient Data Registry (RPDR); M. Bragg, T. Mellen, S. Zimmet and T. A. Mages for logistical support; and K. Tung for anatomical illustrations. Y.B. and K.E.O. thank the Translational Research Unit team at the Institute of Pathology of the University of Bern for technical assistance and IT assistance; in particular, M. Skowronska, L. Daminescu and S. Reinhard. This work was supported in part by the BWH President’s Fund, National Institute of General Medical Sciences (NIGMS) R35GM138216 (to F.M.), Google Cloud Research Grant, Nvidia GPU Grant Program and internal funds from BWH and Massachusetts General Hospital (MGH) Pathology. M.S. was supported by the National Institutes of Health (NIH) National Library of Medicine (NLM) Biomedical Informatics and Data Science Research Training Program, T15LM007092. M.W. was funded by the NIH National Human Genome Research Institute (NHGRI) Ruth L. Kirschstein National Research Service Award Bioinformatics Training Grant, T32HG002295. T.Y.C. was funded by the NIH National Cancer Institute (NCI) Ruth L. Kirschstein National Service Award, T32CA251062. R.J.C. was funded by the National Science Foundation (NSF) Graduate Fellowship. The content is solely the responsibility of the authors and does not reflect the official views of the NIGMS, NIH, NLM, NHGRI, NCI or NSF.
The authors declare no competing interests.
Peer review information
Nature Medicine thanks Geert Litjens and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Michael Basson, in collaboration with the Nature Medicine team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a. Polar scatter plot depicts the differences in the train (US) and test (US, Turkish, Swiss) cohorts, each acquired with different scanners and staining protocols. The angle represents the color (i.e. hue) and the polar axis corresponds to the saturation. Each point represents average hue and saturation of an image patch selected from each cohort. To construct the figure, 100 WSIs were randomly selected from each cohort. For each selected slide, 4 patches of size 1024×1024 at ×10 magnification were randomly selected from the segmented tissue regions. A hue–saturation–density color transform is taken to correct for the logarithmic relationship between light intensity and stain amount. The Swiss cohort demonstrates a large variation in both hue and saturation whereas the US and Turkish cohorts have a relatively uniform saturation but variable hue. Examples of patches with diverse hue and saturation from each cohort are shown in subplots b. and c.
A supervised, patch-level classifier is trained to refine the detected high-grade (2 R + 3 R) cellular rejections into grades 2 and 3. A subplot a. shows manual annotations of the predictive region for each grade as outlined by pathologist. b. Patches extracted from the respective annotation regions serve as input for the binary classifier. Subplot c. shows the model performance at patches extracted from the US (m = 290 patches) and Turkish (m = 131 patches) cohort. Reported are ROC curves with 95% confidence intervals (CIs). The bar plots represent the model accuracy, F1-score, and Cohen’s κ for each cohort. Error bars indicate the 95% CIs while the center is always the computed value of each classification performance metric (specified by its respective axis labels). The slide-level performance is reported in Supplemental Table 6. The Swiss cohort was excluded from the analysis due to the absence of grade 3 rejections.
Model performance at different magnifications scales at a. slide-level and b. patient-level. Reported are AUC-ROC curves with 95% CI for 40×, 20× and 10× computed for the US test set (n = 995 WSIs, N = 336 patients). For the rejection detection tasks, the model typically performs better at higher magnification, while the grade predictions benefit from the increased context presented at lower magnifications. To account for the information from different scales, the detection of rejections and Quilty-B lesions is performed from the fusion of the model predictions from all available scales. In comparison, the rejection grade is determined from 10X magnification. c. Model performance during training and validation. Shown is cross-entropy loss for the multi-task model assessing the biopsy state and for the single-task model estimating the rejection grade. Reported is slide-level performance at 40× for the multi-task model, while the grading scores are measured at 10X magnification. The model with the lowest validation loss encountered during the training is used as the final model.
The CRANE model was evaluated on the test set from the US (n = 995 WSIs, N = 336 patients) and two independent external cohorts from Turkey (n = 1,717, N = 585), and Switzerland (n = 123, N = 123). a. Receiver operating characteristic (ROC) curves for the multi-task classification of EMB and grading at the slide-level. The area under the ROC curve (AUC) scores are reported together with the 95% CIs. b. The bar plots reflect the model accuracy for each task. Error bars (marked by the black lines) indicate 95% CIs while the center is always the computed value for each cohort (specified by the respective axis labels). The results suggest the ability of the CRANE model to generalize across diverse populations, and different scanners and staining protocols, without any domain-specific adaptations. Clinical deployment might benefit from the model’s fine-tuning with the local data and scanners.
The model robustness can be measured through the confidence of the predictions. The models that suffer from overfitting usually reach high performance on the training dataset by memorizing the specifics of the training data rather than learning the task at hand. As a consequence, such models result in incorrect but highly confident predictions during the deployment. The bar plots show the fraction of model predictions achieved with high confidence, for both correctly (blue) and incorrectly (yellow) estimated patient cases. The fraction of highly confident correctly predicted samples is consistently higher than the fraction of confident incorrect predictions across all the tasks. These results indicate the robustness of the model predictions for all tasks.
Reported are confusion matrices for a. rejection detection (including both ACR and AMR), detection of b. ACRs, c. AMRs, d. Quilty-B lesions, and e. discrimination between low (grade 1) and high (grade 2 + 3) rejections. To assess the model’s ability to detect rejections of different grades, subplots f. shows the distinction between normal cases and low-grade rejections, while g. reports distinction between normal cases and high-grade rejections. In both external cohorts, the model reached higher performance for detecting the more clinically relevant high-grade rejections, whereas in the internal cohort the performance is comparable for both low and high-grade cases. The rows of the confusion matrices show the model predictions and the columns represent the diagnosis reported in the patient’s records. The prediction cut-off for each task was computed from the validation set. For the clinical deployment, the cut-off can be modified and fine-tuned with the local data to meet the desirable false-negative rate. The performance is demonstrated on the US hold-out test set (N = 336 patients with 155 normal cases,181 rejections, 161 ACRs, 31 AMRs, 65 Quilty-B lesions, 113 low-grade, and 68 high-grade), Turkey (585 patients with 308 normal cases, 277 rejections, 271 ACRs, 16 AMRs, 74 Quilty-B lesions, 166 low-grade, and 111 high-grade) and Swiss (N = 123 patients with 54 normal cases, 69 rejections, 66 ACRs, 10 AMRs,18 Quilty-B lesions, 59 low-grade and 10 high-grade). Details on each cohort are reported in Supplemental Table 1.
Extended Data Fig. 7 Analysis of case with concurrent cellular, antibody-mediated rejection, and Quilty-B lesions.
a-b. The selected biopsy region and the corresponding H&E stained WSI. Attention heatmaps are computed for each task (c,d,e) and the grade (f). For the cellular task (c.), the high-attention regions correctly identified diffuse, multi-focal interstitial inflammatory infiltrate, predominantly comprised of lymphocytes, and associated myocyte injury. For the antibody heatmap (d.), the high-attention regions identified interstitial edema, endothelial swelling, and mild inflammation, consisting of lymphocytes and macrophages. For the Quilty-B heatmap (e.), the high-attention regions highlighted a focal, dense collection of lymphocytes within the endocardium, with mild crush artifact. For the grade (f.), the high-attention regions identified areas with diffuse, interstitial lymphocytic infiltrate with associated myocyte injury, corresponding to high grade cellular rejection. The high-attention regions for both types of rejection and Quilty-B lesions appear similar at the slide level at low power magnification, since all three tasks assign high-attention to regions with atypical myocardial tissue. However, at higher magnification, the highest attention in each task comes from regions with the task-specific morphology. The image patches with the highest attention scores from each task are shown in the last column. This example also illustrates the potential of CRANE to discriminate between ACR and similarly appearing Quilty-B lesions.
While the attention scores provide only relative importance of each biopsy regions for the model predictions, we attempted to quantify their relevance for diagnostic interpretability at patch- and slide-level. From the internal test set, we randomly selected 30 slides from each diagnosis and computed the attention heatmaps for each task (a-b,f-g).For the patch-level assessment, we selected 3 non-overlapping patches from the highest attention region in each slide. Since the regions with the lowest attention scores often include just a small fraction of tissue, we randomly selected 3 non-overlapping patches from the regions with medium-to-low attentions (i.e. attention scores<0.5). We randomly remove 5% of the patches to prevent pathologist from providing an equal amount of diagnoses, resulting in a total of 513 patches. A pathologist evaluated each patch as relevant or non-relevant for the given diagnosis. The pathologist’s scores are compared against the model predictions of diagnostically relevant (high-attention) vs non-relevant (medium-to-low attention) patches. The subplot shows AUC-ROCscores across all patches, using the normalized attention scores as the probability estimates. The accuracy, F1-score, and Cohen’s κ, computed for all patches and for the specific diagnoses, are reported in e.. These results suggest a high agreement between the model and pathologist’s interpretation of diagnostically relevant regions. For the slide-level assessment, we compare concordance in the predictive regions used by the model and pathologists. A pathologist annotated in each slide the most relevant biopsy region(s) for the given diagnosis (f.). The regions with the top 10% highest attentions scores in each slide are used to determine the most relevant regions used by the model (g.). These are compared against the pathologist’s annotations. The detection rate for all slides, and the individual diagnosis, are reported in h. Although the model did not use any pixel-level annotations during training these results imply relatively high concordance in the predictive regions used by the model and pathologist. It should be noted that the attention heatmaps are always normalized and not absolute, hence, the highest attended region is considered for the analysis similar to17.
The design of the reader study is depicted in a-b. The subplot c. shows the agreement between each pair of pathologists, while the agreement between the AI model and each pathologist is shown in d. The average agreement for each task is plotted as a vertical solid line. The analysis was performed on a random subset of 150 cases randomly selected from the Turkey test cohort: 91 ACR, 23 AMR cases (including 14 concurrent ACR and AMR cases) and 50 normal biopsies. The AI model was trained on the US cohort. For evaluation purposes, the pathologists assessed each case using the H&E slides only. It should be noted that the assessment presented here is based on Cohen’s κ and is not the absolute agreement. Cohen’s κ is a metric which runs between -1 and 1 and takes into account agreement by chance.
An independent reader study was conducted to assess the potential of the CRANE to serve as an assisting diagnostic tool. Subplot a. illustrates the study design. A panel of five cardiac pathologists from an independent center was asked to assess 150 EMBs randomly selected from the Turkey cohort, the same set of slides as used for the assessment of interobserver variability presented in Extended Data Fig. 9. The pathologists were randomly split into two groups. In the first round, the readers from the first group used WSIs only, while the readers from the second group also received assistance from the CRANE in the form of attention heatmaps (HMs) plotted on the top of H&E slides. Following a washout period, the pathologists repeated the task. In the second round, the readers from the first group received WSIs and AI assistance, while the second group used WSIs only. Subplots b-e. report accuracy and assessment time (f.) of the readers without and with AI assistance marked as (WSI) and (HM + WSI), respectively. The ground truth labels were constructed based on the pathologists’ consensus from the reader-study presented in Extended Data Fig. 9. The ability of the CRANE to mark diagnostically relevant regions has increased the accuracy of manual biopsy assessment for all tasks and all readers, as well as reduce the assessment time. These results support the feasibility of CRANE in reducing the interobserver variability and increasing the efficiency of manual biopsy reads.
About this article
Cite this article
Lipkova, J., Chen, T.Y., Lu, M.Y. et al. Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies. Nat Med 28, 575–582 (2022). https://doi.org/10.1038/s41591-022-01709-2
This article is cited by
A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications
Journal of Big Data (2023)
Scientific Reports (2023)
Hypertension Research (2023)
Nature Biomedical Engineering (2023)
Nature Reviews Bioengineering (2023)