Abstract
Deep learning methods have been shown to achieve excellent performance on diagnostic tasks, but how to optimally combine them with expert knowledge and existing clinical decision pathways is still an open challenge. This question is particularly important for the early detection of cancer, where high-volume workflows may benefit from (semi-)automated analysis. Here we present a deep learning framework to analyze samples of the Cytosponge-TFF3 test, a minimally invasive alternative to endoscopy, for detecting Barrett’s esophagus, which is the main precursor of esophageal adenocarcinoma. We trained and independently validated the framework on data from two clinical trials, analyzing a combined total of 4,662 pathology slides from 2,331 patients. Our approach exploits decision patterns of gastrointestinal pathologists to define eight triage classes of varying priority for manual expert review. By substituting manual review with automated review in low-priority classes, we can reduce pathologist workload by 57% while matching the diagnostic performance of experienced pathologists.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The dataset is governed by data usage policies specified by the data controller (University of Cambridge, Cancer Research UK). We are committed to complying with Cancer Research UK’s Data Sharing and Preservation Policy. Whole-slide images used in this study will be available for non-commercial research purposes upon approval by a Data Access Committee according to institutional requirements. Applications for data access should be directed to rcf29@cam.ac.uk. Data derived from the raw images are freely available at a public repository: https://github.com/markowetzlab/cytosponge-triage. The code and included data enable replication of the results and figures in this manuscript.
Code availability
The source code of this work is freely available at a public repository: https://github.com/markowetzlab/cytosponge-triage.
References
Hawkes, N. Cancer survival data emphasise importance of early diagnosis. Br. Med. J. 364, l408 (2019).
Schiffman, J. D., Fisher, P. G. & Gibbs, P. Early detection of cancer: past, present, and future. Am. Soc. Clin. Oncol. Educ. Book 35, 57–65 (2015).
Nanda, K. et al. Accuracy of the Papanicolaou test in screening for and follow-up of cervical cytologic abnormalities: a systematic review. Ann. Intern. Med. 132, 810–819 (2000).
Cyr, P. R. Atypical moles. Am. Fam. Physician 78, 735–740 (2008).
Talbot, I., Price, A. & Salto-Tellez, M. Biopsy Pathology in Colorectal Disease (CRC Press, 2006).
Maung, R. Pathologists’ workload and patient safety. Diagn. Histopathol. 22, 283–287 (2016).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24 (2019).
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform. 19, 1236–1246 (2018).
Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018).
Bray, F. et al. Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).
Pohl, H., Sirovich, B. & Welch, H. G. Esophageal adenocarcinoma incidence: are we reaching the peak? Cancer Epidemiol. Prev. Biomark. 19, 1468–1470 (2010).
Smyth, E. C. et al. Oesophageal cancer. Nat. Rev. Dis. Primers 3, 17048 (2017).
Peters, Y. et al. Barrett oesophagus. Nat. Rev. Dis. Primers 5, 35 (2019).
El-Serag, H. B., Sweet, S., Winchester, C. C. & Dent, J. Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut 63, 871–880 (2014).
Spechler, S. J. & Souza, R. F. Barrett’s esophagus. N. Engl. J. Med. 371, 836–845 (2014).
Odze, R. Histology of Barrett’s metaplasia: do goblet cells matter? Dig. Dis. Sci. 63, 2042–2051 (2018).
Kadri, S. R. et al. Acceptability and accuracy of a non-endoscopic screening test for Barrett’s oesophagus in primary care: cohort study. Br. Med. J. 341, c4372 (2010).
Ross-Innes, C. S. et al. Evaluation of a minimally invasive cell sampling device coupled with assessment of trefoil factor 3 expression for diagnosing Barrett’s esophagus: a multi-center case–control study. PLoS Med. 12, e1001780 (2015).
Freeman, M., Offman, J., Walter, F. M., Sasieni, P. & Smith, S. G. Acceptability of the cytosponge procedure for detecting Barrett’s oesophagus: a qualitative study. BMJ Open 7, e013901 (2017).
Paterson, A. L., Gehrung, M., Fitzgerald, R. C. & O’Donovan, M. Role of TFF3 as an adjunct in the diagnosis of Barrett’s esophagus using a minimally invasive esophageal sampling device—The CytospongeTM. Diagn. Cytopathol. 48, 253–264 (2019).
Lao-Sirieix, P. et al. Non-endoscopic screening biomarkers for Barrett’s oesophagus: from microarray analysis to the clinic. Gut 58, 1451–1459 (2009).
Fitzgerald, R. et al. Cytosponge-trefoil factor 3 versus usual care to identify Barrett’s oesophagus in a primary care setting: a prospective, multicentre, pragmatic, randomised controlled trial. Lancet 396, 333–344 (2020).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM https://doi.org/10.1145/3065386 (2017).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2818–2826 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Iandola, F. N. et al. Squeezenet: Alexnet-level accuracy with 50× fewer parameters and <0.5MB model size. Preprint at https://arxiv.org/abs/1602.07360 (2016).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).
Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
Fitzgerald, R. C. et al. British Society of Gastroenterology guidelines on the diagnosis and management of Barrett’s oesophagus. Gut 63, 7–42 (2014).
Fan, X. & Snyder, N. Prevalence of Barrett’s esophagus in patients with or without GERD symptoms: role of race, age, and gender. Dig. Dis. Sci. 54, 572–577 (2009).
Rex, D. K. et al. Screening for Barrett’s esophagus in colonoscopy patients with and without heartburn. Gastroenterology 125, 1670–1677 (2003).
Elizondo, J. H. et al. Prevalence of Barrett’s esophagus: an observational study from a gastroenterology clinic. Rev. Gastroenterol. Mex. 82, 296–300 (2017).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Iizuka, O. et al. Deep learning models for histopathological classification of gastric and colonic epithelial tumours. Sci. Rep. 10, 1504 (2020).
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).
Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115, E2970–E2979 (2018).
Saillard, C. et al. Predicting survival after hepatocellular carcinoma resection using deep learning on histological slides. Hepatology 72, 2000–2013 (2020).
Echle, A. et al. Clinical-grade detection of microsatellite instability in colorectal tumors by deep learning. Gastroenterology 159, 1406–1416 (2020).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559 (2018).
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Bien, N. et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet. PLoS Med. 15, e1002699 (2018).
Steiner, D. F. et al. Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer. Am. J. Surg. Pathol. 42, 1636–1646 (2018).
Hekler, A. et al. Superior skin cancer classification by the combination of human and artificial intelligence. Eur. J. Cancer 120, 114 (2019).
Kyono, T., Gilbert, F. J. & van der Schaar, M. Improving workflow efficiency for mammography using machine learning. J. Am. Coll. Radiol. 17, 56–63 (2020).
Tschandl, P. et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234 (2020).
Bejnordi, B. E., Timofeeva, N., Otte-Höller, I., Karssemeijer, N. & van der Laak, J. A. Quantitative analysis of stain variability in histology slides and an algorithm for standardization. In Medical Imaging 2014: Digital Pathology (eds Gurcan, M. N. & Madabhushi, A.) https://doi.org/10.1117/12.2043683 (SPIE, 2014).
Imperiale, T. F. et al. Multitarget stool DNA testing for colorectal-cancer screening. N. Engl. J. Med. 370, 1287–1297 (2014).
Liu, M. C. et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).
Kieffer, B., Babaie, M., Kalra, S. & Tizhoosh, H. R. Convolutional neural networks for histopathology image classification: training vs. using pre-trained networks. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) 1–6 (IEEE, 2017).
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Sharma, P. et al. The development and validation of an endoscopic grading system for Barrett’s esophagus: the Prague C & M criteria. Gastroenterology 131, 1392–1399 (2006).
Levine, D. S. et al. An endoscopic biopsy protocol can differentiate high-grade dysplasia from early adenocarcinoma in Barrett’s esophagus. Gastroenterology 105, 40–50 (1993).
Litjens, G. ASAP https://github.com/computationalpathologygroup/ASAP (2015).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds. Wallach, H. et al.) 4399 (Curran Associates, 2019).
Acknowledgements
This research was supported by Cancer Research UK (FM: C14303/A17197), the Medical Research Council (RCF: RG84369) and Cambridge University Hospitals NHS Foundation Trust. BEST2 was funded by Cancer Research UK (12088 and 16893). M.G. acknowledges support from an Enrichment Fellowship from the Alan Turing Institute. M.C.O. acknowledges support from a Borysiewicz Fellowship from the University of Cambridge and a Junior Research Fellowship from Trinity College, Cambridge. F.M. is a Royal Society Wolfson Research Merit Award holder. We thank M. Schneider, R. Drews, P. Martinez-Gonzalez and T. Whitmarsh for valuable input on this work. The authors thank the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) and the Experimental Cancer Medicine Centre for their support and for providing the infrastructure for the research procedures in Cambridge. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. In addition, we thank the Human Research Tissue Bank, which is supported by the UK National Institute for Health Research Cambridge Biomedical Research Centre, from Addenbrookes Hospital. Finally, we thank the BEST2 trial team, the Histopathology core facility at the Cancer Research UK Cambridge Institute and Pathognomics Ltd. for their support.
Author information
Authors and Affiliations
Contributions
M.G. conceived and led the analysis. M.C.O. and A.B. contributed to the analysis. M.G. and A.B. wrote the code for analysis. M.O. and R.C.F. were involved in the collection and labeling of the data. R.C.F. conceived the study. R.C.F. and F.M. directed the project. M.G. and F.M. wrote the manuscript with the assistance and feedback of all other co-authors.
Corresponding authors
Ethics declarations
Competing interests
The Cytosponge device technology and the associated TFF3 biomarker are licensed to Covidien GI solutions (now owned by Medtronic) by the Medical Research Council. M.G., M.C.O. and F.M. are named inventors on a patent pertaining to technology applied in this work. R.C.F. and M.O. are named inventors on patents pertaining to the Cytosponge and associated technology. M.G., M.O. and R.C.F. are shareholders of Cyted Ltd., a company working on early detection technology.
Additional information
Peer review information Nature Medicine thanks Marnix Jansen, Nasir Rajpoot and Pratik Shah for their contribution to the peer review of this work. Javier Carmona was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Differential increase of training partition size for ResNet-18.
Training subset refers to the relative proportion of the training partition used in the model training phase. Development subset refers to the relative proportion of the training partition used in the model development phase. The peak development weighted recall (a) and precision (b) correspond to the best performing cohort for each training run. The size of the development set was fixed at 15 patients. For each patient, an average of 3,500 tiles was used. For both H&E and TFF3 no substantial increase in performance metrics could be observed after a training subset size of 50 patients. Individual Cytosponge H&E sections are already highly heterogeneous, which means that the value gained by increasing the size of the training dataset is limited. We opted for retaining all the annotated data in the training set, to maximize the chances of capturing the whole spectrum of data variability and therefore the robustness of the model. H&E benefited more from an increased number of patients than the TFF3 model. This difference is associated with the increased complexity of detecting different tissue morphologies on H&E vs. brown goblet cells on TFF3. In TFF3 slides regions were extensively annotated by pathologists and this ground truth served as a comparator for the recall provided in both figures.
Extended Data Fig. 2 Comparison of pathologist landmarks with saliency maps extracted from VGG-16 architectures.
Additional examples of saliency maps for Hematoxylin & Eosin stain (squamous cells and columnar epithelium) and Trefoil factor 3 (positive goblet cells). Landmarks selected by an experienced pathologist are shown as overlays with red borders on pathology tile images. For all classes, there was visual agreement between highlighted areas by the pathologist and saliency map activations.
Extended Data Fig. 3 Determination of probability thresholds in order to obtain number of tiles.
Both plots show the AUC-ROC for individual probability thresholds (after softmax) which are used to decide whether a tile falls into the relevant class. a, AUC-ROC for quality control (QC) ground truth determined by the pathologist compared with number of tiles containing columnar epithelium at individual probability thresholds. b, AUC-ROC for diagnosis ground truth determined by the endoscopy (with confirmed IM on pathology) compared with number of tiles containing positive goblet cells at individual probability thresholds.
Extended Data Fig. 4 Performance of all deep learning architectures on the calibration cohort.
(a) ROC analysis of number of tiles containing columnnar epithelium on H\&E compared with pathologist ground truth from Cytosponge (b) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with pathologist ground truth from Cytosponge (c) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with endoscopy (with confirmed IM) ground truth. A weak AUC dependency on architecture complexity can be observed.
Extended Data Fig. 5 Performance of all deep learning architectures on the internal validation cohort.
a, ROC analysis of number of tiles containing columnnar epithelium on H&E compared with pathologist ground truth from Cytosponge (b) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with pathologist ground truth from Cytosponge (c) ROC analysis of number of tiles containing positive goblet cells on TFF3 compared with endoscopy (with confirmed IM) ground truth. As in the calibration cohort, a weak AUC dependency on architecture complexity can be observed.
Extended Data Fig. 6 Application of quality control and diagnostic confidence class scheme to calibration cohort.
The lines indicate operating points chosen by three different expert observers. a, Quality ground truth by pathologist from Cytosponge (top) compared with number of detected columnar epithelium (CE) tiles on H\&E detected by VGG-16 (bottom). For the first operating point, E#2 and E#3 agreed whereas E#1 selected a higher cut-off. Majority voting resulted in the lower cut-off being chosen. For the second operating point, all thee observers (E#1, E#2, and E#3) agreed on the same threshold. The line drawn by E#1 for the second operating point effectively resulted in the same operating point as E#2 and E#3. b, Diagnosis ground truth by pathologist from Cytosponge (top), Endoscopy (with confirmed IM on biopsy) ground truth (middle) compared with number of detected TFF3-positive tiles on TFF3 detected by ResNet-18 (bottom). For both the first and second operating points E#1, E#2, and E#3 agreed. The line drawn by E#3 for the second operating point effectively resulted in the same operating point as E#1 and E#2.
Extended Data Fig. 7 Performance of semi-automated, triage-driven model on external validation cohort.
a, Cumulative substitution scheme starting with fully manual review, followed by substitution with automated review of class no. 1, then 1 and 2, etc. b, Cumulative substitution scheme starting with fully manual review, followed by substitution with automated review of class no. 8, then 8 and 7, etc.
Supplementary information
Supplementary Information
Supplementary Tables 1–6.
Rights and permissions
About this article
Cite this article
Gehrung, M., Crispin-Ortuzar, M., Berman, A.G. et al. Triage-driven diagnosis of Barrett’s esophagus for early detection of esophageal adenocarcinoma using deep learning. Nat Med 27, 833–841 (2021). https://doi.org/10.1038/s41591-021-01287-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41591-021-01287-9
This article is cited by
-
All models are wrong and yours are useless: making clinical prediction models impactful for patients
npj Precision Oncology (2024)
-
Digital pathology-based artificial intelligence models for differential diagnosis and prognosis of sporadic odontogenic keratocysts
International Journal of Oral Science (2024)
-
Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology
Nature Communications (2024)
-
DeepCraftFuse: visual and deeply-learnable features work better together for esophageal cancer detection in patients with Barrett’s esophagus
Neural Computing and Applications (2024)
-
Artificial intelligence in the treatment of cancer: Changing patterns, constraints, and prospects
Health and Technology (2024)