The development of decision support systems for pathology and their deployment in clinical practice have been hindered by the need for large manually annotated datasets. To overcome this problem, we present a multiple instance learning-based deep learning system that uses only the reported diagnoses as labels for training, thereby avoiding expensive and time-consuming pixel-wise manual annotations. We evaluated this framework at scale on a dataset of 44,732 whole slide images from 15,187 patients without any form of data curation. Tests on prostate cancer, basal cell carcinoma and breast cancer metastases to axillary lymph nodes resulted in areas under the curve above 0.98 for all cancer types. Its clinical application would allow pathologists to exclude 65–75% of slides while retaining 100% sensitivity. Our results show that this system has the ability to train accurate classification models at unprecedented scale, laying the foundation for the deployment of computational decision support systems in clinical practice.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The publicly shared MSK breast cancer metastases dataset is available at http://thomasfuchslab.org/data/. The dataset consists of 130 de-identified WSIs of axillary lymph node specimens from 78 patients (see Extended Data Fig. 8). The tissue was stained with hematoxylin and eosin and scanned on Leica Biosystems AT2 digital slide scanners at MSK. Metastatic carcinoma is present in 36 whole slides from 27 patients, and the corresponding label is included in the dataset.
The remaining data that support the findings of this study were offered to editors and peer reviewers at the time of submission for the purposes of evaluating the manuscript upon request. The remaining data are not publicly available, in accordance with institutional requirements governing human subject privacy protection.
The source code of this work can be downloaded from https://github.com/MSKCC-Computational-Pathology/MIL-nature-medicine-2019.
Ball, C. S. The early history of the compound microscope. Bios 37, 51–60 (1966).
Hajdu, S. I. Microscopic contributions of pioneer pathologists. Ann. Clin. Lab. Sci. 41, 201–206 (2011).
Fuchs, T. J., Wild, P. J., Moch, H. & Buhmann, J. M. Computational pathology analysis of tissue microarrays predicts survival of renal clear cell carcinoma patients. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 1–8 (Lecture Notes in Computer Science Vol 5242, Springer, 2008).
Fuchs, T. J. & Buhmann, J. M. Computational pathology: challenges and promises for tissue analysis. Comput. Med. Imaging Graph. 35, 515–530 (2011).
Louis, D. N. et al. Computational pathology: a path ahead. Arch. Pathol. Lab. Med. 140, 41–50 (2016).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 1097–1105 (2012).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at https://arxiv.org/abs/1512.03385 (2015).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
Liu, Y. et al. Detecting cancer metastases on gigapixel pathology images. Preprint at https://arxiv.org/abs/1703.02442 (2017).
Das, K., Karri, S. P. K., Guha Roy, A, Chatterjee, J. & Sheet, D. Classifying histopathology whole-slides using fusion of decisions from deep convolutional network on a collection of random multi-views at multi-magnification. In 2017 IEEE 14th International Symposium on Biomedical Imaging 1024–1027 (IEEE, 2017).
Valkonen, M. et al. Metastasis detection from whole slide images using local features and random forests. Cytom. Part A 91, 555–565 (2017).
Bejnordi, B. E. et al. Using deep convolutional neural networks to identify and classify tumor-associated stroma in diagnostic breast biopsies. Mod. Pathol. 31, 1502–1512 (2018).
Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115, E2970–E2979 (2018).
Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep learning for identifying metastatic breast cancer. Preprint at https://arxiv.org/abs/1606.05718 (2016).
Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J. Pathol. Inform. 7, 29 (2016).
Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 6, 26286 (2016).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Olsen, T. et al. Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology. J. Pathol. Inform. 9, 32 (2018).
Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. J. Am. Med. Assoc. 318, 2199–2210 (2017).
Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2016. CA Cancer J. Clin. 66, 7–30 (2016).
Ozdamar, S. O. et al. Intraobserver and interobserver reproducibility of WHO and Gleason histologic grading systems in prostatic adenocarcinomas. Int. Urol. Nephrol. 28, 73–77 (1996).
Svanholm, H. & Mygind, H. Prostatic carcinoma reproducibility of histologic grading. APMIS 93, 67–71 (1985).
Gleason, D. F. Histologic grading of prostate cancer: a perspective. Hum. Pathol. 23, 273–279 (1992).
LeBoit, P. E. et al. Pathology and Genetics of Skin Tumours (IARC Press, 2006).
Rogers, H. W., Weinstock, M. A., Feldman, S. R. & Coldiron, B. M. Incidence estimate of nonmelanoma skin cancer (keratinocyte carcinomas) in the US population, 2012. JAMA Dermatol. 151, 1081–1086 (2015).
Dietterich, T. G., Lathrop, R. H. & Lozano-P’erez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 31–71 (1997).
Andrews, S., Hofmann, T. & Tsochantaridis, I. Multiple instance learning with generalized support vector machines. In AAAI/IAAI 943–944 (AAAI, 2002).
Nakul, V. Learning from Data with Low Intrinsic Dimension (Univ. California, 2012).
Zhang, C., Platt, J. C. & Viola, P. A. Multiple instance boosting for object detection. Adv. Neural Inf. Process. Syst. 1417–1424 (2006).
Zhang, Q. & Goldman, S. A. EM-DD: an improved multiple-instance learning technique. Adv. Neural Inf. Process. Syst. 1073–1080 (2002).
Kraus, O. Z., Ba, J. L. & Frey, B. J. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32, i52–i59 (2016).
Hou, L. et al. Patch-based convolutional neural network for whole slide tissue image classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2424–2433 (IEEE, 2016).
Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8, 3395 (2018).
Goode, A., Gilbert., B., Harkes, J., Jukic., D. & Satyanarayanan., M. OpenSlide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013).
Paszke, A. et al. Automatic differentiation in PyTorch. In 31st Conference on Neural Information Processing Systems (2017).
R Development Core Team R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2017).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
Carpenter, J. & Bithell, J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19, 1141–1164 (2000).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Yu, Y. et al. Sentinel lymph node biopsy after neoadjuvant chemotherapy for breast cancer: retrospective comparative evaluation of clinically axillary lymph node positive and negative patients, including those with axillary lymph node metastases confirmed by fine needle aspiration. BMC Cancer 16, 808 (2016).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
We thank The Warren Alpert Center for Digital and Computational Pathology and MSK’s high-performance computing team for their support. We also thank J. Samboy for leading the digital scanning initative and E. Stamelos and F. Cao, from the pathology informatics team at MSK, for their invaluable help querying the digital slide and LIS databases. We are in debt to P. Schueffler for extending the digital whole slide viewer specifically for this study and for supporting its use by the whole research team. Finally, we thank C. Virgo for managing the project, D. V. K. Yarlagadda for development support and D. Schnau for help editing the manuscript. This research was funded in part through the NIH/NCI Cancer Center Support Grant P30 CA008748.
T.J.F. is the Chief Scientific Officer of Paige.AI. T.J.F. and D.S.K. are co-founders and equity holders of Paige.AI. M.G.H., V.W.K.S., D.S.K., and V.E.R. are consultants for Paige.AI. V.E.R. is a consultant for Cepheid. M.G.H. is on the medical advisory board of PathPresenter. D.S.K has received speaking/consulting compensation from Merck. G.C. and T.J.F. have intellectual property interests relevant to the work that is the subject of this paper. MSK has financial interests in Paige.AI. and intellectual property interests relevant to the work that is the subject of this paper.
Peer review information: Javier Carmona was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Geographical distribution of the external consultation slides submitted to MSKCC.
We included in our work a total of 17,661 consultation slides: 17,363 came from other US institutions located across 48 US states, Washington DC and Puerto Rico; 248 cases came from international institutions spread across 44 countries in all continents. a, Distribution of consultation slides coming from other US institutions. Top, geographical distribution of slides in the continental United States. Red points correspond to pathology laboratories. Bottom, consultation slides distribution per state (including Washington DC and Puerto Rico). b, Distribution of consultation slides coming from international institutions. Top, geographical locations of consultation slides across the world (light gray, countries that did not contribute slides; light blue, countries that contributed slides; dark blue, United States). Bottom, distribution of external consultation slides per country of origin (excluding the United States).
Performance on the respective test datasets was measured in terms of AUC. a, Best results were achieved on the prostate dataset (n = 1,784), with an AUC of 0.989 at 20× magnification. b, For BCC (n = 1,575), the model trained at 5× performed the best, with an AUC of 0.990. c, The worst performance came on the breast metastasis detection task (n = 1,473), with an AUC of 0.965 at 20×. The axillary lymph node dataset is the smallest of the three datasets, which is in agreement with the hypothesis that larger datasets are necessary to achieve lower error rates on real-world clinical data. Source data
Extended Data Fig. 3 t-SNE visualization of the representation space for the BCC and axillary lymph node models.
Two-dimensional t-SNE projection of the 512-dimensional representation space were generated for 100 randomly sampled tiles per slide. a, BCC representation (n = 144,935). b, Axillary lymph nodes representation (n = 139,178). Source data
The MIL model was run on each slide of the test dataset (n = 1,784) with a stride of 40 pixels. From the resulting tumor probability heat map, hand-engineered features were extracted for classification with the random forest (RF) model. The best MIL-RF model (ensemble model; AUC = 0.987) was not statistically significantly better than the MIL-only model (20× model; AUC = 0.986; see Fig. 3), as determined using DeLong’s test for two correlated ROC curves. Source data
a, Prostate model trained with MIL on MSK in-house slides tested on: (1) an in-house slides test set (n = 1,784) digitized on Aperio scanners; (2) an in-house slides test set digitized on a Philips scanner (n = 1,274); and (3) external slides submitted to MSK for consultation (n = 12,727). b,c, Comparison of the proposed MIL approach with state-of-the-art fully supervised learning for breast metastasis detection in lymph nodes. For b, the breast model was trained on MSK data with our proposed method (MIL-RNN) and tested on the MSK breast data test set (n = 1,473) and on the test set of the CAMELYON16 challenge (n = 129), and achieved AUCs of 0.965 and 0.895, respectively. For c, the fully supervised model was trained on CAMELYON16 data and tested on the CAMELYON16 test set (n = 129), achieving an AUC of 0.930. Its performance dropped to AUC = 0.727 when tested on the MSK test set (n = 1,473).
For each dataset, slides are ordered by their probability of being positive for cancer, as predicted by the respective MIL-RNN model. The sensitivity is computed at the case level. a, BCC (n = 1,575): given a positive prediction threshold of 0.025, it is possible to ignore roughly 68% of the slides while maintaining 100% sensitivity. b, Breast metastases (n = 1,473): given a positive prediction threshold of 0.21, it is possible to ignore roughly 65% of the slides while maintaining 100% sensitivity. Source data
Extended Data Fig. 7 Example of a slide tiled on a grid with no overlap at different magnifications.
A slide represents a bag, and the tiles constitute the instances in that bag. In this work, instances at different magnifications are not part of the same bag. mpp, microns per pixel.
Extended Data Fig. 8 The publicly shared MSK breast cancer metastases dataset is representative of the full MSK breast cancer metastases test set.
We created an additional dataset of the size of the test set of the CAMEYON16 challenge (130 slides) by subsampling the full MSK breast cancer metastases test set, ensuring that the models achieved similar performance for both datasets. Left, the model was trained on MSK data with our proposed method (MIL-RNN) and tested on: the full MSK breast data test set (n = 1,473; AUC = 0.968), the public MSK dataset (n = 130; AUC = 0.965); and the test set of the CAMELYON16 challenge (n = 129; AUC = 0.898). Right, the model was trained on CAMELYON16 data with supervised learning18 and tested on: the test set of the CAMELYON16 challenge (n = 129; AUC = 0.932); the full MSK breast data test set (n = 1,473; AUC = 0.731); and the public MSK dataset (n = 130; AUC = 0.737). Error bars represent 95% confidence intervals for the true AUC calculated by bootstrapping each test set.