Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models

Carrillo-Perez, Francisco; Pizurica, Marija; Zheng, Yuanning; Nandi, Tarak Nath; Madduri, Ravi; Shen, Jeanne; Gevaert, Olivier

doi:10.1038/s41551-024-01193-8

Article
Published: 21 March 2024

Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models

Nature Biomedical Engineering (2024)Cite this article

3135 Accesses
41 Altmetric
Metrics details

Subjects

Abstract

Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: RNA-CDM model architecture used for generating RNA-seq embeddings and synthetic WSI tiles using diffusion models.**

**Fig. 2: RNA-to-image multicancer synthetic samples generated by conditioning on the gene-expression latent representation.**

**Fig. 3: Synthetic samples maintain the cell distributions observed in real-world data.**

**Fig. 4: Pretraining on synthetic samples improves classification performance in a multicancer classification problem.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Pooled multicolour tagging for visualizing subcellular protein dynamics

Article Open access 19 April 2024

PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors

Article 18 April 2024

Data availability

TCGA data can be downloaded from the GDC platform (https://portal.gdc.cancer.gov/). The two GEO series used in this study can be downloaded from the GEO platform: GSE50760 and GSE226069. The PBTA dataset can be downloaded from the Gabriella Miller Kids First Data Resource Portal (KF-DRC, https://kidsfirstdrc.org). Microsatellite-instability-status data can be downloaded from the Kaggle platform: https://www.kaggle.com/datasets/joangibert/tcga_coad_msi_mss_jpg. Case IDs used for this work as well as the RNA-seq encodings obtained for all experiments are available under an academic-use-only licence at https://rna-cdm.stanford.edu. One million synthetic images are available in the Dryad platform at https://doi.org/10.5061/dryad.6djh9w174 (ref. ⁷⁷).

Code availability

A demo for generating synthetic images and the code are available under an academic-use-only licence at https://rna-cdm.stanford.edu.

References

Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Article PubMed Google Scholar
Jones, P. A. & Baylin, S. B. The epigenomics of cancer. Cell 128, 683–692 (2007).
Article CAS PubMed PubMed Central Google Scholar
Lujambio, A. & Lowe, S. W. The microcosmos of cancer. Nature 482, 347–355 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Frangioni, J. V. New technologies for human cancer imaging. J. Clin. Oncol. 26, 4012–4021 (2008).
Article PubMed PubMed Central Google Scholar
Williams, B. J., Bottoms, D. & Treanor, D. Future-proofing pathology: the case for clinical adoption of digital pathology. J. Clin. Pathol. 70, 1010–1018 (2017).
Article PubMed Google Scholar
Heindl, A., Nawaz, S. & Yuan, Y. Mapping spatial heterogeneity in the tumor microenvironment: a new era for digital pathology. Lab. Invest. 95, 377–384 (2015).
Article PubMed Google Scholar
Cheng, J. et al. Identification of topological features in renal tumor microenvironment associated with patient survival. Bioinformatics 34, 1024–1030 (2018).
Article CAS PubMed Google Scholar
Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Castillo, D. et al. Integration of RNA-seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics18, 506 (2017).
Article PubMed PubMed Central Google Scholar
Yu, D. et al. Copy number variation in plasma as a tool for lung cancer prediction using Extreme Gradient Boosting (XGBoost) classifier. Thorac. Cancer 11, 95–102 (2020).
Article CAS PubMed Google Scholar
Maros, M. E. et al. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat. Protoc. 15, 479–512 (2020).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).
Carrillo-Perez, F. et al. Machine-learning-based late fusion on multi-omics and multi-scale data for non-small-cell lung cancer diagnosis. J. Pers. Med. 12, 601 (2022).
Article PubMed PubMed Central Google Scholar
Lee, C. & van der Schaar, M. A variational information bottleneck approach to multi-omics data integration. In International Conference on Artificial Intelligence and Statistics 1513–1521 (PMLR, 2021).
Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging 41, 757–770 (2020).
Article ADS Google Scholar
Cheerla, A. & Gevaert, O. Deep learning with multimodal representation for pancancer prognosis prediction. Bioinformatics 35, i446–i454 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L) 1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed PubMed Central Google Scholar
Jennings, C. N. et al. Bridging the gap with the UK Genomics Pathology Imaging Collection. Nat. Med. 28, 1107–1108 (2022).
Article CAS PubMed Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
Article PubMed PubMed Central Google Scholar
Quiros, A. C., Murray-Smith, R. & Yuan, K. PathologyGAN: learning deep representations of cancer tissue. In Proceedings of the Third Conference on Medical Imaging with Deep Learning 121, 669–695 (PMLR, 2020).
Quiros, A. C., Murray-Smith, R. & Yuan, K. Learning a low dimensional manifold of real cancer tissue with PathologyGAN. Preprint at https://arxiv.org/abs/1907.02644v5 (2020).
Viñas, R., Andrés-Terré, H., Liò, P. & Bryson, K. Adversarial generation of gene expression data. Bioinformatics 38, 730–737 (2022).
Article PubMed Google Scholar
Mitra, R. & MacLean, A. L. RVAgene: generative modeling of gene expression time series data. Bioinformatics 37, 3252–3262 (2021).
Article CAS PubMed PubMed Central Google Scholar
Qiu, Y. L., Zheng, H. & Gevaert, O. Genomic data imputation with variational auto-encoders. Gigascience 9, giaa082 (2020).
Article PubMed PubMed Central Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 5769–5779 (Curran Associates, 2017).
Metz, L., Poole, B., Pfau, D. & Sohl-Dickstein, J. Unrolled generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.1611.02163 (2016).
Salimans, T. et al. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29 (eds Lee, D. et al.) 2234–2242 (Curran Associates, 2016).
Zhao, S., Song, J. & Ermon, S. Infovae: balancing learning and inference in variational autoencoders. Proc. AAAI Conf. Artif. Intell. 33, 5885–5892 (2019).
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 35, 36479–36494 (PMLR, 2022).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning 2256–2265 (PMLR, 2015).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Yu, K. H. et al. Association of omics features with histopathology patterns in lung adenocarcinoma. Cell Syst. 5, 620–627 (2017).
Article CAS PubMed PubMed Central Google Scholar
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Article CAS PubMed Google Scholar
Schmauch, B. et al. A deep learning model to predict RNA-seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://doi.org/10.48550/arXiv.1802.03426 (2018).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 6629–6640 (Curran Associates, 2017).
Binkowski, M., Sutherland, D. J., Arbel, M. & Gretton, A. Demystifying MMD GANS. Preprint at https://doi.org/10.48550/arXiv.1801.01401 (2018).
Kim, S. K. et al. A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients. Mol. Oncol. 8, 1653–1666 (2014).
Article CAS PubMed PubMed Central Google Scholar
Quintanal-Villalonga, A. et al. Comprehensive molecular characterization of lung tumors implicates AKT and MYC signaling in adenocarcinoma to squamous cell transdifferentiation. J. Hematol. Oncol. 14, 170 (2021).
Article CAS PubMed PubMed Central Google Scholar
Graham, S. et al. Hover-Net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 58, 101563 (2019).
Article PubMed Google Scholar
Karimi, E. et al. Single-cell spatial immune landscapes of primary and metastatic brain tumours. Nature 614, 555–563 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Han, S. et al. Rescuing defective tumor-infiltrating T-cell proliferation in glioblastoma patients. Oncol. Lett. 12, 2924–2929 (2016).
Article CAS PubMed PubMed Central Google Scholar
Steyaert, S. et al. Multimodal deep learning to predict prognosis in adult and pediatric brain tumors. Commun. Med. 3, 44 (2023).
Article PubMed PubMed Central Google Scholar
Lehrer, M. et al. in Advances in Biology and Treatment of Glioblastoma (ed. Somasundaram, K.) 143–159 (Springer, 2017).
Yamashita, R. et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 22, 132–141 (2021).
Article PubMed Google Scholar
Marisa, L. et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med. 10, e1001453 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, W. et al. High resolution histopathology image generation and segmentation through adversarial training. Med. Image Anal. 75, 102251 (2022).
Article PubMed Google Scholar
Karras, T., Aittala, M., Aila, T. & Laine, S. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 35, 26565–26577 (PMLR, 2022).
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).
Article PubMed PubMed Central Google Scholar
Azizi, S. et al. Robust and efficient medical imaging with self-supervision. Preprint at https://doi.org/10.48550/arXiv.2205.09723 (2022).
Dries, R. et al. Advances in spatial transcriptomic data analysis. Genome Res. 31, 1706–1718 (2021).
Article PubMed PubMed Central Google Scholar
Zheng, H., Brennan, K., Hernaez, M. & Gevaert, O. Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples. Gigascience 8, giz145 (2019).
Article PubMed PubMed Central Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS PubMed Google Scholar
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Article ADS CAS PubMed Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979).
Article Google Scholar
Goode, A., Gilbert, B., Harkes, J., Jukic, D. & Satyanarayanan, M. OpenSlide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ijaz, H. et al. Pediatric high-grade glioma resources from the Children’s Brain Tumor Tissue Consortium. Neuro Oncol. 22, 163–165 (2020).
Article PubMed Google Scholar
Higgins, I. et al. beta-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations 1–13 (ICLR, 2017).
Hyvärinen, A. & Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695−709 (2005).
Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 23, 1661–1674 (2011).
Article ADS MathSciNet PubMed Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Ho, J. et al. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 1–33 (2022).
MathSciNet Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention (eds Navab, N. et al.) 234–241 (Springer, 2015).
Grill, J. B. et al. Bootstrap your own latent–a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020).
Google Scholar
Kaiser, L. et al. Fast decoding in sequence models using discrete latent variables. Proc. Mach. Learn. Res. 80, 2390–2399 (2018).
Google Scholar
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Article CAS PubMed PubMed Central Google Scholar
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Article CAS PubMed PubMed Central Google Scholar
Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
Article PubMed Google Scholar
Longato, E., Vettoretti, M. & Di Camillo, B. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. J. Biomed. Inform. 108, 103496 (2020).
Article PubMed Google Scholar
Graf, E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18, 2529–2545 (1999).
Article CAS PubMed Google Scholar
Carrillo-Perez, F. RNA-to-image multi-cancer synthesis using cascaded diffusion models, one million synthetic images. Dryad https://doi.org/10.5061/dryad.6djh9w174 (2023).

Download references

Acknowledgements

The results published here are in whole or in part based on data generated by the TCGA Research Network (https://www.cancer.gov/tcga). F.C.-P. was supported by MCIN/AEI/10.13039/501100011033 (grant number PID2021-128317OB-I00), Consejería de Universidad, Investigación e Innovación (grant number P20-00163), which are both funded by ‘ERDF A way of making Europe.’, and a Predoctoral scholarship from the Fulbright Spanish Commission. M.P. was supported by the Belgian American Educational Foundation and FWO (grant number 1161223N). Research reported here was further supported by the National Cancer Institute (NCI) (grant number R01 CA260271). This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA
Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng & Olivier Gevaert
Internet technology and Data science Lab (IDLab), Ghent University, Ghent, Belgium
Marija Pizurica
Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, USA
Tarak Nath Nandi & Ravi Madduri
Department of Pathology, Stanford University, School of Medicine, Palo Alto, CA, USA
Jeanne Shen
Department of Biomedical Data Science, Stanford University, School of Medicine, Stanford, CA, USA
Olivier Gevaert

Authors

Francisco Carrillo-Perez
View author publications
You can also search for this author in PubMed Google Scholar
Marija Pizurica
View author publications
You can also search for this author in PubMed Google Scholar
Yuanning Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Tarak Nath Nandi
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Madduri
View author publications
You can also search for this author in PubMed Google Scholar
Jeanne Shen
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Gevaert
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.C.-P., M.P. and O.G. conceived and designed the study. F.C.-P., M.P. and Y.Z. performed data preprocessing. F.C.-P. developed the code. T.N.N. and R.M. contributed to code optimization and parallel training. R.M. and T.N.N. provided access to the Argonne National Laboratory platform. J.S. performed the analysis of the clinical impact and analysed the digital pathology quality. Y.Z. obtained the deconvolved RNA-seq data. F.C.-P. and M.P. generated the figures. O.G. supervised the work and obtained the funding. F.C.-P. and O.G. wrote the manuscript with contributions and/or revisions from all authors.

Corresponding author

Correspondence to Olivier Gevaert.

Ethics declarations

Competing interests

Stanford has submitted a provisional patent application for this work with patent number 18/538,743, United States, 2023. The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Moritz Gerstung, Ke Yuan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Cell-percentage comparison between using bulk RNA-Seq and de-convolved expression.

a. Percentage of lymphocytes cells found by Hovernet in synthetic tiles generated using bulk RNA-Seq and haematopoietic de-convolved RNA-Seq. A significantly higher percentage of lymphocytes was found in all four out of five cancer types with a significantly p-value in four out of five of them (TCGA-CESC p-value = 0.15; TCGA-KIRP p-value = 6.08 × 10⁻²¹; TCGA-LUAD p-value = 9.86 × 10⁻¹⁶; TCGA-GBM p-value = 2.02 × 10⁻⁷; TCGA-COAD p-value = 1.07 × 10⁻²²). The median difference is annotated in the plot per cancer type. b. UMAP projection of the bulk RNA-Seq expression (circles) and the counterpart deconvolved haematopoietic RNA-Seq (crosses). Clear differences can be observed in the expression, with a mean percentage difference of 7% across the cancer types, which corresponds to a similar increase in lymphocytes in the majority of the cancer types.

Extended Data Fig. 2 Microsatellite-instability-status prediction.

Comparison between a model trained from scratch and a model that have been pretrained using SimCLR on synthetic tiles, on a different number of real tiles sampled from the training set. Metrics are computed on a fivefold CV, and results correspond to those obtained on the different test sets. The model pretrained on the synthetic tiles always outperform the model trained from scratch, no matter the number of training samples that are used.

Supplementary information

Supplementary Information

Supplementary Figures and Tables.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng (2024). https://doi.org/10.1038/s41551-024-01193-8

Download citation

Received: 27 December 2022
Accepted: 29 February 2024
Published: 21 March 2024
DOI: https://doi.org/10.1038/s41551-024-01193-8