Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models

Abstract

Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: RNA-CDM model architecture used for generating RNA-seq embeddings and synthetic WSI tiles using diffusion models.
Fig. 2: RNA-to-image multicancer synthetic samples generated by conditioning on the gene-expression latent representation.
Fig. 3: Synthetic samples maintain the cell distributions observed in real-world data.
Fig. 4: Pretraining on synthetic samples improves classification performance in a multicancer classification problem.

Similar content being viewed by others

Data availability

TCGA data can be downloaded from the GDC platform (https://portal.gdc.cancer.gov/). The two GEO series used in this study can be downloaded from the GEO platform: GSE50760 and GSE226069. The PBTA dataset can be downloaded from the Gabriella Miller Kids First Data Resource Portal (KF-DRC, https://kidsfirstdrc.org). Microsatellite-instability-status data can be downloaded from the Kaggle platform: https://www.kaggle.com/datasets/joangibert/tcga_coad_msi_mss_jpg. Case IDs used for this work as well as the RNA-seq encodings obtained for all experiments are available under an academic-use-only licence at https://rna-cdm.stanford.edu. One million synthetic images are available in the Dryad platform at https://doi.org/10.5061/dryad.6djh9w174 (ref. 77).

Code availability

A demo for generating synthetic images and the code are available under an academic-use-only licence at https://rna-cdm.stanford.edu.

References

  1. Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).

    Article  PubMed  Google Scholar 

  2. Jones, P. A. & Baylin, S. B. The epigenomics of cancer. Cell 128, 683–692 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Lujambio, A. & Lowe, S. W. The microcosmos of cancer. Nature 482, 347–355 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  4. Frangioni, J. V. New technologies for human cancer imaging. J. Clin. Oncol. 26, 4012–4021 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Williams, B. J., Bottoms, D. & Treanor, D. Future-proofing pathology: the case for clinical adoption of digital pathology. J. Clin. Pathol. 70, 1010–1018 (2017).

    Article  PubMed  Google Scholar 

  6. Heindl, A., Nawaz, S. & Yuan, Y. Mapping spatial heterogeneity in the tumor microenvironment: a new era for digital pathology. Lab. Invest. 95, 377–384 (2015).

    Article  PubMed  Google Scholar 

  7. Cheng, J. et al. Identification of topological features in renal tumor microenvironment associated with patient survival. Bioinformatics 34, 1024–1030 (2018).

    Article  CAS  PubMed  Google Scholar 

  8. Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Castillo, D. et al. Integration of RNA-seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics18, 506 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Yu, D. et al. Copy number variation in plasma as a tool for lung cancer prediction using Extreme Gradient Boosting (XGBoost) classifier. Thorac. Cancer 11, 95–102 (2020).

    Article  CAS  PubMed  Google Scholar 

  11. Maros, M. E. et al. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat. Protoc. 15, 479–512 (2020).

    Article  CAS  PubMed  Google Scholar 

  12. Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).

  13. Carrillo-Perez, F. et al. Machine-learning-based late fusion on multi-omics and multi-scale data for non-small-cell lung cancer diagnosis. J. Pers. Med. 12, 601 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Lee, C. & van der Schaar, M. A variational information bottleneck approach to multi-omics data integration. In International Conference on Artificial Intelligence and Statistics 1513–1521 (PMLR, 2021).

  15. Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging 41, 757–770 (2020).

    Article  ADS  Google Scholar 

  16. Cheerla, A. & Gevaert, O. Deep learning with multimodal representation for pancancer prognosis prediction. Bioinformatics 35, i446–i454 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L) 1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Weinstein, J. N. et al. The Cancer Genome Atlas Pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Jennings, C. N. et al. Bridging the gap with the UK Genomics Pathology Imaging Collection. Nat. Med. 28, 1107–1108 (2022).

    Article  CAS  PubMed  Google Scholar 

  22. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Quiros, A. C., Murray-Smith, R. & Yuan, K. PathologyGAN: learning deep representations of cancer tissue. In Proceedings of the Third Conference on Medical Imaging with Deep Learning 121, 669–695 (PMLR, 2020).

  24. Quiros, A. C., Murray-Smith, R. & Yuan, K. Learning a low dimensional manifold of real cancer tissue with PathologyGAN. Preprint at https://arxiv.org/abs/1907.02644v5 (2020).

  25. Viñas, R., Andrés-Terré, H., Liò, P. & Bryson, K. Adversarial generation of gene expression data. Bioinformatics 38, 730–737 (2022).

    Article  PubMed  Google Scholar 

  26. Mitra, R. & MacLean, A. L. RVAgene: generative modeling of gene expression time series data. Bioinformatics 37, 3252–3262 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Qiu, Y. L., Zheng, H. & Gevaert, O. Genomic data imputation with variational auto-encoders. Gigascience 9, giaa082 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 5769–5779 (Curran Associates, 2017).

  29. Metz, L., Poole, B., Pfau, D. & Sohl-Dickstein, J. Unrolled generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.1611.02163 (2016).

  30. Salimans, T. et al. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29 (eds Lee, D. et al.) 2234–2242 (Curran Associates, 2016).

  31. Zhao, S., Song, J. & Ermon, S. Infovae: balancing learning and inference in variational autoencoders. Proc. AAAI Conf. Artif. Intell. 33, 5885–5892 (2019).

    Google Scholar 

  32. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).

  33. Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 35, 36479–36494 (PMLR, 2022).

  34. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning 2256–2265 (PMLR, 2015).

  35. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).

  36. Yu, K. H. et al. Association of omics features with histopathology patterns in lung adenocarcinoma. Cell Syst. 5, 620–627 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).

    Article  CAS  PubMed  Google Scholar 

  38. Schmauch, B. et al. A deep learning model to predict RNA-seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  39. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://doi.org/10.48550/arXiv.1802.03426 (2018).

  40. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 6629–6640 (Curran Associates, 2017).

  41. Binkowski, M., Sutherland, D. J., Arbel, M. & Gretton, A. Demystifying MMD GANS. Preprint at https://doi.org/10.48550/arXiv.1801.01401 (2018).

  42. Kim, S. K. et al. A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients. Mol. Oncol. 8, 1653–1666 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Quintanal-Villalonga, A. et al. Comprehensive molecular characterization of lung tumors implicates AKT and MYC signaling in adenocarcinoma to squamous cell transdifferentiation. J. Hematol. Oncol. 14, 170 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Graham, S. et al. Hover-Net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 58, 101563 (2019).

    Article  PubMed  Google Scholar 

  45. Karimi, E. et al. Single-cell spatial immune landscapes of primary and metastatic brain tumours. Nature 614, 555–563 (2023).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  46. Han, S. et al. Rescuing defective tumor-infiltrating T-cell proliferation in glioblastoma patients. Oncol. Lett. 12, 2924–2929 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Steyaert, S. et al. Multimodal deep learning to predict prognosis in adult and pediatric brain tumors. Commun. Med. 3, 44 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Lehrer, M. et al. in Advances in Biology and Treatment of Glioblastoma (ed. Somasundaram, K.) 143–159 (Springer, 2017).

  49. Yamashita, R. et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 22, 132–141 (2021).

    Article  PubMed  Google Scholar 

  50. Marisa, L. et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med. 10, e1001453 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Li, W. et al. High resolution histopathology image generation and segmentation through adversarial training. Med. Image Anal. 75, 102251 (2022).

    Article  PubMed  Google Scholar 

  52. Karras, T., Aittala, M., Aila, T. & Laine, S. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 35, 26565–26577 (PMLR, 2022).

  53. Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Azizi, S. et al. Robust and efficient medical imaging with self-supervision. Preprint at https://doi.org/10.48550/arXiv.2205.09723 (2022).

  55. Dries, R. et al. Advances in spatial transcriptomic data analysis. Genome Res. 31, 1706–1718 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Zheng, H., Brennan, K., Hernaez, M. & Gevaert, O. Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples. Gigascience 8, giz145 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  PubMed  Google Scholar 

  58. Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).

    Article  ADS  CAS  PubMed  Google Scholar 

  59. Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979).

    Article  Google Scholar 

  61. Goode, A., Gilbert, B., Harkes, J., Jukic, D. & Satyanarayanan, M. OpenSlide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013).

  62. Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Ijaz, H. et al. Pediatric high-grade glioma resources from the Children’s Brain Tumor Tissue Consortium. Neuro Oncol. 22, 163–165 (2020).

    Article  PubMed  Google Scholar 

  64. Higgins, I. et al. beta-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations 1–13 (ICLR, 2017).

  65. Hyvärinen, A. & Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695−709 (2005).

  66. Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 23, 1661–1674 (2011).

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  67. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

    Google Scholar 

  68. Ho, J. et al. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 1–33 (2022).

    MathSciNet  Google Scholar 

  69. Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention (eds Navab, N. et al.) 234–241 (Springer, 2015).

  70. Grill, J. B. et al. Bootstrap your own latent–a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020).

    Google Scholar 

  71. Kaiser, L. et al. Fast decoding in sequence models using discrete latent variables. Proc. Mach. Learn. Res. 80, 2390–2399 (2018).

    Google Scholar 

  72. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).

    Article  PubMed  Google Scholar 

  75. Longato, E., Vettoretti, M. & Di Camillo, B. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. J. Biomed. Inform. 108, 103496 (2020).

    Article  PubMed  Google Scholar 

  76. Graf, E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18, 2529–2545 (1999).

    Article  CAS  PubMed  Google Scholar 

  77. Carrillo-Perez, F. RNA-to-image multi-cancer synthesis using cascaded diffusion models, one million synthetic images. Dryad https://doi.org/10.5061/dryad.6djh9w174 (2023).

Download references

Acknowledgements

The results published here are in whole or in part based on data generated by the TCGA Research Network (https://www.cancer.gov/tcga). F.C.-P. was supported by MCIN/AEI/10.13039/501100011033 (grant number PID2021-128317OB-I00), Consejería de Universidad, Investigación e Innovación (grant number P20-00163), which are both funded by ‘ERDF A way of making Europe.’, and a Predoctoral scholarship from the Fulbright Spanish Commission. M.P. was supported by the Belgian American Educational Foundation and FWO (grant number 1161223N). Research reported here was further supported by the National Cancer Institute (NCI) (grant number R01 CA260271). This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Contributions

F.C.-P., M.P. and O.G. conceived and designed the study. F.C.-P., M.P. and Y.Z. performed data preprocessing. F.C.-P. developed the code. T.N.N. and R.M. contributed to code optimization and parallel training. R.M. and T.N.N. provided access to the Argonne National Laboratory platform. J.S. performed the analysis of the clinical impact and analysed the digital pathology quality. Y.Z. obtained the deconvolved RNA-seq data. F.C.-P. and M.P. generated the figures. O.G. supervised the work and obtained the funding. F.C.-P. and O.G. wrote the manuscript with contributions and/or revisions from all authors.

Corresponding author

Correspondence to Olivier Gevaert.

Ethics declarations

Competing interests

Stanford has submitted a provisional patent application for this work with patent number 18/538,743, United States, 2023. The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Moritz Gerstung, Ke Yuan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Cell-percentage comparison between using bulk RNA-Seq and de-convolved expression.

a. Percentage of lymphocytes cells found by Hovernet in synthetic tiles generated using bulk RNA-Seq and haematopoietic de-convolved RNA-Seq. A significantly higher percentage of lymphocytes was found in all four out of five cancer types with a significantly p-value in four out of five of them (TCGA-CESC p-value = 0.15; TCGA-KIRP p-value = 6.08 × 10−21; TCGA-LUAD p-value = 9.86 × 10−16; TCGA-GBM p-value = 2.02 × 10−7; TCGA-COAD p-value = 1.07 × 10−22). The median difference is annotated in the plot per cancer type. b. UMAP projection of the bulk RNA-Seq expression (circles) and the counterpart deconvolved haematopoietic RNA-Seq (crosses). Clear differences can be observed in the expression, with a mean percentage difference of 7% across the cancer types, which corresponds to a similar increase in lymphocytes in the majority of the cancer types.

Extended Data Fig. 2 Microsatellite-instability-status prediction.

Comparison between a model trained from scratch and a model that have been pretrained using SimCLR on synthetic tiles, on a different number of real tiles sampled from the training set. Metrics are computed on a fivefold CV, and results correspond to those obtained on the different test sets. The model pretrained on the synthetic tiles always outperform the model trained from scratch, no matter the number of training samples that are used.

Supplementary information

Supplementary Information

Supplementary Figures and Tables.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng (2024). https://doi.org/10.1038/s41551-024-01193-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41551-024-01193-8

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer