Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Transfer learning enables predictions in network biology

Abstract

Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding1,2 and computer vision3 by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Geneformer architecture and transfer learning strategy.
Fig. 2: Geneformer boosted predictions of gene dosage sensitivity with limited data.
Fig. 3: Geneformer boosted predictions of chromatin dynamics with limited data.
Fig. 4: Geneformer encoded gene network hierarchy.
Fig. 5: In silico deletion revealed network connections.
Fig. 6: In silico treatment revealed candidate therapeutic targets.

Similar content being viewed by others

Data availability

Genecorpus-30M is available on the Huggingface Dataset Hub at https://huggingface.co/datasets/ctheodoris/Genecorpus-30M.

Code availability

The pretrained Geneformer model, transcriptome tokenizer and code for pretraining and fine-tuning the model are available on the Huggingface Model Hub at https://huggingface.co/ctheodoris/Geneformer. All other code used in this study is available upon request from the corresponding authors.

References

  1. Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).

  2. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 (eds Burstein, J. et al.) 4174–4186 (Association for Computational Linguistics, 2019).

  3. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  4. Theodoris, C. V. et al. Human disease modeling reveals integrated transcriptional and epigenetic mechanisms of NOTCH1 haploinsufficiency. Cell 160, 1072–1086 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Theodoris, C. V. et al. Network-based screen in iPSC-derived cells reveals therapeutic candidate for heart valve disease. Science 371, eabd0724 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Shao, X. et al. ScDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Res. 49, e122 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Lieberman, Y., Rokach, L. & Shay, T. CaSTLe—classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. Preprint at https://doi.org/10.48550/arXiv.2106.04554 (2021).

  9. Ren, J. et al. ZeRO-offload: democratizing billion-scale model training. In Proc. 2021 USENIX Annual Technical Conference 551–564 (USENIX, 2021).

  10. Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Zero: memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis 1–16 (IEEE, 2020).

  11. Selewa, A. et al. Systematic comparison of high-throughput single-cell and single-nucleus transcriptomes during cardiomyocyte differentiation. Sci. Rep. 10, 1535 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. 10x Genomics Datasets https://www.10xgenomics.com/resources/datasets/frozen-pbm-cs-donor-a-1-standard-1-1-0.

  13. 10X Genomics Datasets https://www.10xgenomics.com/resources/datasets/fresh-68-k-pbm-cs-donor-a-1-standard-1-1-0.

  14. Li, Y. et al. Single-cell transcriptome analysis reveals dynamic cell populations and differential gene expression patterns in control and aneurysmal human aortic tissue. Circulation 142, 1374–1388 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Xing, Q. R. et al. Diversification of reprogramming trajectories revealed by parallel single-cell transcriptome and chromatin accessibility sequencing. Sci. Adv. 6, 463–474 (2020).

    Article  Google Scholar 

  16. Guo, D. et al. iMyoblasts for ex vivo and in vivo investigations of human myogenesis and disease modeling. eLife 11, e70341 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Shihab, H. A., Rogers, M. F., Campbell, C. & Gaunt, T. R. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics 33, 1751–1757 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Ni, Z., Zhou, X. Y., Aslam, S. & Niu, D. K. Characterization of human dosage-sensitive transcription factor genes. Front. Genet. 10, 1208 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell 185, 3041–3055 (2022).

    Article  CAS  PubMed  Google Scholar 

  23. Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, 808 (2020).

    Article  Google Scholar 

  24. Pirruccello, J. P. et al. Analysis of cardiac magnetic resonance imaging in 36,000 individuals yields genetic insights into dilated cardiomyopathy. Nat. Commun. 11, 2254 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bolte, C. et al. Expression of Foxm1 transcription factor in cardiomyocytes is required for myocardial development. PLoS ONE 6, e22217 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  26. Bolte, C. et al. Postnatal ablation of Foxm1 from cardiomyocytes causes late onset cardiac hypertrophy and fibrosis without exacerbating pressure overload-induced cardiac remodeling. PLoS ONE 7, e48713 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  27. Currey, L., Thor, S. & Piper, M. TEAD family transcription factors in development and disease. Development 148, dev196675 (2021).

    Article  CAS  PubMed  Google Scholar 

  28. Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315–356 (2006).

    Article  CAS  PubMed  Google Scholar 

  29. Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz406 (2019).

    Article  Google Scholar 

  30. Pan, G. et al. Whole-genome analysis of histone H3 lysine 4 and lysine 27 methylation in human embryonic stem cells. Cell Stem Cell 1, 299–312 (2007).

    Article  CAS  PubMed  Google Scholar 

  31. Chen, C. H. et al. Determinants of transcription factor regulatory range. Nat. Commun. 11, 2472 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  32. Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 455–472 (2020).

    Article  Google Scholar 

  33. Ang, Y. S. et al. Disease model of GATA4 mutation reveals transcription factor cooperativity in human cardiogenesis. Cell 167, 1734–1749 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Kathiriya, I. S. et al. Modeling human TBX5 haploinsufficiency predicts regulatory networks for congenital heart disease. Dev. Cell 56, 292–309 (2021).

    Article  CAS  PubMed  Google Scholar 

  35. Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  36. Hinson, J. T. et al. Titin mutations in iPS cells define sarcomere insufficiency as a cause of dilated cardiomyopathy. Science 349, 982–986 (2015).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Seidman, C. E. & Seidman, J. G. Identifying sarcomere gene mutations in hypertrophic cardiomyopathy: a personal history. Circ. Res. 108, 743–750 (2011).

    Article  CAS  PubMed  Google Scholar 

  38. Kamisago, M. et al. Mutations in sarcomere protein genes as a cause of dilated cardiomyopathy. New Engl. J. Med. 343, 1688–1696 (2000).

    Article  CAS  PubMed  Google Scholar 

  39. Ramaccini, D. et al. Mitochondrial function and dysfunction in dilated cardiomyopathy. Front. Cell Dev. Biol. https://doi.org/10.3389/fcell.2020.624216 (2021).

  40. Ho, D., Yan, L., Iwatsubo, K., Vatner, D. E. & Vatner, S. F. Modulation of β-adrenergic receptor signaling in heart failure and longevity: targeting adenylyl cyclase type 5. Heart Fail. Rev. 15, 495–512 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Wagner, A. H. et al. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res. 44, D1036–D1044 (2016).

    Article  ADS  CAS  PubMed  Google Scholar 

  42. Nakagawa, O. et al. Centronuclear myopathy in mice lacking a novel muscle-specific protein kinase transcriptionally regulated by MEF2. Genes Dev. 19, 2066–2077 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Akazawa, H. & Komuro, I. Roles of cardiac transcription factors in cardiac hypertrophy. Circ. Res. 92, 1079–1088 (2003).

  44. Henighan, T. et al. Scaling laws for autoregressive generative modeling. Preprint at https://doi.org/10.48550/arXiv.2010.14701 (2020).

  45. Madissoon, E. et al. ScRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 21, 1 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Anderson, D. J. et al. NKX2-5 regulates human cardiomyogenesis via a HEY2 dependent transcriptional network. Nat. Commun. 9, 1373 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  47. Smillie, C. S. et al. Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Lee, J. S. et al. Immunophenotyping of Covid-19 and influenza highlights the role of type I interferons in development of severe Covid-19. Sci. Immunol. 5, eabd1554 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Fang, Z. et al. Single-cell heterogeneity analysis and CRISPR screen identify key β-cell-specific disease genes. Cell Rep. 26, 3132–3144 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Agarwal, D. et al. A single-cell atlas of the human substantia nigra reveals cell-specific pathways associated with neurological disorders. Nat. Commun. 11, 4183 (2020).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  52. Rasouli, J. et al. A distinct GM-CSF+ T helper cell subset requires T-bet to adopt a TH1 phenotype and promote neuroinflammation. Sci. Immunol. 5, eaba9953 (2020).

    Article  CAS  PubMed  Google Scholar 

  53. Park, J.-E. et al. A cell atlas of human thymic development defines T cell repertoire formation. Science 367, eaay3224 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Mende, N. et al. Quantitative and molecular differences distinguish adult human medullary and extramedullary haematopoietic stem and progenitor cell landscapes. Preprint at BioRxiv https://doi.org/10.1101/2020.01.26.919753 (2020).

  55. Setty, M. et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Popescu, D.-M. et al. Decoding human fetal liver haematopoiesis. Nature 574, 365–371 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  57. Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal-fetal interface in humans. Nature 563, 347–353 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  58. Ramachandran, P. et al. Resolving the fibrotic niche of human liver cirrhosis at single-cell level. Nature 575, 512–518 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  59. Kinchen, J. et al. Structural remodeling of the human colonic mesenchyme in inflammatory bowel disease. Cell 175, 372–386 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. James, K. R. et al. Distinct microbial and immune niches of the human colon. Nat. Immunol. 21, 343–353 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Zhou, L. et al. Single-cell RNA-seq analysis uncovers distinct functional human NKT cell sub-populations in peripheral blood. Front. Cell Dev. Biol. 8, 384 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Liao, J. et al. Single-cell RNA sequencing of human kidney. Sci. Data 7, 4 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Jäkel, S. et al. Altered human oligodendrocyte heterogeneity in multiple sclerosis. Nature 566, 543–547 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  64. Merrick, D. et al. Identification of a mesenchymal progenitor cell hierarchy in adipose tissue. Science 364, eaav2501 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Habermann, A. C. et al. Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 6, eaba1972 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  66. Rosa, F. F. et al. Direct reprogramming of fibroblasts into antigen-presenting dendritic cells. Sci. Immunol. 3, eaau4292 (2018).

    Article  PubMed  Google Scholar 

  67. Stewart, B. J. et al. Spatiotemporal immune zonation of the human kidney. Science 365, 1461–1466 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  68. MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  69. Welch, J. et al. Integrative inference of brain cell similarities and differences from single-cell genomics. Preprint at BioRxiv https://doi.org/10.1101/459891 (2018).

  70. Ledergor, G. et al. Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma. Nat. Med. 24, 1867–1876 (2018).

    Article  CAS  PubMed  Google Scholar 

  71. Lukowski, S. W. et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 38, e100811 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    Article  CAS  PubMed  Google Scholar 

  73. Zirkel, A. et al. HMGB2 loss upon senescence entry disrupts genomic organization and induces CTCF clustering across cell types. Mol. Cell 70, 730–744 (2018).

    Article  CAS  PubMed  Google Scholar 

  74. Goudot, C. et al. Aryl hydrocarbon receptor controls monocyte differentiation into dendritic cells versus macrophages. Immunity 47, 582–596 (2017).

    Article  CAS  PubMed  Google Scholar 

  75. McCauley, K. B. et al. Single-cell transcriptomic profiling of pluripotent stem cell-derived SCGB3A2+ airway epithelium. Stem Cell Rep. 10, 1579–1595 (2018).

    Article  CAS  Google Scholar 

  76. Das, R. et al. Early B cell changes predict autoimmunity following combination immune checkpoint blockade. J. Clin. Invest. 128, 715–720 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  77. Kini Bailur, J. et al. Changes in bone marrow innate lymphoid cell subsets in monoclonal gammopathy: target for IMiD therapy. Blood Adv. 1, 2343–2347 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  78. Patil, V. S. et al. Precursors of human CD4+ cytotoxic T lymphocytes identified by single-cell transcriptome analysis. Sci. Immunol. 3, eaan8664 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Wang, C. et al. Expansion of hedgehog disrupts mesenchymal identity and induces emphysema phenotype. J. Clin. Invest. 128, 4343–4358 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Hermann, B. P. et al. The mammalian spermatogenesis single-cell transcriptome, from spermatogonial stem cells to spermatids. Cell Rep. 25, 1650–1667 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Menon, R. et al. Single-cell analysis of progenitor cell dynamics and lineage specification in the human fetal kidney. Development 145, dev164038 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  82. Czerniecki, S. M. et al. High-throughput screening enhances kidney organoid differentiation from human pluripotent stem cells and enables automated multidimensional phenotyping. Cell Stem Cell 22, 929–940 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Papa, L. et al. Ex vivo human HSC expansion requires coordination of cellular reprogramming with mitochondrial remodeling and p53 activation. Blood Adv. 2, 2766–2779 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Schulthess, J. et al. The short chain fatty acid butyrate imprints an antimicrobial program in macrophages. Immunity 50, 432–445 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Guo, J. et al. The adult human testis transcriptional cell atlas. Cell Res. 28, 1141–1157 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  86. Karow, M. et al. Direct pericyte-to-neuron reprogramming via unfolding of a neural stem cell-like program. Nat. Neurosci. 21, 932–940 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Xin, Y. et al. Pseudotime ordering of single human β-cells reveals states of insulin production and unfolded protein response. Diabetes 67, 1783–1794 (2018).

    Article  CAS  PubMed  Google Scholar 

  88. Phipson, B. et al. Evaluation of variability in human kidney organoids. Nat. Methods 16, 79–87 (2019).

    Article  CAS  PubMed  Google Scholar 

  89. Balan, S. et al. Large-scale human dendritic cell differentiation revealing notch-dependent lineage bifurcation and heterogeneity. Cell Rep. 24, 1902–1915 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Milpied, P. et al. Human germinal center transcriptional programs are de-synchronized in B cell lymphoma. Nat. Immunol. 19, 1013–1024 (2018).

    Article  CAS  PubMed  Google Scholar 

  91. Parikh, K. et al. Colonic epithelial cell diversity in health and inflammatory bowel disease. Nature 567, 49–55 (2019).

    Article  ADS  CAS  PubMed  Google Scholar 

  92. Habiel, D. M. et al. CCR10+ epithelial cells from idiopathic pulmonary fibrosis lungs drive remodeling. JCI Insight 3, e122211 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Paik, D. T. et al. Large-scale single-cell RNA-seq reveals molecular signatures of heterogeneous populations of human induced pluripotent stem cell-derived endothelial cells. Circ. Res. 123, 443–450 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Martin, J. C. et al. Single-cell analysis of Crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy. Cell 178, 1493–1508 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Zheng, Y. et al. A human circulating immune cell landscape in aging and COVID-19. Protein Cell 11, 740–770 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Hochane, M. et al. Single-cell transcriptomics reveals gene expression dynamics of human fetal kidney development. PLoS Biol. 17, e3000152 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  97. Sohni, A. et al. The neonatal and adult human testis defined at the single-cell level. Cell Rep. 26, 1501–1517 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  98. Tran, T. et al. In vivo developmental trajectories of human podocyte inform in vitro differentiation of pluripotent stem cell-derived podocytes. Dev. Cell 50, 102–116 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Wang, Y. et al. Single-cell transcriptome analysis reveals differential nutrient absorption functions in human intestine. J. Exp. Med. 217, e20191130 (2020).

    Article  PubMed  Google Scholar 

  100. Vieira Braga, F. A. et al. A cellular census of human lungs identifies novel cell states in health and in asthma. Nat. Med. 25, 1153–1163 (2019).

    Article  CAS  PubMed  Google Scholar 

  101. Guo, J. et al. The dynamic transcriptional cell atlas of testis development during human puberty. Cell Stem Cell 26, 262–276 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Voigt, A. P. et al. Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration. Proc. Natl Acad. Sci. USA 116, 24100–24107 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  103. Menon, M. et al. Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration. Nat. Commun. 10, 4902 (2019).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  104. Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 26, 1070–1076 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Li, B. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat. Methods 17, 793–798 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Daniszewski, M. et al. Single cell RNA sequencing of stem cell-derived retinal ganglion cells. Sci. Data 5, 180013 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Goveia, J. et al. An integrated gene expression landscape profiling approach to identify lung tumor endothelial cell heterogeneity and angiogenic candidates. Cancer Cell 37, 21–36 (2020).

    Article  CAS  PubMed  Google Scholar 

  108. Norelli, M. et al. Monocyte-derived IL-1 and IL-6 are differentially required for cytokine-release syndrome and neurotoxicity due to CAR T cells. Nat. Med. 24, 739–748 (2018).

    Article  CAS  PubMed  Google Scholar 

  109. Daniszewski, M. et al. Single-cell profiling identifies key pathways expressed by iPSCs cultured in different commercial media. iScience 7, 30–39 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  110. Miller, A. J. et al. In vitro and in vivo development of the human airway at single-cell resolution. Dev. Cell 53, 117–128 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Silvin, A. et al. Elevated calprotectin and abnormal myeloid cell subsets discriminate severe from mild COVID-19. Cell 182, 1401–1418 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Deprez, M. et al. A single-cell atlas of the human healthy airways. Am. J. Resp. Crit. Care Med. 202, 1636–1645 (2020).

    Article  CAS  PubMed  Google Scholar 

  113. Sridhar, A. et al. Single-cell transcriptomic comparison of human fetal retina, hPSC-derived retinal organoids, and long-term retinal cultures. Cell Rep. 30, 1644–1659 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  114. Wu, H. et al. Comparative analysis and refinement of human PSC-derived kidney organoid differentiation with single-cell transcriptomics. Cell Stem Cell 23, 869–881 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Vijay, J. et al. Single-cell analysis of human adipose tissue identifies depot and disease specific cell types. Nat. Metab. 2, 97–109 (2020).

    Article  PubMed  Google Scholar 

  116. Solé-Boldo, L. et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun. Biol. 3, 188 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  117. Adams, T. S. et al. Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 6, eaba1983 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  118. Moreira, L. M. et al. Paracrine signalling by cardiac calcitonin controls atrial fibrogenesis and arrhythmia. Nature 587, 460–465 (2020).

    Article  ADS  CAS  PubMed  Google Scholar 

  119. Ren, X. et al. COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895–1913 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. Bunis, D. G. et al. Single-cell mapping of progressive fetal-to-adult transition in human naive T cells. Cell Rep. 34, 108573 (2021).

    Article  CAS  PubMed  Google Scholar 

  121. Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  122. Takeda, A. et al. Single-cell survey of human lymphatics unveils marked endothelial cell heterogeneity and mechanisms of homing for neutrophils. Immunity 51, 561–572 (2019).

    Article  CAS  PubMed  Google Scholar 

  123. Frumm, S. M. et al. A hierarchy of proliferative and migratory keratinocytes maintains the tympanic membrane. Cell Stem Cell 28, 315–330 (2021).

    Article  CAS  PubMed  Google Scholar 

  124. Yu, Z. et al. Single-cell transcriptomic map of the human and mouse bladders. J. Am. Soc. Nephrol. 30, 2159–2176 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Rubenstein, A. B. et al. Single-cell transcriptional profiles in human skeletal muscle. Sci. Rep. 10, 229 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  126. McCracken, I. R. et al. Transcriptional dynamics of pluripotent stem cell-derived endothelial cell differentiation revealed by single-cell RNA sequencing. Eur. Heart J. 41, 1024–1036 (2020).

    Article  CAS  PubMed  Google Scholar 

  127. Hua, P. et al. Single-cell analysis of bone marrow-derived CD34+ cells from children with sickle cell disease and thalassemia. Blood 134, 2111–2115 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  128. Orozco, L. D. et al. Integration of eQTL and a single-cell atlas in the human eye identifies causal genes for age-related macular degeneration. Cell Rep. 30, 1246–1259 (2020).

    Article  CAS  PubMed  Google Scholar 

  129. Hurley, K. et al. Reconstructed single-cell fate trajectories define lineage plasticity windows during differentiation of human PSC-derived distal lung progenitors. Cell Stem Cell 26, 593–608 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  130. Schafflick, D. et al. Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis. Nat. Commun. 11, 247 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  131. Su, C. et al. Single-cell RNA sequencing in multiple pathologic types of renal cell carcinoma revealed novel potential tumor-specific markers. Front. Oncol. 11, 719564 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  132. He, J. et al. Dissecting human embryonic skeletal stem cell ontogeny by single-cell transcriptomic and functional analyses. Cell Res. 31, 742–757 (2021).

    Article  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  133. Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26, 842–844 (2020).

    Article  CAS  PubMed  Google Scholar 

  134. Liu, X. et al. Reprogramming roadmap reveals route to human induced trophoblast stem cells. Nature 586, 101–107 (2020).

    Article  ADS  CAS  PubMed  Google Scholar 

  135. He, S. et al. Single-cell transcriptome profiling of an adult human cell atlas of 15 major organs. Genome Biol. 21, 294 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  136. Wu, C.-L. et al. Single cell transcriptomic analysis of human pluripotent stem cell chondrogenesis. Nat. Commun. 12, 362 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  137. Cowan, C. S. et al. Cell types of the human retina and its organoids at single-cell resolution. Cell 182, 1623–1640 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  138. Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).

    Article  CAS  PubMed  Google Scholar 

  139. Wang, L. et al. Single-cell map of diverse immune phenotypes in the metastatic brain tumor microenvironment of non small cell lung cancer. Preprint at BioRxiv https://doi.org/10.1101/2019.12.30.890517 (2019).

  140. Lu, Y.-C. et al. Single-cell transcriptome analysis reveals gene signatures associated with T-cell persistence following adoptive cell therapy. Cancer Immunol. Res. 7, 1824–1836 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  141. Wang, L. et al. The phenotypes of proliferating glioblastoma cells reside on a single axis of variation. Cancer Discov. 9, 1708–1719 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  142. Wang, R. et al. Adult human glioblastomas harbor radial glia-like cells. Stem Cell Rep. 14, 338–350 (2020).

    Article  ADS  CAS  Google Scholar 

  143. Wang, L., Catalan, F., Shamardani, K., Babikir, H. & Diaz, A. Ensemble learning for classifying single-cell data and projection across reference atlases. Bioinformatics 36, 3585–3587 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  144. Ruffin, A. T. et al. B cell signatures and tertiary lymphoid structures contribute to outcome in head and neck squamous cell carcinoma. Nat. Commun. 12, 3349 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  145. Zhang, Q. et al. Landscape and dynamics of single immune cells in hepatocellular carcinoma. Cell 179, 829–845 (2019).

    Article  CAS  PubMed  Google Scholar 

  146. Song, Q. et al. Dissecting intratumoral myeloid cell plasticity by single cell RNA-seq. Cancer Med. 8, 3072–3085 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  147. Kim, N. et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat. Commun. 11, 2285 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  148. Tang-Huau, T.-L. et al. Human in vivo-generated monocyte-derived dendritic cells and macrophages cross-present antigens through a vacuolar pathway. Nat. Commun. 9, 2570 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  149. Peng, J. et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 725–738 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  150. 10x Genomics Datasets: Single Cell Gene Expression. 10x Genomics https://www.10xgenomics.com/resources/datasets?menu%5Bproducts.name%5D=Single%20Cell%20Gene%20Expression&query=&page=1&configure%5Bfacets%5D%5B0%5D=chemistryVersionAndThroughput&configure%5Bfacets%5D%5B1%5D=pipeline.version&configure%5BhitsPerPage%5D=500.

  151. de Andrade, L. F. et al. Discovery of specialized NK cell populations infiltrating human melanoma metastases. JCI Insight 4, e133103 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  152. Zhang, P. et al. Dissecting the single-cell transcriptome network underlying gastric premalignant lesions and early gastric cancer. Cell Rep. 27, 1934–1947 (2019).

    Article  CAS  PubMed  Google Scholar 

  153. Durante, M. A. et al. Single-cell analysis reveals new evolutionary complexity in uveal melanoma. Nat. Commun. 11, 496 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  154. Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database 2020, baaa073 (2020).

  155. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  156. Xin, J. et al. High-performance web services for querying gene and variant annotation. Genome Biol. 17, 91 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  157. Dunning, T. The t-digest: efficient estimates of distributions. Softw. Impacts 7, 100049 (2021).

    Article  Google Scholar 

  158. Lhoest, Q. et al. Datasets: a community library for natural language processing. Preprint at https://doi.org/10.48550/arXiv.2109.02846 (2021).

  159. Wolf, T. et al. HuggingFace’s transformers: state-of-the-art natural language processing. Preprint at https://doi.org/10.48550/arXiv.1910.03771 (2019).

  160. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. Preprint at https://doi.org/10.48550/arXiv.1711.05101 (2017).

Download references

Acknowledgements

We thank J. Rae for helpful scientific discussions and Google Research for providing tensor processing unit (TPU) resources for experimentation. P.T.E. was supported by grants from the National Institutes of Health (NIH) (1RO1HL092577, 1R01HL157635 and 5R01HL139731), American Heart Association Strategically Focused Research Networks (18SFRN34110082) and European Union (MAESTRIA 965286). C.V.T. was supported by NIH T32GM007748 and the Helen Hay Whitney Foundation Postdoctoral Fellowship. L.X. was supported by the American Heart Association (20CDA35260081).

Author information

Authors and Affiliations

Authors

Contributions

C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E. wrote the manuscript. All authors edited the manuscript.

Corresponding authors

Correspondence to Christina V. Theodoris or Patrick T. Ellinor.

Ethics declarations

Competing interests

X.S.L. conducted this work while on faculty at Dana-Farber Cancer Institute and is now a board member and CEO of GV20 Therapeutics. P.T.E. has received sponsored research support from Bayer AG, IBM Research, Bristol Myers Squibb and Pfizer. P.T.E. has also served on advisory boards or consulted for Bayer AG, MyoKardia and Novartis. A.C. is an employee of Bayer US LLC (a subsidiary of Bayer AG) and may own stock in Bayer AG. E.M.B. was a full-time employee of Bayer when this work was performed. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature thanks Amir Bashan, Natasa Przulj and Nathan Palpant for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Geneformer transfer learning strategy.

a, Schematic of standard modelling approach, which necessitates retraining a new model from scratch for each new task. b, Schematic of transfer learning strategy. Through a single initial self-supervised large-scale pretraining on a generalizable learning objective, the model gains fundamental knowledge of the learning domain that is then democratized to a multitude of downstream applications distinct from the pretraining learning objective, transferring knowledge to new tasks. c, Transcription factors are normalized by a statistically significantly lower factor (resulting in higher prioritization in the rank value encoding) compared to all genes. Housekeeping genes on average show a trend of a higher normalization factor (resulting in deprioritization in the rank value encoding) compared to all genes (*p < 0.05 by Wilcoxon, FDR-corrected; all genes n = 17,903, housekeeping genes n = 11, transcription factors n = 1,384; error bars = standard deviation). d, Pretraining was performed with a randomly subsampled corpus of 100,000 cells, holding out 10,000 cells for evaluation, with 3 different random seeds. Evaluation loss was essentially equivalent in the 3 trials, indicating robustness to the set of genes randomly masked for each cell during the pretraining. e, Pretraining was performed with a randomly subsampled corpus of 100,000 cells, holding out 10,000 cells for evaluation, with 3 different masking percentages. 15% masking had marginally lower evaluation loss compared to 5% or 30% masking. f, Pretraining was performed with a randomly subsampled corpus of 90,000 cells and the model was then fine-tuned to distinguish dosage-sensitive vs. -insensitive transcription factors using 10,000 cells that were either included in or excluded from the 90,000 cell pretraining corpus. Predictive potential on the downstream fine-tuning task was measured by fivefold cross-validation with these 10,000 cells, demonstrating essentially equivalent results by AUC, confusion matrices, and F1 score. Because the fine-tuning applications are trained on classification objectives that are completely separate from the masked learning objective, whether or not task-specific data was included in the pretraining corpus is not relevant to the downstream classification predictions.

Extended Data Fig. 2 Geneformer was context-aware and robust to batch-dependent technical artefacts.

a, Effect of gene versus the indicated batch-dependent technical artefact on pretrained Geneformer gene embeddings (*p < 0.05 by Wilcoxon, FDR-corrected; NS: non-significant). We found that the gene embeddings were robust to sequencing platform11, preservation method12,13, and individual patient variability14. b, UMAP of pretrained Geneformer cell embeddings of cells undergoing iPSC reprogramming appropriately captured temporal trajectory of reprogramming (cell types as annotated by original study15; iPSC negative or positive refers to expression of marker TRA-1-60). Cell embeddings suggested that cells which do not progress to the iPSC state bifurcate into an alternative fate compared to cells that progress to the iPSC state after the day 12 stage. c, Compared to in silico reprogramming with random genes, in silico reprogramming of fibroblasts by artificially adding OCT4, SOX2, KLF4, and MYC (OSKM) to the front of their rank value encodings significantly shifted the gene embeddings from their initial fibroblast state to the embedding of that gene in the iPSC state (*p < 0.05 by Wilcoxon). d, UMAP of pretrained Geneformer cell embeddings of cells undergoing iPSC to myoblast differentiation at the earlier S1 (PAX3+) and later S2B (PAX3+/MYOD+) stages (cell types as annotated by original study16). e, Compared to in silico differentiation with random genes, in silico differentiation of the early-stage myogenic cells by artificially adding MYOD to the front of their rank value encodings significantly shifted the gene embeddings from their earlier state to the embedding of that gene in the later MYOD+ myogenic state (*p < 0.05 by Wilcoxon).

Extended Data Fig. 3 Geneformer encoded context-specificity of key NOTCH pathway genes.

Known context-dependent NOTCH genes showed higher variance in their contextual embeddings across variable aortic cell types compared to housekeeping gene GAPDH.

Extended Data Fig. 4 Geneformer pretrained and fine-tuned cell embeddings were robust to batch-dependent technical artefacts.

a, While original data (left) was highly affected by patient batch effect, cell embeddings generated by pretrained Geneformer (right) (without fine-tuning) clustered primarily by cell type and phenotype. Of note, affected individuals 1, 2, and 4 had the phenotype of ascending only aortic aneurysm, which is a different phenotype than aortic aneurysm that includes the root. b, Imbalance in the number of genes detected in each of the two platforms (single-cell Drop-seq versus single-nucleus DroNc-seq), which may result in batch-dependent technical artefacts. c, Cell embeddings from each layer of the Geneformer model fine-tuned to distinguish the indicated cell types (as annotated by original study11) using only the Drop-seq data. As the cells pass through each layer, the model successively extrudes them from each other to derive separable embeddings that distinguish the cell types. d, Cell type predictions on the DroNc-seq data by the model fine-tuned only on the Drop-seq data (out of sample accuracy 84%). Of note, inaccurate predictions were predominantly in predicting that cardiomyocyte type 2 was type 1, as expected given the minimal examples of cardiomyocyte type 2 in the Drop-seq data. e, The imbalance of cardiomyocyte type 1 and 2 between the platforms also suggests that these cellular subtypes may be an artefact of variable gene detection between the two platforms. f, Geneformer fine-tuned with only Drop-seq data automatically integrated DroNc-seq data such that the fine-tuned Geneformer cell embeddings primarily clustered by cell types and showed improved integration of platforms compared to the original data even after batch effect removal using the ComBat17 or Harmony18 methods.

Extended Data Fig. 5 Geneformer boosted predictions in multiclass cell type annotation.

a, Predictive potential (as measured by accuracy and macro F1 score) of Geneformer fine-tuned for cell type annotation in the indicated human tissues as compared to XGBoost (CaSTLe) and deep neural network-based (scDeepSort) methods. The top bar graph indicates the number of cell type classes for each tissue; the gap in performance of Geneformer compared to alternatives increased as the number of cell type classes increased, indicating that Geneformer was robust in even increasingly complex multiclass prediction applications. b, Lung, c, large intestine, or d, pancreas out of sample predictions by Geneformer fine-tuned to distinguish cell types in each tissue (training on 80% of cells, predictions on held-out 20% of cells).

Extended Data Fig. 6 Embedding dimension activations distinguish cell types in fine-tuned Geneformer model.

a, Kidney, b, liver, c, blood, d, spleen, e, brain, or f, placenta out of sample predictions by Geneformer fine-tuned to distinguish cell types in each tissue (training on 80% of cells, predictions on held-out 20% of cells). g, Specific embedding dimension activations distinguish each lung cell type in the fine-tuned model.

Extended Data Fig. 7 Geneformer boosted predictions in a diverse panel of downstream tasks.

a, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing dosage-sensitive vs. insensitive transcription factors. b, Effect on cardiomyocyte embeddings from in silico deletion of genes linked by prior transcriptome-wide association study (TWAS)-prioritized GWAS24 to cardiac MRI traits relevant to cardiac pathology (left ventricular (LV) end diastolic volume (EDV), LV end systolic volume (LVESV), LV ejection fraction (LVEF), and stroke volume (SV)) compared to in silico deletion of control cardiac disease genes expressed in cardiomyocytes but whose pathology occurs in non-cardiomyocyte cell types (hyperlipidemia). (*p < 0.05 by Wilcoxon, FDR-corrected; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5x interquartile range, points = outliers). c, Quantitative PCR (QPCR) data of CRISPR-mediated knockout of TEAD4 in iPSC-derived cardiomyocytes (n = 3, *p < 0.05 by t-test; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = experimental replicates). d, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing bivalent vs. non-methylated genes (56 highly conserved loci28). e, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing bivalent vs. Lys4-only methylated genes (56 highly conserved loci28).

Extended Data Fig. 8 Geneformer boosted predictions in a diverse panel of downstream tasks.

a, Confusion matrix and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing genome-wide30 bivalent vs. Lys4-only methylated genes with model fine-tuned only on 56 highly conserved loci28. b, ROC curve of Geneformer fine-tuned to distinguish genome-wide bivalent vs. Lys4-only-methylated genes using limited data (about 15K ESCs), compared to alternative methods. c, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing genome-wide bivalent vs. non-methylated genes with model fine-tuned on 80% of genome-wide loci and predicting on 20% of out of sample loci. d, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing long- vs. short-range transcription factors. e, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing central vs. peripheral genes within the N1-dependent network in endothelial cells.

Extended Data Fig. 9 In silico deletion strategy revealed network connectivity.

a, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing N1-activated vs. non-targets. b, Confusion matrix and F1 score of Geneformer predictions of central vs. peripheral genes within the N1-dependent network in endothelial cells (ECs) with model fine-tuned only on 884 ECs from healthy or dilated aortas14. c, Pretrained Geneformer attention weights in aortic ECs demonstrated that specific attention heads learned in a completely self-supervised way the relative centrality of the top most central versus most peripheral genes in the N1-dependent gene network (higher valence = more central) (*p < 0.05 Wilcoxon, FDR-corrected). d, Pretrained Geneformer contextual attention versus gene rank in rank value encoding in the indicated aortic cell types, which each have different sets of highest ranked genes based on cell type context (higher rank is leftward on x axis) (*p < 0.05 by Wilcoxon, FDR-corrected, * position = side with higher attention). All cells used for analysis had the same number of genes so that the rank values would be comparable. e, In silico deletion of GATA4 was significantly more deleterious to the previously reported highest confidence GATA4 targets33 than to housekeeping genes. f, In silico deletion of TBX5 was significantly more deleterious to previously reported TBX5 direct targets34 than to housekeeping genes or TBX5 indirect targets. In (e–f): *p < 0.05 by Wilcoxon, FDR-corrected; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = outliers.

Extended Data Fig. 10 Geneformer fine-tuned cardiomyocyte embeddings clustered by phenotype.

a, While original data (left) was highly affected by patient batch effect, cell embeddings generated by pretrained Geneformer (right) (without fine-tuning) clustered primarily by cell type. b, UMAP of cardiomyocyte embeddings from the model fine-tuned to distinguish cardiomyocytes in non-failing hearts from cardiomyocytes in patients with hypertrophic or dilated cardiomyopathy. c, Gene sets significantly associated with hypertrophic or dilated cardiomyopathy states by Geneformer in silico deletion disease modelling significantly overlapped with genes differentially expressed in those respective disease states (differentially expressed vs. non-failing) compared to the overlap of those differentially expressed genes with background genes (the remainder of the genes detected in cardiomyocytes that were not significantly associated with hypertrophic or dilated cardiomyopathy by Geneformer disease modelling) (*p < 0.05 by X2 test, FDR-corrected). d, Pathway enrichment for genes whose in silico deletion in cardiomyocytes from hypertrophic cardiomyopathy patients significantly shifted embeddings towards the non-failing state and away from the dilated cardiomyopathy state, suggesting candidate therapeutic targets. e, QPCR data of CRISPR-mediated knockout of indicated genes in TTN+/− iPSC-derived cardiomyocytes (n = 3, *p < 0.05 by t-test). Centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = experimental replicates.

Supplementary information

Supplementary Information

Supplementary Methods.

Reporting Summary

Peer Review File

Supplementary Table 1

Dataset composition of Genecorpus-30M.

Supplementary Table 2

Fine-tuning training classes and task-specific data.

Supplementary Table 3

Predicted deleterious effect of in silico deletion of genes in fetal cardiomyocytes.

Supplementary Table 4

Gene set enrichments of genes whose in silico deletion is predicted to be deleterious in fetal cardiomyocytes.

Supplementary Table 5

Predicted deleterious effect of in silico deletion or activation of genes in cardiomyocytes from non-failing hearts.

Supplementary Table 6

Gene set enrichments of genes whose in silico deletion defines the hypertrophic cardiomyopathy state.

Supplementary Table 7

Gene set enrichments of genes whose in silico activation defines the hypertrophic cardiomyopathy state.

Supplementary Table 8

Gene set enrichments of genes whose in silico deletion defines the dilated cardiomyopathy state.

Supplementary Table 9

Gene set enrichments of genes whose in silico activation defines the dilated cardiomyopathy state.

Supplementary Table 10

Gene set enrichments of genes whose in silico deletion uniquely defines the dilated rather than hypertrophic cardiomyopathy state.

Supplementary Table 11

Gene set enrichments of genes whose in silico activation uniquely defines the dilated rather than hypertrophic cardiomyopathy state.

Supplementary Table 12

Predicted beneficial effect of in silico deletion or activation of genes in cardiomyocytes from hypertrophic or dilated cardiomyopathy.

Supplementary Table 13

Gene set enrichments of hypertrophic cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico deletion.

Supplementary Table 14

Gene set enrichments of hypertrophic cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico activation.

Supplementary Table 15

Gene set enrichments of dilated cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico deletion.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Theodoris, C.V., Xiao, L., Chopra, A. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). https://doi.org/10.1038/s41586-023-06139-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-023-06139-9

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing