Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding1,2 and computer vision3 by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Genecorpus-30M is available on the Huggingface Dataset Hub at https://huggingface.co/datasets/ctheodoris/Genecorpus-30M.
The pretrained Geneformer model, transcriptome tokenizer and code for pretraining and fine-tuning the model are available on the Huggingface Model Hub at https://huggingface.co/ctheodoris/Geneformer. All other code used in this study is available upon request from the corresponding authors.
Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1 (eds Burstein, J. et al.) 4174–4186 (Association for Computational Linguistics, 2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Theodoris, C. V. et al. Human disease modeling reveals integrated transcriptional and epigenetic mechanisms of NOTCH1 haploinsufficiency. Cell 160, 1072–1086 (2015).
Theodoris, C. V. et al. Network-based screen in iPSC-derived cells reveals therapeutic candidate for heart valve disease. Science 371, eabd0724 (2021).
Shao, X. et al. ScDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic Acids Res. 49, e122 (2021).
Lieberman, Y., Rokach, L. & Shay, T. CaSTLe—classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499 (2018).
Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of transformers. Preprint at https://doi.org/10.48550/arXiv.2106.04554 (2021).
Ren, J. et al. ZeRO-offload: democratizing billion-scale model training. In Proc. 2021 USENIX Annual Technical Conference 551–564 (USENIX, 2021).
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Zero: memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis 1–16 (IEEE, 2020).
Selewa, A. et al. Systematic comparison of high-throughput single-cell and single-nucleus transcriptomes during cardiomyocyte differentiation. Sci. Rep. 10, 1535 (2020).
10x Genomics Datasets https://www.10xgenomics.com/resources/datasets/frozen-pbm-cs-donor-a-1-standard-1-1-0.
10X Genomics Datasets https://www.10xgenomics.com/resources/datasets/fresh-68-k-pbm-cs-donor-a-1-standard-1-1-0.
Li, Y. et al. Single-cell transcriptome analysis reveals dynamic cell populations and differential gene expression patterns in control and aneurysmal human aortic tissue. Circulation 142, 1374–1388 (2020).
Xing, Q. R. et al. Diversification of reprogramming trajectories revealed by parallel single-cell transcriptome and chromatin accessibility sequencing. Sci. Adv. 6, 463–474 (2020).
Guo, D. et al. iMyoblasts for ex vivo and in vivo investigations of human myogenesis and disease modeling. eLife 11, e70341 (2022).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Shihab, H. A., Rogers, M. F., Campbell, C. & Gaunt, T. R. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics 33, 1751–1757 (2017).
Ni, Z., Zhou, X. Y., Aslam, S. & Niu, D. K. Characterization of human dosage-sensitive transcription factor genes. Front. Genet. 10, 1208 (2019).
Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell 185, 3041–3055 (2022).
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, 808 (2020).
Pirruccello, J. P. et al. Analysis of cardiac magnetic resonance imaging in 36,000 individuals yields genetic insights into dilated cardiomyopathy. Nat. Commun. 11, 2254 (2020).
Bolte, C. et al. Expression of Foxm1 transcription factor in cardiomyocytes is required for myocardial development. PLoS ONE 6, e22217 (2011).
Bolte, C. et al. Postnatal ablation of Foxm1 from cardiomyocytes causes late onset cardiac hypertrophy and fibrosis without exacerbating pressure overload-induced cardiac remodeling. PLoS ONE 7, e48713 (2012).
Currey, L., Thor, S. & Piper, M. TEAD family transcription factors in development and disease. Development 148, dev196675 (2021).
Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315–356 (2006).
Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz406 (2019).
Pan, G. et al. Whole-genome analysis of histone H3 lysine 4 and lysine 27 methylation in human embryonic stem cells. Cell Stem Cell 1, 299–312 (2007).
Chen, C. H. et al. Determinants of transcription factor regulatory range. Nat. Commun. 11, 2472 (2020).
Litviňuková, M. et al. Cells of the adult human heart. Nature 588, 455–472 (2020).
Ang, Y. S. et al. Disease model of GATA4 mutation reveals transcription factor cooperativity in human cardiogenesis. Cell 167, 1734–1749 (2016).
Kathiriya, I. S. et al. Modeling human TBX5 haploinsufficiency predicts regulatory networks for congenital heart disease. Dev. Cell 56, 292–309 (2021).
Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).
Hinson, J. T. et al. Titin mutations in iPS cells define sarcomere insufficiency as a cause of dilated cardiomyopathy. Science 349, 982–986 (2015).
Seidman, C. E. & Seidman, J. G. Identifying sarcomere gene mutations in hypertrophic cardiomyopathy: a personal history. Circ. Res. 108, 743–750 (2011).
Kamisago, M. et al. Mutations in sarcomere protein genes as a cause of dilated cardiomyopathy. New Engl. J. Med. 343, 1688–1696 (2000).
Ramaccini, D. et al. Mitochondrial function and dysfunction in dilated cardiomyopathy. Front. Cell Dev. Biol. https://doi.org/10.3389/fcell.2020.624216 (2021).
Ho, D., Yan, L., Iwatsubo, K., Vatner, D. E. & Vatner, S. F. Modulation of β-adrenergic receptor signaling in heart failure and longevity: targeting adenylyl cyclase type 5. Heart Fail. Rev. 15, 495–512 (2010).
Wagner, A. H. et al. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res. 44, D1036–D1044 (2016).
Nakagawa, O. et al. Centronuclear myopathy in mice lacking a novel muscle-specific protein kinase transcriptionally regulated by MEF2. Genes Dev. 19, 2066–2077 (2005).
Akazawa, H. & Komuro, I. Roles of cardiac transcription factors in cardiac hypertrophy. Circ. Res. 92, 1079–1088 (2003).
Henighan, T. et al. Scaling laws for autoregressive generative modeling. Preprint at https://doi.org/10.48550/arXiv.2010.14701 (2020).
Madissoon, E. et al. ScRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 21, 1 (2019).
Anderson, D. J. et al. NKX2-5 regulates human cardiomyogenesis via a HEY2 dependent transcriptional network. Nat. Commun. 9, 1373 (2018).
Smillie, C. S. et al. Intra- and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730 (2019).
Lee, J. S. et al. Immunophenotyping of Covid-19 and influenza highlights the role of type I interferons in development of severe Covid-19. Sci. Immunol. 5, eabd1554 (2020).
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).
Fang, Z. et al. Single-cell heterogeneity analysis and CRISPR screen identify key β-cell-specific disease genes. Cell Rep. 26, 3132–3144 (2019).
Agarwal, D. et al. A single-cell atlas of the human substantia nigra reveals cell-specific pathways associated with neurological disorders. Nat. Commun. 11, 4183 (2020).
Rasouli, J. et al. A distinct GM-CSF+ T helper cell subset requires T-bet to adopt a TH1 phenotype and promote neuroinflammation. Sci. Immunol. 5, eaba9953 (2020).
Park, J.-E. et al. A cell atlas of human thymic development defines T cell repertoire formation. Science 367, eaay3224 (2020).
Mende, N. et al. Quantitative and molecular differences distinguish adult human medullary and extramedullary haematopoietic stem and progenitor cell landscapes. Preprint at BioRxiv https://doi.org/10.1101/2020.01.26.919753 (2020).
Setty, M. et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019).
Popescu, D.-M. et al. Decoding human fetal liver haematopoiesis. Nature 574, 365–371 (2019).
Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal-fetal interface in humans. Nature 563, 347–353 (2018).
Ramachandran, P. et al. Resolving the fibrotic niche of human liver cirrhosis at single-cell level. Nature 575, 512–518 (2019).
Kinchen, J. et al. Structural remodeling of the human colonic mesenchyme in inflammatory bowel disease. Cell 175, 372–386 (2018).
James, K. R. et al. Distinct microbial and immune niches of the human colon. Nat. Immunol. 21, 343–353 (2020).
Zhou, L. et al. Single-cell RNA-seq analysis uncovers distinct functional human NKT cell sub-populations in peripheral blood. Front. Cell Dev. Biol. 8, 384 (2020).
Liao, J. et al. Single-cell RNA sequencing of human kidney. Sci. Data 7, 4 (2020).
Jäkel, S. et al. Altered human oligodendrocyte heterogeneity in multiple sclerosis. Nature 566, 543–547 (2019).
Merrick, D. et al. Identification of a mesenchymal progenitor cell hierarchy in adipose tissue. Science 364, eaav2501 (2019).
Habermann, A. C. et al. Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 6, eaba1972 (2020).
Rosa, F. F. et al. Direct reprogramming of fibroblasts into antigen-presenting dendritic cells. Sci. Immunol. 3, eaau4292 (2018).
Stewart, B. J. et al. Spatiotemporal immune zonation of the human kidney. Science 365, 1461–1466 (2019).
MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).
Welch, J. et al. Integrative inference of brain cell similarities and differences from single-cell genomics. Preprint at BioRxiv https://doi.org/10.1101/459891 (2018).
Ledergor, G. et al. Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma. Nat. Med. 24, 1867–1876 (2018).
Lukowski, S. W. et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 38, e100811 (2019).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Zirkel, A. et al. HMGB2 loss upon senescence entry disrupts genomic organization and induces CTCF clustering across cell types. Mol. Cell 70, 730–744 (2018).
Goudot, C. et al. Aryl hydrocarbon receptor controls monocyte differentiation into dendritic cells versus macrophages. Immunity 47, 582–596 (2017).
McCauley, K. B. et al. Single-cell transcriptomic profiling of pluripotent stem cell-derived SCGB3A2+ airway epithelium. Stem Cell Rep. 10, 1579–1595 (2018).
Das, R. et al. Early B cell changes predict autoimmunity following combination immune checkpoint blockade. J. Clin. Invest. 128, 715–720 (2018).
Kini Bailur, J. et al. Changes in bone marrow innate lymphoid cell subsets in monoclonal gammopathy: target for IMiD therapy. Blood Adv. 1, 2343–2347 (2017).
Patil, V. S. et al. Precursors of human CD4+ cytotoxic T lymphocytes identified by single-cell transcriptome analysis. Sci. Immunol. 3, eaan8664 (2018).
Wang, C. et al. Expansion of hedgehog disrupts mesenchymal identity and induces emphysema phenotype. J. Clin. Invest. 128, 4343–4358 (2018).
Hermann, B. P. et al. The mammalian spermatogenesis single-cell transcriptome, from spermatogonial stem cells to spermatids. Cell Rep. 25, 1650–1667 (2018).
Menon, R. et al. Single-cell analysis of progenitor cell dynamics and lineage specification in the human fetal kidney. Development 145, dev164038 (2018).
Czerniecki, S. M. et al. High-throughput screening enhances kidney organoid differentiation from human pluripotent stem cells and enables automated multidimensional phenotyping. Cell Stem Cell 22, 929–940 (2018).
Papa, L. et al. Ex vivo human HSC expansion requires coordination of cellular reprogramming with mitochondrial remodeling and p53 activation. Blood Adv. 2, 2766–2779 (2018).
Schulthess, J. et al. The short chain fatty acid butyrate imprints an antimicrobial program in macrophages. Immunity 50, 432–445 (2019).
Guo, J. et al. The adult human testis transcriptional cell atlas. Cell Res. 28, 1141–1157 (2018).
Karow, M. et al. Direct pericyte-to-neuron reprogramming via unfolding of a neural stem cell-like program. Nat. Neurosci. 21, 932–940 (2018).
Xin, Y. et al. Pseudotime ordering of single human β-cells reveals states of insulin production and unfolded protein response. Diabetes 67, 1783–1794 (2018).
Phipson, B. et al. Evaluation of variability in human kidney organoids. Nat. Methods 16, 79–87 (2019).
Balan, S. et al. Large-scale human dendritic cell differentiation revealing notch-dependent lineage bifurcation and heterogeneity. Cell Rep. 24, 1902–1915 (2018).
Milpied, P. et al. Human germinal center transcriptional programs are de-synchronized in B cell lymphoma. Nat. Immunol. 19, 1013–1024 (2018).
Parikh, K. et al. Colonic epithelial cell diversity in health and inflammatory bowel disease. Nature 567, 49–55 (2019).
Habiel, D. M. et al. CCR10+ epithelial cells from idiopathic pulmonary fibrosis lungs drive remodeling. JCI Insight 3, e122211 (2018).
Paik, D. T. et al. Large-scale single-cell RNA-seq reveals molecular signatures of heterogeneous populations of human induced pluripotent stem cell-derived endothelial cells. Circ. Res. 123, 443–450 (2018).
Martin, J. C. et al. Single-cell analysis of Crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy. Cell 178, 1493–1508 (2019).
Zheng, Y. et al. A human circulating immune cell landscape in aging and COVID-19. Protein Cell 11, 740–770 (2020).
Hochane, M. et al. Single-cell transcriptomics reveals gene expression dynamics of human fetal kidney development. PLoS Biol. 17, e3000152 (2019).
Sohni, A. et al. The neonatal and adult human testis defined at the single-cell level. Cell Rep. 26, 1501–1517 (2019).
Tran, T. et al. In vivo developmental trajectories of human podocyte inform in vitro differentiation of pluripotent stem cell-derived podocytes. Dev. Cell 50, 102–116 (2019).
Wang, Y. et al. Single-cell transcriptome analysis reveals differential nutrient absorption functions in human intestine. J. Exp. Med. 217, e20191130 (2020).
Vieira Braga, F. A. et al. A cellular census of human lungs identifies novel cell states in health and in asthma. Nat. Med. 25, 1153–1163 (2019).
Guo, J. et al. The dynamic transcriptional cell atlas of testis development during human puberty. Cell Stem Cell 26, 262–276 (2020).
Voigt, A. P. et al. Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration. Proc. Natl Acad. Sci. USA 116, 24100–24107 (2019).
Menon, M. et al. Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration. Nat. Commun. 10, 4902 (2019).
Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 26, 1070–1076 (2020).
Li, B. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat. Methods 17, 793–798 (2020).
Daniszewski, M. et al. Single cell RNA sequencing of stem cell-derived retinal ganglion cells. Sci. Data 5, 180013 (2018).
Goveia, J. et al. An integrated gene expression landscape profiling approach to identify lung tumor endothelial cell heterogeneity and angiogenic candidates. Cancer Cell 37, 21–36 (2020).
Norelli, M. et al. Monocyte-derived IL-1 and IL-6 are differentially required for cytokine-release syndrome and neurotoxicity due to CAR T cells. Nat. Med. 24, 739–748 (2018).
Daniszewski, M. et al. Single-cell profiling identifies key pathways expressed by iPSCs cultured in different commercial media. iScience 7, 30–39 (2018).
Miller, A. J. et al. In vitro and in vivo development of the human airway at single-cell resolution. Dev. Cell 53, 117–128 (2020).
Silvin, A. et al. Elevated calprotectin and abnormal myeloid cell subsets discriminate severe from mild COVID-19. Cell 182, 1401–1418 (2020).
Deprez, M. et al. A single-cell atlas of the human healthy airways. Am. J. Resp. Crit. Care Med. 202, 1636–1645 (2020).
Sridhar, A. et al. Single-cell transcriptomic comparison of human fetal retina, hPSC-derived retinal organoids, and long-term retinal cultures. Cell Rep. 30, 1644–1659 (2020).
Wu, H. et al. Comparative analysis and refinement of human PSC-derived kidney organoid differentiation with single-cell transcriptomics. Cell Stem Cell 23, 869–881 (2018).
Vijay, J. et al. Single-cell analysis of human adipose tissue identifies depot and disease specific cell types. Nat. Metab. 2, 97–109 (2020).
Solé-Boldo, L. et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun. Biol. 3, 188 (2020).
Adams, T. S. et al. Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 6, eaba1983 (2020).
Moreira, L. M. et al. Paracrine signalling by cardiac calcitonin controls atrial fibrogenesis and arrhythmia. Nature 587, 460–465 (2020).
Ren, X. et al. COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895–1913 (2021).
Bunis, D. G. et al. Single-cell mapping of progressive fetal-to-adult transition in human naive T cells. Cell Rep. 34, 108573 (2021).
Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).
Takeda, A. et al. Single-cell survey of human lymphatics unveils marked endothelial cell heterogeneity and mechanisms of homing for neutrophils. Immunity 51, 561–572 (2019).
Frumm, S. M. et al. A hierarchy of proliferative and migratory keratinocytes maintains the tympanic membrane. Cell Stem Cell 28, 315–330 (2021).
Yu, Z. et al. Single-cell transcriptomic map of the human and mouse bladders. J. Am. Soc. Nephrol. 30, 2159–2176 (2019).
Rubenstein, A. B. et al. Single-cell transcriptional profiles in human skeletal muscle. Sci. Rep. 10, 229 (2020).
McCracken, I. R. et al. Transcriptional dynamics of pluripotent stem cell-derived endothelial cell differentiation revealed by single-cell RNA sequencing. Eur. Heart J. 41, 1024–1036 (2020).
Hua, P. et al. Single-cell analysis of bone marrow-derived CD34+ cells from children with sickle cell disease and thalassemia. Blood 134, 2111–2115 (2019).
Orozco, L. D. et al. Integration of eQTL and a single-cell atlas in the human eye identifies causal genes for age-related macular degeneration. Cell Rep. 30, 1246–1259 (2020).
Hurley, K. et al. Reconstructed single-cell fate trajectories define lineage plasticity windows during differentiation of human PSC-derived distal lung progenitors. Cell Stem Cell 26, 593–608 (2020).
Schafflick, D. et al. Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis. Nat. Commun. 11, 247 (2020).
Su, C. et al. Single-cell RNA sequencing in multiple pathologic types of renal cell carcinoma revealed novel potential tumor-specific markers. Front. Oncol. 11, 719564 (2021).
He, J. et al. Dissecting human embryonic skeletal stem cell ontogeny by single-cell transcriptomic and functional analyses. Cell Res. 31, 742–757 (2021).
Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26, 842–844 (2020).
Liu, X. et al. Reprogramming roadmap reveals route to human induced trophoblast stem cells. Nature 586, 101–107 (2020).
He, S. et al. Single-cell transcriptome profiling of an adult human cell atlas of 15 major organs. Genome Biol. 21, 294 (2020).
Wu, C.-L. et al. Single cell transcriptomic analysis of human pluripotent stem cell chondrogenesis. Nat. Commun. 12, 362 (2021).
Cowan, C. S. et al. Cell types of the human retina and its organoids at single-cell resolution. Cell 182, 1623–1640 (2020).
Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).
Wang, L. et al. Single-cell map of diverse immune phenotypes in the metastatic brain tumor microenvironment of non small cell lung cancer. Preprint at BioRxiv https://doi.org/10.1101/2019.12.30.890517 (2019).
Lu, Y.-C. et al. Single-cell transcriptome analysis reveals gene signatures associated with T-cell persistence following adoptive cell therapy. Cancer Immunol. Res. 7, 1824–1836 (2019).
Wang, L. et al. The phenotypes of proliferating glioblastoma cells reside on a single axis of variation. Cancer Discov. 9, 1708–1719 (2019).
Wang, R. et al. Adult human glioblastomas harbor radial glia-like cells. Stem Cell Rep. 14, 338–350 (2020).
Wang, L., Catalan, F., Shamardani, K., Babikir, H. & Diaz, A. Ensemble learning for classifying single-cell data and projection across reference atlases. Bioinformatics 36, 3585–3587 (2020).
Ruffin, A. T. et al. B cell signatures and tertiary lymphoid structures contribute to outcome in head and neck squamous cell carcinoma. Nat. Commun. 12, 3349 (2021).
Zhang, Q. et al. Landscape and dynamics of single immune cells in hepatocellular carcinoma. Cell 179, 829–845 (2019).
Song, Q. et al. Dissecting intratumoral myeloid cell plasticity by single cell RNA-seq. Cancer Med. 8, 3072–3085 (2019).
Kim, N. et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat. Commun. 11, 2285 (2020).
Tang-Huau, T.-L. et al. Human in vivo-generated monocyte-derived dendritic cells and macrophages cross-present antigens through a vacuolar pathway. Nat. Commun. 9, 2570 (2018).
Peng, J. et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 725–738 (2019).
10x Genomics Datasets: Single Cell Gene Expression. 10x Genomics https://www.10xgenomics.com/resources/datasets?menu%5Bproducts.name%5D=Single%20Cell%20Gene%20Expression&query=&page=1&configure%5Bfacets%5D%5B0%5D=chemistryVersionAndThroughput&configure%5Bfacets%5D%5B1%5D=pipeline.version&configure%5BhitsPerPage%5D=500.
de Andrade, L. F. et al. Discovery of specialized NK cell populations infiltrating human melanoma metastases. JCI Insight 4, e133103 (2019).
Zhang, P. et al. Dissecting the single-cell transcriptome network underlying gastric premalignant lesions and early gastric cancer. Cell Rep. 27, 1934–1947 (2019).
Durante, M. A. et al. Single-cell analysis reveals new evolutionary complexity in uveal melanoma. Nat. Commun. 11, 496 (2020).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database 2020, baaa073 (2020).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Xin, J. et al. High-performance web services for querying gene and variant annotation. Genome Biol. 17, 91 (2016).
Dunning, T. The t-digest: efficient estimates of distributions. Softw. Impacts 7, 100049 (2021).
Lhoest, Q. et al. Datasets: a community library for natural language processing. Preprint at https://doi.org/10.48550/arXiv.2109.02846 (2021).
Wolf, T. et al. HuggingFace’s transformers: state-of-the-art natural language processing. Preprint at https://doi.org/10.48550/arXiv.1910.03771 (2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. Preprint at https://doi.org/10.48550/arXiv.1711.05101 (2017).
We thank J. Rae for helpful scientific discussions and Google Research for providing tensor processing unit (TPU) resources for experimentation. P.T.E. was supported by grants from the National Institutes of Health (NIH) (1RO1HL092577, 1R01HL157635 and 5R01HL139731), American Heart Association Strategically Focused Research Networks (18SFRN34110082) and European Union (MAESTRIA 965286). C.V.T. was supported by NIH T32GM007748 and the Helen Hay Whitney Foundation Postdoctoral Fellowship. L.X. was supported by the American Heart Association (20CDA35260081).
X.S.L. conducted this work while on faculty at Dana-Farber Cancer Institute and is now a board member and CEO of GV20 Therapeutics. P.T.E. has received sponsored research support from Bayer AG, IBM Research, Bristol Myers Squibb and Pfizer. P.T.E. has also served on advisory boards or consulted for Bayer AG, MyoKardia and Novartis. A.C. is an employee of Bayer US LLC (a subsidiary of Bayer AG) and may own stock in Bayer AG. E.M.B. was a full-time employee of Bayer when this work was performed. The remaining authors declare no competing interests.
Peer review information
Nature thanks Amir Bashan, Natasa Przulj and Nathan Palpant for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Schematic of standard modelling approach, which necessitates retraining a new model from scratch for each new task. b, Schematic of transfer learning strategy. Through a single initial self-supervised large-scale pretraining on a generalizable learning objective, the model gains fundamental knowledge of the learning domain that is then democratized to a multitude of downstream applications distinct from the pretraining learning objective, transferring knowledge to new tasks. c, Transcription factors are normalized by a statistically significantly lower factor (resulting in higher prioritization in the rank value encoding) compared to all genes. Housekeeping genes on average show a trend of a higher normalization factor (resulting in deprioritization in the rank value encoding) compared to all genes (*p < 0.05 by Wilcoxon, FDR-corrected; all genes n = 17,903, housekeeping genes n = 11, transcription factors n = 1,384; error bars = standard deviation). d, Pretraining was performed with a randomly subsampled corpus of 100,000 cells, holding out 10,000 cells for evaluation, with 3 different random seeds. Evaluation loss was essentially equivalent in the 3 trials, indicating robustness to the set of genes randomly masked for each cell during the pretraining. e, Pretraining was performed with a randomly subsampled corpus of 100,000 cells, holding out 10,000 cells for evaluation, with 3 different masking percentages. 15% masking had marginally lower evaluation loss compared to 5% or 30% masking. f, Pretraining was performed with a randomly subsampled corpus of 90,000 cells and the model was then fine-tuned to distinguish dosage-sensitive vs. -insensitive transcription factors using 10,000 cells that were either included in or excluded from the 90,000 cell pretraining corpus. Predictive potential on the downstream fine-tuning task was measured by fivefold cross-validation with these 10,000 cells, demonstrating essentially equivalent results by AUC, confusion matrices, and F1 score. Because the fine-tuning applications are trained on classification objectives that are completely separate from the masked learning objective, whether or not task-specific data was included in the pretraining corpus is not relevant to the downstream classification predictions.
Extended Data Fig. 2 Geneformer was context-aware and robust to batch-dependent technical artefacts.
a, Effect of gene versus the indicated batch-dependent technical artefact on pretrained Geneformer gene embeddings (*p < 0.05 by Wilcoxon, FDR-corrected; NS: non-significant). We found that the gene embeddings were robust to sequencing platform11, preservation method12,13, and individual patient variability14. b, UMAP of pretrained Geneformer cell embeddings of cells undergoing iPSC reprogramming appropriately captured temporal trajectory of reprogramming (cell types as annotated by original study15; iPSC negative or positive refers to expression of marker TRA-1-60). Cell embeddings suggested that cells which do not progress to the iPSC state bifurcate into an alternative fate compared to cells that progress to the iPSC state after the day 12 stage. c, Compared to in silico reprogramming with random genes, in silico reprogramming of fibroblasts by artificially adding OCT4, SOX2, KLF4, and MYC (OSKM) to the front of their rank value encodings significantly shifted the gene embeddings from their initial fibroblast state to the embedding of that gene in the iPSC state (*p < 0.05 by Wilcoxon). d, UMAP of pretrained Geneformer cell embeddings of cells undergoing iPSC to myoblast differentiation at the earlier S1 (PAX3+) and later S2B (PAX3+/MYOD+) stages (cell types as annotated by original study16). e, Compared to in silico differentiation with random genes, in silico differentiation of the early-stage myogenic cells by artificially adding MYOD to the front of their rank value encodings significantly shifted the gene embeddings from their earlier state to the embedding of that gene in the later MYOD+ myogenic state (*p < 0.05 by Wilcoxon).
Known context-dependent NOTCH genes showed higher variance in their contextual embeddings across variable aortic cell types compared to housekeeping gene GAPDH.
Extended Data Fig. 4 Geneformer pretrained and fine-tuned cell embeddings were robust to batch-dependent technical artefacts.
a, While original data (left) was highly affected by patient batch effect, cell embeddings generated by pretrained Geneformer (right) (without fine-tuning) clustered primarily by cell type and phenotype. Of note, affected individuals 1, 2, and 4 had the phenotype of ascending only aortic aneurysm, which is a different phenotype than aortic aneurysm that includes the root. b, Imbalance in the number of genes detected in each of the two platforms (single-cell Drop-seq versus single-nucleus DroNc-seq), which may result in batch-dependent technical artefacts. c, Cell embeddings from each layer of the Geneformer model fine-tuned to distinguish the indicated cell types (as annotated by original study11) using only the Drop-seq data. As the cells pass through each layer, the model successively extrudes them from each other to derive separable embeddings that distinguish the cell types. d, Cell type predictions on the DroNc-seq data by the model fine-tuned only on the Drop-seq data (out of sample accuracy 84%). Of note, inaccurate predictions were predominantly in predicting that cardiomyocyte type 2 was type 1, as expected given the minimal examples of cardiomyocyte type 2 in the Drop-seq data. e, The imbalance of cardiomyocyte type 1 and 2 between the platforms also suggests that these cellular subtypes may be an artefact of variable gene detection between the two platforms. f, Geneformer fine-tuned with only Drop-seq data automatically integrated DroNc-seq data such that the fine-tuned Geneformer cell embeddings primarily clustered by cell types and showed improved integration of platforms compared to the original data even after batch effect removal using the ComBat17 or Harmony18 methods.
a, Predictive potential (as measured by accuracy and macro F1 score) of Geneformer fine-tuned for cell type annotation in the indicated human tissues as compared to XGBoost (CaSTLe) and deep neural network-based (scDeepSort) methods. The top bar graph indicates the number of cell type classes for each tissue; the gap in performance of Geneformer compared to alternatives increased as the number of cell type classes increased, indicating that Geneformer was robust in even increasingly complex multiclass prediction applications. b, Lung, c, large intestine, or d, pancreas out of sample predictions by Geneformer fine-tuned to distinguish cell types in each tissue (training on 80% of cells, predictions on held-out 20% of cells).
Extended Data Fig. 6 Embedding dimension activations distinguish cell types in fine-tuned Geneformer model.
a, Kidney, b, liver, c, blood, d, spleen, e, brain, or f, placenta out of sample predictions by Geneformer fine-tuned to distinguish cell types in each tissue (training on 80% of cells, predictions on held-out 20% of cells). g, Specific embedding dimension activations distinguish each lung cell type in the fine-tuned model.
a, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing dosage-sensitive vs. insensitive transcription factors. b, Effect on cardiomyocyte embeddings from in silico deletion of genes linked by prior transcriptome-wide association study (TWAS)-prioritized GWAS24 to cardiac MRI traits relevant to cardiac pathology (left ventricular (LV) end diastolic volume (EDV), LV end systolic volume (LVESV), LV ejection fraction (LVEF), and stroke volume (SV)) compared to in silico deletion of control cardiac disease genes expressed in cardiomyocytes but whose pathology occurs in non-cardiomyocyte cell types (hyperlipidemia). (*p < 0.05 by Wilcoxon, FDR-corrected; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5x interquartile range, points = outliers). c, Quantitative PCR (QPCR) data of CRISPR-mediated knockout of TEAD4 in iPSC-derived cardiomyocytes (n = 3, *p < 0.05 by t-test; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = experimental replicates). d, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing bivalent vs. non-methylated genes (56 highly conserved loci28). e, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing bivalent vs. Lys4-only methylated genes (56 highly conserved loci28).
a, Confusion matrix and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing genome-wide30 bivalent vs. Lys4-only methylated genes with model fine-tuned only on 56 highly conserved loci28. b, ROC curve of Geneformer fine-tuned to distinguish genome-wide bivalent vs. Lys4-only-methylated genes using limited data (about 15K ESCs), compared to alternative methods. c, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing genome-wide bivalent vs. non-methylated genes with model fine-tuned on 80% of genome-wide loci and predicting on 20% of out of sample loci. d, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing long- vs. short-range transcription factors. e, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods for downstream task of distinguishing central vs. peripheral genes within the N1-dependent network in endothelial cells.
a, Confusion matrices and F1 score for Geneformer predictions vs. alternative methods (as described in Fig. 2a) for downstream task of distinguishing N1-activated vs. non-targets. b, Confusion matrix and F1 score of Geneformer predictions of central vs. peripheral genes within the N1-dependent network in endothelial cells (ECs) with model fine-tuned only on 884 ECs from healthy or dilated aortas14. c, Pretrained Geneformer attention weights in aortic ECs demonstrated that specific attention heads learned in a completely self-supervised way the relative centrality of the top most central versus most peripheral genes in the N1-dependent gene network (higher valence = more central) (*p < 0.05 Wilcoxon, FDR-corrected). d, Pretrained Geneformer contextual attention versus gene rank in rank value encoding in the indicated aortic cell types, which each have different sets of highest ranked genes based on cell type context (higher rank is leftward on x axis) (*p < 0.05 by Wilcoxon, FDR-corrected, * position = side with higher attention). All cells used for analysis had the same number of genes so that the rank values would be comparable. e, In silico deletion of GATA4 was significantly more deleterious to the previously reported highest confidence GATA4 targets33 than to housekeeping genes. f, In silico deletion of TBX5 was significantly more deleterious to previously reported TBX5 direct targets34 than to housekeeping genes or TBX5 indirect targets. In (e–f): *p < 0.05 by Wilcoxon, FDR-corrected; centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = outliers.
a, While original data (left) was highly affected by patient batch effect, cell embeddings generated by pretrained Geneformer (right) (without fine-tuning) clustered primarily by cell type. b, UMAP of cardiomyocyte embeddings from the model fine-tuned to distinguish cardiomyocytes in non-failing hearts from cardiomyocytes in patients with hypertrophic or dilated cardiomyopathy. c, Gene sets significantly associated with hypertrophic or dilated cardiomyopathy states by Geneformer in silico deletion disease modelling significantly overlapped with genes differentially expressed in those respective disease states (differentially expressed vs. non-failing) compared to the overlap of those differentially expressed genes with background genes (the remainder of the genes detected in cardiomyocytes that were not significantly associated with hypertrophic or dilated cardiomyopathy by Geneformer disease modelling) (*p < 0.05 by X2 test, FDR-corrected). d, Pathway enrichment for genes whose in silico deletion in cardiomyocytes from hypertrophic cardiomyopathy patients significantly shifted embeddings towards the non-failing state and away from the dilated cardiomyopathy state, suggesting candidate therapeutic targets. e, QPCR data of CRISPR-mediated knockout of indicated genes in TTN+/− iPSC-derived cardiomyocytes (n = 3, *p < 0.05 by t-test). Centre line = median, box limits = upper and lower quartiles, whiskers = 1.5× interquartile range, points = experimental replicates.
Dataset composition of Genecorpus-30M.
Fine-tuning training classes and task-specific data.
Predicted deleterious effect of in silico deletion of genes in fetal cardiomyocytes.
Gene set enrichments of genes whose in silico deletion is predicted to be deleterious in fetal cardiomyocytes.
Predicted deleterious effect of in silico deletion or activation of genes in cardiomyocytes from non-failing hearts.
Gene set enrichments of genes whose in silico deletion defines the hypertrophic cardiomyopathy state.
Gene set enrichments of genes whose in silico activation defines the hypertrophic cardiomyopathy state.
Gene set enrichments of genes whose in silico deletion defines the dilated cardiomyopathy state.
Gene set enrichments of genes whose in silico activation defines the dilated cardiomyopathy state.
Gene set enrichments of genes whose in silico deletion uniquely defines the dilated rather than hypertrophic cardiomyopathy state.
Gene set enrichments of genes whose in silico activation uniquely defines the dilated rather than hypertrophic cardiomyopathy state.
Predicted beneficial effect of in silico deletion or activation of genes in cardiomyocytes from hypertrophic or dilated cardiomyopathy.
Gene set enrichments of hypertrophic cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico deletion.
Gene set enrichments of hypertrophic cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico activation.
Gene set enrichments of dilated cardiomyopathy candidate therapeutic targets from in silico treatment analysis by in silico deletion.
About this article
Cite this article
Theodoris, C.V., Xiao, L., Chopra, A. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). https://doi.org/10.1038/s41586-023-06139-9