Transcription factor over-expression is a proven method for reprogramming cells to a desired cell type for regenerative medicine and therapeutic discovery. However, a general method for the identification of reprogramming factors to create an arbitrary cell type is an open problem. Here we examine the success rate of methods and data for differentiation by testing the ability of nine computational methods (CellNet, GarNet, EBseq, AME, DREME, HOMER, KMAC, diffTF and DeepAccess) to discover and rank candidate factors for eight target cell types with known reprogramming solutions. We compare methods that use gene expression, biological networks and chromatin accessibility data, and comprehensively test parameter and preprocessing of input data to optimize performance. We find the best factor identification methods can identify an average of 50–60% of reprogramming factors within the top ten candidates, and methods that use chromatin accessibility perform the best. Among the chromatin accessibility methods, complex methods DeepAccess and diffTF have higher correlation with the ranked significance of transcription factor candidates within reprogramming protocols for differentiation. We provide evidence that AME and diffTF are optimal methods for transcription factor recovery that will allow for systematic prioritization of transcription factor candidates to aid in the design of new reprogramming protocols.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Normalized area under rank recall curve values for all methods are available in Supplementary Tables 2–6 and for epigenomic marks of liver cells in Extended Source Data Fig. 3. Ranks of each reprogramming factor for all methods are available in Supplementary Table 8. The consensus mouse transcription factor motif database derived from the mouse HOCOMOCOv11 database50, shared mouse enhancer sequences, and a list of mouse transcription factors are available at: https://cgs.csail.mit.edu/ReprogrammingRecovery/. Publicly available ATAC-seq and RNA-seq samples were downloaded as fastqs from Nucleotide Read Archive (Supplementary Table 1) and processed as described in the sections on ATAC-seq processing and RNA-seq processing. Uniformly processed gene count and peak files are also available at https://cgs.csail.mit.edu/ReprogrammingRecovery/. Data collection software used were conda/bioconda (v.4.9.0), bedtools (v.2.29.2), Trimgalore) (v.21032019), cutadapt (v.0.6.2), samtools (v.1.7), bwa (v.0.7.17), MACS2 (v.126.96.36.199), FASTQC (v.0.11.8), STAR (v.2.5.2b), RSEM (v.1.3.0), R (v.3.6.1), python (v.3.6.9), DeepAccess (v.0.0.1), EBseq (v.1.2.0), CellNet (v.0.1.0), GarNet (v.0.5.0), HOMER (v.4.9.1), AME/DREME/TomTom (v.5.0.5), KMAC (GEM v.3.4), diffTF (v.1.7.1), PWMScan (v.1.1.1), HOCOMOCO (v.11), GENCODE (v.m24) and mouse genome (mm10), and are cited in Supplementary Table 17. Source data are provided with this paper.
The custom script for performing motif discovery with AME, DREME, HOMER and KMAC is available at: https://cgs.csail.mit.edu/ReprogrammingRecovery/.
Pellegrino, M. et al. RNA-seq following PCR-based sorting reveals rare cell transcriptional signatures. BMC Genomics 17, 361 (2016).
Habib, N. et al. Div-Seq: single-nucleus RNA-seq reveals dynamics of rare adult newborn neurons. Science 353, 925–928 (2016).
Corces, M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods 14, 959–962 (2017).
Rai, V. et al. Single-cell ATAC-seq in human pancreatic islets and deep learning upscaling of rare cells reveals cell-specific type 2 diabetes regulatory signatures. Mol. Metab. 32, 109–121 (2020).
Sasagawa, Y. et al. Quartz-Seq: a highly reproducible and sensitive single-cell RNA-seq reveals non-genetic gene expression heterogeneity. Genome Biol. 14, 3097 (2013).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
Angermueller, C. et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229–232 (2016).
Grün, D. et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature 566, 490–495 (2019).
Lake, B. B. et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 36, 70–80 (2018).
Pijuan-Sala, B. et al. Single-cell chromatin accessibility maps reveal regulatory programs driving early mouse organogenesis. Nat. Cell Biol. 22, 487–497 (2020).
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Wichterle, H., Lieberam, I., Porter, J. A. & Jessell, T. M. Directed differentiation of embryonic stem cells into motor neurons. Cell 110, 385–397 (2002).
Marson, A. et al. Wnt signaling promotes reprogramming of somatic cells to pluripotency. Cell Stem Cell 3, 132–135 (2008).
Ichida, J. K. et al. A small-molecule inhibitor of Tgf-β signaling replaces Sox2 in reprogramming by inducing Nanog. Cell Stem Cell 5, 491–503 (2009).
Oh, Y. & Jang, J. Directed differentiation of pluripotent stem cells by trascription factors. Mol. Cells 42, 200–209 (2019).
Mazzoni, E. O. et al. Synergistic binding of transcription factors to cell-specific enhancers programs motor neuron identity. Nat. Neurosci. 16, 1219–1227 (2013).
Takahashi, K. & Yamanaka, S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676 (2006).
Rackham, O. J. L. et al. A predictive computational framework for direct reprogramming between human cell types. Nat. Genet. 48, 331–335 (2016).
Heinäniemi, M. et al. Gene-pair expression signatures reveal lineage control. Nat. Methods 10, 577–583 (2013).
Roost, M. S. et al. KeyGenes, a tool to probe tissue differentiation using a human fetal transcriptional atlas. Stem Cell Rep. 4, 1112–1124 (2015).
Lang, A. H., Li, H., Collins, J. J. & Mehta, P. Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes. PLoS Comput. Biol. 10, e1003734 (2014).
D’Alessio, A. C. et al. A systematic approach to identify candidate transcription factors that control cell identity. Stem Cell Rep. 5, 763–775 (2015).
Sharma, N. et al. The emergence of transcriptional identity in somatosensory neurons. Nature 577, 392–398 (2020).
Morris, S. A. et al. Dissecting engineered cell types and enhancing cell fate conversion via CellNet. Cell 158, 889–902 (2014).
Cahan, P. et al. CellNet: network biology applied to stem cell engineering. Cell 158, 903–915 (2014).
Radley, A. H. et al. Assessment of engineered cells using CellNet and RNA-seq. Nat. Protoc. 12, 1089–1102 (2017).
Bonneau, R. et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol. 7, R36 (2006).
Miraldi, E. R. et al. Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells. Genome Res. 29, 449–463 (2019).
Tuncbag, N. et al. Network-based interpretation of diverse high-throughput datasets through the omics integrator software package. PLoS Comput. Biol. 12, e1004879 (2016).
Kedaigle, A. J. & Fraenkel, E. in Cancer Systems Biology (ed. Stechow, L.) 13–26 (Springer, 2018).
Leng, N. et al. EBseq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2013).
Machanick, P. & Bailey, T. L. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).
Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Guo, Y., Tian, K., Zeng, H., Guo, X. & Gifford, D. K. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 28, 891–900 (2018).
Hammelman, J., Krismer, K., Banerjee, B., Gifford, D. K. & Sherwood, R. I. Identification of determinants of differential chromatin accessibility through a massively parallel genome-integrated reporter assay. Genome Res. 30, 1468–1480 (2020).
Hammelman, J. & Gifford, D. K. Discovering differential genome sequence activity with interpretable and efficient deep learning. PLoS Comput. Biol. 17, e1009282 (2021).
Berest, I. et al. Quantification of differential transcription factor activity and multiomics-based classification into activators and repressors: diffTF. Cell Rep. 29, 3147–3159 (2019).
De Dieuleveult, M. et al. Genome-wide nucleosome specificity and function of chromatin remodellers in ES cells. Nature 530, 113–116 (2016).
Ferrari, F. et al. DOT1L-mediated murine neuronal differentiation associates with H3K79me2 accumulation and preserves SOX2-enhancer accessibility. Nat. Commun. 11, 5200 (2020).
Cheloufi, S. et al. The histone chaperone CAF-1 safeguards somatic cell identity. Nature 528, 218–224 (2015).
Ramachandran, K. et al. Dynamic enhancers control skeletal muscle identity and reprogramming. PLoS Biol. 17, e3000467 (2019).
Quaife-Ryan, G. A. et al. Multicellular transcriptional analysis of mammalian heart regeneration. Circulation 136, 1123–1139 (2017).
Lawlor, N., Youn, A., Kursawe, R., Ucar, D. & Stitzel, M. L. Alpha TC1 and Beta-TC-6 genomic profiling uncovers both shared and distinct transcriptional regulatory features with their primary islet counterparts. Sci. Rep. 7, 11959 (2017).
McClymont, S. A. et al. Parkinson-associated SNCA enhancer variants revealed by open chromatin in mouse dopamine neurons. Am. J. Hum. Genet. 103, 874–892 (2018).
Closser, M. et al. An expansion of the non-coding genome and its regulatory potential underlies vertebrate neuronal diversity. Neuron 110, 70–85.e6 (2022).
Cernilogar, F. M. et al. Pre-marked chromatin and transcription factor co-binding shape the pioneering activity of Foxa2. Nucleic Acids Res. 47, 9069–9086 (2019).
Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Frey, B. J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Zhang, Y. et al. Model-based analysis of ChIP-seq (MACS). Genome Biol. 9, R137 (2008).
Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116–120 (2012).
Fu, S. et al. Differential analysis of chromatin accessibility and histone modifications for predicting mouse developmental enhancers. Nucleic Acids Res. 46, 11184–11201 (2018).
Wamstad, J. A., Wang, X., Demuren, O. O. & Boyer, L. A. Distal enhancers: new insights into heart development and disease. Trends Cell Biol. 24, 294–302 (2014).
Soufi, A. et al. Pioneer transcription factors target partial DNA motifs on nucleosomes to initiate reprogramming. Cell 161, 555–568 (2015).
Yamamizu, K. et al. Identification of transcription factors for lineage-specific ESC differentiation. Stem Cell Rep. 1, 545–559 (2013).
Simeonov, K. P. & Uppal, H. Direct reprogramming of human fibroblasts to hepatocyte-like cells by synthetic modified mRNAs. PLoS ONE 9, e100134 (2014).
Bai, F. et al. Directed differentiation of embryonic stem cells into cardiomyocytes by bacterial injection of defined transcription factors. Sci. Rep. 5, 15014 (2015).
Jin, Y. et al. Enhanced differentiation of human pluripotent stem cells into cardiomyocytes by bacteria-mediated transcription factors delivery. PLoS ONE 13, e0194895 (2018).
Pistocchi, A. et al. Conserved and divergent functions of NFIX in skeletal muscle development during vertebrate evolution. Development 140, 1528–1536 (2013).
Messina, G. et al. NFIX regulates fetal-specific transcription in developing skeletal muscle. Cell 140, 554–566 (2010).
De Vas, M. G. et al. Hnf1b controls pancreas morphogenesis and the generation of Ngn3+ endocrine progenitors. Development 142, 871–882 (2015).
Ait-Lounis, A. et al. The transcription factor Rfx3 regulates beta-cell differentiation, function, and glucokinase expression. Diabetes 59, 1674–1685 (2010).
Piccand, J. et al. Rfx6 maintains the functional identity of adult pancreatic β cells. Cell Rep. 9, 2219–2232 (2014).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Koo, P. K., Anand, P., Paul, S. B. & Eddy, S. R. Inferring sequence-structure preferences of RNA-binding proteins with convolutional residual networks. Preprint at bioRxiv 418459 (2018).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53.3, 354–366 (2021).
Kim, D. et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. 53.11, 1564–1576 (2021).
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).
Jung, S., Appleton, E., Ali, M., Church, G. M. & del Sol, A. A computer-guided design tool to increase the efficiency of cellular conversions. Nat. Commun. 12, 1659 (2021).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Liu, Y. et al. CRISPR activation screens systematically identify factors that drive neuronal fate and reprogramming. Cell Stem Cell 23, 758–771 (2018).
Yang, J. et al. Genome-scale CRISPRa screen identifies novel factors for cellular reprogramming. Stem Cell Rep. 12, 757–771 (2019).
Black, J. B. et al. Master regulators and cofactors of human neuronal cell fate specification identified by CRISPR gene activation screens. Cell Rep. 33, 108460 (2020).
Genga, R. M. J. et al. Single-cell RNA-sequencing-based CRISPRi screening resolves molecular drivers of early human endoderm development. Cell Rep. 27, 708–718.e10 (2019).
Ng, A. H. M. et al. A comprehensive library of human transcription factors for cell fate engineering. Nat. Biotechnol. 39, 510–519 (2020).
Nakatake, Y. et al. Generation and profiling of 2,135 human ESC lines for the systematic analyses of cell states perturbed by inducing single transcription factors. Cell Rep. 31, 107655 (2020).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv1303.3997 (2013).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Dobin, A. & Gingeras, T. R. Mapping RNA‐seq reads with STAR. Curr. Protoc. Bioinforma. 51, 11–14 (2015).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Ambrosini, G., Groux, R. & Bucher, P. PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix. Bioinformatics 34, 2483–2484 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Grüning, B. et al. Practical computational reproducibility in the life sciences. Cell Syst. 6, 631–635 (2018).
We thank members of the Gifford and Wichterle laboratories for helpful discussions. We gratefully acknowledge funding from 1RO1HG008363 (D.G.), 1R01HG008754 (D.G.), 1R01NS109217 (D.G. and H.W.), R01NS116141 (H.W.), NINDS Postdoctoral NRSA Fellowship (F32NS105372) (T.P.), Brain Initiative K99 (1K99NS121136) (T.P.) and National Science Foundation Graduate Research Fellowship (1122374) (J.H.).
The authors declare no competing interests.
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Madhura Mukhopadhyay was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a, HOCOMOCO v11 mouse transcription factor core motif database is used as input. Motif PWM similarity to the HOCOMOCO database is computed using Tomtom. b, For each pair of motifs, Pearson correlation between Tomtom scores is computed, resulting in a symmetric correlation matrix. Affinity propagation clustering is applied to the correlation matrix, resulting in 107 clusters of transcription factor motifs with one motif being selected as the representative motif of the cluster. c, Cluster representing OCT/SOX heterodimer-like motifs with SOX2 motif selected as the representative. d, Cluster representing LIM-like motifs with LHX3 motif selected as the representative.
Extended Data Fig. 2 Comparing input features and methods for transcription factor recovery from chromatin accessibility data.
a, Reprogramming recovery effected estimated by linear models for decision axes in input to chromatin models for AURC of top 100 ranked factor motifs, excluding predicting stem cell reprogramming factors to estimate effect of use of fibroblast or stem cell as source cell type is selection of cell type-specific regions, or selection of top regions without eliminating regions that are accessible in the source cell type. b, Cell type AURC for top 100 ranked factor motifs stratified by decision axis and marginalized over other axes. Box plots show median and quartile values. Whiskers extend to represent the rest of the data distribution with the exception of outliers that are defined as values greater than 1.5 times the inter-quartile range and are plotted as individual points.
Extended Data Fig. 3 Comparing chromatin accessibility overlapping histone mark and EP300 epigenomic data for transcription factor recovery.
AURC for top 100 ranked factor motifs in liver using overlaps between chromatin accessibility (ATAC-seq) and overlap of chromatin accessibility with H3K27ac, EP300, H3K4me1, H3K4me3, and 3 enhancer markers (EP300, H3K27ac, and H3K4me1) per method identifies for DREME, HOMER, and KMAC worst performance using ATAC + H3K4me3 which is correlated with promoter activity, and for all methods we see similar performance levels with ATAC, ATAC + H3K27ac, and ATAC + H3K4me1 which mark enhancers.
Chromatin methods use optimal input features for each background a, Normalized area under the rank recall curve for top 10 ranked motifs averaged over cell types, b, scatter plot of normalized area under the recall curve for fibroblast (x-axis) and stem cell (y-axis) each dot represents the normalized area under the rank recall curve for top 10 ranked motifs for one cell type and one method where color represents the method, c, scatter plot of normalized area under the recall curve for fibroblast (x-axis) and stem cell (y-axis) each dot represents the normalized area under the rank recall curve for top 100 ranked motifs for one cell type and one method where color represents the method.
Extended Data Fig. 5 GarNet distance thresholds do not majorly impact performance for transcription factor recovery.
GarNet fraction of reprogramming factors over eight target cell types recovered by 2 kb, 10 kb, and 100 kb thresholds for maximum distance between transcription factor binding site and gene transcription factor start site.
Decision chart for performing optimal reprogramming factor recovery given chromatin accessibility data for a desired target reprogramming factor cell type.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
Statistical source data.
About this article
Cite this article
Hammelman, J., Patel, T., Closser, M. et al. Ranking reprogramming factors for cell differentiation. Nat Methods 19, 812–822 (2022). https://doi.org/10.1038/s41592-022-01522-2