Transcriptional synergy as an emergent property defining cell subpopulation identity enables population shift

Single-cell RNA sequencing allows defining molecularly distinct cell subpopulations. However, the identification of specific sets of transcription factors (TFs) that define the identity of these subpopulations remains a challenge. Here we propose that subpopulation identity emerges from the synergistic activity of multiple TFs. Based on this concept, we develop a computational platform (TransSyn) for identifying synergistic transcriptional cores that determine cell subpopulation identities. TransSyn leverages single-cell RNA-seq data, and performs a dynamic search for an optimal synergistic transcriptional core using an information theoretic measure of synergy. A large-scale TransSyn analysis identifies transcriptional cores for 186 subpopulations, and predicts identity conversion TFs between 3786 pairs of cell subpopulations. Finally, TransSyn predictions enable experimental conversion of human hindbrain neuroepithelial cells into medial floor plate midbrain progenitors, capable of rapidly differentiating into dopaminergic neurons. Thus, TransSyn can facilitate designing strategies for conversion of cell subpopulation identities with potential applications in regenerative medicine.

R ecent advances in single-cell RNA-seq technologies have allowed to classify cells into distinct cell subpopulations based on their gene expression profiles. The identity of these cell subpopulations can range from well-defined cell types, subtypes of a same cell type to cells with unclear characters. It has been observed that a handful of specific TFs is sufficient to maintain cell subpopulation identity 1 . Identification of such core TFs can facilitate the characterization and conversion of any cell subpopulation, including rare and previously unknown ones, opening thus novel functional applications 2 . However, this is a challenge since the core TFs that determine the identity of such novel cell subpopulations are largely unknown. Importantly, the definition of identity TFs is dependent on the cellular context in which it is employed 3 . In the context of cell/tissue types, for example between neurons and hepatocytes, the identity TFs are defined by the comparison between these largely different cell types. However, in the context of cell subpopulations within a cell type, such as different subtypes of dopaminergic neurons 4 , the definition of identity TFs becomes subtler due to the increased commonality between them.
Existing methods for identifying TFs for cell identity or cellular conversions 5-7 rely on a set of gene expression profiles of bulk cell/tissue types. Consequently, the application of these methods is limited to those bulk cell/tissue types, and cannot be applied to novel subpopulations of cells identified in a newly generated single-cell dataset. In addition, these methods detect potential identity TFs by focusing on properties of individual TFs, such as gene expression levels or the number of their unique target genes, rather than emergent properties of potential identity TFs themselves, such as transcriptional synergy among them.
Combinatorial binding of specific TFs to enhancers is known to result in a synergistic activity essential for robust and specific transcriptional programmes during development 8 . The functionality of several TFs operating together to achieve a common output has been studied in detail in embryonic stem cells (ESCs), where a transcriptional core involving Pou5f1, Sox2, and Nanog controls pluripotency 9 . Furthermore, it has been observed in different systems that multiple TFs are required to function cooperatively to sustain the overall cellular phenotype 10 .
Here, we propose the general concept that cell subpopulation identity is an emergent property arising from a synergistic activity of multiple TFs that stabilizes their gene expression levels. Based on this concept, we develop a computational platform, TransSyn, for the identification of synergistic transcriptional cores defining cell subpopulation identities. TransSyn does not depend on the inference of gene regulatory networks (GRNs), which are often incomplete and their topological characteristics not always capture the multiple direct and indirect interactions between genes. In addition, it only requires a single-cell RNA-seq data of distinct subpopulations as input (Fig. 1a), and does not depend on pre-compiled gene expression datasets or any other prior knowledge. Consequently, TransSyn infers subpopulation identities within a cell population, and aids in designing strategies to convert cell subpopulation identities, especially in cases of closely related subpopulations in functionally different states. Finally, as a direct application of TransSyn, we show that the knowledge of cell subpopulation-specific synergistic transcriptional cores enables experimental conversion of human hindbrain neuroepithelial cells into medial floor plate midbrain progenitors, which rapidly differentiate into DA neurons. Thus, TransSyn can facilitate conversion of cell subpopulation identities with potential applications in regenerative medicine.

Results
Rationale and outline of the method. TransSyn identifies a specific combination of TFs that are most frequently expressed and exhibit high transcriptional synergy computed by multivariate mutual information (MMI) 11 . MMI measures the information (i.e., predictability) gained by an additional variable (TF), which cannot be explained by the simple summation of the information given by the subsets of variables. For example, MMI among three TFs, X, Y, and Z, is defined as: This indicates that when MMI is negative, the three TFs are synergistically interacting with each other, because the knowledge of both X and Y together (i.e., I(X, Y; Z)) provides more information about Z than the sum of the knowledges given by X and Y separately (i.e., I(X; Z)+I(Y; Z)) (Fig. 1b). The same principle applies to MMI with higher numbers of variables. In this way, TransSyn considers all possible direct and indirect regulatory interactions that can be measured by gene expression. Therefore, it can account for the disparate nature of synergistic transcriptional regulation, including combinatorial/cooperative binding of TFs to target gene promoter/enhancer regions 8 , and proteinprotein interactions among transcriptional co-factors. TransSyn requires single-cell RNA-seq data for MMI computation. Ideally, MMI for all possible combinations of TFs should be calculated to identify the most synergistic TF combination. However, such computation is infeasible (for example, the number of all combinations of 3, 4, 5, and 6 TFs among 100 TFs already adds up to 1, 271, 422, 845). Therefore, we implemented a dynamic search algorithm, in which an initial set of most synergistic 3-TF combinations (seed combinations) are progressively extended by adding TFs one by one as long as MMI calculated for the new combination exceeds the MMI of the previous seed combination ( Fig. 1c; Supplementary Fig. 1) (see Methods). The search is terminated when the addition of a new TF results in no further decrease in MMI, and the current TF combination exhibiting the least MMI (i.e., most synergistic) is considered the synergistic transcriptional core. Upon termination, if more than one TF combination exhibits the highest synergy, they are ranked by another information theoretic measurement, total correlation (TC), which, unlike MMI, incorporates interactions between all possible combinations of TFs within each core providing a measure of interaction strength 12 .
TransSyn captures known synergistic transcriptional cores. By applying TransSyn to a large compilation of published single-cell RNA-seq data, we created a catalog of synergistic transcriptional cores specific to 186 cell subpopulations (Supplementary Data 1). Here, by subpopulations we mean distinct groups of cells within a heterogeneous cell population identified based on their gene expression profiles, and do not discriminate between well-defined cell types, subtypes of a same cell type and cells with unclear identity. The predicted synergistic transcriptional cores, when evidence is available, consistently contained TFs known to maintain the respective cell subpopulation identities. For example, the key pluripotency factors POU5F1, NANOG, and SOX2 that maintain the ESC phenotype were found as the most synergistic transcriptional core in hESCs (Table 1; Supplementary Data 1). Notably, these TFs have been speculated to act synergistically via large clusters of enhancers 13 . Another example is the blood progenitor subpopulation 14 that contained Tal1, Gata2, Runx1, and Fli1 in its synergistic transcriptional core (Table 1; Supplementary Data 1). These TFs have been shown to form complexes via protein-protein interactions that stabilize their cooperative binding to DNA and synergistically control the subpopulation identity 15 . Therefore, this represents another known example where a synergistic interaction of TFs defines a cell subpopulation identity. Finally, the synergistic core of human fetal oculomotor and trochlear nucleus (hOMTN) subpopulation consisted of ISL1 and PHOX2A (Supplementary Data 1), which have been shown to synergistically specify cranial motor neurons from mESCs 16 .
TransSyn predictions also contained several TFs known to interact with each other to control cell subpopulation identities. For example, Gata1, Gfi1b Klf1, and Ikzf1, known to maintain embryonic blood cells 17,18 were found in the synergistic transcriptional core of the embryonic primitive erythrocyte subpopulation 14 (Supplementary Data 1). Gata1 and Ikzf1 are known to functionally regulate each other. In addition, the synergistic transcriptional core of the embryonic visceral endoderm subpopulation 14 included Eomes, Otx2, Zic3, Foxa2, and Hnf4a (Table 1; Supplementary Data 1), which are known to regulate each other and other downstream targets specific to this cell subpopulation 19,20 . Id3, Klf13, Klf6, and Klf4 are known for their roles in the acquisition of vascular endothelial cell fate, whose synergistic transcriptional core contained these TFs 21,22 . The synergistic core of the mouse enteroendocrine cell contained Neurog3, Neurog1, Insm1, Nkx2-2, Foxa1, Foxa2, Pax4, and Lmx1a, all of which are known to be essential for the functioning of the cell [23][24][25][26][27][28][29] . We also examined the synergistic transcriptional core of the human subpopulations for which only mouse functional data is available, such as hProgFPM and hDA2  neurons thought to give rise to substantia nigra DA neurons postnatally 4 . The synergistic transcriptional core of hDA2 neurons identified NR4A2, a nuclear receptor that controls mDA neuron identity and survival in mice 30 , and FOXA1, a TF that together with FOXA2 contros mDA identity and neurogenesis in mice 31 . Finally, the hProgFPM synergistic core included TFs previously identified in the mouse midbrain floor plate and important for mDA neuron development in mice, such as FOXA2, OTX2, LMX1A 10, 32-34 which have not been previously recognized as the core of hProgFPM ( Evaluation of TransSyn performance. For an unbiased assessment of TransSyn performance, we calculated the percentage of cell subpopulations where at least one predicted TF has previously been experimentally validated to define the identity of that cell subpopulation. This showed that TransSyn could predict at least one such TF for 85 % of the cell subpopulations, for which at least one experimentally validated TF is known. We followed this criterion since the current knowledge of experimentally validated TFs is not complete (i.e., previously not tested) and includes TFs which are not classified as identity TFs according to our definition. The compiled list of TFs known to maintain cell subpopulation identities is shown in (Supplementary Data 2).
Importantly, we observed that pair-wise mutual information (MI) was not able to capture all the interactions among TFs in synergistic cores, supporting that these TFs interact synergistically rather than pair-wise (Supplementary Data 3). For example, this was observed in the case of interaction between the three plutipotency TFs (NANOG, POU5F1, and SOX2) in hESC, and Runx1, Fli1, Gata2, and Tal1 in the blood progenitor subpopulation described above, due to the multifactorial nature of the transcriptional regulatory mechanism. On the contrary, a set of TFs exhibiting pair-wise interactions among themselves does not necessarily display a multiple synergistic interaction, and therefore will not represent a synergistic transcriptional core. To show this, we performed a topological analysis of subpopulation specific GRNs inferred from pair-wise co-expression to identify top 10 subpopulation-specific hubs that could potentially be TFs that define subpopulation identities. Results showed that only a few known TFs were recovered as unique hubs (Table 2; Supplementary Data 4), indicating that transcriptional synergy is more suited for unraveling TFs that define subpopulation identities.
Next, we compared the performance of TransSyn to a method for identifying candidate identity TFs for bulk cell/tissue types using Jensen-Shannon Divergence (JSD) 7 . Since JSD was computed from bulk microarray data in this earlier study, we computed JSD using the average single-cell gene expression in each cell subpopulation. Results showed that in general, JSD predicted at least one TF in 33 % cell subpopulations in contrast to 85% achieved by TransSyn (Table 1A; Supplementary Data 5).  Table 1). The reprogramming factors identified by the latter for ESCs, NSCs, pancreatic mast cells and endothelial cells contained known identity TFs, whereas the factors for neurons from NSCs and between lung fibroblast and bronchial epithelial cells did not contain any known TF.
Experimental validation of predicted identity TFs. Finally, to demonstrate the usefulness of TransSyn, we carried out an experiment to shift the identity of hidbrain hNES cell line (SAI2) 35 to that of midbrain hProgFPM cells 4 . We first generated singlecell RNA-seq data of hNES cells, and found that its synergistic transcriptional core is quite different from that of hProgFPM cells (Fig. 2a). Analysis of the TFs required to convert hNES cells into hProgFPM cells identified OTX2, LMX1A, and FOXA2 (Fig. 2a). Since OTX2 is known to induce LMX1A 36 , the conversion was performed by inducing the expression of the other two TFs, OTX2, and FOXA2. This was achieved by treating hNES cells during proliferation (FGF2+EGF) with two factors: (i) The small molecule smoothened agonist (SAG, 500 nM), which directly activates Shh signaling 37 and induces FOXA2 38 . (ii) The Wnt antagonist Dickoppf1 (Dkk1, 150 ng/ml), to reduce Wnt/β-catenin signaling to the levels required to induce OTX2 10 and midbrain dopamine neuron development 39 (Fig. 2b). Our expectation was that SAG would ventralize hNES cells and change their basolateral identity 35 into floor plate cells expressing FOXA2; and Dkk1 would anteriorize hindbrain cells expressing GBX2 into midbrain cells expressing OTX2 40 . Our results show that treatment of proliferating hNES cells with SAG and Dkk1 did not change the levels of the common midbrain-hindbrain TFs engrailed1 (EN1) and PAX2 (Fig. 2c, d), but increased the ratio of OTX2:GBX2 expression (Fig. 2e), indicating efficient anteriorization and acquisition of midbrain identity. In addition, we also observed increased levels of FOXA2 (Fig. 2f) and decreased levels of lateral genes, such as PAX6 and IRX3 (Fig. 2g, h), indicative of efficient ventralization. These results were also confirmed by immunohistochemistry, which showed increased numbers of OTX2-positive cells (Fig. 2i).
To further confirm that the identity of the hNES cells had become that of hProgFPM cells, we tested their function, as assessed by their capacity to induce the expression of LMX1A at a later time-point and differentiate into midbrain DA neurons, reasoning that cells with the correct identity will be more efficient at generating DA neurons than the parental cells. Differentiation involved the removal of mitogens (FGF2 and EGF), as well as treatment with well-know midbrain patterning and differentiation factors such as Shh, Wnt5a, BDNF, GDNF, TGFβ3, and Wnt5a (reviewed in ref. 40 ). In addition, we tested whether treatment with FGF8, a factor that was recently found to improve midbrain patterning and differentiation in human ES cells 41 was capable of further improving our protocol (Fig. 3a). We found that while both protocols strongly increased OTX2 and decreased GBX2 expression, only the protocol without FGF8 significantly increased LMX1A expression at day 8, as assessed by RT-qPCR (Fig. 3b). Similarly, both protocols increased the levels of NR4A2 and SLC6A3, but TH expression was only significantly increased by the SAG and Dkk1 protocol (Fig. 3c). Accordingly, while control unconverted cells were only capable of giving rise to rare and weak TH+ cells, cells differentiated after SAG and Dkk1 treatment gave rise to abundant TH+ neurons (Fig. 3d), and significant increased in the number of OTX2+ cells (Fig. 3e) and of TH+ cells (Fig. 3f). Moreover, TH+ cells were also LMX1A+, NR4A2+, and PBX1+ (Fig. 3g-i), confirming their midbrain identity 30,33,46 . In addition, TH+ cells expressed the mature neuronal marker MAP2 (Fig. 3j) and some of them were found to acquire a mature neuronal morphology with long processes and varicosities and bipolar morphology, typical of mDA neurons (Fig. 3k). Thus, our results show that by switching the identity of hNES to hProgFPM prior to differentiation, it is possible to rapidly differentiate hNES into DA neurons.

Discussion
In this study we postulated that cell subpopulation identity is determined by TFs that exhibit transcriptional synergy. Based on this proposition, we developed a computational method that dynamically searches for optimal synegistic transcriptional cores using an information theoretic measure of synergy computed from single-cell RNA-seq data. The predicted transcriptional cores recapitulatd known identity TFs in 85% of the tested cases and known synergistic TF interactions that relate to cell identity. Thus, the concept of transcriptional synergy employed in TransSyn represents a novel approach to specifically identifying transcriptional cores defining cell subpopulation identities. Following the experimental validation of the predicted identity transcriptional core of hProgFPM cells, we compiled a list of TFs whose up-/down-regulation may convert one cell subpopulation into another for 3786 pairs of initial and target cell subpopulations (Supplementary Data 6). Further validation of these transcriptional cores will reinforce the generality of the method. Importantly, unlike previously introduced methods, TransSyn does not require pre-compiled reference single-cell datasets, which are unavailable for newly identified cell subpopulations. In addition, TransSyn does not rely on GRN inference and analysis, which could be a bottleneck for accurate predictions of identity transcriptional cores. In summary, such unbiased identification of synergistic transcriptional cores may facilitate the development of general strategies for cell subpopulation conversions, opening up novel functional applications in regenerative medicine, such as the generation of DA neurons for Parkinson's disease.

Methods
Single-cell RNA-seq data. Single-cell RNA-seq data used in this study were obtained for the following biological systems; the mouse datasets for lung, striatum, cortex, and hippocampus, quiescent, and active NSCs, intestine, circulating pancreatic tumor cells, hair follicles, and gastrulating embryo, and the human datasets   Fig. 2 Conversion of basal hNES cells into medial floor plate midbrain progenitors (hProgFPM) by treatment of proliferating hNES for 2 days with the smoothen agonist (SAG, 500 nM) and Dickoppf1 (Dkk1, 150 ng/ml). a Transcription factors forming the synergistic transcriptional core of hNES cells and required for their conversion to midbrain progenitors. b Schematic representation of the treatment followed to convert proliferating hNES into HProgFPM, which included SAG, smoothen agonist (500 nM) and Dkk1, Dickoppf1 (150 ng/ml). c-f RT-qPCR analysis at day 2 showing the expression of TFs that define cell populations in the midbrain and hindbrain regions: EN1 (c) and PAX2 (d); the ventral midbrain: OTX2:GBX2 ratio (e) and FOXA2 (f), as well as more lateral compartments: PAX6 (g) and IRX3 (h). i Comparison of OTX2+cells in control and SAB+Dkk1-converted NES cells. Scale 50 μm. Box plots (ch): Center line, median; hinges, 25% and 75% quartiles; whiskers, 1.5 interquartile range. Statistics: t-test; *p ≤ 0.05; *p ≤ 0.01. N = 3 for midbrain, CD127+ lymphoid cells, pancreas, ovarian cancer, germline cells, and in vitro hESCs. The reference to each dataset is described in (Supplementary Data 1). We used the same subpopulation classifications defined in the respective original studies. We also analyzed other datasets not listed here, however, they did not have an enough number of expressed TFs in the majority of cell subpopulations, and were therefore discarded. In addition, synergistic transcriptional cores for cell subpopulations that were either "undefined" or with less than three cells were not considered. We did not reprocess each raw data and same gene expression values that were used in the original studies were also used in this study. TFs were considered "expressed" if their expression values were ≧1 in RNA-seq FPKM/ RPKM/TPM values, ≧10 in normalized read counts, or ≧1 in UMI counts. TFs below these thresholds were considered "not expressed". Exceptionally, the expression cutoff of 10 was used for the hESC dataset, since setting it to 1 resulted in too many expressed TFs and the subsequent computation became infeasible.  Identification of most frequently expressed TFs. The definition of TFs was obtained from the AnimalTF database 42 . The fraction of cells expressing each TF was computed in each subpopulation and the top 10% most frequently expressed TFs were shortlisted for further analyses. Among these TFs, we discarded those that were not expressed in more than 30% of cells. For the La Manno et al., 2016, dataset 4 , the binarized expression status estimated in the original study was used. Since this filtering retained many TFs that were expressed at very low intensity, TFs with mean UMI count <1 were further discarded. Since the subsequent computation becomes infeasible for standard desktop computers if the number of TFs is more than 150, in these cases TFs with highest coefficient of variation were discarded to make the number of TFs ≦150.
MMI computation. In each cell subpopulation MMI 11 among the most frequently expressed TFs was computed by: where S ¼ X 1; X 2 ; :::X n n o , T is a subset of S, and T j j indicates the number of variables in this subset. In the current study these variables are discretized gene expression values of TFs. This equation becomes MI if only two variables are considered. In the case of three variables the equation can be written as: To compute Shannon's entropy, gene expression values were first log10transformed. Zero gene expression values were converted into 1 prior to the transformation. This value was then discretized within each cell subpopulation using the Freedman-Diaconis rule implemented in the R nclass.FD function and Shannon's entropy of each TF was computed on these discretized values. The input value for the nclass.FD function was set to the number of cells +1 for FPKM/ RPKM/TPM values, and normalized read counts, while the number of cells +6 was used for UMI counts. The range of gene expression value was set between 0 and maximum value of a given cell subpopulation. Since the bin size for each TF is different, the entropy was normalized by the theoretical maximum entropy (i.e., entropy when all bins contain an equal number of variable) to enable a direct comparison between different TF entropies. The MMI was then computed using all cells in the entire population of a given dataset except for the ones in the subpopulation for which MMI is being computed. As described above, the joint entropy was also normalized by the theoretical maximum entropy prior to MMI computation.
Dynamic search for synergistic transcriptional cores. MMI was first computed for all combinations of three TFs and the top one percent lowest MMI (i.e., synergistic) combinations were taken. Then, these TF combinations were ranked by TC defined as: where S ¼ fX 1; X 2 ; :::X n g. TC measures the interaction strengths (MI) shared among all subsets of the variables within a combination, and is more appropriate for comparing interaction strengths between different combinations than MMI 12 , which measures the information gain from the previous seed combination. Then, top one percent highest TC combinations were used as initial seeds for the subsequent search for higher-level synergistic combinations of TFs. To this end, new TFs were added to each seed combination one by one and MMI for the new combination was computed. Then, the combinations that showed lower (less than 0.05) MMI than the seed were taken to the next iteration. MMI between the new combination and seed was negative, then that new combination was kept. Then, these new, more synergistic TF combinations were again ranked by TC and the top 10 combinations were used as seeds for the identification of best 5 TF combinations next, and so on. This procedure was continued until no new combination is more synergistic than the seed. We also terminated the procedure when the number of TFs reached 15, since continuing with more than this number was often computationally impractical. We think this operation is acceptable, since usually at this point most TFs are shared among different combinations. Once the search is terminated, MMI for all combinations of the top 20 best TC combinations is computed and if there is more synergistic combination(s), then those combinations are ranked by TC as the final synergistic transcriptional cores. If more than one top combination (i.e., ties) is present, they are ranked by the highest summed mean gene expression and the top three combinations were kept as the final synergistic transcriptional cores. For the identification of cell conversion TFs, TFs in the synergistic transcriptional core of a target cell subpopulation were ranked by the mean gene expression fold change between the target cell subpopulation and starting cell subpopulation. The main part of TransSyn was written in C++, which was wrapped in R using the Rcpp package.
MI computation between TFs. Pair-wise MI was computed for TF pairs in transcriptional synergistic cores, in which at least two TF are known to maintain that cell subpopulation. The gene expression values were first log2-transformed and then discretized within each cell subpopulation using the Freedman-Diaconis rule, as described above. Then Shannon's entropy of each TF and joint entropies of each pair of TFs were computed on these discretized values. MI was then computed by: The statistical significance of each edge was computed by a t-test against a null distribution formed by randomizing data 50 times and edges with the top 1% lowest p-value were kept as the final edges. Cultivation of Lt-NES SAI2 cells. In our study we used the Long-term selfrenewing neuroepithelial-like stem cells (Lt-NES) SAI2 line generated from human hindbrain fetal tissue 35 . Mycoplasma-free cells have been kept in proliferation according to previously described protocols 45 , in 6-well plates coated with poly-Lornithine (1:5 in water; Sigma) and laminin (1:500 in water, Invitrogen), using maintenance media based on DMEM F12 Glutamax Medium (GIBCO, Life-Technologies) supplemented with N2 (1:100, GIBCO, LifeTechnologies), B27 (1:1000, GIBCO, LifeTechnologies), and the growth factors hEGF (10 ng/ml, R&D) and FGF2 (10 ng/ml, R&D). To modify the identity of Lt-NES, cells were treated for 48 h with SAG (500 nM, Tocris) and Dkk1 (150 ng/ml, R&D) in the proliferation media. Treated and non-treated cells were compared.