Gene-pair expression signatures reveal lineage control

Article metrics


The distinct cell types of multicellular organisms arise owing to constraints imposed by gene regulatory networks on the collective change of gene expression across the genome, creating self-stabilizing expression states, or attractors. We curated human expression data comprising 166 cell types and 2,602 transcription-regulating genes and developed a data-driven method for identifying putative determinants of cell fate built around the concept of expression reversal of gene pairs, such as those participating in toggle-switch circuits. This approach allows us to organize the cell types into their ontogenic lineage relationships. Our method identifies genes in regulatory circuits that control neuronal fate, pluripotency and blood cell differentiation, and it may be useful for prioritizing candidate factors for direct conversion of cell fate.


Mammalian organisms contain at least 250 cell types1, each with a characteristic gene expression profile. Despite the increasing availability of expression data, comprehensive characterization of cell type–specific expression profiles remains challenging because of inconsistencies in annotations and technical issues such as data normalization. Moreover, common differential expression analyses alone are insufficient to recover ontogenic cell-lineage relationships or to reflect regulatory relationships among transcription factors (TFs) that lead some to function as fate determinants.

We describe a data-driven method that addresses these problems in the context of the very mechanisms by which the gene regulatory networks govern lineage development. Our analysis is motivated by a two-gene circuit motif known to control binary developmental decisions2. This motif, first hypothesized to control developmental switches in Drosophila3,4, contains a pair of mutually repressive TF-encoding genes and effectively constitutes a toggle switch. These circuits allow a bipotent progenitor cell to simultaneously express two opposing TF genes at low levels, existing in the poised state TF1TF2 (ref. 2), but force the differentiating cell to choose between either of two stable configurations in which one TF dominates the other, TF1 TF2 or TF2 TF1.

Such pairs of antagonistic TFs can govern the development of 'sister' lineages: in addition to cross-inhibiting each other, they act as lineage-specifying master regulators of reciprocally expressed target genes, thus establishing lineage-specific gene expression profiles2. The pair {SPI1, GATA1} is a well-studied example in the hematopoietic system5. SPI1 (PU.1) specifies the myeloid lineage, characterized by SPI1 GATA1, whereas GATA1 specifies the erythroid lineage, in which GATA1 SPI1 (ref. 6). Such a lineage split manifests as the establishment of a mutual exclusion of the fate-determining TFs, resulting in their reversed expression, which can be used to identify master regulators. We scored genes for potential participation in such expression reversals. We expected gene pairs that function as lineage determinants to exhibit consistent relative expression across samples from the same cell type (and lineage) and consistent reversal of relative expression between cell types from sister lineages, a property that has been exploited in expression-based classifiers7,8,9.

By applying this method to curated gene expression data from 166 cell types and 2,602 transcription-regulating genes, we showed that experimentally verified master regulators of cell fate were indeed revealed through quantification of their participation in expression reversals. In analyses focused on hematopoiesis, our method revealed known and novel candidate fate-specifying genes. Finally, we derived a cell type similarity measure from expression reversals with which we could recover known ontogenic lineage relationships reminiscent of the branching valleys of the epigenetic landscape envisioned by Waddington10.


Gene expression reversal analysis

We curated a data set comprising 2,919 microarrays and representing 166 normal human cell types (Supplementary Results, Supplementary Tables 1,2,3 and Supplementary Fig. 1) and selected genes with functional annotation related to transcription regulation (Supplementary Results, Supplementary Tables 4 and 5 and Supplementary Fig. 2). A subset formed from strictly defined TF genes will be referred to as the 'TF set' (844 genes). 'TF' will be used to refer to all transcription-regulating genes for simplicity.

For each pair of genes and each pair of cell types, we defined the reversal score Δ to be the difference between cell types of the mean rank difference (within each cell type) between genes (equations (1,2,3) in Online Methods, Fig. 1 and Supplementary Results). Use of rank data rather than absolute expression obviates the need for sample normalization, which is typically required to correct for sample distribution differences (Supplementary Fig. 3), because all direct comparisons between genes happen within samples, and conventional normalization methods are rank preserving. Thus, large absolute values of Δ identify gene pairs that reverse expression between cell types. Δ is clamped to 0 for pairs of genes that do not change relative expression (that is, the difference in their mean ranks does not change sign) between cell types. Fixing the gene pair in Δ and letting the cell types vary (Supplementary Results) produces gene pair–reversal plots that visualize the potential for a gene pair to participate in a lineage split between any pair of cell types (Fig. 1b). Finally, we defined the participation score Ψ for a particular gene (equations (4) and (5) in Online Methods) to be an aggregate measure of the number and strength of reversals in which the gene participates. The quantification of Ψ encompassing all cell type comparisons produces 'reversal participation portraits', which can be examined at the level of the gene or the cell type (Fig. 1c). These quantities offer a new way to explore gene expression profiles across cell types and are made available as an interactive tool online ( Source code to perform the analysis is available as Supplementary Software, and updated versions are available upon request.

Figure 1: Gene-pair expression-reversal analysis.

(a) Ranks of two hypothetical genes g and g′ scaled by the total number of genes, plotted from microarray samples assigned to three hypothetical cell types. δ, normalized mean rank difference of two genes. (b) Gene pair–reversal plot. The reversal behavior of n = 3 cell types for the {g, g′} gene pair is shown as an n × n symmetric matrix. The Δ value, indicating the extent of reversal behavior, is represented by the color in the heat map; gray corresponds to Δ = 0, red tones indicate that the configuration changes from g g′ in the first cell type to g g′ in the second cell type, and opposite reversals are indicated in blue. (c) Reversal participation. The Ψ value for gene g quantifies reversal participation, a measure of the number and strength of reversals in which the gene participates. The matrix on the left displays a cell portrait in which rows correspond to the reversal participation scores of genes for those pairwise cell type comparisons involving cell type 12 (comparisons to self are indicated in dark blue). The portraits are sorted to reveal highest-scoring genes on top. Alternatively, for assessing reversal participation of a particular gene across all cell types (here, pairwise comparisons of 32 hypothetical cell types) the Ψ values can be visualized as gene portraits (note the corresponding rows in the two matrices: the row showing reversal participation of gene g in cell type 12 on the left matches the row for cell type 12 in the gene portrait shown on the right).

Revealing critical factors for induced pluripotency

We hypothesized that participation of a gene in reversals involving a given cell type is indicative of the specificity of the gene for that cell type as well as its potential to participate in lineage determination. We sorted genes by their participation scores in comparisons of embryonic stem cells (ESCs) with other cell types (Fig. 2a). The genes NANOG, POU5F1 (OCT3 or OCT4), SOX2 and LIN28A appear among the 20 highest-ranking genes; these are precisely the genes that jointly are capable of inducing the pluripotent state from differentiated cells11 (see also Supplementary Fig. 4). A critical role in regulation of stem cell transcription has been reported for 17 of the top 20 genes (Supplementary Table 6). These results are robust with respect to noise and sample size differences (Supplementary Figs. 5, 6, 7 and Supplementary Results).

Figure 2: Reversal participation analysis in ESCs.

(a) First 100 rows (of 2,602 TF genes evaluated) of the ESC portrait; the names of top 20 ESC-specific transcription-regulating genes are indicated (refer to Supplementary Table 3 for the order of cell types in columns). (b) Plots showing gene-reversal portraits, ENCODE12 RNA-seq data (R) and ChIP-seq histone methylation data (C) for the top 20 ESC-specific genes. H3K4me3 data are shown for six ENCODE cell types: human ESC line H1 (H1 ES), breast epithelial cell (HMEC), skeletal muscle myoblast (HSMM), umbilical vein endothelial cell (HUVEC), epithelial keratinocyte (NHEK) and lung fibroblast (NHLF). RNA-seq data are shown for H1 ES, HUVEC and NHEK cells. Ψ, reversal participation score.

We examined previously published sequencing data12 for the transcription start site (TSS) activity chromatin marker trimethylated histone 3 lysine 4 (H3K4me3) (obtained via sequencing of immunoprecipitated chromatin, ChIP-seq) and for transcription (via RNA-seq) in multiple normal human cell types (including ESCs) for the top 20 genes associated with ESCs, as defined by the reversal participation score (Fig. 2b). Genes with a highly ESC-restricted gene portrait appeared ESC specific in both ChIP-seq and RNA-seq results. Furthermore, TF ChIP-seq data suggest that the pluripotency-inducing TFs NANOG, POU5F1 and SOX2 co-occupy regulatory regions of genes that are at the top of the list of ESC-associated genes13 (Supplementary Fig. 8).

Reversals expose genes with lineage-determining potential

In addition to capturing cell type–restricted expression, our data on ESCs suggest that reversal participation may identify TFs with lineage-specifying power that could be used to induce conversion toward a particular cell type. We investigated this possibility in a published reprogramming experiment14. ASCL1 is a critical TF that alone and in combination with other factors was discovered to induce fibroblast-to-neuron conversion14. We generated reversal participation portraits of the 19 genes encoding TFs initially evaluated as candidates for fibroblast-to-neuron conversion and examined the portraits in light of each factor's potency14 in enhancing ASCL1-induced conversion (Fig. 3). Those TFs previously observed to be effective in inducing transdifferentiation showed strong reversal participation signal localized to comparisons involving neuronal cell types. The diffuse patterns in the plots are in agreement with experimental results14 in which those genes showed no effect on ASCL1-induced conversion. Therefore, gene-reversal participation also identifies potential fate-determining roles of a TF in a given lineage.

Figure 3: Reversal participation analysis of a candidate gene set for neuronal specification.

The reversal participation (Ψ) portraits of 19 candidate genes for inducing fibroblast-to-neuron conversion14 are shown. ASCL1 was previously found to be most potent on its own for this induction. The ordering of the portraits reflects previous experimental success with induction of neuronal fate in combination with ASCL1. The gray bar indicates the location (rows) of neuronal cells in each portrait. The combination of genes indicated in bold resulted in the best reprogramming efficiency.

Expression reversals in the hematopoietic lineage splits

To demonstrate how gene pair–reversal analysis can shed light on toggle-switch circuits, we selected three characterized mutual repression circuits involved in blood cell lineage control: {GATA1, SPI1}, {GATA1, GATA2} and {GFI1, EGR2}. These pairs govern the lineage splits between erythroid versus myeloid, erythroid versus megakaryocyte and granulocyte versus macrophage, respectively5,15,16.

For the first lineage split, we observed the SPI1 ≈ GATA1 configuration in progenitor cells, a result consistent with previous data17, whereas a pronounced reversal of their relative expression occurred between the proerythroid and promyeloid cells: GATA1 SPI1 in all array data from proerythroid cells and GATA1 SPI1 in all data from promyeloid cells (Supplementary Fig. 9a). Thus, the behavior of this gene pair across all cell types in the comparison set, as seen in the gene-pair plot, highlighted the erythroid-myeloid lineage split (Supplementary Fig. 9b). Similarly, expression of the {GATA1, GATA2} pair was reversed between proerythroid cells and platelets that segregate in a downstream lineage split15 (Supplementary Fig. 9c). Finally, expression of the {GFI1, EGR2} pair was strongly reversed between granulocyte-lineage progenitors and differentiated macrophages. Notably, this pair exhibited a signal of mutual exclusion in the lymphoid lineage, suggesting a broader role in the blood system: that is, the reuse of circuits for different decisions2 (Supplementary Fig. 9d).

Lineage branching is often controlled not just by one toggle-switch circuit but rather by the integrated action of many interconnected18, mutually repressing gene pairs. We evaluated the reversal behavior of all gene pairs in the TF set in an extended set of hematopoietic cell types, attempting to identify pairs that exhibit an expression reversal associated specifically with the erythroid-myeloid lineage split or the B versus T lymphoid lineage split (Online Methods). To increase specificity, we required that the TF gene pairs separating erythroid and myeloid cells be disjoint with the pairs separating lymphoid cells.

We matched the expression reversal pattern expected in these lineage splits, shown in ideal reversal plots (Fig. 4a), against the data to identify pairs of TF genes that were maximally lineage restricted for either the common erythroid-myeloid or lymphoid progenitors and that exhibited minimal reversal outside these cell types. To distinguish these reversals from those obtained by chance in comparisons between irrelevant cell types, we ordered the results of our reversal analysis by the probability of obtaining reversals in the entire 166 × 166 cell type comparison matrix using the hypergeometric distribution. We identified five pairs that fulfilled the erythroid-myeloid reversal pattern (exhibiting at least one reversal with |Δ| > 1) (Fig. 4b), including {GATA1, SPI1} and three pairs that matched the lymphoid pattern (Fig. 4c), each containing GATA3. Interestingly, many of the genes found, including the validated GATA1-SPI1 toggle switch, are known to be part of the core network that controls erythropoiesis, myelopoiesis or lymphopoiesis19,20,21,22,23,24,25,26,27 and have been shown in some cases to engage in mutual interaction5,28,29,30. For comparison, we also used standard rank-based differential expression to identify genes that would be specific for the lineage split examined (Supplementary Results). Although we also obtained several of the same genes, these approaches failed to capture the lineage-differentiating property, as it is not attributable to single genes but rather to pairs of genes (Supplementary Results and Supplementary Tables 7,8,9).

Figure 4: Identification of reversal pairs in lineage splits of the blood system.

(a) The tree shows early splits in the blood lineage. Lineage-determining TF gene pairs at binary splits are expected to follow the reversal pattern shown in the idealized gene pair–reversal plots for pairwise comparisons of the hematopoietic stem cell (HSC), erythroid, myeloid and lymphoid cell types. An ideal pair will also show no reversals for other cell type pairs in the full 166 × 166 cell type comparison matrix. (b,c) Gene pair–reversal plots for cell types from the blood lineage used in ranking (top row) and for all cell type comparisons (bottom row). Pairs of TF genes that show statistically significant restricted reversal in the 166 × 166 cell type data are shown with their P values (hypergeometric test) for the erythroid-myeloid (b) and B-T lymphoid (c) splits. Color in the gene pair–reversal plots is as in Figure 1b and corresponds to the Δ value indicating the extent of reversal.

Several previous experiments support the involvement in the T versus B lymphoid lineage split of the {GATA3, EBF1} gene pair we identified by expression-reversal scoring. Gata3 binding was observed in mouse ChIP-seq data31 near the TSS of Ebf1 but not Spib or Aff3, which led us to focus on the {GATA3, EBF1} pair as the best candidate for the lymphoid lineage split. In agreement with this, Gata3 was shown to be repressed by Ebf1 in a gain-of-function study32. Furthermore, ChIP-seq data from human lymphoblastoid cells12 indicates EBF1 binding near the GATA3 TSS. ChIP-seq data also confirmed the possibility of cross-inhibitory interactions at the level of DNA binding for three of the five putative toggle-switch circuits from the erythroid-myeloid analysis (Supplementary Figs. 10 and 11). Moreover, the observed binding of regulatory factors to their own promoters indicates possible autoregulation, which is proposed to be important for genes that participate in lineage-regulatory toggle circuits for stabilizing the poised progenitor state2,6.

Using multiple independent ChIP-seq data sets (Supplementary Table 10), we examined genome-wide binding of the TFs GATA1, TAL1, SPI1, EBF1 and GATA3, which show evidence of cross-inhibitory interactions with the other TF in each identified pair. Specifically, we performed genomic region enrichment analyses (Online Methods and Supplementary Results) to test whether their binding preferentially occurred in the vicinity of genes associated with specific hematopoietic lineages. Indeed, we found that GATA1 and TAL1 binding were clearly associated with ontology terms related to erythrocyte phenotype and differentiation, SPI1 with the myeloid-macrophage, EBF1 with B cells and GATA3 with T cells (Supplementary Tables 11,12,13,14,15), also matching known TF gene knockout phenotypes in mouse (Supplementary Table 16). Furthermore, each member of the antagonistic pairs was also associated with phenotype terms of the respective sister lineage, which could indicate widespread repressive regulation, beyond that of the antagonistic pair.

Gene-pair reversals reflect lineage relationships

Lineage relationships are often illustrated as a tree because of the developmental genealogy of cell types, but the detailed structure of the actual tree for all cell types in higher metazoa remains unknown. We hypothesized that the number of gene pairs with reversed expression between a pair of cell types is indicative of the relatedness of the cell types. Formalizing this, we defined a similarity measure Φ (X,Y) between two cell types, X and Y, as the count of gene pairs for which Δ > 1. We selected well-studied sets of hematopoietic cells and developmentally related endothelial cells to test whether the similarity measure Φ was able to capture known hierarchical lineage relationships. Several precursor cells of these lineages were present in the transcriptome data set we analyzed, permitting the study of branch points. Although traditional hierarchical clustering methods generate dendrograms, cluster labels in dendrograms are placed on the terminal branches (leaves). Thus, such methods cannot reflect the biological lineage tree because all precursors (which exhibit promiscuous gene expression profiles) would necessarily be placed on the leaves rather than on the branch points.

To build this biological intuition into our analysis, we first performed a hierarchical clustering of similarity Φ in differentiated cell types, which we followed by a separate placement of precursor cell types onto the tree branch points, taking Φ into consideration (Online Methods). The resulting dendrogram (Fig. 5a) reflected the well-known hierarchical lineage relationships among these cell types. To facilitate interpretation, we used the similarity Φ of each cell profile to that of the ESC to superimpose an elevation onto the dendrogram. This exposed a key feature of the cell-fate map: we observed that the hematopoietic stem cell and other precursor cell types were more proximal to the ESC than to terminally differentiated cells. The third dimension therefore captured properties of a true differentiation landscape reminiscent of Waddington's metaphoric epigenetic landscape10. We obtained a similar landscape for blood cell types using an independent data set (Supplementary Fig. 12 and Supplementary Table 17).

Figure 5: Lineage relationships based on gene-pair expression reversals.

An evaluation of utility of the similarity Φ to reflect lineage separation is shown. (a) Hierarchical clustering of 29 differentiated cell types based on similarity Φ. The circular dendrogram in the xy plane arranges cells to branching lineages. Ten precursor cell types placed to branch points according to the Hungarian algorithm (Online Methods) are indicated. The landscape elevation z represents the Φ similarity to ESCs. Blue color and high altitude on the landscape corresponds to large similarity to the pluripotent state. (b,c) To represent all 166 cell types, landscapes as in a are shown with multidimensional scaling for (b) TF genes or (c) metabolic genes41.

To challenge the possibility that the landscape reflects differentiation potential, we extended the clustering to include all 166 cell types (Fig. 5b) and compared it to the landscape results obtained using metabolic genes33 instead of TF genes (Fig. 5c and Supplementary Fig. 13). As the precursors of many cell types are not present in the data set, we used multidimensional scaling to visualize cell type dissimilarities on a plane. After again superimposing elevation using the similarity to ESCs, we observed precursor cell types at elevated locations and a distinct peak for the pluripotent cells in the TF landscape. In contrast, metabolic genes, which are not expected to drive lineage determination, failed to discriminate precursor cells. In the metabolic landscape, these cells resided in a large basin that connects cell types from multiple lineages and differentiation stages.


We show a unique way to analyze cell type–specific gene expression profiles that is connected to the very principles by which gene circuits govern cell type diversification. Using the information in the reversal of expression between pairs of genes encoding TFs in cell type comparisons, we generated 'participation portraits' of cell types that identified TFs known to play a role in fate determination. Our curated sets of TFs that operate at the core of cell fate–switch circuits should pave the way toward investigating how TFs, chromatin modification and RNA processing act together in cell lineage control34 and in regulatory networks. For instance, two genes that were highly ranked in ESCs by our analysis, DNMT3B and TET1, regulate DNA methylation: DNMT3B has been described as an epigenetic regulator of pluripotency genes35,36,37. Upon its discovery, TET1 lacked annotation of its cellular function38. Our analysis suggested a developmental function, and a key role for TET1 in maintaining pluripotency was indeed subsequently found39. Knowing the mechanistic interactions of transcriptional regulatory networks in different cell types40 will enable cell type–specific modeling of genetic networks and an understanding of how mutually repressive pairs of TF genes that act as bistable lineage-determining toggle switches affect other TF genes and ultimately the global state of the network.

By showing that bidirectional regulation epitomized by the toggle-switch circuits is manifested in expression reversal behavior, we grounded our method on proposed mechanisms in developmental biology2,3,4. We successfully identified lineage-specific profiles and TF genes involved in core fate-determining circuits. Because the identified genes are not only reporters correlated with cell lineages but also possibly involved in regulatory circuits that carry out cell-fate decisions, our data and the interactive tool we provide to explore them could also inform the choice of candidate genes for cell fate reprogramming.

We identified with high significance eight gene pairs for the developmental circuitry of the common blood progenitors that allowed us to explore further how inherent properties of antagonistic pairs may manifest in other types of large-scale data sets. Finally, we used the reversal analysis to design a cell type–similarity measure that integrates regulatory information, affording a first opportunity to capture the epigenetic landscape of the cell differentiation tree directly from expression profile data. In conclusion, our methods for the global analysis of published cell type transcriptomes capture the underlying regulatory dynamics in static gene expression profiles.


Data set collection and preprocessing of expression values.

We analyzed 2,919 microarrays comprising 166 different cell types (in some cases, tissues) that represent each cell type in its normal state. The data set was collected from the GEO microarray repository from the hgu133Plus2 array type, with each cell type represented by at least two arrays. Further details on the selection of the samples can be found in the Supplementary Results.

Gene expression for the transcription-regulating gene set was summarized using the GC-RMA algorithm42 (no quantile normalization) and custom probe mappings. In total, the 2,602 genes are included in the analysis, of which 844 represent TF genes with high confidence (TF set). Details on gene set curation and probe mapping can be found in the Supplementary Results.

Representing gene expression data as gene-pair data.

To derive a normalization-independent quantity, we first convert the gene expression values to rank r within each sample. The quantity that represents the gene-pair configuration on a cell type level, the normalized mean rank difference of two genes, δ, is calculated as the mean rank difference of the two genes from each sample that represents this cell type, with the requirement that the relative ranking between the pair members must be consistent (always rg > rg or rg < rg).

Toward this end, let T be an ordered set of cell type labels, G be an ordered set of genes and nt be the number of samples for t T (nt ≥ 2 always). Let Rt = [r(t)gi] be the matrix of normalized expression ranks for gene g G and sample i for cell type t. By averaging over all samples nt for a given cell type t, we construct the matrix R = [rgt] of mean normalized expression ranks.

Normalized here means that simple rank values (integers in 1, ..., |G|) are scaled by |G|−1 so that r(t)gi [|G|−1,1]. Clearly, rgt [|G|−1,1] as well. In the sequel, we will use 'ranks' with the understanding that we are speaking of normalized ranks.

To detect a gene-pair expression reversal, we are interested in how the two genes' ranks differ between cell types. To this end, we define the mean normalized rank difference of two genes in a given cell type

Notice that δ (g, g′, t) is nonzero if and only if the genes' ranks manifest the same strict inequality across all samples associated with cell type t. Clearly, δ (g, g′, t) (−1, 1). In the text we denote this by δ for short.

Comparison of gene-pair data across cell types: gene pair–reversal analysis.

Because we are interested in reversals of the genes' relationship between cell types, we similarly define the difference of differences as

where sgn is the sign of a real number.

Clearly, Δ (g, g′, t, t′) (−2, 2) and is nonzero only if calculated from nonzero values. Those pairs with Δ ≠ 0 are referred to as reversal pairs. To extract only results in which both members of a gene pair change their mean rank between the cell types, |Δ| ≥ 1 must hold. In the text we use the notation Δ for Δ (g, g′, t, t′).

A simple result to justify thresholds: |Δ| ≥ 1 is possible only when both genes' mean ranks change between cell types. Assume without loss of generality that the mean rank of g does not change between cell types, so rgt = rgt′. Then

and −1 < rgt − 1 ≤ rgtrgt < rgt ≤ 1 with each inequality by virtue of positivity of [rgt].

Reversal participation.

We define the reversal participation score Ψ to quantify the strength of participation of gene g in (potentially bistable) expression reversals in all pairs of cell types, t and t′. That is, g is fixed for the entire plot displayed, and t and t′ correspond to cell types. This measure of strength is the product of (the log of) the number of reversals above a given threshold in which the gene participates and the actual magnitude of the strongest (positive or negative) reversal in which it participates.

First, we identify the gene ĝ with respect to which g exhibits the strongest reversal Δ for a given pair of cell types, t and t′ as

and then define the reversal participation score as

where H is the |Δ| value above which we deem a reversal to have occurred and I is the indicator function. We use H = 1 in our analysis. As t and t′ range over all 166 cell types, this yields square, skew-symmetric plots. Note that genes ubiquitously highly expressed do not show up as reversal pairs; thus, they are separated from lineage-specific highly expressed genes.

Finding the top reversal pairs for a specific lineage split.

A supervised search for candidate toggle gene pairs was formulated by setting criteria based on biological knowledge of lineage relationships and expected reversal pattern of such a gene pair in the precursor (P), lineage 1 (L1) and lineage 2 (L2) cells. An external (E) group corresponds to cell types outside the lineage split. The search was performed to extract the top pairs of the erythroid-myeloid and B-T lymphoid splits.

Erythroid-myeloid. The hematopoietic stem cell was selected as the precursor cell type (P), L1 has three erythroid (proerythroid, erythroblast, erythrocyte) and L2 has five myeloid (promyeloid, CD11b+ bone marrow cell, monocyte, CD16+ monocyte and neutrophil) cell types included, and three cell types from the lymphoid lineage (naive CD4+ T cell, naive CD8+ T cell and naive B cell) were selected as an external (E) group.

B-T lymphoid. The hematopoietic stem cell served again as the precursor cell type (P), L1 has four B lymphoid (naive B cell, activated B cell, germinal-center centrocyte and centroblast) and L2 has four T lymphoid (naive CD4+ T cell, activated CD4+ T cell, naive CD8+ T cell and activated CD8+ T cell) cell types included, and the proerythroid and promyeloid cell types were selected as an external (E) group.

To identify candidate toggle pairs, we consider the ternary states Δ < 0, Δ > 0 and Δ = 0 and compare the expected configuration for the lineage split to the one observed in a particular cell type set (with representative cell types of a lineage split). We expect no reversals (Δ = 0) in the P-L1, P-L2, P-E, L1-E and L2-E comparisons and always a reversal in all L1-L2 comparisons (Δ < 0 for each L1 vs. L2 and Δ > 0 for each L2 vs. L1, or Δ > 0 for each L1 vs. L2 and Δ < 0 for each L2 vs. L1). The exact match is the first filter to find candidate pairs. (The external group can be omitted, but it is useful if pairs that do not exhibit expression reversals in neighboring lineages should be excluded.) Additionally, at least one reversal with |Δ| > 1 is required to accept a candidate gene pair to the final list shown. Supplementary Table 7 shows additional results when one or more of these criteria are relaxed. Invariantly, the top pairs presented are among the most promising candidates. Finally, the hypergeometric probability to obtain a defined set of reversals is calculated for each pair and used to sort the gene pairs. To calculate this distribution, the number of successes in the sample corresponds to the observed reversals in the specified cell type set, the number of successes in the population to the observed reversals across all cell type comparisons and the sample size to the number of cell types assigned to P, L1, L2 and E.

Clustering of cell types.

We define a similarity measure based on gene-pair expression reversals, Φ, as the number of reversal pairs with | Δ | ≥ 1 (as defined above) for a given cell type comparison. By examining all possible pairs of TF genes in our data set, we can count the number of reversal pairs {g, g′} between two cell types (X, Y). Then, the greater the number of reversal pairs, the greater the similarity Φ(X, Y) between the two cell types.

The cell lineage was reconstructed using hierarchical clustering with average linkage for the endothelial and hematopoietic cell types. Clustering was applied to terminally differentiated cell types. The hematopoietic and endothelial cells are closely related in early development. A hemangioblast cell type is a progenitor for both hematopoietic and endothelial precursors43. In the clustering, we do not have the common precursor cell type present, nor a precursor for endothelial differentiation. Therefore, all endothelial cells are assigned as differentiated cell types. The hematopoietic cell is the common precursor of the blood cell types and placed at the center. There are three early precursor cell types for the erythroid-myeloid lineage: erythroblast, bone marrow promyelocyte and CD11+ cells. In addition, we chose to assign monocyte as a precursor cell type, as the data set contains multiple monocyte-derived cell types (macrophages and dendritic cells). There is no early lymphoid precursor in the data set. We chose to assign the naive cell types as precursors. For the B cell lineage, a further maturation step occurs in the germinal centers43. For this reason, the germinal-center centrocyte and centroblast are assigned as precursors. The other cell types are considered to represent a differentiated state.

The placement of the progenitor cell types {B1,...,BM}, where M is the number of progenitor cell types, was done using the Hungarian algorithm (HA)44 to solve an assignment problem: let Xn = {Φ(a1, Bi), ...,Φ(ak , Bi)} and Yn = {Φ(b1, Bi), ..., Φ(bl, Bi)} contain the similarities Φ from progenitor cell type Bi to the cell types on the left {a1, ..., ak} and right {b1, ..., bl} branch of the node n, n {1, ..., N}, respectively, and where N is the number of branches in the clustering tree. Here, k and l are the number of cell types in the left and right branches, respectively. Similarity D(n, Bi) of cell type Bi from node n is defined as D(n, Bi) = |mean {Xn} – mean{Yn}|, where |.| denotes absolute value and mean {.} denotes the mean value from a set of similarities. The obtained similarity matrix DN,M, containing D(n, Bi) for all the node and cell type pairs is then scaled by the similarity to the ESC from each progenitor cell type Ds = D × desc, where desc = [Φ(Aesc, B1), ..., Φ(Aesc, BM)] and Aesc is the ESC. desc is normalized to the [0, 1] interval. This makes the ESC a reference point. HA is then applied on Ds to obtain the optimal assignment for each progenitor cell type.

It should be noted that there are more nodes in the clustering tree than there are progenitor cell types with measurement data. Thus, a progenitor cell type is assigned only to best-fitting nodes according to HA optimization. For a representation containing all 166 cell types, multidimensional scaling is used to obtain a two-dimensional representation of the full reversal similarity matrix. A landscape is interpolated over the 2D representation of cell types using the similarity Φ to the ESC as elevation.

ChIP-seq data.

The ChIP-seq data sets used are listed in Supplementary Table 10, and their use is further described in the Supplementary Results. The peak lists as published by the authors were assembled for each TF. The peak sizes were equalized to ±250 bp around the peak center. For the ESC data, overlapping intervals representing the binding of the same protein were merged into one. The intersection of peak lists between pairs of TF genes was defined as a minimum 1-bp overlapping region. The genomic region enrichment analysis was performed using the GREAT45 tool (binomial test, FDR 1%).

Online resource.

The online data resource and interactive tool ( encompassing pairwise comparisons of the genes and cell types presented in this article is available to explore transcriptome diversity in metazoa and is accompanied by a user guide and video tutorial. The TF gene landscape is also available as an interactive browsable format online. The source code to perform the analysis is available as Supplementary Software, and updated versions are available upon request.


  1. 1

    Alberts, B. et al. Cells and genomes. in Molecular Biology of the Cell 3rd edn. Ch. 22 (Garland Science, New York, 1994).

  2. 2

    Zhou, J.X. & Huang, S. Understanding gene circuits at cell-fate branch points for rational cell reprogramming. Trends Genet. 27, 55–62 (2011).

  3. 3

    Kauffman, S.A. Control circuits for determination and transdetermination. Science 181, 310–318 (1973).

  4. 4

    Kauffman, S.A., Shymko, R.M. & Trabert, K. Control of sequential compartment formation in Drosophila. Science 199, 259–270 (1978).

  5. 5

    Zhang, P. et al. Negative cross-talk between hematopoietic regulators: GATA proteins repress PU.1. Proc. Natl. Acad. Sci. USA 96, 8705–8710 (1999).

  6. 6

    Huang, S. et al. Bifurcation dynamics in lineage-commitment in bipotent progenitor cells. Dev. Biol. 305, 695–713 (2007).

  7. 7

    Geman, D., d'Avignon, C., Naiman, D.Q. & Winslow, R.L. Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. 3, Article 19 (2004).

  8. 8

    Tan, A.C. et al. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21, 3896–3904 (2005).

  9. 9

    Price, N.D. et al. Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas. Proc. Natl. Acad. Sci. USA 104, 3414–3419 (2007).

  10. 10

    Waddington, C.H. The Strategy of the Genes: A Discussion of Some Aspects of Theoretical Biology (Allen & Unwin, London, 1957).

  11. 11

    Yu, J. et al. Induced pluripotent stem cell lines derived from human somatic cells. Science 318, 1917–1920 (2007).

  12. 12

    The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  13. 13

    Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008).

  14. 14

    Vierbuchen, T. et al. Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463, 1035–1041 (2010).

  15. 15

    Grass, J.A. et al. GATA-1-dependent transcriptional repression of GATA-2 via disruption of positive autoregulation and domain-wide chromatin remodeling. Proc. Natl. Acad. Sci. USA 100, 8811–8816 (2003).

  16. 16

    Laslo, P. et al. Multilineage transcriptional priming and determination of alternate hematopoietic cell fates. Cell 126, 755–766 (2006).

  17. 17

    Hu, M. et al. Multilineage gene expression precedes commitment in the hemopoietic system. Genes Dev. 11, 774–785 (1997).

  18. 18

    Zhou, J.X., Brusch, L. & Huang, S. Predicting pancreas cell fate decisions and reprogramming with a hierarchical multi-attractor model. PLoS ONE 6, e14752 (2011).

  19. 19

    Hosoya, T. et al. GATA-3 is required for early T lineage progenitor development. J. Exp. Med. 206, 2987–3000 (2009).

  20. 20

    Miranda-Saavedra, D. & Göttgens, B. Transcriptional regulatory networks in haematopoiesis. Curr. Opin. Genet. Dev. 18, 530–535 (2008).

  21. 21

    Swiers, G., Patient, R. & Loose, M. Genetic regulatory networks programming hematopoietic stem cells and erythroid lineage specification. Dev. Biol. 294, 525–540 (2006).

  22. 22

    Feinberg, M.W. et al. The Kruppel-like factor KLF4 is a critical regulator of monocyte differentiation. EMBO J. 26, 4138–4148 (2007).

  23. 23

    Hoang, T. et al. Opposing effects of the basic helix-loop-helix transcription factor SCL on erythroid and monocytic differentiation. Blood 87, 102–111 (1996).

  24. 24

    Ma, C. & Staudt, L.M. LAF-4 encodes a lymphoid nuclear protein with transactivation potential that is homologous to AF-4, the gene fused to MLL in t(4;11) leukemias. Blood 87, 734–745 (1996).

  25. 25

    Nagasawa, M., Schmidlin, H., Hazekamp, M.G., Schotte, R. & Blom, B. Development of human plasmacytoid dendritic cells depends on the combined action of the basic helix-loop-helix factor E2-2 and the Ets factor Spi-B. Eur. J. Immunol. 38, 2389–2400 (2008).

  26. 26

    Hagman, J., Belanger, C., Travis, A., Turck, C.W. & Grosschedl, R. Cloning and functional characterization of early B-cell factor, a regulator of lymphocyte-specific gene expression. Genes Dev. 7, 760–773 (1993).

  27. 27

    Zandi, S. et al. EBF1 is essential for B-lineage priming and establishment of a transcription factor network in common lymphoid progenitors. J. Immunol. 181, 3364–3372 (2008).

  28. 28

    Lukin, K. et al. A dose-dependent role for EBF1 in repressing non-B-cell-specific genes. Eur. J. Immunol. 41, 1787–1793 (2011).

  29. 29

    Dontje, W. et al. Delta-like1-induced Notch1 signaling regulates the human plasmacytoid dendritic cell versus T-cell lineage decision through control of GATA-3 and Spi-B. Blood 107, 2446–2452 (2006).

  30. 30

    Rosa, A. et al. The interplay between the master transcription factor PU.1 and miR-424 regulates human monocyte/macrophage differentiation. Proc. Natl. Acad. Sci. USA 104, 19849–19854 (2007).

  31. 31

    Wei, G. et al. Genome-wide analyses of transcription factor GATA3-mediated gene regulation in distinct T cell types. Immunity 35, 299–311 (2011).

  32. 32

    Treiber, T. et al. Early B cell factor 1 regulates B cell gene networks by activation, repression, and transcription- independent poising of chromatin. Immunity 32, 714–725 (2010).

  33. 33

    Duarte, N.C. et al. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. USA 104, 1777–1782 (2007).

  34. 34

    Pardo, M. et al. An expanded Oct4 interaction network: implications for stem cell biology, development, and disease. Cell Stem Cell 6, 382–395 (2010).

  35. 35

    Kashyap, V. et al. Regulation of stem cell pluripotency and differentiation involves a mutual regulatory circuit of the NANOG, OCT4, and SOX2 pluripotency transcription factors with polycomb repressive complexes and stem cell microRNAs. Stem Cells Dev. 18, 1093–1108 (2009).

  36. 36

    Li, J.-Y. et al. Synergistic function of DNA methyltransferases Dnmt3a and Dnmt3b in the methylation of Oct4 and Nanog. Mol. Cell Biol. 27, 8748–8759 (2007).

  37. 37

    Sinkkonen, L. et al. MicroRNAs control de novo DNA methylation through regulation of transcriptional repressors in mouse embryonic stem cells. Nat. Struct. Mol. Biol. 15, 259–267 (2008).

  38. 38

    Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930–935 (2009).

  39. 39

    Ito, S. et al. Role of Tet proteins in 5mC to 5hmC conversion, ES-cell self-renewal and inner cell mass specification. Nature 466, 1129–1133 (2010).

  40. 40

    Neph, S. et al. Circuitry and dynamics of human transcription factor regulatory networks. Cell 150, 1274–1286 (2012).

  41. 42

    Wu, Z. & Irizarry, R.A. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J. Comput. Biol. 12, 882–893 (2005).

  42. 43

    Nishikawa, S.I. et al. Progressive lineage analysis by cell sorting and culture identifies FLK1+VE-cadherin+ cells at a diverging point of endothelial and hemopoietic lineages. Development 125, 1747–1757 (1998).

  43. 41

    Allen, C.D.C., Okada, T. & Cyster, J.G. Germinal-center organization and cellular dynamics. Immunity 27, 190–202 (2007).

  44. 44

    Burkard, R., DellAmico, M. & Martello, S. Assignment Problems (SIAM, Philadelphia, 2009).

  45. 45

    McLean, C.Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).

Download references


We thank R. Bressler (Institute for Systems Biology) for providing the interactive landscape visualization for the web page, T. Sauter and T. Schilling (University of Luxembourg) for the use of their computational resource, D. Galas and C. Carlberg for useful discussions and suggestions and E. Friederich and N. Vlassis for reading the manuscript; and we gratefully acknowledge these sources of funding: the Academy of Finland, project no. 132877 (M.N.), the University of Luxembourg, Tekes FiDiPro Program (S.A.K.), Alberta Innovates the Future (S.H.) and US National Institutes of Health–National Institute of General Medical Sciences grants R01GM072855 and P50GMO76547 (I.S.).

Author information

M.H., M.N., R. Kramer and I.S. designed the gene-pair analysis, and M.H. and R. Kramer performed the analysis. M.H. and A.W.-B. designed the gene curation pipeline, and M.H., A.W.-B. and L.S. curated the genes. M.N., M.H., J.X.Z., S.A.K., S.H. and I.S. designed the clustering experiments and visualization of cell type dissimilarities. M.N. designed the branch-point placement algorithm. M.H. and M.N. compiled the ChIP-seq validations. M.H. and S.H. designed the reversal participation analysis. R. Kreisberg, M.H., M.N. and I.S. designed the content of the online resource. M.H., M.N., R. Kramer, S.H. and I.S. wrote the manuscript. All authors commented on the manuscript.

Correspondence to Ilya Shmulevich.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 4 and 6 and Supplementary Results (PDF 2503 kb)

Supplementary Table 1

Cell type and tissue ontology terms (XLS 50 kb)

Supplementary Table 2

Microarray samples mapped to ontology terms (XLS 247 kb)

Supplementary Table 3

The order of cell types as it appears in the heat maps presented (XLS 29 kb)

Supplementary Table 5

Functional evidence for a role in transcription regulation found in the gene-set curation (XLS 306 kb)

Supplementary Table 7

Identification of candidate toggle pairs (XLS 69 kb)

Supplementary Table 8

Rank-based differential expression analysis comparison using RCoS (XLS 710 kb)

Supplementary Table 9

Rank-based differential expression analysis comparison using RDAM (XLS 245 kb)

Supplementary Table 10

Public ChIP-seq data sets used (XLS 23 kb)

Supplementary Table 11

Genomic region enrichment results for GATA1 ChIP-seq data (XLS 1413 kb)

Supplementary Table 12

Genomic region enrichment results for TAL1 ChIP-seq data (XLS 713 kb)

Supplementary Table 13

Genomic region enrichment results for SPI1 ChIP-seq data (XLS 1936 kb)

Supplementary Table 14

Genomic region enrichment results for EBF1 ChIP-seq data (XLS 969 kb)

Supplementary Table 15

Genomic region enrichment results for GATA3 ChIP-seq data (XLS 3041 kb)

Supplementary Table 16

Mouse knockout phenotypes of Gata1, Tal1, Sfpi1, Ebf1 and Gata3 (XLS 114 kb)

Supplementary Table 17

Additional microarray data used for validation. (XLS 97 kb)

Supplementary Software

Online data resource and tool TREL. The online data resource and interactive tool ( encompassing pairwise comparisons of the genes and cell types presented in this article is available to explore transcriptome diversity in metazoa; this resource accompanied by a user guide and video tutorial. (ZIP 4901 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Heinäniemi, M., Nykter, M., Kramer, R. et al. Gene-pair expression signatures reveal lineage control. Nat Methods 10, 577–583 (2013) doi:10.1038/nmeth.2445

Download citation

Further reading