Introduction

Mammalian genomes encode 1,500–2,000 transcription factors (TFs)1, which cross-regulate one another to form the network of TFs. The network controls the transcriptome of cells, thereby defining the identity of cells. A powerful approach to deciphering such a complex network is the systematic perturbation of individual TFs followed by global gene expression profiling2.

Results

Here we report the generation of mouse embryonic stem (ES) lines, each of which has been engineered by integrating an expression cassette of a specific transcription factor (TF) into the ubiquitously expressing Rosa26 locus (Fig. 1a)2. The Rosa26 locus3 drives relatively uniform expression of the exogenous copy (transgene) of a TF, which is repressed by doxycycline (Dox) and can be induced in Dox- cell culture conditions (Fig. 1b)4. Combined with the 53 ES lines reported previously2, we present a total 137 ES cell lines. The majority of the manipulated genes were TFs, which were selected from a set of high-priority genes involved in critical functions in mouse ES cells and their differentiation5. To ensure the quality of these ES cell lines, we implemented vigorous QC steps that have been described previously in detail2. As a part of the characterization of these ES cell lines, we carried out global gene expression profiling by DNA microarrays 48 hours after TF induction (Fig. 1c; GEO accession number, GSE31381). The induction of a TF was confirmed by qRT-PCR (Fig. 1d, Supplementary Table 1 for primer pairs). The effect of TF induction on the transcriptome of mouse ES cells was highly variable (Fig. 1e; Supplementary Table 2). On a scale of the number of genes significantly changed in expression (FDR ≤ 0.05, fold change ≥1.5), the top 10% of studied TFs changed 4676 genes on average (e.g., Dmrt1), whereas the bottom 50% of TFs caused significant changes in expression in only 54.5 genes on average (e.g., Mbd3) (Fig. 1c, d).

Figure 1
figure 1

Induction of transcription factors (TFs) in ES cells:

(a) plasmid structure that includes loxP recombination sites, puromycin resistance gene, open reading frame (ORF) of a TF with hCMV promoter followed by His6-FLAG tag; (b) schematic diagram showing the expression of transgenic TF induced in Dox- conditions; (c) examples of scatterplots of gene expression in Dox- versus Dox+ condition. Green and red dots indicate genes that are differentially expressed with statistical significance (FDR<0.05, change >1.5 fold); (d) Increase of transcription factor expression after the induction of a transgene, as measured by qPCR (Dox- vs. Dox+); results from two biological replicates (3 technical replicates each); error bars (S.E.M.; ANOVA); and dashed line = 2 fold change; (e) a list of TFs and the number of genes up- or down-regulated by the induction of the TF (FDR<0.05, change >1.5 fold) (Supplementary Table S2).

To further characterize the transcriptome alterations caused by each TF, we compared our microarray data with 3 public databases: the gene expression profiles of many mouse organs/tissues at The Genomics Institute of the Novartis Research Foundation (GNF) (ver. 2 & 3)6,7, the Genetic Association Database (GAD) on gene sets associated with mouse phenotypes8 and the MSigDB database (ver. 3) of gene sets associated with signaling pathways and cellular functions9. Because the GNF database is quantitative and the two other databases are qualitative, we used different methods to quantify association: correlation of median-subtracted log-transformed gene expression values for the GNF database and Parametric Analysis of Gene Expression (PAGE)10 for the GAD and msigdb databases (see Supplementary Methods).

A comparison of our microarray data with the GNF database showed that the induction of a TF in ES cells often initiates the differentiation of ES cells into specific cell types as soon as 48 hr later, when cells do not yet exhibit any overt phenotypes (Fig. 2 for GNF ver. 3; Supplementary Fig. 1 for GNF ver. 2). For example, the transcriptome of ES cells shifted toward a neural profile after the induction of Sox9, Foxg1, Klf3, or Pou5f1; toward endoderm after the induction of Hnf4a, Gata2, Gata3, or Esx1; and toward skeletal muscle and heart after the induction of Myod1 or Mef2c. Similarly, the transcriptome of ES cells shifted toward hematopoietic cell lineages after the induction of Sfpi1, Elf1, or Irf2; and toward T-cells and thymocytes after the induction of Elf5 or Tgif1. Interestingly, TFs associated positively with transcriptome changes toward specific lineages showed a negative association with those toward different cell lineages (Fig. 2). For example, TFs associated with transcriptome changes toward neural tissues were negatively associated with those toward hematopoietic lineages (e.g., Sox9 and Foxg1 in Fig. 2) and vice versa (e.g., Irf2, Elf1, Sfpi1 in Fig. 2). These data suggest that TF networks are organized to cross-regulate as if different tissue lineages are mutually exclusive.

Figure 2
figure 2

Correlation of gene expression response to the induction of TFs with tissue-specific gene expression from the GNF ver. 3 database7.

A comparison of our microarray data with the GAD database identified associations of TF's with mouse phenotypes (Fig. 3). Many newly identified associations are consistent with published data. For example, Hoxa2 was associated with the pancreatic alpha and beta cells11; Foxc1, with hair follicle/shaft12,13; and Sox11 with skeletal defects14. A comparison of our microarray data with the msigdb database identified the association of each TF with specific cells and pathways (Fig. 4). For example, Smad6 was associated with keratinocytes15; Myod1, with alveolar rhabdomyosarcoma16; and Hnf4a, with lipoproteins17.

Figure 3
figure 3

Enrichment of gene sets associated with mouse phenotypes from GAD database 8 among genes that were upregulated (positive) or downregulated (negative) after the induction of various TFs.

Figure 4
figure 4

Enrichment of gene sets associated with various functions and signaling pathways from msigdb ver. 3 database 9 among genes that were upregulated (positive) or downregulated (negative) after the induction of various TFs.

Discussion

The collection of mouse ES cell lines reported here are freely available to the research community (http://esbank.nia.nih.gov/index.html). The analysis presented here can help researchers select ES cell lines suitable for their own research programs. For example, these TF-manipulable ES cell lines can be used to study the complex mechanisms of ES cell differentiation toward specific lineages. These ES cell lines are also adaptable to a variety of experiments and analyses, as shown in our previous report2. For example, each TF is C-terminally tagged with His6-FLAG, which simplifies studies of TF localization, protein-protein interactions and protein-DNA interactions2. Further mining of the microarray results reported here as well as additional experiments with provided ES cell lines and their derivatives will yield more insight into gene regulatory networks. Carrying out similar experiments for more regulatory proteins (ideally for all TFs and additional signaling proteins) should give increasingly complete information to comprehend gene regulation in mammalian cells and organs.

Methods

Derivation of transgenic ES cell lines

ES cell lines with inducible TF transgenes were derived from MC1 mouse ES cells (129S6/SvEvTac), passage 17. Cells were cultured in DMEM with 15% FBS and LIF on feeder cells. Cells were electroporated with a linearized pMWROSATcH vector and selected by hygromycin B. Knock-in for ROSA-TET locus was confirmed by southern blotting. For exchange vectors, PCR amplified ORFs were subcloned into pZhcSfi that was modified to express a His6-FLAG tagged protein and puromycin resistance gene. ES cells were co-transfected with a sequence verified exchange vector and pCAGGS-Cre and selected by puromycin in the presence of doxycycline (Dox). Isolated clones were tested for Venus expression, hygromycin B susceptibility, transgene RNA expression, genotyping for Cre mediated integration and mycoplasma contamination.

Gene expression analysis of cells with induced TFs

ES cells (passage 25) were cultured in the standard LIF+ medium with Dox+ on a gelatin-coated dish throughout the experiments. Cells from each cell line were split into 6 wells and the media was changed 24 hr after cell plating: 3 wells with Dox+ medium and 3 wells with Dox- medium to induce transgenic TFs. Dox was removed via washing 3 times with PBS at 3 hour intervals. Total RNA was isolated by TRIzol (Invitrogen) after 48 hr and two replications were used for real time qPCR (see primers in Supplementary Table S1) and for microarray hybridization. RNA samples were labeled with total RNA by the Low RNA Input Fluorescent Linear Amplification Kit (Agilent). For most TFs, we hybridized Cy3-CTP labeled sample from Dox- medium together with a Cy5-CTP labeled sample from Dox+ medium. But for 7 TFs we labeled samples from Dox- and Dox+ with Cy3 and hybridized them independently with a Cy5-labeled reference target, which is a mixture of Stratagene Universal Mouse Reference RNA and MC1 cells RNA (this method requires a double number of arrays). Analysis showed that both methods produce results of comparable quality. Targets were hybridized to the NIA Mouse 44K Microarray v3.0 (Agilent, design ID 015087)18. Slides were scanned with Agilent DNA Microarray Scanner. All DNA Microarray data are available in Supplementary Table S2, at GEO/NCBI19 (http://www.ncbi.nlm.nih.gov/geo; accession number GSE31381) and at NIA Array Analysis software20 (http://lgsun.grc.nia.nih.gov/ANOVA).

Normalization of microarray data and detection of outliers

Two methods of array hybridizations were used in this study: (1) RNA extracted from cells with induced transcription factors (TFs) (cultured in Dox- conditions) and from controlled cells (cultured in Dox+ conditions) were Cy3 labeled and all hybridized on separate arrays together with reference RNA labeled with Cy5; and (2) RNA extracted from cells with induced TFs (Dox-) were labeled with Cy3 and hybridized together with RNA from control cells (Dox+) which were labeled with Cy5. The second method does not use reference RNA. Data processing depended on the method of hybridization. Potential Cy3/Cy5 bias in microarrays with the hybridization of Dox- vs. Dox+ samples was removed by normalization to the median logratio of gene expression change in all TF-manipulation experiments. The details of the method are available in Supplementary Information.

Statistical analysis of microarray data

For statistical analysis we used NIA Array Analysis, which estimates the False Discovery Rate (FDR) to account for multiple hypothesis testing20. Response of genes to the knockdown of TFs was measured as a logratio (i.e., difference between means of log-transformed intensities) between manipulated (Dox-) and control (Dox+) cells. We considered gene expression change as significant if logratio was significantly different from zero (FDR < 0.05) and the change of expression was >1.5 fold.

Correlation with tissue-specific gene expression

Association of gene expression changes induced by TF manipulation with tissue-specific gene expression was evaluated based on the correlation between our microarray results with the GNF database7. Correlation was estimated between gene expression responses to TF manipulation (logratio of Dox- vs. Dox+) and median-centered log-transformed gene expression in various tissues from GNF database (ver. 2 and 3). Because the importance of genes in ES cells and adult tissues may be different and different platforms of microarrays used in these studies are not 100% compatible, we applied correlation analysis to a subset of genes that are highly expressed and dynamic in both data sets. We selected 10,000 genes in each database with the highest score equal to the product of average log-expression and standard deviation of expression (after induction of various TFs or in different tissues) and then took the intersecting portion of 5,595 genes for GNF ver. 3 (5,295 genes for ver. 2). Then, correlation values and corresponding z-values were estimated based on this subset of genes. The matrix was sorted using hierarchical clustering, TMEV, ver 3.121.

Analysis of gene set enrichment

Enrichment of target genes in subsets of genes that are upregulated or/and downregulated following the manipulation of the TF is quantified using a modified Parametric Analysis of Gene Enrichment (PAGE)10. PAGE is based on the comparison of the average expression change in a specific subset of genes, xset, with the average expression change in all genes, xall:

where nset is the size of the gene set and SDall is standard deviation of expression change among all genes. We modified this method by applying equation (1) to the subset of N top upregulated and another subset of N top downregulated genes rather than to all genes combined, which allowed us to detect the enrichment of the same gene set among both upregulated and downregulated genes. The value of N = 5000 was selected experimentally because it appeared that the enrichment of genes with TF binding sites is always limited to the top 5000 upregulated or downregulated genes. The probability distribution of expression change within subsets of N upregulated and downregulated genes is not normal; however, because we compare averages for large sets of genes (usually, nset is >50), the probability distribution of these averages is close to normal based on the central limit theorem22. Thus, it is reasonable to use equation (1) as an approximation. In the case when both up-regulated and down-regulated genes were enriched in a specific functional gene set, we subtracted the smaller z-value from both z-values. The matrix of z-values was first sorted using hierarchical clustering, TMEV, ver 3.121 and then manually converted to a semi-diagonal form.