Introduction

A gene regulatory network (GRN) describes how gene expression dynamics is regulated in an organism under different biological conditions. Building a GRN requires information concerning three domains - the components and circuits of the network, how these components and circuits are used under various conditions, and the output of the network, i.e. the dynamics of gene expression pattern. Over the past decade, enormous progress has been made in the first and third domains, but only minimal progress has been made to integrate these two domains that would enhance our knowledge with respect to the second domain.

Transcription factors (TF) bind to cis-regulatory sequences or motifs within a gene’s promoter and regulate expression. The binding of TFs to promoters of TF and non-TF genes constitutes the backbone of a GRN. Many TFs’ binding motifs have been characterized and listed in databases such as JASPAR and TRANSFEC1, 2. Recently, the human ENCODE project has mapped TF binding sites at the genome level using chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) and expression regulatory regions using DNase I hypersensitive sites sequencing (DNase-Seq)3,4,5. Data generated from these analyses have been used to derive the circuits and architectures of TF regulatory network3,4,5. Such studies lead to an extensive characterization of the components of regulatory network. Interactions between and combinations of these components provide a vast regulatory space and potential for gene expression regulation.

Human gene expression in various tissues, during development, or under diverse environmental conditions has also been cataloged systematically in NCBI GEO or ArrayExpress databases. These large datasets have been used to generate gene co-expression networks, in which genes with similar expression patterns were connected6. These networks effectively group genes with similar functions or functioning in the same processes, and have been used to analyze the transcriptome of the human brain, primary cell lines, and various tissues. This advanced the identification of, for example, specific molecular pathways in autism and amyotrophic lateral sclerosis7,8,9,10,11. Although these analyses catalogued distinct expression patterns, they failed to predict a specific TF or group of TFs that regulate the identified co-expressed genes in the network.

A substantial amount of data on the components and the output of human GRNs have now been accumulated. However, very limited efforts have been made to integrate these datasets. Although an enormous amount of TF-binding site data is available in public repositories, it is difficult to uncover the relevant components for specific conditions without further careful informatics-based analyses. In addition, the co-expressed gene groups derived from co-expression networks provide little insights into the regulatory TF which drive the expression of genes in the co-expression network.

To overcome these shortcomings, we recently described novel methods to integrate the two distinct components of a regulatory network for a plant model system, Arabidopsis thaliana 12. We conducted promoter motif analysis overlying the gene co-expression network and identified target genes regulated by specific cis-regulatory motifs via motif enrichment and motif position bias towards transcription starting site (TSS). The target genes were then used to identify motif-regulated gene co-expression modules. The relevant TFs driving the expression of genes within the network were then identified. Comparing to other co-expression network studies, our approach provided the much-needed mechanistic insights on how gene co-expression networks are regulated by different TFs12.

Here, we describe a human GRN by merging both regulatory components and gene co-expression networks. We used data from 948 microarray datasets from ArrayExpress13 to build a human gene co-expression network. Promoter motif analysis over the network identified many target genes and co-expression modules via motif enrichment and motif position bias methods. Many known and novel modules regulated by the nuclear factor Y (NF-Y), specificity protein 1 (Sp1), and 37 other cis-regulatory motifs were identified. The interaction between NF-Y and Sp1 TFs and their target genes were validated using ENCODE ChIP-seq data. Interestingly, while modules regulated by NF-Y are mainly involved in house-keeping functions, the Sp1 motif targets include both house-keeping and tissue specific gene expression modules. The derived Sp1 modules were superimposed on various genomic and epigenomic data to provide insights into how Sp1 regulates diverse gene targets. Modules were also identified for 37 additional motifs, such as the YY1, RFX2, and IRF1 binding motifs. Our approach identified numerous novel target genes for various motifs, and organized these targets into co-expression modules. The modules then enabled integrating various genomic/epigenomic data into a coherent regulatory system, providing a valuable resource to identify transcriptional regulators for various human developmental, disease, or immunity pathways.

Results

Human gene co-expression network

We constructed a gene co-expression network for 19,718 human genes based on the graphical Gaussian model (GGM)14, 15 using Affymetrix U133 Plus 2.0 microarray data deposited in the ArrayExpress database13. GGM uses partial correlation coefficient (pcor), the correlation between two genes after removing the effects from other genes, to measure gene expression similarity. Pcor performs better than the conventional Pearson’s correlation coefficient in gene network analyses15, 16. As shown in Fig. 1a, 97% of the gene pairs have their pcor values in the range of −0.01 to 0.01, indicating no correlation. The gene pairs with pcor >= 0.04 (false discover rate, FDR, 3.56E-15) were selected. As a result, 186,132 significantly correlated gene pairs (0.095% of all possible pairs) among 19,376 genes were used to construct a human GGM gene co-expression network.

Figure 1
figure 1

Characterization of the human co-expression network. (a) Histogram showing the distribution of the partial correlation coefficient (pcor) between gene pairs. Most gene pairs show pcors between -0.02 and 0.02. (b) A sub-network for immunity-related modules extracted from the entire gene co-expression network. In the network, each sphere represents a gene, and connection between genes indicates their similar expression pattern. Genes are colored according to their module identities. (c) A simplified version of the sub-network from B is shown. The genes from the same module are represented by a single sphere. The size of the sphere is proportional to the number of genes within a module. The number shown within the module sphere represents module # shown in Supplementary Dataset 1. The network is shown in a 3-D space layout and some modules (e.g. #477 and #851) are hidden behind modules in foreground.

The derived network consolidated into 930 clusters via the Markov Cluster Algorithm (MCL) (Supplementary Dataset 1)17. These clusters were treated as co-expression modules. Gene ontology (GO) analysis identified 36 modules enriched with genes functioning in immunity pathways (pValue < 1E-5) (Supplementary Dataset 2). A sub-network extracted for these 36 modules (Fig. 1b and c) includes multiple aspects of immune signaling pathways such as B-cells (module #51, 80, 477), T-cells (#32, 60, 851), and nature killer cells signaling (#31, 556, 702), p53 signaling and apoptosis (#68), Interferon α/β signaling (#43, 199), MHC I (#61) and MHC II (#238) antibody processing and presentation, complement & coagulation cascades (#28, 63, 149), NOD NLR singling (#45), and inflammatory response (#67, 82). In addition to immune signaling modules, our network identified another 142 modules enriched with genes functioning in development, metabolism, or house-keeping functions and other signaling pathways (Supplementary Dataset 1).

Identification of targets of promoter motifs from gene co-expression network

The gene co-expression network contains gene co-expression modules regulated by specific promoter motifs(s). A bottom-up approach was employed to identify such motifs-regulated modules. The target genes for a specific motif are identified by motif analysis12 over gene co-expression network. The target gene list is then used to detect if they form any modules. For each gene, the gene itself and its neighbor are treated as a group, and the gene itself as a seeded gene. The group’ promoters are then analyzed to see if the motif has enrichment within them (measured with a pValue via hypergeometric distribution), or if the motif has position bias distribution towards transcription start site (TSS) (measured with a Z score, see below for details). If the seeded gene’s promoter contains the motif, and the motif is enriched in the group’s promoters or demonstrates significant position bias towards TSS, all the genes within the group which contain the motif will be considered to be regulated by that motif (Fig. 2). A sub-network is then extracted for the target genes and used for gene co-expression modules detection.

Figure 2
figure 2

Motif target identified through motif position bias towards TSS. (a) A diagram showing a group of promoters. Within the promoters, a random/non-functional motif (grey triangle) distributes randomly along the promoters, while a functional motif (solid black triangle) distributes towards the transcription start site (TSS). The white bars represent transposon or repetitive sequence in the promoters, which are excluded from motif analysis. The arrow indicates the direction of transcription. (b) A representative distribution of a random/non-functional motif (grey line) and a functional motif (black line) with bias distribution along promoters. (c) Distribution of the NF-Y (CCAAT) motif within gene promoters from the modules regulated by this motif. Data from the first 10 modules shown in Supplementary Dataset 3 are indicated by different colors, as specified by the color key on the top left. (d) Distribution of the Sp1 motif within the gene promoters for modules regulated by the Sp1 motif. Data from first 10 modules shown in Supplementary Dataset 5 are indicated by different colors, as specified by the color key on the top left.

While motif enrichment analysis has been widely used18,19,20, there has yet to be an efficient and accurate model to measure motif position bias, although such bias has been used in numerous reports as evidence of bona-fide motifs21, 22. We recently described a model based on discrete uniform distribution to measure such motif position bias12. Here, we expanded the model to accommodate more complex conditions. From the motif analyses, we first excluded sequences that form simple repeats or transposons within all promoters. A second consideration was based on that cis-regulatory motifs can be present either upstream or downstream of TSS (Fig. 2a). If an irrelevant motif were to have arisen randomly within a group of promoters, it will distribute uniformly along the promoters (Fig. 2b). The motif’s expected distance E(d) from TSS and its variance V(d) can be calculated (see Material & Methods for details). In contrast, many functional motifs distribute in a biased manner towards TSS with much smaller distance (Fig. 2b). For a given motif that appears n times within the same group of promoters, with mean distance of \(\overline{|x|}\) from TSS, a Z score can be calculated via the following formula as a measurement of biased distribution.

$$Z=(E(d)-\overline{|x|})/\sqrt{V(d)/n}$$
(1)

The higher the Z score, the higher the chance that the motif is biasedly distributed towards TSS and the higher possibility that the genes are regulated via that motif. Using this approach, we identified many known and hitherto unknown target genes for NF-Y and Sp1 binding cis-regulatory motifs (Fig. 2c and d).

Gene expression modules regulated by the NF-Y motif

The gene co-expression network was analyzed to identify potential targets regulated by the ubiquitously expressed NF-Y TFs that bind to the CCAAT motif23. This motif appears in the promoters of 11,998 human genes (61% of all genes analyzed). The motif enrichment method identified only 85 genes as NF-Y motif target genes with a pValue cutoff set at 1E-05. In contrast, the motif position bias analysis identified 3,062 genes as NF-Y motif target genes with Z scores >  = 3.5. These 3,062 genes include the 85 genes identified via motif enrichment method. The promoter sequences of all genes used in the network analysis were then randomized and subjected to the same analysis. In average 34 genes were identified as NF-Y motif targets in each permutation experiment. Thus the false discovery rate (FDR) for NF-Y motif target genes analysis is 1.1% (34/3062).

After identifying 3,062 NF-Y target genes by performing motif analysis over the whole co-expression network, we then asked if these target genes form any gene module or not. A sub-network was extracted from the whole co-expression network for these 3,062 target genes and clustered into 129 modules, 33 of which contain enriched GO terms (pValue < 5E-7, Table 1 and Supplementary Dataset 3). These modules are potentially regulated by the NF-Y TFs through the CCAAT motif. The CCAAT motif displays position bias towards the TSS among the genes within these modules (Fig. 2c). Many of the modules function in known pathways regulated by NF-Y TFs, such as the cell cycle (module #1), RNA/mRNA processing (#13, 16, 77), protein folding and ER related functions (#4, 26), cholesterol and lipid metabolism (#7, 48, 63), developmental patterning (#62, 68, 83), glucose and carboxylic acid metabolism (#11, 20), fatty acid oxidation (#49), and antigen processing via MHC class II (#29) (Table 1)24,25,26. Other modules indicate novel functions for human NF-Y, such as, Golgi vesicle transport (#18), protein polymerization (#22), circadian rhythmic regulation (#24), cilium organization and spermatogenesis (#36), gland development (#53), platelet activation (#59), and cellular response to lipopolysaccharide (#38) (Table 1). Supporting our findings for novel modules #36 and #53, a NF-YB homolog in Schimidtea mediterranea is required for male germ cell development, while NF-Y binding sites are required for basal transcription of TBX3, a key developmental regulator in module #5327, 28.

Table 1 NF-Y motif (CCAAT)-regulated modules.

We validated the binding of NF-Y TFs to the gene promoters within the above-described modules using ENCODE ChIP-Seq data4, 25 (Table 1). Among all the 11,998 genes with NF-Y motifs in their promoters used in our network analysis, 3,378 (or 28%) contained NF-Y motif site(s) that were bound by NF-YA and/or NF-YB protein in at least one of three human cell lines (K562, GM12878, and HeLa S3) used in the ENCODE ChIP-Seq analyses. Among the 3,062 NF-Y motif target genes identified from our network analyses, 1,508 (or 49%) contained NF-Y protein-bound NF-Y motif sites, representing a 1.75 (49%/28%) fold enrichment compared to the genome-wide level (pValue = 3.6E-187). Furthermore, 20 of the 34 NF-Y motif regulated modules that we identified in our analyses, including the ones with novel functions, have enrichment for NF-Y binding (pValue < 0.05, Table 1). For example, 19 of the 35 genes (54%) in the circadian rhythm module (#24) have NF-Y protein-bound NF-Y motif sites in their promoters, representing a 1.9 fold enrichment (pValue = 0.001) compared to the genome-wide average level. However, for modules #48, #62, #68, and #85, we observed under-representation for NF-YA/B binding in the ENCODE ChIP-Seq data (Table 1). Interestingly, module #48 functions in lipid metabolism specifically in adipocytes, while module #62 and #68 participate in developmental pattern regulation. Therefore, we hypothesize that the reason for low coverage in ENCODE ChIP-Seq data might be that the genes’ promoters in these modules are regulated by NF-Y TF in a cell type-specific manner. Consistent with this, the genes in these three modules are expressed at very low level in the three cell lines used in the ENCODE ChIP-Seq experiment (Supplementary Dataset 4).

Additionally, NF-Y TFs’ regulation on selected modules’ gene expression was also confirmed using published microarray data25, 26. Fleming et al. have conducted expression microarray analysis on HeLa S3 cell lines after depleting NF-YA gene’ expression using small hairpin RNA25. Based on their data, NF-YA’s depletion resulted in down-regulation of the cell cycle module (#1) and up-regulation of the nucleosome organization module (#10) and DNA damage response module (#42) (Supplementary Fig. S1). Benatti et al. also measured the transcriptomes of NF-YA depleted epithelial HCT116 cells, within which the modules involved in cholesterol biosynthesis (#7), pyruvate metabolism (#11), fatty acid oxidation (#49), and vesicle trafficking (#18) were repressed (Supplementary Fig. S1). As to our knowledge, NF-Y’s regulation on nucleosome organization (#10) and vesicle trafficking (#18) have not been reported before. It should be noted that there are three genes encoding NF-Y TFs in human, namely NF-YA, NF-YB, NF-YC, and knocking down just NF-YA might not affect all the modules regulated by the NF-Y motif described here.

Gene expression modules regulated by the Sp1 motif

Sp1 is a ubiquitously expressed zinc finger TF that binds to the GC-rich Sp1 motif29, 30 (JASPAR motif ID: MA0079.3) and regulates diverse cellular processes such as cell differentiation and growth, apoptosis, immune response, DNA damage response, and chromatin remodeling. Polymorphisms in Sp1 binding motif sites are risk factors of many diseases such as osteoporosis, heart disease, type 2 diabetes, Alzheimer’s disease, and tumors31,32,33,34,35. The Sp1 binding sites have been mapped for human chromosome 21 and 22 using ChIP-Chip36. Additionally, the ENCODE project has mapped whole genome Sp1 binding sites in four human cell lines using ChIP-Seq. Interestingly, our network motif-based findings described below identified many novel Sp1-motif regulated genes that were not captured by the ChIP-Chip or the ENCODE ChIP-Seq experiments.

Among the genes used in our network analysis, 10,459 genes’ promoters contain the Sp1 motif. Our analysis identified 8,048 of them as potential Sp1 motif target genes. Among these target genes, 8,037 were identified by the motif position bias method (Z >= 4), 703 by the motif enrichment method (pValue <= 1E-4), and 694 were identified by both methods. The promoter sequences of all genes used in the analysis were then randomized and subjected to the same analysis. In each permutation run, on average only 3 genes were identified as Sp1 motif targets by our analysis. Thus the FDR for Sp1 motif analysis is 0.04% (3/8048).

A sub-network extracted for the 8,048 Sp1 motif target genes contained 410 modules (Supplementary Dataset 5). Within these modules, the Sp1 motif shows position bias towards TSS in the genes’ promoters (Fig. 2d). 60 of these modules have significantly enriched GO terms (pValue < 5E-7) and can be divided into three categories: house-keeping or generic cellular function related modules, developmental related or tissue specific modules, and immunity related modules (Table 2). A sub-network for the immunity and development related modules is shown in Fig. 3. Consistent with previous reports on Sp1 motif functions, the immunity modules include platelet activation (module #19, 28), TNF-α signaling (#24), osteoclast differentiation (#27), interferon α/β signaling (#62), antigen processing and presentation via MHC I (#81), and chemokine-mediated signaling (#160) (Table 2)37,38,39,40,41

Table 2 Sp1 motif-regulated modules.

.

Figure 3
figure 3

Co-expression modules regulated by the Sp1 motif. A sub-network for development and immune response modules regulated by the Sp1 motif is shown. The size of a sphere is proportional to the number of genes within the module. The number shown within the module represents module # shown in Table 2.

The house-keeping or generic cellular function category contains modules with known functions of Sp1 such as cell cycle regulation (module #2), DNA damage response and DNA repair (#36), response to stimulus (#15), chromatin modification (#37), and lipid biosynthesis (#32) (Table 2)30, 42. It also includes novel functional modules regulated by Sp1: RNA processing (#6, 31, 34), protein folding (#23), vesicle trafficking (#48), and regulation of circadian rhythm (#45) (Table 2).

The ENCODE ChIP-Seq data of four human cell lines (K562, GM12878, H1-hESC, and HepG2) identified 4,361 gene promoters with Sp1 motif bound by Sp1 TF among all 10,459 Sp1 motif-containing genes used in our network. Out of the 23 house-keeping modules identified in our analyses, 16 show enrichment for Sp1 TF binding in the ENCODE ChIP-Seq data (pValue < 0.05, Table 2). These results provide validation of our network findings. Genes within these 16 modules are expressed well in the four cell lines used in the ENCODE project (Fig. 4a) and in diverse human primary cell lines (Supplementary Fig. S2).

Figure 4
figure 4

Gene expression and epigenetic regulation of Sp1 motif regulated modules. (a) The median gene expression level for genes within the Sp1-regulated modules in four human cell lines. The module numbers shown in X-axis are related to development and house-keeping categories described in Table 2. FPKM values from RNA-Seq experiments conducted by the ENCODE project is used as Y-axis. (b) The median H3K4me3 level in the promoters for genes within the Sp1-regulated modules in three human cell lines used in the ENCODE project. (c) The average promoter DNA methylation level for the genes within the Sp1-regulated modules in four human cell lines used in the ENCODE project.

The development-related or tissue specificity-related functions category regulated by Sp1 includes various modules functioning in development of the nervous system (#17, 33, 70, 217), skeletal system (#66), muscle (#14), skin (#7, 85), hepatocyte (#46), blood vessel (#5), pancreas (#152), thyroid (#212), kidney (#272), cartilage (#44), reproduction systems (#42, 255), and stem cell (#91) (Fig. 3 and Table 2). Previous studies have shown Sp1’s involvement in the development of these tissues individually. For example, Sp1 is important for nervous system development43. Huntington’s disease, a neurodegenerative disease, is caused by mutated Huntington protein that interacts with Sp1 and thus fails to bind to DNA43. Sp1 also regulates the expression of NOS3 gene in the module #5 that encodes the endothelial nitric oxide synthase critical for blood vessel and embryonic heart development44. In the skin development module (#7), Sp1 functions as a repressor to down-regulate KLK5 and KLK7 expression45. Additional examples of Sp1 motif regulating developmental genes are shown in Supplementary Dataset 6, which together verify our novel network findings.

Interestingly, when we looked into the ENCODE ChIP-Seq data, the Sp1 motif sites within the genes of above-described developmental modules are under-represented or even depleted of Sp1 TF binding in the four human cell lines used for ChIP-Seq analyses (Table 2). Compared to Sp1 regulated house-keeping modules, the developmental modules have little or no expression in the four cell lines used in the ENCODE project (Fig. 4a). Instead, they display specific expression in other primary cell lines (Supplementary Fig. S2), indicating cell/tissue specific expression of genes regulated by the Sp1 motif.

Epigenetic regulation of Sp1-regulated house-keeping and the developmental modules

We reasoned that the expression difference between the Sp1-regulated house-keeping modules and the developmental modules could be due to epigenetic regulation. Therefore, we analyzed histone H3 lysine 4 tri-methylation (H3K4me3) and DNA methylation patterns for the promoters of target genes identified in our analyses using data from ENCODE. Apart from TF binding data, the ENCODE project has also measured these two epigenetic marks over ~60 different cell lines that includes the ones used for the Sp1 ChIP-Seq experiments (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeUwHistone; http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeHaibMethyl450).

H3K4me3 is normally associated with active or poised promoters46. All house-keeping modules identified in our analyses show high levels of H3K4me3 (Fig. 4b) in K562, GM12878 and HepG2 human cell lines (data for H1-hESC are not available), consistent with their high expression levels (Fig. 4a). Interestingly, in agreement with their low gene expression levels, most of the developmental modules (Fig. 4a) have low H3K4me3 levels in these three cell lines (Fig. 4b) as well as in 56 other cell lines (Supplementary Fig. S3) used by the ENCODE project. However, there are cell-line specific H3K4me3 level hikes for many of the Sp1-regulated developmental modules that match their functions. For example, module #85 is enriched with epithelium development related genes and their H3K4me3 levels are relatively higher in esophageal epithelial cells (HEEpiC), mammary epithelial cells (HMEC), and small airway epithelial cells (SAEC) (Supplementary Fig. S3). Module #5, enriched with blood vessels genes, has highest H3K4me3 levels in umbilical vein endothelial cells (HUVEC) (Supplementary Fig. S3).

DNA methylation in gene promoters normally represses their transcription. Consistent with this, we found higher methylation levels in developmental gene modules than house-keeping modules regulated by Sp1 in the four-cell lines used in the ENCODE ChIP-Seq (Fig. 4c), as well as in 59 other cell lines used in ENCODE (Supplementary Fig. S3). Cell-line specific reduction in methylation level is also observed in the developmental modules. For example, the lowest DNA methylation level for module #85 is in three epithelium-related cell lines NHBE (bronchial epithelial cells), HEEpiC, and SAEC, and for module #5 in the HUVEC cells (Supplementary Fig. S3). Therefore, we hypothesize that Sp1 TF only binds to the gene promoters of the developmental modules after they are activated in specific cells/tissues through epigenetic modifications, and thus Sp1 regulates genes in a cell/tissue-specific manner. Consistent with this, Sp1 binds to the promoter of NOS3 and ACVRL1 of module #5 specifically in the endothelial HUVEC cell line, and in other cell lines after treatment with DNA methylation inhibitors44, 47. DNA methylation also regulates the expression of LDHC of module #2848.

Gene expression modules regulated by other motifs

We also analyzed other human TF motifs catalogued in the JASPAR database or derived from the ENCODE project1, 49. Gene targets for 37 motifs with FDR range from 0.1% to 4.2% were identified (Supplementary Dataset 7). Discussed below are three specific examples of gene modules regulated by Yin Yang (YY1), Regulatory Factor X 2 (RFX2), and Interferon Regulatory Factor 1 (IRF1) motifs (Table 3).

Table 3 Gene modules regulated by YY1, RFX2, and IRF1 motifs.

YY1 is a zinc finger TF with both activation and repression functions50. The modules identified for YY1 motif (MA0095.2) regulate two mitochondria pathways (module #1 for cellular respiration, and #12 for fatty acid beta-oxidation), cell cycle (#4), histone modification (#18), RNA processing, splicing and metabolism (#7, 37, 42, 46), ribosome biogenesis (#17), translational elongation (#10), and type I interferon signaling pathway (#43). Previous reports identified YY1 binding targets enriched with mitochondria, ribosomal, and RNA metabolism genes51,52,53, which confirm our network findings.

The RFX TFs belong to a winged-helix DNA-binding domain-containing TF family conserved in yeast, flies, and vertebrates54. They play important roles in transcriptional regulation of ciliogenesis55. We identified gene co-expression modules regulated by the human RFX2 promoter motif (MA0600.1). Among them are three modules involved in cilium morphogenesis, cilium assembly, and microtubule-based movement (module #1, 2, 14). Other modules indicate novel processes regulated by RFX2 or RFX TFs, such as synaptic transmission (#21), striated muscle cell differentiation (#33), mitotic cell cycle (#42), cellular respiration (#10), and protein folding (#4). Supporting our findings, RFX2 is involved in nerve cell response to paclitaxel in rats, while the RFX TF gene sak + regulates mitotic cell cycle in fission yeast56, 57.

The IRF family TFs are important regulators of immunity58. The gene co-expression modules regulated by the human IRF1 promoter motif (MA0050.2) include those for Interferon α/β signaling (module #1, #12), interferon γ signaling (#3), antigen processing and presentation via MHC I (#2), NOD-like receptor signaling (#13), response to virus (#4), leukocyte activation (#7), and regulation of transposition (#18). Interestingly, also included are three developmental related modules, for synaptic transmission (#5), neuronal action potential (#20), and hair follicle development (#6), indicating that IRF1 or related IRF TFs might regulate these processes.

The motif target lists of NF-Y, Sp1, YY1, RFX2, and IRF1 from our analyses described here together with the target list of other 34 motifs listed in Supplementary Dataset 7, provide the basis for deciphering the human gene expression regulatory mechanisms that shape the expression landscape as captured by our gene co-expression network.

Discussion

We describe here a human gene co-expression network that we used to identify gene co-expression modules regulated by various cis-regulatory motifs. Compared to other co-expression network studies, a distinctive advantage of our approach is that it provides in-depth characterizations of the TF motifs that regulate gene expression within the network. Using motif enrichment and, more importantly, motif position bias methods, many target genes were identified with high confidence for well-studied NF-Y, Sp1, and other TF motifs. Interestingly, ~90% of the Sp1 motif targets were only identified by the position bias method but not by the typically used motif enrichment method. Additionally, rather than merely producing a list of target genes for selected motifs, our analysis also organized and placed them into diverse gene co-expression modules, providing a clear and streamlined overview of the gene expression landscapes regulated by specific motifs.

The gene network and gene co-expression modules also enable easy integration of various types of omics-based data into a coherent regulatory system. While the regulatory modules were identified based on gene expression and promoter sequence analysis, we used independent TF-promoter interaction data from the ENCODE ChIP-Seq experiments to verify our prediction. The modules from our analyses can be interrogated with gene expression and various types of epigenomic data. For example, the Sp1 motif target genes’ profiles on H3K4me3 and DNA methylation show perfect correlation between the Sp1 binding profile and gene expression profile in different cell types. Furthermore, we hypothesize that the dynamic change in these two epigenetic marks will be accompanied by dynamic change in Sp1 binding and its target genes’ expression, which provides a possible mechanism on how Sp1 differentially regulates house-keeping and tissue specific gene co-expression modules. Cell lines-specific ChIP-Seq or ChIP-qPCR measuring the SP1 binding sites will be helpful to validate such a hypothesis. The modules could also be used to dissect the function of other epigenetic marks on gene expression regulation, including the genome-wide data for more than 20 histone marks deposited by the NIH Roadmap Epigenome project59.

The robustness and novelty of our approach is demonstrated by the gene co-expression modules identified in our analyses. For example, our analyses captured known and novel modules for the two well-studied motifs NF-Y and Sp1. Since these two motifs are widely distributed in the genome, it makes it hard to identify their targets via typically used motif enrichment method. Therefore, our motif position bias method towards TSS made it possible to identify targets of NF-Y, Sp1 and other TFs. While most of the NF-Y targets are related to house-keeping functions, the Sp1 targets do include immunity and tissue development processes. The large number of target modules regulated through Sp1 motif is also reflected by the recognition that Sp1 target gene expression deregulation is associated with many disease risk factors. These deregulations usually involve polymorphism within Sp1 motif sites of target gene promoters. We expect that our network described here with its expression module based approach will allow for and promote the identification of additional disease-associated deregulation incidents.

Our results also show that integrating gene co-expression network with different types of omics data allows construction of integrated gene expression systems with multiple layers of regulation (Fig. 5). The bottom layer of such an integrated approach will conceivably be the gene co-expression network, where co-expression modules can be identified. These co-expression modules perform specific functions in metabolism or signaling pathways. Superimposed exists a layer of epigenetic regulation affecting promoter activation, through H3K4me3, DNA methylation, histone acetylation, and other processes. It should be noted that the current study was mainly focused on promoter motifs proximal to the TSS, but there are also other cis-regulatory motifs located at enhancers more than 2.5 kb upstream or 0.2 kb downstream of the TSS. Datasets generated by the chromosome conformation capture techniques would be helpful to incorporate such distal regulation into our current model. These regulations can determine the activated or deactivated status of gene promoters. Another parallel layer is represented by different TFs, which will bind to cis-regulatory motifs within the activated gene promoters to regulate their expression. The TFs themselves can interact with each other to co-regulate their downstream target genes. Therefore, the results described here provide a snapshot at the identification of gene co-expression modules (expression layer), motif-promoter interactions, and epigenetic regulatory effects. However, it must be noted that multiple TFs, of the same or different gene families, may bind to same motif sites within a gene’s promoter. Therefore, in the future identification of corresponding TF that drive the expression of the individual modules will be an important task.

Figure 5
figure 5

A gene regulatory system with different layers of regulations. The gene regulatory system consists of three layers: gene co-expression network, transcription factors network, and epigenetic regulation. Gene co-expression network capture the output of the regulatory system, i.e. the expression patterns, which are regulated by both transcription factors network and epigenetic regulation.

Methods

Gene co-expression network

Publicly available microarray datasets generated with Affymetrix U133 plus 2.0 arrays and deposited in the ArrayExpress database13 were used to construct the human gene co-expression network based on GGM as described previously14. See the supplementary methods for more details.

Promoter motif analysis and target gene identification

Promoter sequences for the 19,718 analyzed genes were extracted as 2500 bp upstream of TSS and 200 bp downstream of TSS, except for NF-Y motif analysis. For NF-Y motif, due to its high prevalence, promoter sequences are defined as 1000 bp upstream of TSS. After excluding the repeat sequences and transposon sequences (by replacing all nucleotides within these sequences with the letter code “N”), the promoter sequences were scanned for presences of selected motif sites with motif position weight matrix (PWM) obtained from JASPAR, or from those motifs derived from the ENCODE project1, 49, using POSSUM with a pValue cut off at 4−860, 61. Motif enrichment and motif position bias analysis were then carried out for genes in the network to identify target genes regulated by the motif. Permutation analysis on randomized promoters was conducted to assess the FDR. See supplementary experimental procedure for more details.

Briefly, the motif position bias analysis calculates the extent a motif’s distribution deviating from random uniform distribution towards TSS within a group of promoters. A background model for a uniformly distributed motif is first established to calculate the motif’s expected distance from TSS and its variance. Suppose for all promoters in the group, defined as M bps upstream and N bps downstream of TSS, there are K free bps in total that are not occupied by repeat or transposon sequences. The motif has equal chance to appear in any of these K bp. Suppose in position i (−M <= i <= N) along the promoters, which is i bp relative to TSS, there are k i free bps, i.e. there are k i promoters are not occupied by repeat or transposon sequences in that position. Then, the motif’s expected mean distance from TSS is given by:

$$E(d)=\,\sum _{i\,=-M}^{N}\frac{{k}_{i}}{K}\times |i|$$
(2)

And its variance is given by:

$$V(d)=\,\sum _{i\,=-M}^{N}\frac{{k}_{i}}{K}\times {i}^{2}-{(E(d))}^{2}$$
(3)

For an actual motif appears n times in that group of promoters, with mean distance \(\overline{|x|}\) from the TSS, a Z score is calculated as:

$$Z=\,\frac{E(d)-\overline{|x|}\,}{\sqrt{\frac{V(d)}{n}}}$$
(4)

A Z score larger or equal to selected cutoff value is considered to have significant position bias towards TSS.

Co-expression modules GO, TF binding, H3K4me3, methylation, RNA-Seq data, and NF-Y related microarray data analyses

Target genes for a specific motif were used to extract a sub-network from the whole human gene co-expression network. The sub-network was clustered using the MCL clustering algorithm and visualized using BioLayout Express 3D62, 63. GO analysis was carried out using the GOStats package in Bioconductor64.

To evaluate the enrichment of corresponding TF’s binding to its motifs within the promoters of co-expression modules, the ENCODE TF binding data were used3. RNA-Seq, H3K4me3, and DNA methylation (methyl 450 K beads) from the ENCODE project were used to evaluate expression and epigenetic modification level associated with the genes in the modules. See supplementary experimental procedure for more details.

Two published microarray datasets25, 26 (GSE40215 and GSE70543) were also used to assess if NF-Y motif modules were regulated by the NF-Y TFs. Within these two dataset, the NF-YA gene was knocked down in using small hairpin RNA in two human cell lines, HeLa S3 and HCT11625, 26. The list of gene regulation values from each dataset was analyzed with the Gene Set Enrichment Analysis (GSEA v2.2.0)65 to see if any of the NF-Y motif modules were up- or down-regulated within the sample. A module was called up-regulated if its NES value was >0 and FDR value <= 0.05, or down-regulated if its NES value was <0 and FDR value <= 0.05.