Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology1,2. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for individual factors are general principles of metazoan regulation that are remarkably well-preserved despite extensive functional divergence of individual network connections. The comparative maps of regulatory circuitry provided here will drive an improved understanding of the regulatory underpinnings of model organism biology and how these relate to human biology, development and disease.
Transcription regulatory factors guide the development and cellular activities of all organisms through highly cooperative and dynamic control of gene expression programs. Regulatory factor coding genes are often conserved across deep phylogenies, their DNA-binding protein domains are preferentially conserved at the amino-acid level, and their in vitro binding specificities are also frequently conserved across large distances3,4. However, the specific DNA targets and binding partners of regulators can evolve much more rapidly than DNA-binding domains, making it unclear whether the in vivo binding properties of regulatory factors are conserved across large evolutionary distances.
Comparisons of the locations of regulatory binding across species has been controversial, with some studies suggesting extensive conservation1,2,5,6,7,8,9,10, whereas others suggest extensive turnover11,12,13,14. Although it is generally assumed that across very large evolutionary distances regulatory circuitry is largely diverged, there exist highly conserved sub-networks15,16,17,18. Thus, confusion exists in the level of regulatory turnover between related species, possibly owing to the small number of factors studied. Moreover, despite recent observations of the architecture of metazoan regulatory networks a direct comparison of their topology and structure—such as clustered binding and regulatory network motif—has not been possible owing to large differences in the procedures employed to assay regulatory factor binding in distinct species. Here we present a systematic and uniform comparison of regulation using many factors across distantly related species to help address these questions on a scale not previously possible.
To compare regulatory architecture and binding across diverse organisms, the modENCODE and ENCODE consortia mapped the binding locations of 93 Caenorhabditis elegans regulatory factors, 52 Drosophila melanogaster regulatory factors and 165 human regulatory factors as a community resource (Fig. 1 and Supplementary Table 1). These regulatory factor binding data sets represent a substantial increase over those previously published for worm (194 new data sets for a total of 219) and human (211 new, 707 total) and a substantial improvement in data quality in fly with a move from chromatin immunoprecipitation with DNA microarray (ChIP-chip) to ChIP followed by sequencing (ChIP-seq) (93 new, 93 total)2,8,19,20. The majority of regulatory factors are site-specific transcription factors (83 in worm, 41 in fly, and 119 in human), although general regulatory factors such as RNA Pol II were also assayed.
All regulatory factors were analysed by ChIP-seq according to modENCODE/ENCODE standards: antibodies were extensively characterized, and at least two independent biological replicates were analysed21. Worm regulatory factors were assayed in embryo and stage 1–4 larvae (L1–L4 larvae), fly regulatory factors in early embryo, late embryo and post embryo, and human regulatory factors in myelocytic leukaemia K562 cells, lymphoblastoid GM12878 cells, H1 embryonic stem cells, cervical cancer HeLa cells, and liver eptihelium HepG2 cells. Binding sites were scored using a uniform pipeline that identifies reproducible targets using irreproducible discovery rate (IDR) analysis (Extended Data Fig. 1)22 and quality-filtered experiments (see Methods). These rigorous quality metrics insure that the data sets used here are robust. All data presented are available at http://www.ENCODEProject.org/comparative/regulation/.
To explore motif conservation, we examined the 31 cases in which we had members of orthologous transcription-factor families profiled in at least two species (Extended Data Fig. 2a and Methods). Sequence enriched motifs were found for 18 of the 31 families and for 12 orthologous families (41 regulatory factors), the same motif is enriched in both species (Extended Data Fig. 2b, c). For 18 of 31 families (64 of 93 regulatory factors), the motif from one species is enriched in the bound regions of another species (one-sided hypergeometric, P = 3.3 × 10−4). These findings indicate that many factors retain highly similar in vivo sequence specificity within orthologous families, a feature noted previously across studies working on smaller numbers of factors.
Next, we used RNA-seq data3 to determine whether targets of orthologous regulatory factors are specifically expressed at similar developmental stages between fly and worm. As a class, orthologous regulatory factors (both assayed here and not) are significantly expressed at similar stages (Extended Data Fig. 3a–c). However, expression of orthologous targets of orthologous regulatory factors in worm and fly shows little significant target overlap (Extended Data Fig. 3d) and the large majority of orthologous regulatory factors did not show conserved target functions (Extended Data Fig. 4a–c), suggesting extensive re-wiring of regulatory control across metazoans. Nevertheless, human and worm orthologous regulatory factors were more likely to show conserved target gene functions than non-orthologous regulatory factors (Extended Data Fig. 4d, Wilcoxon test P < 3.9 × 10−6), highlighting regulatory factors with conserved target functions.
Regulatory factor binding is not randomly distributed throughout the genome, but rather, in all three species, approximately 50% of binding events are found in highly-occupied clusters, termed high-occupancy target (HOT) regions1,2,5,8,10. HOT regions show enhancer function in integrated transcriptional reporters11 and are stabilized by cohesin15,17. HOT regions show no significant enrichment with non-specific antibodies (Extended Data Fig. 5), in contrast to recent work using raw signal19 rather than IDR peaks, although the possibility that they are artefacts has been raised.
By comparing HOT regions across different developmental times and cells types, we find that 5–10% of HOT regions are constitutive, indicating that HOT regions are dynamically established, rather than an intrinsic property of specific regions. In humans we find that approximately 90% of constitutive HOT regions fall within promoter chromatin states compared to only approximately 10–20% of context-specific HOT regions (Fig. 2a and Extended Data Fig. 6). Instead, approximately 80–90% of context-specific HOT regions fall within enhancer states. Moreover, these context-specific HOT regions are specifically enriched for enhancers in matching cell types or developmental stages. For example, 80% of GM12878-called HOT regions fall within GM12878-specific enhancers but only approximately 10% of GM12878-called HOT regions fall within enhancers called in other cell-types (Fig. 2b). These patterns remain similar for all cell types (Extended Data Fig. 7), suggesting the two types of HOT regions are established concordantly and dynamically between cell types, though these patterns are weaker in the worm and fly data.
We constructed regulatory networks in each species by predicting gene targets of each regulatory factor using TIP23 and used simulated annealing to reveal the organization of regulatory factors in three layers of master-regulators, intermediate regulators, and low-level regulators (Fig. 3a, b). The algorithm found only 7% of regulatory factors at the top layer of the network in fly and 13% in worm, compared to 33% in human. We also found that more edges are upward flowing in human (30%) than worm and fly (22% and 7%). This suggests differences in the global network organization with more extensive feedback and a higher number of master regulators in human.
We next assessed the local structure of regulatory networks, by searching for enriched sub-graphs known as network motifs (Fig. 3c). We found that the same network motifs were most and least enriched in the three species. In each case, the most abundant was the feed-forward loop (FFL), while the least abundant were cascade motifs, and both divergent and convergent regulation. Moreover, specific regulatory factors were enriched for origin, target, or intermediate regulators in these FFLs in each species (Fig. 3d). Surprisingly, the number of feed forward loops (FFLs) varied by developmental stage in both worm and fly, with L1 stage in worm and late-embryo stage in fly showing the highest number of FFLs (Extended Data Fig. 8), suggesting increased filtering fluctuations and accelerating responses in these stages24.
We asked whether the three species showed conserved regulatory factor co-associations. We first focused on global co-associations where two factors co-associate frequently regardless of context, either by intermolecular interactions or independent recruitment (Extended Data Fig. 9). With the exception of a small number of conserved global regulatory factor co-associations (for example, SIN3A with HDAC1, HDAC2 and NR2C2 in fly and human25,26,27, and MXI1 with E2F1, E2F4 and E2F6 in worm and human), the majority of global co-associations were not conserved in the contexts and species pairs analysed.
As regulatory factor co-association at distinct binding regions is local and contextual (that is, different combinations of factors co-associate at different genomic locations), we next used an approach to detect co-association at distinct regions of the genome based on conserved patterns of regulatory factor binding. This method uses self-organizing maps (SOMs) to analyse co-association patterns at specific loci by better exploring the full combinatorial space of regulatory factor binding than traditional co-association approaches (Fig. 4a–c)28. We demonstrate that co-associations at distinct genomic regions reveal a more complex view of regulatory structure and bring forth categorical enrichments that are lost in a larger, genomic context.
We examined whether specific contextual co-associations are conserved for orthologous regulatory factors by using binding data from each organismal pair; that is, human–worm and human–fly (Fig. 4b, g). Specific regulatory factor co-associations were observed; most are conserved to varying degrees across each organism with very few that are entirely organism-specific (Fig. 4b, g). These co-associations result in expected sets of factors such as the previously noted SIN3A + HDAC co-association. In addition, we find new co-associations such as the pattern in Fig. 4f for human–worm, which in worm is highly enriched for GO terms associated with sex determination. We further examined which co-associations are conserved at distinct gene locations (that is, proximal and distal). We found distinct combinations of conserved co-associations in relation to transcription start site (TSS) regions. Interestingly, virtually all TSS-proximal co-associations in human remain TSS-proximal in worm (approximately 80%) and fly (approximately 100%), indicating that co-associations that occur at promoters are often highly conserved (Fig. 4h). Conversely, co-associations at distal regions are much less conserved.
Our results, obtained using a large resource of regulatory binding information, suggest that there is little conservation of individual regulatory targets and binding patterns for these highly divergent metazoans: C. elegans, D. melanogaster and H. sapiens. However, we do find strong conservation of overall regulatory architecture, both in network motif usage and in concentrated regulatory binding at dynamically established HOT regions. We observe an increased conservation of in vivo sequence preferences and some target gene functions, with context-specific regulatory factor partners still observed at specific loci in these distal comparisons. These findings are consistent with previous results indicating that the gene targets of regulation are typically quite divergent and are likely to account for many of the phenotypic differences among species12,13,14,16,29,30, despite conserved sequence preferences. We significantly extend these observations, both in the number of regulators studied and in the range of regulatory properties studied, and provide specific examples of conserved and diverged regulatory functions. Lastly, beyond its potential for comparative studies of gene regulation, the primary data sets provide invaluable new information of genome-wide transcription-factor binding information both in human, and in two of themost important metazoan models of human biology, development, and disease.
A data portal has been created for the modENCODE project where data from all stages of analysis in this project are available (http://ENCODEProject.org/comparative/regulation/).
Experimental methods for D. melanogaster ChIP-seq assay
Transgenic lines containing GFP-tagged transcription factors within their endogenous genomic contexts were produced as described previously1,31. Chromatin was collected and chromatin immunoprecipitation was performed as described previously20. Multiplexing allowed for sequencing of between 4 and 12 samples per lane on an Illumina Hi-Seq for a minimum of 5 million reads per sample. New GFP-tagged lines are made available at the Bloomington Stock Center. Tagged line stock numbers are: Abd-B stock 38625; Eip74EF stock 38636; Lola stock 38660; N stock 38665; Stat92E stock 38670; usp stock 38672.
Experimental methods for C. elegans ChIP-seq assay
C. elegans ChIP-seq assays were performed as described in32, with a few modifications. In brief, transgenic worms containing GFP-tagged transcription factors were grown to the desired developmental stage under controlled conditions and cross-linked with 2% formaldehyde. Cell extracts were sonicated to yield predominantly DNA fragments in the range of 200–500 bp. The sonicated lysates were immunoprecipitated in either 5% or 1% Triton using anti-GFP antibody. Sequencing libraries were prepared from the two independent biological replicates of immunoprecipitation-enriched and input DNA fragments. Libraries were multiplexed using four 4-bp barcodes33 and sequenced on Illumina Genome Analyzer II.
Experimental methods for human ChIP-seq assay
Human ChIP-seq was performed using the overall method outlined in ref. 21. In brief, 2 × 107 cells were cross-linked using 1% formaldehyde at room temperature followed by treatment with 125 mM glycine. The cross-linked cells were resuspended in hypotonic buffer and the cells were lysed by Dounce homogenization. The resulting nuclear extract was sonicated to obtain DNA fragments in the target size of 200–500 bp. Immunoprecipitation was performed overnight at 4 °C using 2 µg of antibody. The transcription factor–antibody complexes were collected using protein A and Protein G agarose beads. The immunoprecipitation-enriched DNA (transcription-factor antibody as well as control IgG) was used to prepare sequencing libraries similar to the methods used for C. elegans ChIP-seq library preparation. A single sample was run per lane of the Illumina Genome Analyzer II.
Uniform processing of transcription factor ChIP-seq data sets
We used a uniform processing pipeline to identify high-confidence binding events (peaks) for a large collection of ChIP-seq data sets in three species from the modENCODE and ENCODE consortia; worm (C. elegans), fly (D. melanogaster) and human (H. sapiens). For human, we analysed 707 distinct ChIP-seq data sets (with at least two replicate experiments) representing 165 unique regulatory factors (generic and sequence-specific factors). The data sets span 91 human cell types and some are in various treatment conditions. These data sets were generated by production groups located at the following universities: The Broad Institute, Stanford University, Yale University, University of California Davis, Harvard University, HudsonAlpha, University of Texas (Austin) and University of Washington. For worm, we analysed 220 distinct ChIP-seq data sets (with at least two replicates) spanning 93 unique regulatory factors in 11 developmental stages. For fly, we analysed 93 distinct ChIP-seq data sets (with at least two replicates each) spanning 52 unique regulatory factors in 17 developmental stages.
For each experiment, mapped reads in the form of BAM files were downloaded from the ENCODE University of California Santa Cruz Data Coordination Center (http://encodeproject.org/ENCODE/downloads.html) and the modENCODE Data Coordination Center (http://www.modencode.org/). These BAM files were generated by the individual data production labs using different mappers and mapping parameters. In order to standardize the mapping protocol, we used custom mappability tracks to filter out multi-mapping reads and only retain unique mapping reads that is, reads that map to exactly one location in the genome. We also filtered all positional and polymerase chain reaction (PCR) duplicates.
A number of quality metrics for all replicate experiments of each data set were computed (ref. 21, and A.K., unpublished observations). In brief, these metrics measure ChIP enrichment and signal-to-noise ratios, sequencing depth and library complexity and reproducibility of peak calling. These measures will be reported at the ENCODE portal at http://encodeproject.org/ENCODE/qualityMetrics.html. Data sets that did not pass the minimum quality control thresholds were discarded and not used in any analyses. Data sets that passed most but not all quality metrics were flagged.
All ChIP-seq experiments were scored against an appropriate control designated by the production groups (either input DNA or DNA obtained from a control immunoprecipitation). For human and worm data sets, we used the SPP peak caller to identify and score (rank) potential binding sites and peaks34. However, for fly data sets, we instead used the MACS (v.2) peak caller35. Most of the fly data sets used the NexTera sample preparation protocol which resulted in non-canonical distribution of reads around binding sites and lower signal to noise ratios. These characteristics made them unsuitable for use with the SPP peak caller which specifically models peak shape and penalizes peaks with non-canonical stranded distribution of reads around binding sites. The MACS v.2 peak caller does not directly model such peak structure and is thus more immune to non-canonical read distributions.
To obtain optimal thresholds, we used the irreproducible discovery rate (IDR) framework to determine high confidence binding events by leveraging the reproducibility and rank consistency of peak identifications across replicate experiments of a data set22 (A.K., unpublished observations). Code and detailed step-by-step instructions to call peaks using the IDR framework are available at https://sites.google.com/site/anshulkundaje/projects/idr.
For worm and human data sets, the SPP peak caller34 was used with a relaxed peak calling threshold (FDR = 0.9) to obtain a large number of peaks (maximum of 300,000 for human and 30,000 for worm) that span true signal as well as noise (false identifications). Peaks were ranked using the signal score output from SPP (which is a combination of enrichment over control with a penalty for peak shape). The IDR method analyses a pair of replicates, and considers peaks that are present in both replicates to belong to one of two populations: a reproducible signal group or an irreproducible noise group. Peaks from the reproducible group are expected to show relatively higher ranks (ranked based on signal scores) and stronger rank-consistency across the replicates, relative to peaks in the irreproducible groups. Based on these assumptions, a two-component probabilistic copula-mixture model is used to fit the bivariate peak rank distributions from the pairs of replicates22.
The method adaptively learns the degree of peak-rank consistency in the signal component and the proportion of peaks belonging to each component. The model can then be used to infer an IDR score for every peak that is found in both replicates. The IDR score of a peak represents the expected probability that the peak belongs to the noise component, and is based on its ranks in the two replicates. Hence, low IDR scores represent high-confidence peaks. An IDR score threshold of 2% for human data sets and 5% for worm data sets was used to obtain an optimal peak rank threshold on the replicate peak sets (cross-replicate threshold). If a data set had more than two replicates, all pairs of replicates were analysed using the IDR method. The maximum peak rank threshold across all pairwise analyses was used as the final cross-replicate peak rank threshold.
Any thresholds based on reproducibility of peak calling between biological replicates are bounded by the quality and enrichment of the worst replicate. Valuable signal is lost in cases for which a data set has one replicate that is significantly worse in data quality than another replicate. Hence, we used a rescue strategy to overcome this issue. In order to balance data quality between a set of replicates, mapped reads were pooled across all replicates of a data set, and then randomly sampled (without replacement) to generate two pseudo-replicates with equal numbers of reads. This sampling strategy tends to transfer signal from stronger replicates to the weaker replicates, thereby balancing cross-replicate data quality and sequencing depth. These pseudo-replicates were then processed using the same IDR pipeline as was used for the true biological replicates to learn a rescue threshold. For data sets with comparable replicates (based on independent measures of data quality), the rescue threshold and cross-replicate thresholds were found to be very similar. However, for data sets with replicates of differing data quality, the rescue thresholds were often higher than the cross-replicate thresholds, and were able to capture more peaks that showed statistically significant and visually compelling ChIP-seq signal in one replicate but not in the other. Ultimately, for each data set, the best of the cross-replicate and rescue thresholds were used to obtain a final rank threshold. Reads from replicate data sets were then pooled and SPP was once again used to call peaks on the pooled data with a relaxed FDR of 0.9. Pooled-data peaks were once again ranked by signal-score. The final rank threshold (best of cross-replicate and rescue threshold) was then used to threshold the ranked set of pooled-data peaks.
For fly data sets, we used a slightly modified version of the above pipeline. For each replicate experiment of a data set, we used the MACS v.2 peak caller35 with a relaxed P value threshold of 1 × 10−3 to obtain a maximum of 30,000 peaks (replicate sets). Peaks were ranked based on their P values. Reads from the replicate experiments were then pooled and once again MACS v.2 was used with a P-value threshold of 1 × 10−3 to obtain a relaxed set of peaks (pooled set). We only retained peaks in the pooled set that overlapped at least one peak in both replicate sets (replicate-reproducible peaks). For each replicate-reproducible peak in the pooled set, we obtained a pair of P values corresponding to the overlapping peaks in each of the replicate sets. If a peak in the replicate-reproducible set overlapped multiple peaks in a replicate-set then the P value of the replicate-set peak with the maximal overlap with the pooled-set peak was used. Thus, we obtain two independent rankings based on P values from each replicate for the same set of replicate-reproducible peaks (using peak coordinates learned on the pooled set). The pair of ranked lists for the replicate-reproducible peaks were then used as input to the IDR framework as described above to learn cross-replicate rank thresholds at an IDR of 5%. The above protocol was repeated for pseudo-replicates to obtain a rescue rank threshold at an IDR of 5%. The better of the two rank thresholds was used to truncate the replicate-reproducible peaks in the pooled set to obtain the final set of optimal rank consistent and reproducible peaks.
All peak sets were then screened against specially curated empirical blacklists for each species. In brief, these blacklist regions typically show the following characteristics: unstructured and extreme high signal in sequenced input-DNA and control data sets as well as open chromatin data sets irrespective of cell-type identity; an extreme ratio of multi-mapping to unique mapping reads from sequencing experiments; overlap with specific types of repeat regions such as centromeric, telomeric and satellite repeats that often have few unique mappable locations interspersed in repeats.
Identification of HOT and XOT regions
To identify regions with higher than expected binding occupancies, we first determined for each specific context in each organism the number and size distribution of observed binding sites for each factor assayed, as well as the total number and size distribution of binding regions in which these binding sites from all factors are clustered. For each target case (context per species evaluated), we first analysed the number and size distribution of target binding regions (in which factor binding sites are concentrated). For each target case simulation, we randomly selected an equivalent number of random binding regions with a matched size distribution. Next, for each factor assayed (in the target case), we evaluated the number and size of observed binding sites, and simulated an equivalent number and size distribution of target binding sites, restricting their placement to the simulated binding regions. We collapsed simulated binding sites from all factors into binding regions, verifying that these cluster into a similar number of simulated binding regions as the target binding regions. For each target case simulation, the occupancy (number of peaks), density (peaks per kb), and complexity (diversity of factors) in the simulated binding regions are annotated. This procedure was repeated 1,000 times for each case (human = 5 contexts; worm = 5 contexts; fly = 3 contexts). For each target case, we constructed expected binding region occupancy distributions from the corresponding 1,000 simulations. We determined the cutoffs at which fewer than 5% and 1% of the simulated binding regions have higher occupancies (Extended Data Fig. 2). We classified observed binding regions with occupancies higher than the 5% and 1% cutoffs as high-occupancy target (HOT) and extreme-occupancy target (XOT) regions, respectively. As such, HOT regions include XOT regions.
GO enrichment analysis
To evaluate the functional role of regulators we performed GO enrichment analysis on the targets of binding of each ChIP-seq experiment. In brief, we applied ChIPpeakAnno to assign factor binding to genic targets and to evaluate the enrichment of genic targets for GO ontologies using standard procedures36. We required a minimum of 20 peaks per ChIP-seq experiment to evaluate enrichment and report Benjamini–Hochberg corrected P values of enrichment (hypergeometric testing). We report GO terms in which at least one ChIP-seq experiment was significantly enriched (corrected P < 0.05). The specific enrichments for each human, worm and fly ChIP-seq experiment are provided in Supplementary Tables 2, 3, and 4, respectively.
To compare the functional conservation of regulatory binding between transcription-factor orthologues, we evaluated the overlap in GO term enrichments for orthologous factors between species. Specifically, for each species comparison, we calculated the significance of the overlap in GO term enrichments for all ChIP-seq experiments involving orthologous factors assayed in the two species. Overlap enrichment and depletion P values between ChIP-seq experiments of each species were determined using directional Fisher’s exact tests and were Benjamini–Hochberg-corrected. To generate a final overlap score, we selected the most significant of the enrichment and depletion scores, reporting the −log10(P value of enrichment) or the log10(P value of depletion).
Generation of orthologue list
Analysis was performed on twelve Drosophila species (D. melanogaster, D. simulans, D. sechellia, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. mojavenis, D. virilis, D. grimshawi) using the September 2010 release of FlyBase, five Caenorhabditis species (C. elegans, C. brenneri, C. briggsae, C. japonica, C. remanei) using WormBase WS220, and two mammals (H. sapiens, Mus musculus) and one out-group species (Saccharomyces cerevisiae) using Ensembl release 61.
Gene families were defined using Ensembl Compara gene families for the primary species (human, mouse, D. melanogaster, C. elegans, S. cerevisiae), and these clusters were supplemented by genes from the additional fly and worm species using BLAST37. For each gene family, we aligned the peptide sequences using MUSCLE38. From this alignment, we built an initial gene tree using RAxML39 with the PROTGAMMAJTT model, then corrected for topological uncertainty using TreeFix40, and finally accounted for possible incomplete lineage sorting using DLCoal41. For DLCoal, we used species tree parameters from literature for the main species and assumed that the remaining fly and worm species take the parameters of D. melanogaster or C. elegans, respectively. To infer homologues, we considered two genes as orthologues (paralogues) if their most recent common ancestor is a speciation (duplication) node. To improve orthologue calls, we filtered out duplications was zero consistency score42. Finally, we remapped Ensembl identifiers to release 65.
From a total of 31,751 identified gene families within the three genomes, our data sets here capture 242, with 34 families having a transcription factor from at least two species (100 transcription factors, 459 data sets), and 6 families from all three species (24 transcription factors, 130 data sets). Overall, we found 14 pairs of homologous factors between worm and fly (corresponding to 12 transcription factors in worm and 12 in fly), 41 pairs between worm and human (23 and 36 transcription factors, respectively) and 28 pairs between fly and human (17 and 24 transcription factors, respectively). 14 orthologous triplets were in common for all three organisms (corresponding to 10 transcription factors in human, 8 in fly, and 6 in worm).
Many of these factors are quite divergent in sequence among the species with the exception of RNA polymerases II and III, histone deacetylases, and TBP. Multiple experiments in different stages were available for many factors and some of the common factors are expressed at analogous times for worm and fly3.
We restrict our analysis to the entire genome excluding HOT regions, unmappable and blacklist regions, 3′ untranslated regions (UTRs), coding exons, and several other exons for human (ribosomal RNAs, small nucleolar RNAs, and other miscellaneous RNAs, small nuclear RNAs, microRNAs) and worm (all available). These background regions are randomly split into two groups: one for motif discovery and another for ranking the enrichment of the discovered motifs. Transcription factors that have more than five available data sets in a species have five data sets randomly selected for analysis. Motif discovery is conducted on the top 200 peaks for each data set that overlap the discovery background using five discovery tools: AlignACE43 (v.4.0 with default parameters), MDscan44 (v.2004 with default parameters), MEME45 (v.4.7.0 with -maxw 26 and -nmotifs 6), Weeder46 (v.1.4.2 with option large), and Trawler47 (v.1.2 with 200 random intergenic blocks for background). For each species and factor family, the top three motifs are selected after ranking by the enrichment in the data sets for that species and excluding motifs for which a similar motif has already been selected (Pearson r > 0.7). These discovered motifs are augmented with all known literature motifs for factors in that gene family48,49,50.
Enrichments are computed by taking the fraction of motif instances that are inside the bound regions and dividing that by the fraction of shuffle motif instances inside (where the bound regions are filtered against the background regions, defined below). They are also corrected for small counts by using a Wilson’s binomial confidence interval (with Z = 1.5) around each fraction and taking the extreme which leads to the enrichment closest to 1. Motifs are considered enriched if this corrected enrichment is at least 1.5-fold.
The discovered motifs, their enrichments, and the underlying annotations are available at http://www.broadinstitute.org/~pouyak/motif-disc/integrate-cold/.
Enrichment of orthologous transcription-factor expression
To match the developmental stages of D. melanogaster and C. elegans, we first estimated the expression levels of orthologous genes between fly and worm at different developmental stages by applying Cufflinks51 to modENCODE time course RNA-seq data. We next identified stage-associated genes—genes highly expressed at that stage but not always highly expressed across all stages—for every fly and worm developmental stage. Then for every possible pair of fly and worm stages, we counted the number of orthologous gene pairs between their stage-associated genes, which would be used to test against the null hypothesis that the fly and worm stages have independent stage-associated genes. For the resulting p values, we applied Bonferroni correction and used the corrected P values to decide which fly and worm stages ‘match’ (have dependent stage-associated genes).
Transcription-factor co-association (intervalStats)
We determined the similarity in binding sites between ChIP-seq experiments applying recently developed interval statistics methods that allow calculation of exact P values for proximity between binding sites52. Using this method, we performed all pairwise comparisons of ChIP-seq experiments for each organism, evaluating binding similarity in 114,582 human comparisons, 34,782 worm comparisons, and 3,906 fly comparisons. For each species, we restrained interval analyses to the promoter domains by excluding binding intervals outside promoter regions. To exclude the possibility of promiscuous binding regions and generate more conservative co-association estimates, we excluded binding sites from XOT regions in each specific context from these analyses. Promoter regions were defined as 5,000 bp upstream to 500 bp downstream of human TSSs, and 2,000 bp to 200 bp downstream of worm and fly TSSs. Focusing co-association analyses on the promoter domains serves to focus co-association evaluations on transcriptional regulatory interactions and to account for the known biases in binding at TSSs and produces more conservative estimates of co-association significance. For each comparison, the intervals of the query ChIP-seq experiment are compared individually against all reference intervals of the alternate ChIP-seq experiment, calculating the probability that a randomly located query interval of the same length would be at least as close to the reference set. For each comparison, we compute the fraction of proximal binding events in promoter domains that are significant (P value <0.05). Because these comparisons are asymmetric—depending on the assignment of experiments as query or reference sets—we report the mean values of the complementary (inverted) comparisons.
Transcription-factor co-association (SOM)
Using the orthologous factors between human-worm or human-fly, we defined a cis-regulatory module as the maximum overlapping block of the intersection of all transcription-factor binding peaks on either genome. We require a minimum of two transcription factors bound in a cis-regulatory module to be considered for further analysis in the self-organizing map (SOM). Several window sizes were examined for co-association (500 bp, 1 kb, and DNase hypersensitive sites53) with similar results found in each case.
We binarized each cis-regulatory module as either bound (1) or not bound (0) by overlap with peaks from each transcription factor. This results in the cis-regulatory modules being represented as a binary vector of the number of dimensions being the count of orthologous transcription-factor families. These vectors, which map back to specific genomic locations, are now directly comparable across species. These are used as input to the SOM and resulting descriptions of each neuron are also described in this form.
For each SOM trained, we followed the rules described previously in ref. 28. In brief, these rules are: the SOM is initialized as a random toroid; the SOM is hexagonal; the SOM is trained for 100 epochs (that is, complete iterations through the data set); the SOM update radius was one-third of the map size with a learning rate (alpha) of 0.05 (these were linearly decreased throughout the training process); the best out of 1,000 trials, based on lowest quantization error, were selected for analysis (defined as the average Euclidean distance of all CRMs to their best matching neuron).
The training described above is performed in R using a variant of the ‘kohonen’ package available from CRAN54. Minor modifications were performed to the R package to allow for better handling of the large data sets in memory. Furthermore, significant changes to the graphical output of the package were made to allow for the improved figures displayed here and on the supplementary website. Final optimal seeds for the training were human–worm SOM: 49,027 and human–fly SOM: 60938. One hundred epochs of training resulted in stabilization of the classification error, and of the 1,000 iterations of the SOM there was minimal divergence with the best SOM having less than 0.3% difference in error than the average error of the non-optimal SOMs. Final SOM sizes were 25 × 18 and 17 × 14 for the human–worm and human–fly SOMs respectively and average CRM distance to the best matching neuron was 0.429 and 0.308 for human–worm and human–fly respectively.
Interactive SOMs can be accessed at http://ENCODEProject.org/comparative/regulation/Worm/SOM/ and http://ENCODEProject.org/comparative/regulation/Fly/SOM/.
The targets of individual transcription factors in human, worm and fly were identified using TIP23. The regulatory networks are the superposition of all the regulatory edges in the three species respectively. For the analysis of transcription factor–transcription factor regulatory networks (Fig. 3a, c, d), we used a Q-value threshold of 0.1 in all three species. For the analysis including various target genes, a Q-value threshold of 0.01 was employed. In Fig. 3a, b, the hierarchical organization was constructed by assigning the nodes in three levels such that an energy function based on the number of feedback edges was minimized. For enrichment analysis (Fig. 3c, d) the null model is an ensemble of random networks with the same degree distribution as the network of interest. In part d, the tendency of a transcription factor at a particular position of a FFL is obtained by counting how often it appears at the position in the network of interest, and how often in appears at the same position in the null model.
This work is supported by the NHGRI as part of the modENCODE and ENCODE projects. This work was funded by U01HG004264, RC2HG005679 and P50GM081892 to K.P.W., U54HG006996, U54HG004558 and U01HG004267 to M.S., and F32GM101778 to K.E.G.
Extended data figures
This zipped file contains Supplementary Tables 1-4.
About this article
Nature Communications (2016)