To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human–mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.
Determining the similarities and differences between mouse and human regulatory networks will not only improve our understanding of the evolution of regulatory mechanisms, but also help to interpret biomedical insights derived from research performed on mouse models. Recent genome-wide binding studies of eight TFs in several species uncovered many regulatory networks that have been highly rewired since the divergence of ancestors to mouse and human1,2,3,4, consistent with early studies in other species5. These results contrast sharply with other data showing that conservation of genomic DNA sequences can be a useful guide to discovery of regulatory regions6, and that the regulatory landscape can be highly conserved among more distant species7. Considering the large numbers of known TFs and their functional diversity, comprehensive studies on a broader range of TFs are needed to resolve these apparent discrepancies. Furthermore, our knowledge of the functional consequences of either divergence or conservation of TF occupancy remains limited.
The mouse–human orthologous occupancy profiles
To examine conservation of TF binding regions both between species and across different cell types, we generated and analysed a large data set of genome-wide binding profiles for 34 TFs in mouse and human. A diverse panel of TFs were chosen including those that bind DNA through specific consensus sequences, comprise part of the general transcriptional machinery such as RNA polymerase 2 (POL2), and modify or remodel chromatin (Extended Data Fig. 1a and Supplementary Information). For simplicity, we refer to the entire collection as TFs, even though some are general factors. We focused on occupancy by 32 TFs in cell line models for erythroid progenitors (mouse erythroleukaemia MEL and human leukaemia K562 cells) and lymphoblasts (mouse lymphoma CH12 and human B lymphoblastoid GM12878 cells) in mouse and human, and we also showed that the results are similar to those obtained in mouse and human embryonic stem cells (Extended Data Fig. 8). Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) assays were conducted using replicate experiments and in accordance with ENCODE standards8. A total of 120 data sets were generated and analysed.
Conserved and non-conserved features
These genome-wide binding data for a large and diverse set of TFs revealed both conserved and non-conserved features of TF occupancy between mouse and human. First, although most TFs can reside at both promoters and distal sites, each shows a pronounced preference (Fig. 1a and Extended Data Fig. 2a, b). The preference is strongly conserved between mouse and human (R = 0.8; Extended Data Fig. 2c). The one exception is ETS1. Even though the primary motif in ETS1 is conserved between mouse and human (Fig. 1b), it preferentially binds proximal to promoters in human but not in mouse. ETS1 is responsible for the mouse-specific expression of the T-cell marker Thy-1 in the thymus9, and we propose that this marked difference in its binding location may contribute to immune system differences between mouse and human10. Second, although the primary motifs of most sequence-specific TFs are conserved between mouse and human, the secondary motifs (for example, motifs of associated factors; see Supplementary Information) tend to be lineage-specific (Fig. 1b and Extended Data Fig. 2d), indicating a change in co-associated partners.
The preferred chromatin states, defined by histone modifications, for occupied sequences (OSs) of orthologous TFs are also conserved between mouse and human. Using data on five histone modifications, the mouse and human genomes were segmented into eight chromatin states (Fig. 1c and Extended Data Fig. 3a, b). Most TF OSs are located in states characteristic of promoters and enhancers (states 1–4). By contrast, approximately 50% of OSs for the CTCF–cohesin complex (CTCF, RAD21 and SMC3)11,12 are located in state 5 and 8, which mark quiescent regions with very low signal for all the histone modifications. MAFK also shows preference for quiescent regions. Notably, both the CTCF–cohesin complex and MAFK13 can mediate long-range interactions in the genome. The state preference is conserved between mouse and human (Fig. 1c; R = 0.9; Extended Data Fig. 3b), suggesting that the overall functions of the occupied segments are similar in the two species. Indeed, the proportion of enhancers, predicted by a different approach14,15, is also conserved (R = 0.7) (Extended Data Fig. 4).
We also examined DNA methylation profiles in TF OSs by using both methylated DNA immunoprecipitation (MeDIP) and DNA digestion with methyl-sensitive restriction enzymes followed by sequencing (MRE-seq)16. The TF OSs are highly enriched for MRE-seq signals and depleted of MeDIP-seq signals, showing that TF OSs are generally hypomethylated in both species (Fig. 1d and Extended Data Fig. 3c).
TF- and location-specific occupancy conservation
The TF binding regions are enriched for conservation of DNA sequences, showing a strong signal for evolutionary constraint within ±50 base pairs (bp) of ChIP-seq peak summits (Fig. 2a). This result indicates that purifying selection has acted on DNA sequences in many of the TF OSs, but it does not mean that all TF OSs are uniformly under constraint. Approximately 50% of TF OSs do not align between mouse and human15 because either they are lineage-specific sequences such as transposable elements17, or they have diverged to an extent that they no longer align.
We then focused on the subset of TF OSs in which the sequences aligned between mouse and human to determine whether orthologous DNA sequences are also occupied by orthologous TFs (details in Supplementary Methods). Notably, the proportion of TF OSs at which occupancy was conserved varied markedly both among TFs and with the genomic locations (Fig. 2b). Conservation of occupancy is consistently higher in the promoter regions and lower in distal regions for almost all TFs, suggesting that the promoters may be under stronger selection than distal enhancers. Conserved promoter occupancy is observed both for factors that bind near promoters (NRF1 and MAZ) and for factors with a minority of binding sites in promoter regions (for example, MEF2A and TAL1). A notable exception is the CTCF–cohesin complex, which not only shows high levels of occupancy conservation as described previously18, but also the conservation remains high at proximal, middle and distal regions relative to the transcription start site (TSS) (Fig. 2b). These patterns of variation in conservation of occupancy are robust. One potential confounding factor is the tendency for promoter sequences to be more conserved than other regulatory regions, but adjusting the occupancy conservation by the sequence conservation difference revealed similar trends, that is, the OSs in promoter regions are more conserved than those in other regions (Extended Data Fig. 5a). Similarly, removal of the few TFs for which markedly different numbers of peaks were called between mouse and human did not change the patterns of conservation of occupancy (Extended Data Fig. 5b and Supplementary Information).
Next, we investigated how epigenetic factors influence TF binding at orthologous sites between mouse and human. As expected, the distribution of chromatin states is highly similar for occupancy-conserved TF OSs. For orthologues of TF OSs that can be aligned between the two species but are bound only in one species, a smaller proportion were in enhancer-associated states (states 3 and 4) and a larger proportion were in either repressed (state 7) or quiescent (states 5 and 8) chromatin OSs (Fig. 2c and Extended Data Fig. 6a, b). Thus species-specific loss of TF occupancy at many sites is accompanied by a shift to repressive or quiescent chromatin. By contrast, the promoter states (states 1 and 2) were largely maintained in the second species even with the loss of TF binding. This result indicates that other TFs may help to maintain conservation of a promoter state in these regions. We also searched for changes in the level of DNA methylation between TF OSs and their orthologous sequences. DNA methylation levels remained low in both species for occupancy-conserved TF OSs (Fig. 2d and Extended Data Fig. 6c), but the DNA methylation levels were significantly increased in the unbound, orthologous sequences. Thus, species-specific loss of TF occupancy is also associated with species-specific increases in DNA methylation.
Occupancy conservation associates with pleiotropy
We proposed that TF OSs with regulatory functions in several tissues would be under increased selective pressure, and thus more likely to be conserved in occupancy. To test this hypothesis, we first examined DNase I hypersensitive sites (DHSs) across 55 mouse tissues and cell lines15 to measure the chromatin accessibility of each TF OS among different tissues. Because DHSs are a proxy for regulatory element activity19, TF OS regions accessible in multiple tissues are more likely to function in those tissues. Chromatin accessibility of TF OSs presents wide variation, ranging from tissue-specific to ubiquitous patterns (Fig. 3a). Notably, the TF OSs with more pervasive chromatin accessibility across different tissues show the highest extent of occupancy conservation between mouse and human. The association between tissue usage and occupancy conservation is general; it was observed for most of the TFs examined (Extended Data Fig. 7b, c). This association is also robust to several potential confounding factors. CTCF–cohesin complexes, which are abundant and conserved across different tissue types and species18,20, might be expected to bias the result; however, we obtained comparable results after removing all the genomic regions occupied by CTCF, RAD21 or SMC3 (Extended Data Fig. 7a). The conservation of promoter regions among several tissues and species14 might also be expected to bias our analysis, but, after removal of occupancy-conserved TF OSs that lie within 2 kilobases (kb) of TSSs, we still found that the association between tissue usage and TF occupancy conservation holds for distal TF OSs (Extended Data Fig. 7d, e). Furthermore, specifically examining distal TF OSs that overlapped with enhancers predicted by chromatin signals14 showed that broad tissue usage of presumptive enhancers tracks strongly with conservation of occupancy between mouse and human (Fig. 3b).
A prediction of our hypothesis is that occupancy-conserved TF OSs will tend to be active in multiple tissues. To test this prediction experimentally, we randomly chose ten occupancy-conserved GATA1 OSs. Even though OSs were chosen on the basis of the occupancy profile of an erythroid-specific regulatory factor, all ten conserved OSs overlapped with DHSs peaks and predicted enhancers in many tissues, such as brain (Fig. 3c). When tested for in vivo enhancer activity in transgenic mouse reporter assays at embryonic day 11.5, nine of the ten showed strong, reproducible in vivo enhancer activity, and four were active in non-erythroid tissues such as midbrain and neural tube (Fig. 3c). We expanded our analysis to examine other mouse GATA1 OSs that overlapped with previously tested enhancers deposited in the VISTA Enhancer Browser (http://enhancer.lbl.gov)21. Six GATA1 OSs that are specific to mouse generated positive enhancer assays; only one (16%) showed expression in tissues other than blood vessels and heart. By contrast, among 12 additional occupancy-conserved GATA1 OSs with in vivo enhancer activity, 6 (50%) were active in non-erythroid tissues such as midbrain (Supplementary Table 5).
Conservation and divergence of TFs co-association
Because precise gene regulation requires complex interactions among different TFs, we speculated that differences in conservation of TF occupancy may be related, at least in part, to different co-association partners. By calculating the occupancy signals for all the TFs in each TF OS, we found that, in general, occupancy-conserved TF OSs tend to be bound by more TFs compared to lineage-specific TF OSs (P < 2.2 × 10−16, two-tailed t-test; Fig. 4a), suggesting that co-association with several TFs increases the level of purifying selection on the occupied sequences. Furthermore, by examining each co-associated TF pair (Fig. 4b), we determined whether the co-associations were more enriched in occupancy-conserved versus species-specific binding sites (Fig. 4c and Extended Data Fig. 9). The relationships fell into three categories. In the first category, co-association of TFs is not linked with occupancy conservation. For example, RAD21 is highly associated with CTCF in MEL cells; however, this co-association occurs with equivalent frequency at occupancy-conserved and species-specific binding sites. In the second category, TF co-association is negatively correlated with occupancy conservation. For example, the co-association of MYC OSs with EP300, an enhancer-associated factor22, is highly enriched in the mouse-specific binding sites. In the last category, TF co-association is positively correlated with occupancy conservation, as exemplified by the co-association of MYC OSs with the co-repressor SIN3A (ref. 23), suggesting that MYC-associated repressors tend to be conserved between mouse and human.
Occupancy conservation and functional SNVs
In a previous study, we assigned putative regulatory potential to genome variations by combining high-throughput experimental data sets, computational predictions, and manual annotation24. Interestingly, even though conservation was not considered during the previous classifications, we found that single nucleotide variants (SNVs) with high regulatory potential were highly enriched in occupancy-conserved TF OSs (Extended Data Table 1a). Moreover, examination of the distribution of genome-wide association study (GWAS) single nucleotide polymorphisms (SNPs) as a function of TF OS occupancy conservation revealed a significant enrichment of GWAS SNPs in occupancy-conserved TF OSs (P < 2.2 × 10−16, Fisher’s exact test; see Supplementary Information) compared with the background distribution of all genetic variation in the SNP database (dbSNP). When examining individual phenotypes, we found that SNPs associated with several phenotypes such as type I diabetes are significantly enriched in occupancy-conserved TF OSs (P = 0.019, Fisher’s exact test; Extended Data Table 1b). However, SNPs associated with other phenotypes, such as pulmonary function, are highly human-specific (P = 0.027, Fisher’s exact test; Extended Data Table 1b). Thus, although GWAS SNPs are generally enriched in occupancy-conserved TF OSs, this enrichment is phenotype-specific.
Here we report that the conservation of TF occupancy associates with pleiotropic functions. This observation was further validated by in vivo enhancer assays in transgenic mice. To our knowledge, this is the first systematic investigation and validation of the relationship between pleiotropic TF OSs and their occupancy conservation. The pleiotropic functions of a regulatory module subject it to several constraints that preserve the underlying motifs and occupancy patterns. However, the roles in different tissues need not be carried out by the same TF. Paralogous proteins that bind to the same DNA motif (for example, GATA5 or GATA6) could be the active proteins in non-erythroid tissues at the GATA1 OSs with conserved occupancy and pleiotropic functions. This prediction can be tested in future studies.
Cell lines were used in this study because they provide an abundant source of almost identical cells, whereas obtaining primary cells in sufficient number for a study of this scale is problematic for many cell types. One concern is that cell lines across different species may not be entirely analogous. Although this possibility cannot be ruled out, when we compared the expression profile of the four cell lines with those of many other mouse tissues, we found that both MEL and K562, and also CH12 and GM12878, were the most similar pairs (Supplementary Fig. 2a). This close similarity was also seen for genome-wide histone modification signatures (Supplementary Fig. 2b). Thus, we conclude that the K562 and MEL pair of cell lines and the GM12878 and CH12 cell-line pair are sufficiently similar for meaningful cross-species comparisons. Another concern is that the trends observed in cell lines may not be representative of primary cells. Examination of binding of five TFs in mouse and human ES cells confirmed the preferential conservation of binding at promoters and the correlation of occupancy conservation with pleiotropy of DHSs (Extended Data Fig. 8). Thus, the principles gleaned from our examination of many TFs in cell lines are likely to hold for TFs in primary cells.
ChIP for TFs was carried out as previously described25. Cultured cells for biological replicates were grown in separate batches and at separate times. In brief, 5 × 107 cells were grown to a density of 0.6–0.8 × 106 per ml, cells were then cross-linked in 1% formaldehyde for 10 min at room temperature. Nuclear lysates were sonicated using a Branson 250 Sonifier (power setting 7, 100% duty cycle for 12 × 20 s intervals), such that the chromatin fragments ranged from 50 to 2,000 bp. Information on control IgG and TF antibodies used for ChIP-seq experiments is listed in Supplementary Table 2. Protein–DNA–TF antibody complexes were captured on Protein A/G agarose beads (Millipore 16-156/16-266) and eluted in 1% SDS TE buffer at 65 °C. After cross-link reversal and DNA purification, the ChIP DNA sequencing libraries were prepared as described8. Libraries were sequenced on an Illumina Genome Analyzer II and HiSeq 2000.
Uniform ChIP-Seq data processing pipeline
We used a uniform processing pipeline to identify high confidence binding peaks in mouse and human. Reads mapping: for human ChIP-Seq, mapped reads in the form of BAM files were downloaded from ENCODE University of California, Santa Cruz (UCSC) Data Coordination Center (DCC) (http://encodeproject.org/ENCODE/downloads.html). For mouse ChIP-seq, reads were mapped by BWA26. To standardize the mapping protocol, we used custom mappability tracks to filter out multi-mapping reads and only retain unique mapping reads (reads that map to exactly one location in the genome). We also filtered all positional and PCR duplicates. Quality control: several quality metrics for all replicate experiments of each data set were computed. In brief, these metrics measure ChIP enrichment, signal-to-noise ratios, sequencing depth, library complexity and reproducibility of peak calling8. ChIP-seq that did not pass the minimum quality control thresholds were discarded and not used in any analyses. Peak calling: all ChIP-seq experiments were scored against an appropriate control designated by the production groups (either input DNA or DNA obtained from a control immunoprecipitation). We used the SPP peak caller27 to identify and score (rank) potential occupancy sites/peaks. For obtaining optimal thresholds, we used the irreproducible discovery rate (IDR) framework to determine high confidence occupancy events by leveraging the reproducibility and rank consistency of peak identifications across replicate experiments of a data set. Code and detailed step-by-step instructions to call peaks using the IDR framework are available at: https://sites.google.com/site/anshulkundaje/projects/idr. Black list: all peak sets were then screened against specially curated empirical blacklists for each species (A.P.B. and A.K., manuscript submitted). In brief, these blacklist regions typically show the following characteristics: unstructured and extreme high signal in sequenced input DNA and control data sets as well as open chromatin data sets irrespective of cell type identity; an extreme ratio of multi-mapping to unique mapping reads from sequencing experiments; overlap with specific types of repeat regions such as centromeric, telomeric and satellite repeats that often have few unique mappable locations interspersed in repeats. The human blacklist can be found from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz. The mouse blacklist can be downloaded from: http://www.broadinstitute.org/∼anshul/projects/mouse/blacklist/mm9-blacklist.bed.gz. In this study, the blacklist filtered IDR binding peaks for the same TF using the same cell line generated by different institutes were merged. All the raw read files, mapped files and peak files in mouse are deposited in http://mouseencode.org. The human data can be accessed in https://www.encodeproject.org. The access ID in each experiment can be found in Supplementary Table 2.
To compare mouse and human regulatory networks, we applied the de novo motif discovery approach that we developed previously28 and obtained a list of high-confidence sequence motifs using the ChIP-seq data sets. For each ChIP-seq data set, our computational pipeline reported up to five significant motifs. Typically, one of the motifs is the canonical motif of the TF, reflecting its DNA-binding specificity, and we call this the primary motif. If the TF does not have a DNA binding domain, we define the strongest motif as its primary motif. We call the remaining motifs secondary motifs. When the primary motifs of a pair of orthologous TFs are compared, they are either ‘conserved’ or ‘not conserved’ on the basis of whether the similarity between them passes the cut off (1.0 × 10−5). Because a TF may have several secondary motifs, the secondary motifs of two orthologous TFs are ‘partly conserved’ if a subset, but not all, of the motifs are conserved. When neither the human TF nor the mouse TF has a secondary motif, we assign the situation as motif ‘not available’.
ChromHMM29 was applied on the ChIP-seq data of five histone modifications to learn a multivariate HMM model for segmentation of mapped genome in each cell type. Specifically, the ChIP-seq mapped reads were first pooled from replicates for each of the five histone modifications (H3K4me3, H3K4me1, H3K36me3, H3K27ac and H3K27me3). These mapped reads were first processed by ChromHMM into binarized data in every 200-bp window over the entire mapped genome, with ChIP ‘input’ reads as the background control. To learn the model jointly from mouse and human, a pseudo genome table was first constructed by concatenating the mouse mm9 and human hg19 table, then the model was learned from the binarized data in all four cell lines, giving a single model with a common set of emission parameters and transition parameters, which was then used to produce segmentations in all cell types based on the most likely state assignment of the model. We tried models with up to 20 states and selected an eight-state-model as it appeared most parsimonious in the sense that all eight states had clearly distinct emission properties, while the interpretability of distinction between states in models with additional states was less clear.
MeDIP-seq and MRE-seq
MeDIP-seq and MRE-seq experiments were performed as previously described16. The reads were aligned to hg19 and mm9 using BWA. MRE-seq reads were further normalized for difference in enzyme efficiency.
Defining different genomic locations
TSSs were defined by ENCOCDE consortium15. Promoter regions were defined as 2 kb upstream and downstream of the TSS. Distal regions were defined as 10 kb away from TSS. The rest of the genomic regions were defined as middle regions. All the three genomic locations are exclusive to each other, and the priority during the definition is promoter, distal and middle. Each TF OS was assigned to one (and only one) genomic location. If TF OSs overlapped with several regions, the centre of the OS was used to define which region to assign.
TF OSs sequence
phyloP30 wiggle track were downloaded from the UCSC browser. Specifically, hg19 phyloP46way track was used for human and mm9 phyloP30way track was used for mouse. This average phyloP score were calculated at one base pair resolution in 200-bp regions centred on the summit of TF peaks.
Mapping reciprocal orthologous sequences between human and mouse
Orthologous DNA sequences between human and mouse were mapped by bnMapper (O. Denas, R. Sandstrom and J. Taylor, manuscript submitted) using reciprocal chain with default setting (bnMapper.py -f BED12).
RegulomeDB SNV and occupancy conservation
SNPs assigned with pre-calculated regulatory potentials were downloaded from: http://www.regulomedb.org/downloads. dbSNP138 was downloaded from the UCSC genome browser. TF OSs were divided into two exclusive groups: occupancy-conserved and human-specific. The number of SNPs with high regulatory potential and the number of dbSNPs located in each group of TF OSs were calculated. Fisher’s exact test was conducted to examine the enrichment of SNPs with high regulatory potential in each group.
GWAS SNPs and occupancy conservation
GWAS catalogue file was downloaded from: http://www.genome.gov/admin/gwascatalog.txt. Lead SNPs that overlapped with exons were removed. For each lead SNP, if either the SNP itself or the linkage disequilibrium SNPs are located within a given TF OS, it was assigned to that TF OS. Lead SNPs that can be assigned to several TF OSs were also removed. Two-sided Fisher’s exact tests were conduct to calculate the enrichment of conservation in each given phenotype compared with the distribution of all dbSNPs, and P values were further adjusted by Benjamini–Hochberg procedure.
This work is funded by grants 3RC2HG005602, 5U54HG006996 and 1U54HG00699 (M.P.S.), and R01DK065806 and RC2HG005573 (R.C.H.). A.V. and L.A.P. were supported by National Human Genome Research Institute (NHGRI) grant R01HG003988, U54HG006997 and supplementary funds provided by the American Recovery and Reinvestment Act. The in vivo enhancer activity assays were conducted at the E.O. Lawrence Berkeley National Laboratory and performed under Department of Energy Contract DE-AC02-05CH11231, University of California. We acknowledge R. M. Myers for providing access to ChIP-seq data in human embryonic cells. Illumina sequencing services were performed by the Stanford Center for Genomics and Personalized Medicine.
Extended data figures
Extended data tables
This file contains Supplementary Table 1
This file contains Supplementary Table 2
This file contains Supplementary Table 3
This file contains Supplementary Table 4
This file contains Supplementary Table 5
This file contains Supplementary Table 6