Main

Gene regulation is classically partitioned into cis- and trans-acting compartments, which are in turn integrated to form a regulatory network. The cis compartment comprises DNA elements that encode TF recognition sites, while the trans compartment encompasses hundreds of TF genes and their DNA recognition repertoires. The cross-regulation of TF genes by one another creates a regulatory network that facilitates complex information processing and potentiates robustness at the cellular and higher levels1.

In metazoan genomes, actuatable TF recognition sites are clustered into compact (100–300 bp) regulatory DNA regions that give rise to DNase I hypersensitive sites (DHSs) upon TF occupancy in place of a canonical nucleosome2. Mice and humans diverged 90 million years ago3, and an extensive survey of mouse DHSs indicates that the cis-regulatory DNA compartment has evolved markedly since the last common ancestor4, generalizing and extending observations from selected TFs assayed by ChIP-seq in one or a few tissues5,6. However, given the limited experimental resolution of previous studies, it is currently unknown how dynamic are individual in vivo TF recognition sites within broader regulatory regions, or more generally how cis-regulatory dynamics relate to the conservation of the higher-level cellular and physiological features that define mammals. Earlier studies of individual regulatory elements in Drosophila7 and zebrafish8 indicate a potential for functional conservation without sequence conservation, and the maintenance of regulatory activity with different phenotypic outcomes. However, the generality of these observations and their broader relevance for mammalian evolution is unclear.

Genomic DNase I footprinting enables systematic delineation of TF–DNA interactions at nucleotide resolution and on a global scale9,10,11, permitting: (1) the simultaneous interrogation of hundreds of DNA-binding TFs expressed in a given cell type in a single experiment; (2) de novo derivation of the cis-regulatory lexicon of an organism; and (3) systematic mapping of TF-to-TF cross-regulatory networks1,10.

To delineate an expansive set of specific mouse genomic sequence elements contacted by TFs in vivo, we performed genomic DNase I footprinting on 25 diverse mouse cell and tissue types (Extended Data Table 1). From an average of 323 million uniquely mapped DNase I cleavages per cell type, we identified an average of 1 million high-confidence (false discovery rate (FDR) 1%10,11) DNase I footprints (6 to 40 base pairs (bp)), and a total of 8.6 million differentially occupied footprints (Fig. 1a and Extended Data Fig. 1a). DNase I footprints were highly reproducible (Extended Data Fig. 1b) and robust to intrinsic DNase I cleavage propensities (Extended Data Fig. 2a).

Figure 1: Footprinting the mouse genome and comparison with human footprints.
figure 1

a, Derivation of 8.6 million differentially occupied DNase I footprints from 25 mouse cell and tissue types. b, Per-nucleotide DNase I cleavage across three gene promoters in both mouse and human cell types; shared TF occupancy sites are indicated by faded boxes. c, Percentage of mouse DNase I footprints with sequence aligning to the human genome but not occupied in any human cell type (grey) versus aligning footprints that are occupied in one or more human cell type (red).

PowerPoint slide

Evolutionary turnover of TF footprints

To study the evolution of TF occupancy patterns between mouse and human, we compared mouse DNase I footprint maps with those from 41 diverse human cell types10,12 by using bi-directional pairwise alignments of the mouse and human genomes4 to resolve mouse DNase I footprints to the human genome (Fig. 1b). In total, 65% of mouse TF footprint sequences could be localized within the human genome, comparable to the cross-alignment rate of entire 150-bp DHSs4 (Fig. 1c). However, whereas 35% of mouse DHSs have human orthologues that are also DNase I hypersensitive in at least one human cell type4, only 22% of mouse TF footprints have human sequence orthologues that are occupied in any of the human cell types assayed (Fig. 1c). This indicates that the individual DNA elements within DHSs that are directly contacted by TFs in vivo have undergone massive turnover since the last common ancestor of mouse and human.

Conservation of TF recognition lexicon

Although most mouse TFs have human orthologues, the collective consequences of divergence in DNA binding domains and lineage-specific expansion of certain TF families (for example, KRAB zinc fingers) for the genomic occupancy landscape is unknown. We thus next explored the evolutionary stability of the mammalian TF recognition repertoire encompassed within mouse and human TF footprints. At directly occupied recognition sites for a given TF, footprinting data closely recapitulate TF ChIP-seq10,11 (Extended Data Fig. 3), and average per-nucleotide DNase I cleavage profiles mirror the morphology of the DNA–protein binding interface10,11,13. Examination of cleavage profiles at occupied sites for diverse TFs showed these to be nearly identical between mouse and human cell types (Fig. 2a and Extended Data Fig. 2b), suggesting that in vivo DNA recognition preferences for many TFs have experienced little change between mouse and human.

Figure 2: Mouse TF footprints define a conserved cis-regulatory lexicon.
figure 2

a, Average per-nucleotide DNase I cleavage at occupied TF recognition sites within mouse and human DHSs. b, Of 604 motif models derived de novo from mouse footprints, 355 match curated databases. c, Comparison of 249 novel mouse motif models with models derived from human footprints. d, DNase I footprinting pattern at a novel mouse-selective motif instance. e, Preferential occupancy of 16 out of 22 mouse-selective motifs (red); occupancy of pluripotency-related TFs is shown in blue. f, Average human nucleotide diversity (π) in different classes of human DNase I footprints partitioned by matches to mouse-derived motifs (mean ± 95% confidence interval (CI); bootstrap resampling). NS, not significant.

PowerPoint slide

To investigate comprehensively the divergence of mouse and human TF recognition repertoires, we performed de novo motif discovery on the 8.6 million mouse TF footprints. In total, we defined 604 unique motif models collectively accounting for the large majority of footprints (Fig. 2b), of which 355 models (59%) matched those within motif databases and 249 were novel (Extended Data Fig. 4a). Comparison of known and novel mouse-derived motif models to motif models derived de novo from 8.4 million human DNase I footprints10 revealed that >94% of the collective TF lexicon is conserved between mouse and humans (Fig. 2c). The human lineage has witnessed expansion of certain TF gene families, notably zinc finger TFs14; our results indicate that the proportion of genomic DNA elements bound by lineage-specific TFs in vivo is comparatively small. The fact that TF footprints in mouse and human contain highly similar effective in vivo recognition sequence repertoires indicates that regulatory divergence between mouse and humans has occurred chiefly at the level of individual TF-binding cis-regulatory elements.

A total of 22 novel motif models were selective for the mouse lineage and 14 were selective for the human lineage (Fig. 2c). The 22 novel mouse-selective motifs are found chiefly in distal elements (Extended Data Fig. 4b), where they populate 2% of DNase I footprints and show cell/tissue-specific occupancy, predominantly for mouse ES cells (Fig. 2d, e). This suggests that the TFs recognizing these elements may have important roles in very early development, when humans and rodents show more differences than at later stages15, and further highlights the role of distal gene regulation in species divergence16. Notably, whereas sequence matches to the 14 human-selective models in human DNase I footprints showed evidence of strong human-specific evolutionary constraint10,17 (Fig. 2f), nucleotide diversity at sequence matches to the 22 mouse-selective models in human DNase I footprints is compatible with significantly reduced human-specific evolutionary constraint (P < 0.05) (Fig. 2f), consistent with a loss of TF occupancy (and selective pressure) due to divergence (or loss) of the cognate factor within the human lineage.

Conservation of TF-to-TF connections

We next sought to characterize the core mouse TF regulatory network, and to compare its features with the human TF network. Genomic footprinting provides a direct and empirical approach for mapping the core TF regulatory network of an organism comprising cross-regulatory interactions (network edges) between TF genes (network nodes)1. Footprint-anchored TF regulatory networks precisely recapitulate well-validated TF-to-TF regulatory connections1,18, and are agnostic to whether any given TF-to-TF regulatory interaction is positive (activating) or negative (repressive), as these may vary conditionally even for a given TF. Following the approach of ref. 1, we mapped mouse TF-to-TF networks connecting the 586 mouse TF genes with known recognition sequences (Supplementary Information) within each of the 25 cell/tissue types (Fig. 3a). This disclosed an average of 22,970 unique TF-to-TF edges per cell type, totalling 77,084 non-redundant edges across all 25 cell types. Differences between cell types derived from both the cell-selective usage of TFs, as well as the cell-selective occupancy patterns of these TFs. For example, the neuronal developmental regulator OTX2 is selective for neuronal tissue, but its connectivity/occupancy patterns differ between distinct neuronal cell/tissue types (Fig. 3b).

Figure 3: Evolutionary dynamics of cis-regulatory logic.
figure 3

a, Schematic for construction of cell-type regulatory networks using TF footprints: TF genes = network nodes; occupied TF motifs = directed network edges. b, TF genes regulated by OTX2 in fetal brain and retina networks. Symbols indicate known roles of target genes in brain versus retina development. c, Clustering of cell/tissue TF regulatory networks using Jaccard distances between regulatory networks. Cell/tissue types are coloured using physioanatomical and/or functional properties. d, Heat map showing network similarity (Jaccard index) between human and mouse cell-type regulatory networks. e, Pairwise similarities (Jaccard index) between the regulatory networks of all human and mouse cell/tissue types.

PowerPoint slide

Mouse TF regulatory networks from functionally similar cell and tissue types are coherently organized into anatomical and functional groups (Fig. 3c), analogous to results from human TF regulatory networks1. However, although the similarity (pairwise Jaccard indices) between all mouse and human networks was mostly maximal between orthologous mouse–human cell and tissue pairs (Fig. 3d, e), network differences within each species were smaller than differences between species (Fig. 3e).

We next asked to what extent specific mouse TF-to-TF regulatory connections were conserved in human. We first identified TF-to-TF connections that were mouse-specific, human-specific or shared across both orthologous human and mouse cell types (Fig. 4a and Extended Data Table 2). We then differentiated shared regulatory edges (that is, present in both a mouse cell type and its human orthologue) arising from TF occupancy of an orthologous binding element from those shared edges arising from occupancy of non-orthologous sequence within regulatory DNA of the orthologous target gene (Fig. 4a). In the former case, both sequence and circuitry are conserved; in the latter, circuitry only. Overall, 44% of the TF-to-TF regulatory connections are conserved between orthologous mouse and human cell types (P < 0.001) (Fig. 4b). However, >40% of these connections represent edges created by TF binding to a novel sequence element arising since mouse–human divergence (Fig. 4b). As such, conservation of functional regulatory circuitry is considerably greater than indicated by sequence conservation alone.

Figure 4: Conservation of TF-to-TF regulatory circuitry.
figure 4

a, Four categories of regulatory interactions identified by comparative analysis of mouse and human TF networks. Functionally conserved connections can be mediated by TF occupancy at orthologous (red) or non-orthologous (blue) binding sites. b, Categorization and overall conservation of TF-to-TF connections between orthologous mouse and human cell types. On average 44% of TF-to-TF edges are conserved (P < 0.001; empirically calculated using shuffled networks).

PowerPoint slide

Comparative TF network architecture

We next compared the overall architecture of mouse and human TF networks. The architecture of complex networks can be analysed in terms of simple regulatory circuit ‘building blocks’ termed network motifs, such as the feed-forward loop (FFL)19. In human, despite the general selectivity of specific TF-to-TF edges for specific cell types, the pattern of utilization of three-node network motifs within each individual cell type network is nearly identical1. Computing network motif utilization within each of the 25 mouse TF networks also revealed uniform patterns across mouse cell/tissue type regulatory networks (Extended Data Fig. 5a). Strikingly, these patterns are nearly identical with human, indicating that mouse and human TF networks utilize virtually the same architecture (Fig. 5a and Extended Data Fig. 5).

Figure 5: Conserved organizing principles of mammalian TF regulatory networks.
figure 5

a, Enrichment of three-node circuits in each mouse (red lines) and human (black lines) TF regulatory network (expanded in Extended Data Fig. 5). b, Left: frequency with which individual three-node circuits are identically maintained between the mouse and human Treg network. Middle: percentage of specific three-node circuits identically maintained between the mouse and human Treg network. Right: enrichment of three-node circuits in a network constructed using edges present in both mouse and human Treg networks. c, d, Frequency with which TFs from six functional classes occupy different positions (driver, first passenger, second passenger) within FFL (c) or RM (d) circuits in different mouse and human cell-type networks (hfBrain and hfHeart refer to human fetal brain and heart, respectively).

PowerPoint slide

To analyse evolutionary conservation at the level of individual regulatory circuits, we identified all instances of each three-node network motif within each mouse cell type, extracted the constituent TFs, and computed how the same TFs were connected in orthologous human cell types. Despite the conservation of overall network architecture between mouse and humans, this analysis revealed that the specific combinations of TFs comprising individual regulatory circuits have undergone substantial remodelling between mouse and human (Fig. 5b and Extended Data Fig. 6). Overall, 39% of combinations of three TFs found within one or more three-node circuit in a given mouse cell type were also organized into at least one type of three-node circuit in an orthologous human cell type (Extended Data Fig. 6b). For example, >25% of three-TF combinations organized into ‘regulating mutual’ circuits were conserved between orthologous mouse and human cell types, whereas only 8% of three-TF combinations that form ‘mutual-and-three-chain’ circuits show such conservation. By contrast, 12% of three-TF combinations that form ‘mutual-and-three-chain’ circuits lose one cross-regulatory interaction, transforming them into FFL circuits in orthologous human cell types (Fig. 5b and Extended Data Fig. 6c). Collectively, TF circuits conserved between mouse and human were enriched in four major network motif types: (1) the FFL motif; (2) the ‘regulated mutual’ motif; (3) the ‘regulating mutual’ (RM) motif; and (4) the ‘clique’ motif (Fig. 5b and Extended Data Fig. 6c). As such, these circuits appear to comprise the most vital building blocks of mammalian TF regulatory architectures.

Conserved TF positions within networks

We next asked to what degree the position of a specific TF within a given network motif circuit was conserved between mouse and human. To analyse this, we focused on FFL and RM circuits, as these are both strongly conserved overall and have a clear top-down hierarchical organization (Fig. 5a, b). Computation of the propensity for each TF (of 586) to occupy each of the nodes within these network motifs revealed that the preferred position of a given TF within FFL and RM circuits is strongly conserved between orthologous human and mouse cell types (Fig. 5c, d). It also revealed conserved preferential positioning of entire classes of TFs within particular network motif positions. For example, TFs with ubiquitous cellular functions such as CTCF, SP1 and NRF1 systematically localize within the driver positions of FFL and RM circuits (Fig. 5c, d), while TFs involved in cell lineage fate decisions (for example, SOX2, NFE2 and FOXP3) preferentially localized within the final passenger positions (Fig. 5c, d and Extended Data Fig. 7a, b). We also found the passenger edges of FFL and RM motifs to be significantly more cell-selective than the driver edges (Extended Data Fig. 7c, d). These findings raise the possibility that one of the major functions of conserved mammalian network motifs may be to stabilize the expression of TFs that drive cell-type-specific regulatory programs via exploitation of stable cell-ubiquitous regulatory interactions.

A conserved developmental program

To explore how the TF regulatory network interacts with downstream non-TF structural/effector genes and to test for conserved interactions, we first quantified, for each TF, whether it preferentially regulates another TF gene(s) or a non-TF ‘structural’ gene(s) across different mouse and human cell types (Extended Data Fig. 8a). This parameter varied widely between different TFs; in general, TFs involved in development state specification such as HOXB1, OCT4 and SOX2 preferentially regulated other TF genes, while general transcriptional regulators such as NRF1, CTCF and SP1 preferentially regulated non-TF genes (Extended Data Fig. 8b, c). To test how these preferences varied by cell type, we averaged TF gene versus structural gene propensities for all TFs within each cell-type regulatory network. This revealed that the TF networks of pluripotent and early developmental cell types and tissues such as ES cells and fetal brain were globally significantly more oriented towards regulation of TF genes compared with the TF networks of more highly differentiated cell types (for example, B cells, T cells) and tissues (for example, adult brain) (Extended Data Fig. 8d). These TF versus structural gene preferences—both at the individual TF level and at the cell-type regulatory network level—were strongly conserved between mouse and human (Extended Data Fig. 8d, e). The above findings suggest the operation of a conserved global developmental regulatory program that directs a shift in the orientation of TF regulatory networks from TF genes to structural genes during the transition from primitive to definitive cells.

Taken together, our results expose several major organizing principles of mammalian gene regulation, and a fundamental hierarchy in the modes of evolutionary transmission of regulatory information, ranging from poor conservation of cis-acting sequence elements to the preservation of trans-acting and network-level regulatory features (Fig. 6). Conservation of trans-acting components is reflected both in the effective in vivo recognition repertoires of human and mouse TFs, which differ only slightly, and in the conserved patterns of TF-to-gene interactions. The dichotomy between cis- and trans-acting regulatory components is most apparent in the context of the core TF regulatory network. Whereas the individual DNA bases contacted by TFs in vivo have undergone extensive turnover since the last common ancestor of mouse and human, the repertoire of TFs regulating other TF genes is vastly more conserved. Notably, this cis-acting versus trans-acting disparity in mammals greatly eclipses that previously described for different Drosophila species20.

Figure 6: Hierarchy of evolutionary constraint on cis- versus trans-regulatory features.
figure 6

Shown are: overall proportion of conserved DNA bases between mouse and human3; proportion of orthologous TF footprints (from data shown in Fig. 1c); average proportion of individual conserved TF-to-TF regulatory connections across orthologous mouse and human cell types (from data shown in Fig. 4); and similarity in overall TF regulatory network architecture (from data shown in Figs 2 and 5).

PowerPoint slide

At the TF network level, organization of the regulatory circuitry in both mouse and human cell types appears to be governed by common principles that result in highly similar network architectures (Fig. 6). Conserved shifts in TF network orientation during the transition from primitive to definitive cells in both organisms suggest that the mammalian regulatory network architecture has converged around a central goal of guiding cell identity during development.

Collectively, our results indicate that evolutionary selection on gene regulation is targeted chiefly at the level of regulatory networks, and explain how essential features of the mammalian body plan and physiology have been maintained in the face of massive turnover of the cis-regulatory landscape.

Methods

Definition of DNase I footprint

Following the original description of ref. 21, DNase I footprints signify short polynucleotide segments over which the cleavage pattern induced by DNase I is attenuated by the presence of a ‘binding protein on the DNA sequence’. This concept was subsequently generalized to encompass altered cleavage patterns encompassing both attenuation of cleavage as well as potentiation of cleavage due to the alteration in minor groove resulting from TF–DNA engagement22. It is critical to recognize that DNase I footprints represent TF occupancy at specific positions along the genome. Recently, several publications have mistakenly confounded individual DNase I footprints with aggregated DNase I cleavage profiles for a given TF motif23,25. Aggregated DNase I cleavage plots were originated by ref. 9 to visualize and summarize averaged per-nucleotide DNase I cleavage patterns across hundreds to thousands of instances of a given TF recognition sequence (typically within DHSs) genome-wide9,10. Because they encompass both occupied and unoccupied motifs, the morphology of the averaged profile depends greatly on the proportion of occupied elements. In the case of TFs with few high-affinity, highly occupied sites, such as the glucocorticoid receptor, aggregated cleavage profiles will dominantly reflect the unoccupied elements, and thus converge on intrinsic DNase I cleavage biases, which have now been well defined24. Failure to acknowledge this feature of the data has mistakenly led to erroneous statements concerning DNase I footprinting of low-occupancy TFs, and to restating of previously published conclusions10,21.

Genomic footprinting

A description of each cell and tissue type used in this study can be found in Extended Data Table 1 and at https://genome.ucsc.edu/encode/dataSummaryMouse.html. IACUC approval for all mouse samples was obtained from the Fred Hutchinson Cancer Research Center. Mouse cell and tissue types were subjected to DNase I digestion and high-throughput sequencing, following previous methods26. 36-bp sequence tags were aligned to the reference genome, build NCBI37/mm9, using Bowtie 3, version 0.12.7 with parameters: –mm -n 3 -v 3 -k 2, and –phred33-quals. DNase I footprint discovery and false discovery rate estimation (software available at https://github.com/StamLab/footprinting2012) were performed as previously described10 using 36-mer sequencing reads and unique mappability information for mouse, build NCBI37/mm9 (available at http://www.uwencode.org/proj/hotspot/). For clarity, we note that the footprint detection algorithm we employed differs substantially from (and greatly outperforms) an early algorithm9. A recently published modification of the algorithm of ref. 10 termed Wellington incorporates stranded cleavage information and specifically identifies high occupancy sites, although at the expense of greatly reduced sensitivity27. Of note, another recently published DNase I footprint detection algorithm25 was reported to have compared itself against the algorithm of ref. 10, but in fact compared itself against an ad hoc concoction of the ref. 9, ref. 10 and ref. 28 algorithms.

The number and proportion of all DNase I cleavages that fell within DNase I hotspot regions were calculated as previously described26 (Extended Data Table 1). To identify the total cohort of DNA elements contained within mouse FDR 1% DNase I footprints we first computed the multi-set union of all footprints across all cell types using BEDOPS29. For each element of the union, we then collected all significantly overlapping footprints, which were defined as those footprints with 65% or more of their bases in common with the element (bedmap–fraction-map 0.65). A footprint’s genomic coordinates were redefined to the minimum and maximum coordinates from its overlap set (bedmap–echo-map-range), which always included the footprint itself. All redefined footprints from the union then passed through a subsumption and uniqueness filter: when a footprint was genomically contained within another, the filter discarded the smaller of the two or selected just one footprint if identical. Footprints passing through the filter comprised the final set of 8.6 million combined footprints across all cell types. Unlike footprints from any single cell type, the combined set included overlapping footprints. We further computed the number of cell types from which each of these 8.6 million combined footprints were derived. To identify the reproducibility of a DNase I footprint, we calculated for every sample the proportion of DNase I footprints that were independently discovered in 1 or more other samples from the same species using an overlap criterion of 25% (bedmap–fraction-either 0.25).

Accounting for intrinsic DNase I cleavage preferences

Different rates of DNase I cleavage of phosphate bonds between different flanking base combinations was originally discussed by ref. 21, and have more recently been exhaustively quantified by ref. 24, who performed deep sequencing of DNase I-digested naked DNA from yeast and from human fetal lung fibroblast cells (IMR90) (ref. 24). For each nucleotide within a genomic window [i,l] the normalized expected cleavage rate is . We define ak as the relative cleavage bias of the 6-mer spanning the positions [k − 3, k + 2] as described in ref. 24. We redistributed the total observed cleavages () in a window [i,l] such that the observed and expected count for each base j is nj and . The per-nucleotide deviation from intrinsic sequence specificity was defined as . The sequence bias normalization was computed separately for each strand and then recombined for visualization purposes.

Using deeply mapped DNase I cleavage preferences24, we analysed each FDR 1% footprint in all mouse and human cell/tissue types and counted the total number of mapped tags falling in each footprint and the left and right flanking regions. We then randomly assigned the same number of simulated tags to positions within these regions, using probabilities proportional to the DNase I cut-rate bias model for the sequence context surrounding each position. A new footprint-occupancy score (FOS) was calculated over the same L, C and R regions as before10 and compared to the FOS value of the original footprint. Footprints that showed smaller FOS values using the DNase I cut-rate bias model were considered potential false-positive footprints.

Correspondence of DNase I footprints with ChIP-seq peaks

TF occupancy profiles generated by ChIP-seq represent a mixture of both direct (TFs directly contacting the DNA) and indirect (TFs contacting another protein or complex that is contacting the DNA) occupancy events. Of note, for the majority of TFs analysed to date, the indirect component predominates10. In contrast to ChIP-seq, DNase I footprinting provides information exclusively at sites of direct TF occupancy10. In Extended Data Fig. 3, motif models (from TRANSFAC, JASPAR Core, and UniPROBE) were used in conjunction with the FIMO motif scanning software30, version 4.6.1 using a P < 1 × 10−5 threshold, to find all motif instances of CTCF (Transfac model V_CTCF_01), GATA1 (Jaspar model MA0035.2-GATA1), MAX (Jaspar model MA0058.1-MAX), Myc (Jaspar model MA0147.1-Myc), and TBP (Transfac model V_TATA_01) within DNase I hotspots of the MEL cell line. We buffered (±30 nucleotides) discovered motif instances and counted at each base position within the buffered motif the number of uniquely mapping DNase I sequencing reads with a 5′ end mapping to that position. We sorted buffered motif instances by their total counts, and then normalized each instance’s counts to a mean value of 0 and variance 1. A heat map, with 1 row per motif instance, was generated using matrix2png31, version 1.2.1. A 46-species phyloP evolutionary conservation score heat map over the same ordered motif instances and bases was generated using the same processing techniques. Motif instances that overlapped DNase I footprints by at least 3 nucleotides were annotated. Uniformly processed mm9 MEL ChIP-seq peaks were downloaded from the UCSC Genome Browser website and motif instances overlapping ChIP-seq peaks by at least 3 nucleotides were also annotated.

Identification of orthologous human sequence at mouse footprints

We aligned the coordinates for the 8.6 million combined mouse footprints to the human genome using the ‘over chain’ best pairwise alignment file available from the UCSC Genome Browser. Mouse footprints with 50% or more of their constituent sequences aligned to the human genome, with at least half not aligned to insertions or deletions, were considered successfully aligned. For a description of the alignment procedure, see ref. 4.

Aggregated DNase I cleavage profiles

Mouse motif models from TRANSFAC32, version 2011.1, JASPAR Core33, and UniPROBE34 were used in conjunction with the FIMO motif scanning software, version 4.6.1, using a P < 1 × 10−5 threshold, to find predicted motif instances within hotspot regions as identified by the hotspot algorithm26. All motif instances identified for a given model were padded by 10 bp on each side, and aligned in a strand-sensitive manner. DNase I cleavages were averaged for each aligned nucleotide to create an aggregate profile for the motif model.

De novo motif model discovery and comparison

The method for the identification of de novo motif models using mouse DNase I footprints was identical to that previously described using human DNase I footprints10. Across 25 mouse cell types, we identified 604 unique motif models within DNase I footprints.

We compared de novo motif models to models available as part of various experimentally grounded databases, including TRANSFAC, JASPAR Core, and UniPROBE using the TOMTOM software, version 4.6.1 (ref. 35). TOMTOM parameters were set to their default values during model comparisons with the exception of the min-overlap argument, which was set to 5. When partitioning the de novo motifs by assigning each to a single category, the order of match assignment preference was to TRANSFAC, JASPAR Core, UniPROBE and finally to the novel motif category. The novel motif models were further classified using previously published motif models derived from human DNase I footprinting experiments10. We also determined the proportion of motif models in each experimentally grounded database that matched to mouse de novo motif models using TOMTOM with the same parameter settings.

Analysis of nucleotide diversity (π)

To quantify the nature of selection operating on regulatory DNA, we surveyed nucleotide diversity (π) in DNase I footprints. Population genetics analyses were performed as previously described on 53 unrelated, publicly available human genomes released by Complete Genomics, version 1.10 (ref. 36). Relatedness was determined both by pedigree and with KING37. Variant sites were filtered by coverage (>20% of individuals must have calls). Additionally, Complete Genomics makes partial calls at some sites (that is, one allele is A and the other is N). These were counted as fully missing. Repeats were defined by RepeatMasker, downloaded from the UCSC Genome Browser (http://www.repeatmasker.org). CpGs and repeats were removed from all footprints before analysis. π for a single variant is 2pq, where p = major allele frequency and q = minor allele frequency. π was calculated for each cell type by summing for all variants and dividing by total number of bases considered. Although binding elements for mouse-selective motif models are enriched in mouse DNase I footprints, instances of these models in human footprints are also present, but to a significantly lesser degree. To identify instances of mouse-selective motif models in human regulatory elements, human DHSs were scanned using each of the novel mouse-selective motif models and the FIMO software tool (P < 1 × 10−5). Predicted motif instances in human DHSs were then filtered to those that overlapped human DNase I footprints identified in any human cell type by at least three nucleotides.

Calculation of cell-selective motif occupancy

We scanned for instances of a motif model using the FIMO software tool (P < 1 × 10−5) and filtered predicted motif instances to those that overlapped DNase I footprints identified in a particular cell type by at least three nucleotides. To derive a final occupancy value for a motif model in that cell type, we counted the total number of DNase I footprinted motif instances for that motif model and normalized it by the total number of bases contained within DNase I footprints in that cell type.

Calculation of promoter-proximal occupancy of motif models

We scanned for instances of a novel mouse-selective motif model using the FIMO software tool (P < 1 × 10−5) and filtered predicted motif instances to those that overlapped DNase I footprints identified in any cell type by at least three nucleotides. We classified those within 5 kb of a transcriptional start site using RefSeq annotations as ‘promoter-proximal’ and all others as ‘promoter-distal’.

TF regulatory network construction

Transcription factor (TF) regulatory networks were constructed as previously described1 using 5,000 nucleotide buffers anchored on canonical TF transcriptional start site (TSS) annotations. TF genes and motif models used for network construction were collected from the JASPAR Core, UniPROBE and TRANSFAC databases (Supplementary Information). To create genome-wide networks this method was extended to include all mm9 RefSeq genes, anchored using the 5′-most TSS annotation38.

Clustering and similarities of TF regulatory networks

We computed the pairwise Jaccard distances between TF regulatory networks and applied Ward clustering39 using the hclust and dendrogram functions in R. The heat map representation in Fig. 3d used the Jaccard index for a similarity measure. Importantly, all comparisons were made using the same subset of orthologous TF genes (567) with known, associated motif models in both species.

TF regulatory edge conservation

To identify conserved regulatory connections that are also sequence conserved we first collected all motif instances that overlapped a DNase I footprint by at least 3 nucleotides in a specific mouse cell type that gave rise to a regulatory edge in that cell-type TF regulatory network. We then aligned the coordinates of this mouse motif instance to the human genome using the ‘over chain’ best pairwise alignment file available from the UCSC Genome Browser. A mouse motif instance was considered successfully aligned if 50% or more of its underlying sequence aligned to the human genome, with at least half not aligned to insertions or deletions. If a footprinted mouse motif instance aligned to a motif instance of the same TF in an orthologous human cell type that also overlapped a footprint by 3 nucleotides or more, the human motif possibly gave rise to the same regulatory edge. If it did, the edge in the mouse regulatory network was classified as a shared edge between species arising from orthologous binding elements. Notably, an edge that connects two TFs within a regulatory network may arise from a single, or multiple, distinct footprinted TF binding elements. In cases where multiple, distinct footprinted TF binding elements underlie a regulatory edge within a mouse cell-type TF regulatory network, this regulatory edge is considered to arise from an orthologous binding element so long as one of these TF binding elements is a shared connection arising from an orthologous binding element.

We calculated an empirical P value to evaluate the significance of the number of shared edges found between orthologous mouse and human cell types. We first generated 1,000 randomized human TF regulatory networks. When creating a randomized network, we ignored the usual requirement that a motif instance must significantly overlap a human footprint. The genomic space used to construct a random network was identical to that used in the observed case (within 5,000 nucleotides of a canonical TSS). A random subset of generated edges was chosen so that the in-degree to every TF gene node was identical to that of the observed human TF regulatory network case (and, hence, the total number of edges was the same), and all edges were unique. We then determined the number of functionally conserved edges between the observed mouse TF regulatory network and each randomized human TF regulatory network. We counted the number of times this number of functionally conserved edges was at least as large as in the observed TF regulatory network's case. An empirical P value was calculated as one more than the number of times this event occurred divided by 1,000. This analysis was performed between every pair of orthologous cell types. No randomized experiment gave a functionally conserved number that reached or exceeded the observed, real TF regulatory networks case.

Network motif architectures

We removed self-edges from every TF regulatory network and used the mfinder software tool for network motif analysis40. A z-score was calculated over each of 13 network motifs of size 3 (three-node network motifs), using 250 randomized networks of the same size for a null estimate. We vectorized z-scores from every cell type and normalized each to unit length to create triad significance profiles19.

Distribution of three-node network motifs

We enumerated all three-node circuits in a mouse TF regulatory network, and determined if and how each was connected in an orthologous human cell-type TF regulatory network. Software is available for download at https://github.com/StamLab/network-motifs.

Central-facing versus peripheral-facing TF enrichments

Enrichments were calculated by taking the log base 2 of the ratio of two proportions. The numerator was the proportion of outgoing edges from a TF node in the regulatory network that connected to another TF node, divided by the total number of input edges to all TFs. The denominator was the proportion of outgoing edges from a TF node that connected to any non-TF gene node, divided by the total number of input edges to all non-TFs gene nodes.