The vast non-coding portion of the human genome is full of functional elements and disease-causing regulatory variants. The principles defining the relationships between these elements and distal target genes remain unknown. Promoters and distal elements can engage in looping interactions that have been implicated in gene regulation1. Here we have applied chromosome conformation capture carbon copy (5C2) to interrogate comprehensively interactions between transcription start sites (TSSs) and distal elements in 1% of the human genome representing the ENCODE pilot project regions3. 5C maps were generated for GM12878, K562 and HeLa-S3 cells and results were integrated with data from the ENCODE consortium4. In each cell line we discovered >1,000 long-range interactions between promoters and distal sites that include elements resembling enhancers, promoters and CTCF-bound sites. We observed significant correlations between gene expression, promoter–enhancer interactions and the presence of enhancer RNAs. Long-range interactions show marked asymmetry with a bias for interactions with elements located ∼120 kilobases upstream of the TSS. Long-range interactions are often not blocked by sites bound by CTCF and cohesin, indicating that many of these sites do not demarcate physically insulated gene domains. Furthermore, only ∼7% of looping interactions are with the nearest gene, indicating that genomic proximity is not a simple predictor for long-range interactions. Finally, promoters and distal elements are engaged in multiple long-range interactions to form complex networks. Our results start to place genes and regulatory elements in three-dimensional context, revealing their functional relationships.
Spatial proximity and specific long-range interactions between genomic elements can be detected using chromosome conformation capture (3C)-based methods5. Previous studies have been limited to analysis of single loci5,6,7,8, interactions that involve a single protein of interest9, or to analysis of genome-wide folding of chromosomes at a resolution that cannot detect specific looping interactions between genes and functional elements10. To overcome these limitations we previously developed 5C (ref. 2). 5C is a high-throughput adaptation of 3C and uses pools of reverse and forward 5C primers to detect long-range interactions between two targeted sets of genomic loci, for example, promoters and distal gene regulatory elements in this study. By targeting a specific part of the genome, 5C facilitates detection of interactions at single restriction fragment resolution.
To begin to define the principles of long-range gene regulation in the human genome we have used 5C to map interactions systematically between promoters and distal elements throughout the 44 ENCODE pilot project regions representing 1% (30 megabases (Mb), Supplementary Table 1) of the genome in three cell lines (Fig. 1a). The ENCODE regions, ranging in size from 500 kilobases (kb) to 1.9 Mb, were selected for comprehensive annotation by the ENCODE pilot project11. Here we analysed interactions between 628 TSS-containing restriction fragments and 4,535 ‘distal’ restriction fragments covering the ENCODE regions (Fig. 1a and Supplementary Tables 2 and 3; see also Methods).
5C libraries were generated for two biological replicates of GM12878, K562 and HeLa-S3 (Supplementary Tables 4–6). These cell lines are extensively annotated by the ENCODE consortium3,4. 5C interaction frequencies measured between ENCODE regions located on different chromosomes were used to quantify minor variations in interaction detection efficiencies due to technical biases related to 5C primer efficiency, restriction fragment length, or digestion efficiency. 5C interaction frequencies were then corrected for these biases (Methods and Supplementary Data).
An example of a 5C long-range interaction map representing TSS–distal fragment interactions along and between 14 ENCODE regions (ENm001–ENm014) is shown in Fig. 1b. 5C detects known general features of spatial chromatin organization. First, interactions within the same ENCODE region are more frequent than those between different ENCODE regions. Within one ENCODE region interaction frequencies are generally higher for pairs of loci located closer together in the linear genome. This inverse relationship between genomic distance and interaction frequency is as expected for a flexible chromatin fibre5,12. Second, interactions between ENCODE regions that are located on the same chromosome are more frequent than interactions between regions located on different chromosomes (arrow in Fig. 1b). This is consistent with 4C and Hi-C analyses6,10, and is due to the formation of spatially separated chromosome territories.
5C data sets were analysed to identify TSS–distal fragment pairs that interact more frequently than expected, indicating that they are relatively close in space. For each biological replicate we independently determined the average relationship between interaction frequency and genomic distance (solid red lines in Fig. 1c, d). We defined this as the expected interaction frequency. Next we identified interactions that occur significantly more frequently than expected for loci separated by a corresponding genomic distance by transforming 5C signals into a z-score (false discovery rate (FDR) = 1%; Methods). Specific long-range interactions are then defined as pairs of loci that interact significantly more frequently than expected in both replicates. By excluding interactions that are significant in only one replicate, we estimate that only around 10–18% of the significant long-range interactions identified by our approach might be false positives, as estimated from analysis of interactions in gene desert ENCODE regions (ENr112, ENr113 and ENr313) where no significant long-range interactions were expected (Methods). This application of stringent thresholds probably leads to a higher false-negative rate. Consistently, interaction frequencies that are found to be significant in only one replicate are still significantly elevated in the other replicate as compared to interactions that are never significant, but are just below the chosen 1% FDR threshold (Supplementary Fig. 1).
Our analysis correctly identified known interactions between TSSs and their cognate distal regulatory elements, providing validation of the approach (Supplementary Fig. 3). As an example, Fig. 1d shows the 5C interaction profile in K562 cells for a TSS located in the β-globin locus. We previously found that this TSS located just downstream of the γ-globin genes displayed prominent looping interactions with the distal locus control region (LCR) in K562 cells2. Our analysis accurately detected these looping interactions (HS3, HS4 and HS5). We identified additional known long-range interactions with DNase I hypersensitive sites (DHSs) near distal CTCF-bound elements (3′HS1 and HS-111)2,13,14. In K562 cells we also detected the known interactions between the γ-globin gene (HBG1) and the LCR (HS5) and between the α-globin genes and three distal regulatory elements including the α-globin enhancer HS40, and two CTCF-bound elements (HS46 and HS10), located 40, 46 and 10 kb upstream of the genes, respectively (Supplementary Fig. 3 and refs 15, 16). The importance of these distal elements in regulating globin gene expression through looping has been extensively documented14,16. As expected, these looping interactions in the globin loci were not detected in GM12878 or HeLa-S3 cells that express little or no globin (Supplementary Fig. 3). Additional examples of cell-type-specific TSS–distal element interactions are shown in Supplementary Fig. 4. Furthermore, 5C interaction frequencies are correlated with TSS–distal DHS pairs predicted to be functionally connected based on their highly correlated activity across a large panel of cell lines (P < 10−13, one-sided Mann–Whitney U-test17), providing independent validation of their biological significance.
In each cell line we identified large numbers of statistically significant TSS–distal fragment interactions, of which ∼60% were observed in only one of the three cell lines (Fig. 2a). These data point to intricate cell-type-specific three-dimensional folding of chromatin. 3C-based assays detect specific and functional interactions, for example, TSSs with gene regulatory elements8. In addition, the assay will detect ‘structural’ interactions, for example, close spatial proximity as a result of other nearby specific looping interactions (bystander interactions) or overall higher order folding of the chromatin fibre. To determine which looping interactions involved distal sites that displayed specific chromatin features associated with functional elements, we compared our data with data sets generated by the ENCODE consortium (Fig. 2b and Supplementary Table 7). We found that looping interactions in all cell lines were significantly enriched for distal fragments that are bound by CTCF—a protein known to mediate DNA looping18—contain open chromatin (as determined by FAIRE19 or DHS mapping17), and/or contain histones with modifications that are characteristic for active functional elements (H3K4me1, H3K4me2 and H3K4me3). Long-range interactions are also enriched for H3K9ac and H3K27ac, but are not enriched or significantly depleted for H3K27me3, a mark typically found at inactive or closed chromatin.
To gain more insight into the types of element present in the distal looping fragments, we made use of genome-wide and cell-line-specific segmentation analyses that identified seven distinct chromatin states based on histone modifications, the presence of DHSs and the localization of proteins such as RNA polymerase II and CTCF (ref. 4 and Fig. 2b). These states are: (1) enhancer (E); (2) weak enhancer (WE); (3) TSS; (4) predicted promoter flanking regions (PF); (5) insulator element (CTCF); (6) predicted repressed region (R); and (7) predicted transcribed region (T). The ENCODE consortium tested sets of the E elements in enhancer assays and confirmed that >50% display enhancer activity4. We found that looping interactions were significantly enriched for distal fragments that contained E, WE and CTCF elements, and the actively transcribed chromatin state (T), but were depleted for the repressed chromatin state (R). We note that some distal looping fragments contained elements classified as TSS or PF, even though they did not contain TSSs as defined by the GENCODE v7 annotation20. Possibly, these are yet-to-be-annotated TSSs.
Next, we used the seven-way segmentation data to categorize looping interactions into four broader functional groups (Fig. 2c, Supplementary Fig. 5 and Supplementary Data): those that involve a distal fragment that contains a putative enhancer (‘E’ (E or WE)), a putative promoter (‘P’ (TSS or PF)), or a CTCF-bound element (CTCF). The final class contains interactions with fragments that do not contain any of these three types of element, although they do contain T and R states (‘U’, unclassified). The last class is relatively large but is still significantly enriched in features that are characteristic for active functional elements such as H3K4me1, and over 60% of the unclassified fragments contain chromatin features found at active chromatin elements (Supplementary Fig. 7). Thus, these are not simply noise or false positives, but are probably the result of the conservative segmentation approach.
We found that TSS–E and TSS–P interactions are more cell-type specific than TSS–CTCF interactions: for the TSS–E and TSS–P categories, the ratio of interactions that is seen in only one cell line versus more than one cell line is ∼4:1, whereas it is close to ∼1:1 for the TSS–CTCF category (Supplementary Fig. 5). The cell-type-specific activity of some of these E elements was confirmed using transient reporter assays (Supplementary Fig. 10). Next, we determined whether looping of a TSS to any of the four categories of chromatin states is correlated with transcription. We used CAGE expression data21 to assign an expression level to each TSS. We found that looping interactions with fragments containing enhancer-like E elements were significantly enriched for those that involved expressed TSSs (Fig. 2d and Supplementary Fig. 6). In addition, the subset of TSSs that interact with fragments containing E elements was significantly more highly expressed compared to TSSs that do not interact with E elements. Interactions with other classes of element (CTCF, P and U) are significantly enriched for actively expressed genes in some, but not all, cell lines (Supplementary Fig. 6).
Active enhancers often express enhancer RNAs22. We used a comprehensive enhancer RNA data set generated by the ENCODE consortium to determine whether TSSs preferentially interact with active enhancer-like elements23. We found that E elements that are looping to TSSs are significantly more likely to express enhancer RNAs than E elements that are not looping (P < 5 × 10−5, hypergeometric test, Supplementary Fig. 10). We conclude that looping interactions preferentially involve active enhancer-like elements.
Next we analysed the distribution of long-range interactions upstream and downstream of TSSs. To generate this landscape of looping interactions we aligned all TSSs and calculated the average number of interactions that a TSS has with each class of distal element at increasing genomic distances upstream and downstream of the TSS. Figure 3a shows the resulting average long-range interaction profile across all three cell lines (similar results were obtained when each of the cell lines was analysed separately; Supplementary Fig. 8). Notably, we found that the long-range interaction landscape is asymmetric, with interactions of E, P and CTCF classes peaking around 120 kb upstream of the TSS. This asymmetry of interactions reveals an unanticipated directionality in long-range interactions with TSSs. This may indicate the presence of topological constraints imposed by the mechanism by which such interactions regulate target promoters. No such bias was observed for the set of unclassified elements, or for the complete set of interrogated interactions (Fig. 3a). Interestingly, previous analyses showed that conserved non-coding elements are also often found within similar distances of target genes24. Third, when we analysed expressed TSSs and non-expressed TSSs separately, we found that both have a similar interaction landscape but that expressed TSSs tend to have more interactions, especially with the E, P and CTCF classes. We cannot rule out the possibility that some TSSs classified as non-expressed based on the absence of CAGE tags are actually expressed at low levels.
Next we explored whether the relative order of elements in the genome affects which long-range interactions occur. It is often assumed that distal elements such as enhancers target the nearest TSS. Only ∼7% of the looping interactions are between an element and the nearest TSS (Fig. 3b). This number goes up to 22% when only active TSSs are included. Similarly, 27% of the distal elements have an interaction with the nearest TSS, and 47% of elements have interactions with the nearest expressed TSS. Thus, when predicting TSS–distal element interactions, choosing the nearest (active) gene is often not correct.
It has been suggested that CTCF sites located between an enhancer and a TSS may prevent enhancer–promoter interactions18,25, although in individual cases interactions over such sites have been observed14,26. To address this question we determined the frequency of identified long-range interactions between a TSS and a distal element that skip over one or more sites bound by CTCF. We found that 79% of long-range interactions are unimpeded by the presence of one or more CTCF-bound sites (Fig. 3c). Thus, the presence of a CTCF-bound site does not block physical long-range interactions. It has been reported that CTCF acts in conjunction with the cohesin complex to block promoter–enhancer interactions27. We found that 58% of looping interactions skip sites co-bound by CTCF and cohesin (Fig. 3c). We obtained similar results when the different categories of long-range interaction (TSS–E, TSS–P, TSS–CTCF and TSS–U) were analysed separately. Possibly, additional factors need to be recruited to CTCF-bound sites to acquire interaction-blocking activity.
The large number of long-range interactions that we discovered indicate that distal elements and TSSs are each engaged in multiple long-range interactions. To characterize this phenomenon in more detail we determined the interaction degree of TSSs and distal fragments. We found that ∼50% of TSSs display one or more long-range interaction, with some interacting with as many as 20 distal fragments (Fig. 4a). Expressed TSSs interact with slightly more fragments as compared to non-expressed TSSs (the mean for GM12878 is 1.88 versus 1.37, or 3.88 versus 3.25 when including only those TSSs with at least one interaction). Out of all distal fragments interrogated, ∼10% interacted with one or more TSS, with some interacting with more than 10 (mean of 2.15 (for GM12878) when including only those distal fragments with at least one interaction). The degree distribution of the four categories of distal elements was very similar (Supplementary Fig. 9).
Figure 4b shows an example of the complex long-range interaction networks formed by TSSs and distal fragments in the ENr132 region in K562 cells. It is unlikely that these interactions can all occur at the same time in the same cell, which is indicative of significant cell-to-cell variation. The data indicate that gene–element interactions are not exclusively one-to-one, and suggest that multiple genes and distal elements can assemble in larger clusters, as proposed for the β-globin locus14.
Our data provide new insights into the landscape of chromatin looping that bring genes and distant elements in close spatial proximity. In addition to generating a rich data set reflecting specific gene–element interactions, the average interaction profile of TSSs with surrounding chromatin reveals several general principles regarding the asymmetric relationships between genomic distance, the order of elements, and the formation of looping interactions. The bias for upstream interactions may indicate that the protein complexes on many TSSs may be asymmetric and may preferentially interact on one side with enhancer–protein complexes. It is also possible that the asymmetry of the long-range interaction landscape reflects a potential preference of looping to elements that are located in intergenic non-transcribed regions. Furthermore, although these average long-range interaction landscapes may facilitate computational prediction of long-range interactions throughout the genome, the fact that interactions skip genes and CTCF/cohesin sites indicates that additional mechanisms for target selection and gene insulation exist.
Although conventional 3C may still be the method of choice to study the folding of individual loci, the 5C design strategy and data analysis methods applied here may provide a general approach for systematically mapping gene–element interactions for large gene sets. With further development of 3C technology and increases in sequencing capacity, similar high-resolution studies should become feasible to map specific long-range interactions throughout the genome, which may uncover additional principles that guide chromatin looping. Such insights will also be critical for interpreting genome-wide association studies that often identify regions with regulatory elements but not their distally located target genes. Co-published ENCODE-related papers can be explored online via the Nature ENCODE explorer (http://www.nature.com/ENCODE), a specially designed visualization tool that allows users to access the linked papers and investigate topics that are discussed in multiple papers via thematically organized threads.
5C was performed using two pools of 5C primers: one for ENm001–ENm014 and ENr313, and one pool for all 30 randomly picked ENCODE regions (ENr111–ENr334)11 (Supplementary Tables 2 and 3). 5C libraries (two biological replicates per cell line) were sequenced on an Illumina GAIIx platform and sequence reads were mapped using Novoalign (http://www.novocraft.com), as described15. Interaction data for each experiment are available in GEO (accession number GSE39510). Statistically significant pair-wise interactions were identified (Methods) by converting each 5C signal into a z-score using the average 5C signal distribution versus genomic distance as a background estimate. Significant interactions (1% FDR) observed in both biological replicates were considered looping interactions. 5C looping interactions were compared to a variety of genome-wide data sets generated by the ENCODE consortium4 (Supplementary Table 7).
Cell growth conditions
GM12878 lymphoblastoid cells were procured from Coriell Cell Repositories and grown in RPMI 1640 medium supplemented with 2 mM l-glutamine, 15% fetal bovine serum (FBS) and antibiotic (1% penicillin–streptomycin). K562 (CCL-243), a CML cell line, and HeLa-S3 (CCL2.2), a cervical carcinoma cell line, were obtained from American Type Culture Collection (ATCC). K562 cells were cultured in similar media as GM12878 cells except with 10% FBS, whereas HeLa-S3 cells were maintained in ATCC recommended F-12K medium (Kaighn’s modification of Ham's F-12 medium) with 10% FBS and 1% penicillin–streptomycin. The culture densities and conditions were maintained as per recommendations of the repositories.
For suspension cells (GM12878, K562) a total of 1 × 108 freshly growing cells were centrifuged at 100g for 5 min. Cell pellets were re-suspended in 45 ml of respective growth medium in a 50-ml Falcon tube. Cells were fixed by addition of 1.25 ml of 37% formaldehyde (final concentration of formaldehyde 1%). The cell suspension was gently mixed by inverting the tube up and down 4–6 times at room temperature and the tubes were rotated on an end-to-end shaker for exactly 10 min. Crosslinking was stopped by addition of 2.5 M glycine (final concentration 125 mM) and cell suspensions were incubated at room temperature for 15 min using an end-to-end shaker. The crosslinked cells were then pelleted at 100g for 5 min and the cell pellet was stored at −80 °C. For HeLa-S3 cells, the adherent cells were first trypsinized and then the crosslinking was performed as described above.
5C analysis was carried out as previously described2,15 for the 44 ENCODE Pilot regions (ENCODE manual (ENm) and ENCODE random (ENr)). The chromosomal position and coordinates of the regions as per the February 2009 GRCh37/hg19 human genome assembly are listed in Supplementary Table 1. The 5C experiment is designed to interrogate looping interactions between HindIII fragments containing transcription start sites (TSSs) and any other HindIII restriction fragment (distal fragments) in the ENCODE pilot regions.
5C primer design
5C primers were designed at HindIII restriction sites (AAGCTT) using 5C primer design tools previously developed and made available online at My5C website (http://my5C.umassmed.edu)28. Reverse 5C primers were designed for HindIII restriction fragments overlapping a known TSS from GENCODE transcripts, or overlapping a start site as experimentally determined by CAGE tag data of the ENCODE pilot project (Supplementary Table 2). Forward 5C primers were designed for the remaining HindIII restriction fragments (Supplementary Table 3). For ENCODE regions that do not contain any TSS according to gene annotation in 2008 (ENr112, ENr113, ENr311 and ENr313), we used an alternative primer design. For these regions an alternating design of forward and reverse 5C primers was used in which forward and reverse primers are designed for alternating restriction fragments2. Note that ENr311 contains genes according to 2011 GENCODE v7 annotation20. Primers were excluded for highly repetitive sequences that prevented the design of a sufficiently unique 5C primer. Primers settings were as described before15: U-BLAST, 3; S-BLAST, 130; 15-MER, 1,320; MIN_FSIZE, 40; MAX_FSIZE, 50,000; OPT_TM, 65; OPT_PSIZE, 40. The 5C primers contained up to 40 bases that were specific for the corresponding restriction fragment. If a shorter sequence was sufficient to obtain a predicted annealing temperature of 65 °C, that shorter sequence was used, and random sequence was added to make a total of 40 bases. All of the 5C primers have an extension of universal tail sequences at the 5′ end for forward 5C primers and at the 3′ end for reverse 5C primers. DNA sequence of the universal tails of forward primers was 5′-CCTCTCTATGGGCAGTCGGTGAT-3′; DNA sequence for the universal tails of reverse primers was 5′-AGAGAATGAGGAACCCGGGGCAG-3′. A six-base barcode was included between the specific sequence of the primers and the universal tail to aid in mapping of the high-throughput short sequencing reads. The length of each primer was 69 bp. In total, 981 reverse primers and 5,321 forward primers were designed (corresponding to ∼77.1% (6,302 of 8,174) of all HindIII fragments in the 44 ENCODE regions).
Generation of 5C libraries
3C was performed with HindIII restriction enzyme as previously described15,29 for GM12878, K562 and HeLa-S3 cells separately with two biological replicates for each cell line. The 3C libraries were then interrogated by 5C. The 44 ENCODE regions were analysed in two groups using two separate 5C primer pools. The first group (ENm) contained the manually picked ENCODE regions ENm001–ENm014 and ENr313. The second group (ENr) contained the 30 randomly picked ENCODE regions. The two 5C primer pools were made by pooling 5C primers for interrogating long-range interactions in the two groups of ENCODE regions. In these pools each primer was present at a final concentration of 0.5 fmol μl−1.
The primer pool for the ENm group contained a total of 3,150 primers (476 reverse 5C primers and 2,674 forward 5C primers). This primer pool allows interrogation of a total of 1,272,824 interactions. Of these, 83,427 interactions were between fragments that were both located in the same ENCODE region. The primer pool for the ENr group contained a total of 3,152 primers (505 reverse 5C primers and 2,647 forward 5C primers). This primer pool allows interrogation of a total of 1,336,735 interactions. Of these, 34,859 interactions were between fragments that were both located in the same ENCODE region.
5C was performed in 10–15 reactions each containing an amount of 3C library that represents 200,000 genome equivalents and 0.5 fmol of each primer. The multiplex annealing reaction was performed overnight at 55 °C. Pairs of annealed 5C primers were ligated at the same temperature using Taq DNA ligase for 1 h. Ligated 5C primer pairs, which represent a specific ligation junction in the 3C library and thus a long-range interaction between the two corresponding loci, were then amplified using 28 cycles of PCR with universal tail primers that recognize the common tails of the 5C forward and reverse primers. At least four separate amplification reactions were carried out for each of 10–15 annealing reactions described above and all the PCR products were pooled together. This pool constitutes the 5C library. The libraries were concentrated using Qiaquick PCR purification kit and a 3′-A tailing reaction was done using dATP and Taq DNA polymerase in the presence of 1× standard Taq buffer (NEB) at 72 °C for 30 min.
To facilitate Illumina paired-end DNA sequence analysis of 5C libraries, Illumina paired-end adaptor oligonucleotides (Illumina) were ligated to the 5C library using the Illumina PE protocol. The linkered 5C library was then amplified by PCR (17 or 18 cycles, with Phusion High Fidelity DNA polymerase) using Illumina PCR primer PE 1.0 and 2.0. The 5C library was gel purified and sequenced on the Illumina GAIIx platform, generating 36-bp paired-end reads.
5C read mapping
Sequencing data was obtained from an Illumina GAIIx machine and was processed through a custom pipeline to map and assemble 5C interactions. We used 36-bp paired-end reads to sequence all 5C libraries. Owing to sequencing efficiency, some 5C libraries were re-sequenced as many as ten times to obtain the required read depth for our analysis.
The fastQ files were taken directly from the Illumina GAIIx and fed into our in-house 5C mapping pipeline. Each side of the paired end read was independently mapped to a pseudo-genome of all possible 5C primer sequences using the novoalign mapping algorithm (V2.05 http://novocraft.com). The default alignment settings for novoalign were used. After mapping, if both of the paired-end reads could be uniquely mapped to a 5C primer, a 5C interaction was assembled. Invalid interactions between the same primer or between primers of the same type were removed as these would represent a mapping artefact or an issue with the 5C technique. The number of invalid interactions detected across all libraries was <0.01%, which would be expected if solely due to random mapping errors.
Statistics regarding the 5C library quality, mapping efficiency, etc. can be found in Supplementary Table 4. Because it is only necessary to map the paired-end reads to the list of all possible 5C primers rather than to the entire genome, a higher percentage of mapped/usable reads can be achieved. We found that >90% of all paired-end reads (after Illumina chastity filtering) can be uniquely mapped to a single 5C interaction. For libraries where more than one lane was used to achieve adequate sequence depth, the interactions from each lane were summed to produce the complete 5C interaction data set. A table summarizing the read depth of each 5C library can be found in Supplementary Table 5. Pearson correlation coefficients between the biological replicates can be found in Supplementary Table 6.
Detection bias correction
5C experiments involve a number of steps that can locally differ in efficiency, thereby introducing biases in efficiency of detection of pairs of interactions. These biases could be due to differences in the efficiency of crosslinking, the efficiency of restriction digestion (related to crosslinking efficiency), the efficiency of ligation (related to fragment size), the efficiency of 5C primers (related to annealing and PCR amplification) and finally the efficiency of DNA sequencing (related to base composition). All of these potential biases—several of which are common to other approaches such as chromatin immunoprecipitation (for example, crosslinking efficiency, PCR amplification, base-composition-dependent sequencing efficiency)—will have an impact on the overall efficiency with which long-range interactions for a given locus (restriction fragment) can be detected. To determine this overall efficiency of interaction detection we have developed the following general strategy. To determine overall interaction detection efficiency for a given restriction fragment we analysed the large set of interchromosomal interactions that are detected for each fragment. We then defined the overall efficiency of interchromosomal interaction detection for a given fragment as the ratio of the average interchromosomal signal obtained with that fragment and the average interchromosomal signal of all fragments. We then corrected the frequency of each interrogated long-range intrachromosomal interaction using a correction factor that is the product of the overall efficiency of interchromosomal interaction detection for the two interacting fragments.
This procedure will correct for any of the biases in detectability of interactions for a given locus, as listed above, and will also adjust for copy number variation of a locus, which can vary in transformed cell lines such as K562 and HeLa-S3 cells, as these factors will also affect the level of interchromosomal interactions.
Detailed primer filtering
To approximate the relative 5C signal of each restriction fragment interrogated in the experiment we first calculated the average 5C signal for all trans interactions (interactions between different chromosomes). To remove any extreme outliers from the mean calculation (for example, due to primer failure) we first filtered down the distribution of 5C signals in trans for each restriction fragment by removing all signals beyond the mean ± 3 standard deviations (s.d.). After calculating the filtered mean for each restriction fragment in trans, we calculated the global mean of all interchromosomal interaction frequencies. We then calculated a correction factor for each restriction fragment that would normalize its set of trans interactions to the entire set. Once the correction factors were calculated, we then calculated the mean and s.d. correction factor and flagged any restriction fragments requiring a correction value beyond the mean ± 1.654 s.d. Fragments with a correction factor outside of this limit were flagged for removal as their trans signal is too above/below the expected signal by chance. Here, we assume that any variation in 5C signals detected within the trans space is due to experimental factors, differing primer efficiencies, ligation efficiencies, etc.
Detailed primer correction
Once the outlier fragments are removed from the 5C data set, we repeated the above-described steps to calculate the primer correction values required to normalize the 5C signals for the remaining restriction fragments. Then, for each 5C interaction within an ENCODE region in the data set, we used the product of the correction factors from the two restrictions fragments involved in the interaction as the final correction factor to apply to the 5C signal. 5C signals were then either increased or decreased by the correction factor to correct for varying signals from the fragments visibility in the trans interaction space.
To detect significant looping interactions from background looping interactions we developed an in-house ‘5C peak calling’ algorithm. We chose to call peaks in each 5C biological replicate separately and then take only the peaks that intersect across replicates as our final list of significant looping interactions.
5C signals represent the three-dimensional contact probabilities between pairs of loci. This relationship inversely scaled with genomic distance. To control properly for the varying genomic distances tested in the 5C data set, we first determined the relationship of 5C signals over genomic distance. Using a Lowess smoothing algorithm we found the weighted average and weighted s.d. of all 5C signals across the range of all interrogated genomic distances. We used the traditional tri-cubic weighting function and an α parameter of 0.01 to average the closest 1% of the 5C signals around each genomic distance. We assumed that the large majority of interactions are not significant looping interactions and thus we interpreted this weighted average as the expected 5C signal for any given genomic distance. The 5C signals were then transformed into a z-score by calculating the (obs − exp/s.d.). Where the obs value is the detected 5C signal for a specific interaction, exp is the calculated weighted average of 5C signals for a specific genomic distance, and s.d. is the calculated weighted standard deviation of 5C signals for a specific genomic distance. Once the z-scores were calculated, the distribution of z-scores was fit to a Weibull distribution. We found that the distribution of z-scores fits to the Weibull distribution with an R2 value of >0.939 for all cell lines. P values can then be mapped to each z-score and then also transformed into q values for FDR analysis. The ‘q value’ package from R (qvalue.cal [siggenes]) was used to compute the q values for the given set of P values determined from the fit to the Weibull distribution. Using an FDR cutoff of 1%, we selected all 5C interactions with a q value ≤0.01. We then took the intersection of all significant looping interactions across the two biological replicates as our final list of 5C looping interactions.
Estimation of frequency of false-positive looping interactions
We defined a false-positive 5C looping interaction as an interaction that is identified in the peak calling approach described above but is due to technical biases or noise and thus does not reflect a biologically meaningful long-range interaction. To estimate the frequency by which our approach detects significant looping interactions by chance, we analysed 5C data obtained for the three ENCODE regions that are devoid of genes and are almost devoid of active regulatory elements (according the ENCODE seven-way segmentation4). As described above, we used an alternating 5C primer design for these regions. As a result, long-range interaction profiles are not specifically anchored on any type of genomic element. Combined with the fact that these regions are largely devoid of any functional elements, we do not expect to detect any significant looping interactions. Thus, assessment of the number of looping interactions detected for these regions using our peak-calling pipeline provides an empirical approach to estimate the frequency by which significant looping interactions are detected by chance and thus represent false positives.
Supplementary Fig. 1a shows the number of peaks detected in the three gene desert ENCODE regions (ENr112, ENr113 and ENr313). We used these numbers to estimate the frequency with which we detect significant looping interactions by chance. For GM12878 cells we identified 17 significant looping interactions in both replicates. For these three ENCODE regions we interrogated 7,819 5C interactions. Thus, we estimate that the fraction of interrogated interactions that by chance scores as a significant long-range interaction: (17/7,819)100 = 0.217%. Assuming that this fraction is the same for the set of 82,545 interrogated TSS–distal element interactions throughout the ENCODE regions, we expect to detect (0.217 × 82,545)/100 = 179 false-positive looping interactions. We detected 1,011 significant looping interactions between TSSs and distal sites in GM12878 cells, which leads us to estimate that the false-positive detection rate is around 18% [(179/1,011)100]. Similar analyses of 5C data from K562 and HeLa-S3 cells lead to estimates of false-positive detection rates of 10% and 12%, respectively, corresponding to 147 out of 1,434 and 190 out of 1,620 looping interactions possibly being false positives. We note that these represent upper limit estimates, as some of the significant looping interactions detected in the gene desert regions may be real.
The false-positive detection rate for single replicates can be calculated in exactly the same way. We found that the fraction of significant looping interactions detected in one replicate that might be false positives ranges from 20% to 47%. Thus, by requiring interactions to be significant in both replicates, we greatly reduce the fraction of false-positive significant interactions (from 20–47% to 10–18% of the significant interactions). At the same time, many of the significant interactions detected in only one replicate were not false positives, and by excluding this subset of interactions from our analysis we introduce false negatives. Consistent with our interpretation that many of the peaks seen in only one replicate represent false negatives, we found that when we take the union of the peaks found in replicates 1 and 2, or analyse the set of peaks obtained with individual replicates separately, all of the results that we presented remain the same: (1) enrichment for distal elements that resemble active gene regulatory elements (Supplementary Fig. 1e); (2) asymmetry of the long-range interaction landscape with a peak around 120 kb upstream of the TSS (Supplementary Fig. 8); (3) skipping over CTCF sites; and (4) formation of interwoven interaction networks. The fact that all our results can be obtained using different peak sets (for example, the union of two replicates, or the intersection of the replicates) indicates that our basic findings are robust and not very sensitive to where the threshold for peaks is placed. By focusing exclusively on the set of peaks independently detected in both replicates we are being conservative, only report the strongest signals that display the strongest enrichments for active chromatin features (Supplementary Fig. 1), and reduce the false-positive rate.
In general we prefer false negatives over false positives.
To annotate the interrogated restriction fragments, a variety of ENCODE data sets were used to check for overlap with our list of restriction fragments. A list of all used ENCODE data sets can be found in Supplementary Table 7.
Gene Expression Omnibus
All data are publicly available at GEO (accession number GSE39510). 5C data has also been deposited in the public UCSC ENCODE database (http://encodeproject.org/ENCODE/). 5C data can be found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUmassDekker5C/.
We thank the University of Massachusetts Medical School Deep Sequencing core for sequencing 5C libraries, and R. Thurman and J. Stamatoyannopoulos for discussion and help with peak calling analysis. We thank M. Walhout and members of the Dekker laboratory for discussions. This work was supported by grants from the National Institutes of Health, National Human Genome Research Institute (HG003143 and HG003143-06S1) and a W.M Keck Foundation Distinguished Young scholar in Medical Research award to J.D.
This zipped file contains Supplementary Tables 1-7 comprising: Table 1 Coordinates of the 44 ENCODE region (Hg19); Table 2 Reverse 5C primers; Table 3 Forward 5C primers; Table 4 Mapping Summary for 6 5C libraries; Table 5 Summary of 5C datasets; Table 6 Correlation of biological replicates (raw data); Table 7 List of all ENCODE datasets used.
This zipped file contains Supplementary Data set 1 containing 5C data (raw, filtered, corrected).
This zipped file contains Supplementary Data set 6 containing a list of all ‘peak’ called 5C interactions and relevant scores.
This zipped file contains Supplementary Data sets 2-5 and 7-9. The Supplementary Data sets include: Data Set 2 a list of manually removed primers (duplex); Data Set 3 a list of filtered primers; Data Set 4 Primer correction values for normalization; Data Set 5 a list of all fragment annotations (TSS+mark); Data Set 7 BED12 files for all 5C loops; Data Set 8 ‘WebPlots’ of all 5C loops in all 44 ENCODE regions (over 3 cell-lines) and Data Set 9 a list of all degree distributions per 5C primer.
About this article
European Journal of Human Genetics (2018)