Population-scale study of eRNA transcription reveals bipartite functional enhancer architecture

Enhancer RNAs (eRNA) are unstable non-coding RNAs, transcribed bidirectionally from active regulatory sequences, whose expression levels correlate with enhancer activity. We use capped-nascent-RNA sequencing to efficiently capture bidirectional transcription initiation across several human lymphoblastoid cell lines (Yoruba population) and detect ~75,000 eRNA transcription sites with high sensitivity and specificity. The use of nascent-RNA sequencing sidesteps the confounding effect of eRNA instability. We identify quantitative trait loci (QTLs) associated with the level and directionality of eRNA expression. High-resolution analyses of these two types of QTLs reveal distinct positions of enrichment at the central transcription factor (TF) binding regions and at the flanking eRNA initiation regions, both of which are associated with mRNA expression QTLs. These two regions—the central TF-binding footprint and the eRNA initiation cores—define a bipartite architecture of enhancers, inform enhancer function, and can be used as an indicator of the significance of non-coding regulatory variants.

Enhancers that contain tiQTLs and/or diQTLs (with) are enriched in eQTLs, compared to enhancers with no tior diQTL (without). The proportion of enhancer tTREs that contain eQTLs were calculated for tiQTL containing, diQTL containing, and exclusively diQTL containing (no tiQTLs) tTREs and compared to those without. Indicated p-values from two-sided Fisher's exact test. Source data are provided as a Source Data File. (B) Percent of SNPs with FDR < 0.1 when used as SNP set for eQTL discovery. All p-values calculated in comparison to percentage for all SNPs within 2kb of tTREs. Indicated p-values from two-sidedn Fisher's exact test. Source data are provided as a Source Data file. (C) Enrichment of tiQTLs +/-20 bp from TRE centers compared to overlap of background SNPs. Overlap of QTL and background are compared for each bar with two-sided Fisher's exact test. Source data are provided as a Source Data file. (D) As in (C) for diQTLs at TRE TSSs +/-25 bp. Source data are provided as a Source Data file. (E) Frequency of eQTLs among distance-fromcenter-matched SNPs from out and core regions of enhancer tTREs. Left: SNP density plotted as function of distance from TRE center. (F) Bootstrapped frequency of eQTLs in (E). Data are presented as mean eQTL frequency +/-standard deviation. Indicated p-value from bootstrapped Welch's two-sided t-test, n = 500 bootstrapped eQTL frequencies. Source data are provided as a Source Data file. (G) Frequency of eQTLs among distance-from-center-matched SNPs from core and non-core non-center regions of enhancer tTREs. Left: SNP density plotted as function of distance from TRE center. (H) Bootstrapped frequency of eQTLs in (G). Data are presented as mean eQTL frequency +/-standard deviation. Indicated p-value from bootstrapped Welch's twosided t-test, n = 500 bootstrapped eQTL frequencies. Source data are provided as a Source Data file.   For each dataset, we pre-processed the raw sequence reads (fastq) by trimming out 3´ RNA adaptor sequences using cutadapt, and tagging the reads with first 8mer of the sequences for unique molecular indices (UMIs). We aligned the reads to a human ribosomal RNA reference (NR_145819) and filtered out the reads that mapped to the ribosomal sequences using bowtie, allowing up to 2 mismatches. Then we aligned the remaining reads to the hg19 reference genome using bowtie, and used uniquely mapped reads allowing up to 2 mismatches. Duplicate reads, having the same mapped position and the same UMI, were reduced to a single read using custom scripts, and stored in bam format. We mapped the 5′ (PRO-cap) or 3′ (PRO-cap) ends of the reads in each individual and stored the read counts in bedgraph format using the bedtools suite (bedtools coverage).

Identification of nTSSs and tTREs.
To find nascent transcription start sites (nTSSs), we made 100 bp bins of the PRO-cap read counts at the 5′ end of the reads on the plus strands of the hg19 genome, and selected the top 5 percentile bins (1.5 million bins). To find bidirectional PRO-cap peaks in these bins, we extended the top 1.5 million 100 bp bins by 100 bps upstream and downstream, and picked the 1 bp position with the highest read count greater than 5 reads within the resulting 300 bp region.
We then selected antisense strand peaks that have the highest read counts (greater than 5 reads) within 300 bp upstream relative to the sense strand peak. We repeated this procedure starting with the minus strand and combined both to generate 491,289 bidirectional PRO-cap peak candidates.
Since transcription initiation can arise from a broader region up to several tens of base pairs, we made sums of PRO-cap read counts from the midpoint to +200 bp on the plus strand, and -200 bp to the midpoint on the minus strand, for each PRO-cap peak candidates. Then we removed PRO-cap peaks whose midpoints are within 150 bp of another peak's midpoint, keeping only the peak with the highest read counts (219,312 PRO-cap peaks). The read count sums for each PRO-cap peak were normalized by per million mapped reads (RPM).
To select PRO-cap peaks that are expressed above the background distribution of PRO-cap read counts, we repeated our analysis on randomly selected 1,000,000 genomic positions that are 1 kb   To mask out and exclude non-unique regions for further analyses, we generated tiled 30mer reads from the concatenated individual tTRE haplotype DNA sequences and assessed their unique mappability. We mapped the reads back to the hg19 reference genome and to the two tTRE haplotype sequences from the same individual. We removed three types of non-uniquely mappable tiled 30mers: 1) 30mers from the tTRE reference that mapped to more than 1 position in the hg19 genome allowing up to 2 mismatches -removes PRO-cap reads that are originally The PRO-cap reads were aligned to the two tTRE haplotype sequences of the same individuals separately allowing only perfect matches. Reads with the same UMIs mapped to the same coordinates are collapsed to a single read. We removed reads whose 5′ and 3′ ends are both within masked non-mappable regions. Allele specific PRO-cap reads will be mapped only once to the heterogeneous variant sites of one of the two tTRE haplotype sequences, while non-allele specific PRO-cap reads will be mapped to both haplotypes. We extracted reads that are covering the heterogeneous variant sites as the allele specific reads. We then generated total read counts and allele specific read counts for each haplotype for every tTRE regions (0 -+250 relative to the tTRE midpoints for the plus strand, -250 -0 relative to the tTRE midpoints for the minus strand). Total read counts were half the sum of read counts on haplotype 1, read counts on haplotype 2, allele specific read counts on haplotype 1, and allele specific read counts on haplotype 2.
Validation of tiQTLs using allele specific expression. For the allele specific analysis, we used 28,118 tiQTLs (fdr<0.1) whose SNP sites are within -250 -+250 bp from the tTRE midpoints.
Of these sites, we used the 5,317 sites that have more than 90 allele specific PRO-cap reads, and plotted the alternative allele read fractions (y-axis) as a function of the tiQTL effect sizes (x-axis) ( Supplementary Fig. 4A).
Validation of tiQTL and diQTL using WASP. We used WASP 1 mappability filtering pipeline as described in the tool suite (https://github.com/bmvdgeijn/WASP). Since WASP filtering was a computationally intense procedure, we applied WASP filtering test only on the reads that intersect with the tiQTL or diQTL SNPS, which were 34.2 million and 9.98 million reads respectively from the 67 LCL PRO-cap data rather than all the 1.4 billion reads. After mapping back to the genome post-filtering by WASP, we calculated the re-mapping fraction of the tiQTL and diQTL mapped reads per each individual sample. We also applied WASP filtering on all the reads that mapped to chromosome 22 in our pipeline, and re-tested the tiQTL associations at the same adjusted p-values thresholds to evaluate the re-discovery rate.
Average profiles of tiQTL effects. We selected the tiQTLs most likely to be causal variants: These tiQTLs 1) are significant at FDR<0.1 in 2 kb analysis, 2) the lead SNP is located in the target tTRE midpoint ±40 bp region, 3) and the lead SNP is 10 times more significantly associated (p-value 10 times smaller) than the 2nd SNP (n=1,226). We extracted RPKM normalized PRO-cap read profiles from ±1 kb regions around the SNPs for each individual, assigned them according to the individual's genotype class (high-activity/heterogeneous/lowactivity), and averaged the genotype class both the plus and the minus strands. The averaged profiles were fit to smoothing splines (R smooth.spline function with spar=0.3, Supplementary   Fig. 4B).
Distance-matched bootstrapping of eQTLs. SNPs were sampled from the core region and either the out or NCNC region, maintaining an equivalent distribution of distances from the tTRE midpoint between the samples. The resulting sample sizes are N = 5088 SNPs for core vs.
out and N = 4128 for core vs. NCNC and samplings are repeated 500 times. Each time the proportion of SNPs that pass the fdr < 0.05 threshold for gene expression association is computed (eQTL proportion). The mean and standard deviation of the 500 eQTL proportions was calculated, and a bootstrap-estimated P-value was computed by the fraction of the 500 eQTL proportions for which the core had a higher proportion than the mean for the comparison group (out or NCNC).