High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human

Genome-wide epigenomic maps have revealed millions of putative enhancers and promoters, but experimental validation of their function and high-resolution dissection of their driver nucleotides remain limited. Here, we present HiDRA (High-resolution Dissection of Regulatory Activity), a combined experimental and computational method for high-resolution genome-wide testing and dissection of putative regulatory regions. We test ~7 million accessible DNA fragments in a single experiment, by coupling accessible chromatin extraction with self-transcribing episomal reporters (ATAC-STARR-seq). By design, fragments are highly overlapping in densely-sampled accessible regions, enabling us to pinpoint driver regulatory nucleotides by exploiting differences in activity between partially-overlapping fragments using a machine learning model (SHARPR-RE). In GM12878 lymphoblastoid cells, we find ~65,000 regions showing enhancer function, and pinpoint ~13,000 high-resolution driver elements. These are enriched for regulatory motifs, evolutionarily-conserved nucleotides, and disease-associated genetic variants from genome-wide association studies. Overall, HiDRA provides a high-throughput, high-resolution approach for dissecting regulatory regions and driver nucleotides.

Control: driver elements vs. adjacent regions a Evolutionary conservation Driver with motif from TF expressed in K562 Driver with motif from TF expressed in HepG2 Not expressed in HepG2 Not expressed in K562 All driver elements As HiDRA relies on random fragmentation of the genome, fragments carrying different alleles at a SNP might have differential activity due to the position of their ends, rather than due to allelic activity. In this hypothetical example, a SNP with no true allelic activity is mistakenly called as active because the fragment containing the reference allele overlaps a driver element not present in the alternate fragment.  Figure. 13: Proportion of reads lost by each processing filter for HiDRA library.

Supplementary Notes
0) Note on HiDRA insert size -we recommend caution before trying very large fragments (e.g. 800nt and above) in HiDRA or STARR-seq. If sequencing on an Illumina machine, fragments of this size do not cluster efficiently and can lead to poor sequencing results. We tried a library with a wide length distribution of 600-1.5kb, and found that large fragments (800nt and above) were very poorly represented in plasmid samples. Surprisingly, some of the strongest "active" regions in RNA output were large fragments (1kb and above) -this is likely an artificial signal due to internal splicing of some large fragments, creating a smaller fragment that is very efficiently sequenced. A more detailed study of this artifact could yield insights into RNA splicing of nongenic regions and the evolution of new genes.
1) We performed 16 ATAC-seq reactions on 50,000 GM12878 cells each. We chose to perform extra ATAC-seq reactions to ensure high library complexity, but performing so many reactions is not necessary if initial cell/tissue amount is an issue.
2) 16 reactions chosen to maintain high library complexity and allow for low cycle number based on slide 67 from https://www.broadinstitute.org/files/shared/illuminavids/SamplePrepSlides.pdf. We did not quantify library complexity with fewer reactions, but if reagents or time are an issue, reducing number of reactions will probably have minimal effect on the library.
3) Mitochondrial fragment depletion was useful, however for future studies we recommend designing a denser set of gRNAs to achieve greater amounts of depletion to save on highthroughput sequencing costs later.

4)
In subsequent tests we found that in our hands that the NEBuilder HiFi DNA Assembly enzyme (NEB #e5520) yielded approximately 8-10X more bacterial colonies per reaction using the same primers as described here, in our hands. Based on manufacturer's' literature, primers with longer homology arms (20-25 nt overlaps) should yield even greater efficiency.

5)
We recommend using MegaX DH10B T1R cells, or performing extensive tests if changing to a different line of bacteria. In our hands, we experienced substantially lower transfection efficiencies and greater degree of arcing when using other electrocompetent cells (e.g. NEB 10beta).
6) The most important consideration for HiDRA library preparation is the expected complexity (number of unique fragments). If the complexity is too low, there will be insufficient fragments in most regulatory regions for high-resolution mapping. If library complexity is too high, more cells may need to be transfected for reliable activity readings, and DNA and RNA libraries will need to be sequenced to greater depth. While developing HiDRA, we were able to generate plasmid libraries with over 30-50 million unique fragments (almost an order of magnitude greater than the data presented here), however this would require very large sequencing runs. In our experience, library complexity can be controlled in the (bottleneck) homology-based cloning step. Before proceeding through time-consuming and expensive transfection, RNA collection & RNA sequencing steps, we recommend sequencing the plasmid library (MiSeq or spiking into a larger run) to estimate library complexity and proportion of reads within interesting regions (enhancers, promoters, etc).

7)
As the Qiagen Oligotex mRNA kits are fairly expensive, another option is to synthesize biotinlabelled capture probes against the reporter gene transcript and perform streptavidin bead pulldown, as described by Tewhey et al. (Cell, 2016), but modifying the probe sequences to match the sgGFP reporter gene on pSTARR-seq_human.
8) The Superscript III RT manual recommends using no more than 500ng of polyA RNA for reverse transcription reactions. As we are only reverse transcribing a single gene and not the entire transcriptome, we reasoned we could use add more polyA RNA per reaction. We selected 2ug of polyA RNA after performing reverse transcription reactions with increasing amounts of RNA followed by reaction cleanup and 6-cycle PCR to quantify yield.
9) If possible we recommend balancing libraries on either a MiSeq or by spiking in on a sequencing run. We have tried Kapa kits, Bioanalyzer and Qubit/Nanodrop, and balancing by MiSeq/sequencing is the best option.
10) We always recommend filtering against the ENCODE blacklist regions, especially for ATACseq or ATAC-seq-esque libraries due to the presence of a pseudo-mitochondrial region near the beginning of chr1 that will otherwise substantially affect downstream analyses. Adapter removal is also important if read length is greater than minimum fragment size.
11) The experimental approach we used to sequence the RNA output could conceivably also include false positive results from promoters that have broad, dispersed initiation (i.e. initiation on the STARR-seq plasmid occurs immediately upstream of the inserted fragment, rather than the minimal promoter). In our dataset, promoters that appear to have the ability to promote transcription initiation in a broad upstream region (based on CAGE-seq data from the FANTOM consortium) do not have substantial activity (26 possible cases with FDR<0.05 out of ~66,000 active regions), however this possibility should be considered for future studies using this experimental approach.

A1. Model specification Basic model
We define a "tiled region" as a continuous region in which each position is covered by at least one HiDRA fragment. Suppose that a tiled region containing ! positions is covered by ! fragments. The regulatory activity of each fragment ! with a length ! ! , ! ∈ {1, … , !} is measured by the ratio between the counts of sequenced RNA and DNA. For a design with multiple replicates, the ratio can be calculated from the average counts of RNA and DNA across the replicates. In this paper, we calculated RNA/DNA ratios for each fragment after normalization of RNA & DNA by DESeq2 with the library split into 100nt bins (100-200nt, 200-300nt, etc). We expect that the ratio for a fragment containing one or more functional driver element site is larger than those not overlapping a driver element. For the downstream analysis, we use the transformed observation ! ! ! by taking the log-transformation with base e of . For the HiDRA library described in the Methods section, we observed that the empirical distribution of ! ! ! across the whole genome (approximately 4 million fragments after quality control and filtering for minimum expression) is nearly symmetrically centered at zero but with heavy tails that indicate regulatory activity (Supplementary Figure 14).
In HiDRA, the length of a tiled region is generally much larger than the number of fragments (! ≫ !). The basic idea of SHARPR-RE is to use a shrinkage prior to tackle this large p small n problem. We first compute a centered variable ! ! for each fragment ! by subtracting ! ! , the mean of the background signal (!. !., ! ! = ! ! ! − ! ! ). The mean of the background signal ! ! is the average signal intensity from fragments not overlapping a driver element. We estimate ! ! by the mean of the observations taken from all tiled regions covered by <5 fragments across the whole chromosome, with the assumption that the majority of these tiled regions do not contain a driver element. More specifically, suppose that there are ! tiled regions on a chromosome and each tiled region is covered by ! ! fragments each of which has an observation ! !" , where ! is the set of all tiled regions covered by <5 fragments (! = ! ! ! < 5 ).
Within one tiled region, we assume that ! ! (we omit the index ! whenever the formula only involves a specific tiled region) follows an i.i.d. normal distribution with a mean equal to a scaled sum of those regulatory scores ! ! that are covered by fragment !, that is, (1) where ! ∈ {0,1} !×! is an indicator matrix, i.e., ! !" = 1 if position !, ! ∈ {1, … , !}, is covered by fragment !; otherwise ! !" = 0, and ! ∈ ℝ !×! is a diagonal matrix for scaling each fragment. Note that this specification of ! assumes that each position in the tiled region contributes identically to the regulatory activity measurement of the fragments. If, for example, driver elements at the ends of a fragment may contribute less to the regulatory activity, smaller weights can be assigned according to its distance to the middle of the fragment. For the purpose of regularization, we impose an ℓ ! penalty on !, which is equivalent to a normal prior from the Bayesian perspective. Generalizing SHARPR-MPRA 1 from 5nt to 1nt, the regulatory score ! ! at each position !, which is a latent variable, is assigned by a univariate normal prior where ! ! ! is a hyper-parameter, which is defined by users and is tested for specific values in SHARPR-MPRA 1 . In SHARPR-MPRA, it is assumed that ! !! = ! ! . Because each fragment has the same length in SHARPR-MPRA, we end up with ! = !" and !~!( !" ! , Σ ! ) 1 , where ! is the identity matrix. In contrast, each fragment has a different length in HiDRA ranging from 150nt to 500nt. In SHARPR-RE, we choose a uniform scale coefficient is the average length of all fragments on the chromosome. Under this modeling of !, the signal of a fragment depends only on the sum of the regulatory scores at all positions that the fragment covers but not on the fragment length. Σ ! ∈ ℝ !×! is a covariance matrix with non-zero diagonal elements equal to ! ! ! , which is set to be the sample variance of ! ! in SHARPR-MPRA 1 . Thus, the marginal distribution of ! after integrating out ! from (1) follows where is a diagonal matrix and the prime stands for transpose. Thus, the ridge estimate or the posterior mean of ! given the observed ! is (4) After some rearrangement to merge Σ ! and Σ ! , we end up with the following equation

Selection of penalizing coefficient
Instead of letting ! ! ! and thus the penalizing coefficient ! be defined by users as in Ernst et al. 11 , we select ! in a data-driven way. This is because the choice of ! substantially affects the estimates and the performance of the following hypothesis testing procedure. This means that ! should be selected carefully. Note that although the formula (4) is essentially the same as the posterior mean in the Bayesian framework used in SHARPR-MPRA, we instead regard (5) as a ridge estimate under the classical framework in SHARPR-RE. In this case, we only assume that (1) is the true model in which ! are parameters rather than random variables, and (2) is used for the purpose of regularization. Note that in this case the choice of ! has significant influence on the estimation of !. If ! is too small, the estimates would be unstable, while an overly large ! would bring more bias. A handful of strategies have been proposed to select an optimal and stable !, including cross-validation 2 , the Hoerl-Kennard-Baldwin plug-in method [3][4][5] , and a Markov chain Monte Carlo (MCMC) method 6 . In SHARPR-RE, we select ! by following the strategy proposed by Cule and De Iorio, 2013, which generalizes the idea of Hoerl et al. 1975 5 to the large p small n problem and shows fast and stable estimation in simulation and real data studies. More specifically, we first perform a singular value decomposition (SVD) for ! !! !: where ! is a diagonal matrix with ! non-zero diagonal elements ! !! , and ! ≤ min (!, !). We select ! * ∈ {1, … , !}, so that where ! ! is an !-vector of the first ! elements in !, and ! ! is the first ! column of !.
, and the estimate of ! in SHARPR-RE is where For HiDRA datasets, it is often the case that the number of fragments ! is much smaller than the length of a tiled region !. To make the computation more efficient, we apply SVD to the hat matrix to avoid the inversion of a large-scale matrix, so that we have in which the computation of !" is dramatically faster as ! has at most ! non-zero diagonal elements. In the analysis of the example HiDRA library, we observed that this algorithm of selecting ! ! * produced stable estimates of the regulatory scores. We also noticed that the algorithm would produce an overly small ! ! * if two or more fragments in a tiled region are mapped to almost the same position (the difference is only a couple of nucleotides) and have large opposite values of ln( ). This phenomenon may suggest a potential data problem.
Note that this algorithm estimates a unique ! ! * for each tiled region, and thus the estimated regulatory scores cannot be compared directly across tiled regions. If the comparison across regions is the major concern (e.g., using the estimated regulatory scores as a training set in deep learning such as convolutional neural networks (CNN) for other downstream analysis), studentized estimates ! !" can be used (described in the next section).

Accuracy of estimation
To measure the accuracy of the estimates, we compute the pointwise mean square error (MSE) of ! ! . As we assume that (1) is the true model, ! ! is a biased estimate of ! if ! ≠ 0, and the MSE of ! ! should take into account both variance and bias. That is, we are interested in finding not only !"#(!) but !(! ! − !) ! as well. Note that the MSE can be decomposed into where !"#$ ! ! = ! ! ! − ! measures the bias between the true value of ! and the mean of ! ! . The bias term is given by The variance !"# ! ! can be shown as The true value of ! ! ! is unknown, but can be estimated from the residuals is the residual degrees of freedom 7 and !"() stands for the trace. Plugging in the ridge estimate (5) to ! and the sample estimate Pointwise confidence intervals (CIs) can be calculated from !"# ! ! , e.g., 95%CI ≈ ! ! ± 1.96× !"# ! ! . Note that the bias term !"#$ ! ! is non-zero if ! or ! is non-zero. Therefore, it is not straightforward to interpret the CIs obtained from !"# ! ! . Instead, the following adjusted 95%CI is proposed 8 , which adjusts for the bias. One problem of the adjusted CI is that the true bias is unknown and its estimate !"#$ ! ! might not be accurate.

A2. Identifying high-resolution driver elements
Regional FWER controlling procedure Given the estimated regulatory scores ! ! for each nucleotide within a specific tiled region, we then aim at finding a regional threshold to declare significant regulatory regions, which we term as high-resolution "driver" elements at which an active motif is located. More specifically, we need to make the inference for each nucleotide ! by testing the following hypothesis, For this hypothesis testing, we focus only on finding activating regulatory elements but not repressive ones; however, generalization to a two-sided test is straightforward. For a specific tiled region containing ! positions, we want to find a cutoff ! ! so that the family-wise error rate (FWER) ! is bounded below a given value (e.g., 0.05). The value of ! can be set differently among different tiled regions. This amounts to a multiple testing problem of performing ! onesided tests of the estimated regulatory scores ! ! = (! !! , … , ! !" ) ! = 0 simultaneously. One way can be computing a p-value for each ! !" and using the simple Bonferroni correction to obtain a local significance level ! ! = ! ! from which ! ! can be computed. This approach would be overly conservative as ! !" was not independent of each other in this case. A more accurate cutoff should take into account the correlation structure of the estimated regulatory scores. On the other hand, performing a permutation test for each tiled region would be too time consuming for a library comprising the whole genome albeit more accurate. Following the strategy described by 9,10 , we thus propose a fast multiple testing procedure based on Gaussian copula to find region-specific cutoffs for controlling FWER !. Note that under the null hypothesis ! ! = 0, the bias term in (7) disappears. We use the studentized estimate as the test statistics where diag() ! stands for the !th element in the vector of the diagonal elements of a matrix. It has been shown that under the null hypothesis, ! !" follows a Student t-distribution and can be approximated by a standard normal distribution under a large sample size 11,12 . Cule et al. 2011 find through simulation studies that the type I error rate and the statistical power using the normality approximation are comparable to those from permutation tests for a wide range of !. This observation motivates us to assume that under the null hypothesis, ! ! approximately follows a multivariate normal distribution where ⊙ is the Hadamard product and In the simulation studies provided in the next section, we investigate the empirical FWER based on this multivariate normal approximation under small sample size and high-dimensional cases. Denote by ! ! ! ! the marginal cumulative density function (CDF) of ! ! , which is continuous. According to Sklar's theorem 13 , there exists a unique copula !: Hence, for the one-sided test we have Under the multivariate normality approximation of (8), we have where ! !!"# ! ! ! (! ! , … , ! ! ) is a Gaussian copula with a correlation parameter matrix of !!"# ! ! !, and Φ(c) is the CDF of a standard normal distribution. Given a specific value of !, there are infinite many solutions ! ! , … , ! ! = ! !!"# ! ! ! !! position as equally important and pursue a single-step common-quantile cutoff (Dudoit and van der Laan, 2008, Chapter 4) !, i.e., ! ! = ⋯ = ! ! = !, we can find a unique solution So, we reject ! ! for the positions in ℋ = {! ∈ 1, … , ! : ! !" > ! * }. The common-quantile cutoff ! * can be calculated, for example, by the function qmvnorm in the R package mvtnorm 14 . The similar idea can also be used to obtain adjusted p-values for controlling regional FWER as shown in Conneely and Boehnke, 2007 15 . In real data analysis, the estimated covariance matrix !"# ! ! is often degenerated and the estimates of adjacent positions are completely correlated when ! > !. Therefore, we trim the number of the estimates by selecting one position from each group in which the estimates for the positions are completely correlated. This also dramatically reduces the computational intensity for finding the solution to (9). After identifying the driver elements, we can further attempt to pinpoint the location of the most possible occurrence of a 20nt "core" driver element (see section A4 below for rationale for choosing ~20nt as the estimated "core" region). We predict the center position ! ! of a 20nt core driver element by the highest regulatory scores over its 20nt flanking region, i.e., Supplementary Figure 16 gives an illustration of the significant regulatory region and the predicted motif region. In this example, the true motif is located at position 400-420nt and is covered by an identified significant driver element by SHARPR-RE (highlighted in red). The predicted core driver region (highlighted in purple) further pinpoints the location of the motif at ~400nt.
Global FDR controlling procedure The above regional procedure calls significant driver elements for a specific tiled region. If we want to identify driver elements across an entire genome, it may be preferable to control the global false discovery rate (FDR). We thus propose a global multiple testing correction procedure for this purpose by taking into account the p-values observed from the whole genome. We first calculate the pointwise p-values for all positions in each tiled region across the genome based on the t-distribution where ! − !"(! ! ) is the sample size minus the effective degrees of freedom. As mentioned in the local controlling procedure, we select one position from a consecutive region in which the estimates for these positions are completely correlated. Then, we apply the Benjamini-Hochberg procedure to the pointwise p-values to control the global FDR at level !. As p-values from different tiled regions are independent, the p-values across the genome can be regarded being dependent in finite blocks if the size of the largest tiled region is limited. More specifically, we assume that the ratio between max(! ! ) and the total number of fragments number of tests in a tiled region is related to ! ! when ! ! ≫ ! ! ). Thus, under this assumption, which is biologically reasonable, the estimate of FDR is consistent 16,17 .

Simulation settings
We assessed the performance of the proposed SHARPR-RE algorithm in terms of empirical statistical power estimated from our simulation studies. To mimic the current version of the HiDRA library, we randomly generated a number ! of fragments (! between 25-100) in a tiled region with ! = 1!". The length of each fragment was sampled from a uniform distribution ! !~! 175,450 , ! ∈ {1, … , !}. We randomly selected a 20nt driver element from a 400nt window in the middle of the tiled region. For any fragment that covers the driver element, we generated its signal from a normal distribution !(! !"#$%" = ! ! , ! !"#$%" = 0.1), where ! ! is the true signal varying across different simulation scenarios. For the rest of the fragments, we generated signals from a normal distribution !(! !"#$% = 0, ! !"#$! = 1). We defined the signal-to-noise-ratio (SNR) as SNR = . We examined the empirical FWER and empirical statistical power under different SNR and numbers of fragments. Under each simulation setting, we generated 500 replicates to obtain the estimates of the empirical FWER and statistical power.

Evaluation of empirical type I error rate
Our results in Table 1 show that generally the empirical regional FWER was controlled at ~5%, which was the theoretical FWER, when the number of fragments was above 50. We observed mild inflation of the empirical FWER especially in the case of small sample size (e.g., 25), but the inflation diminished with the sample size increasing in most situations. This inflation can be due to the discrepancy between the true null distribution of the statistics and the asymptotic multivariate normal distribution at the tails as shown in 18 . This indicates that the error introduced by the multivariate normality approximation should be taken into account when the sample size is overly small (for example, by using a similar scaling procedure as proposed in 18 or by setting a more stringent cutoff for a tiled region covered by a small number of fragments).

Evaluation of empirical statistical power
Next, we examined the statistical power for pinpointing a driver element under the condition of ! = 5%, i.e., the FWER <5%. In this investigation, a true positive is counted if an identified driver element or a predicted 20bp functional motif region overlapping the true driver element region. The results in Supplementary Figure 17 show that the statistical power for both regions consistently increases with respect to the number of fragments and the SNR. If there are 100 fragments in a tiled region, SHARPR-RE can achieve more than 80% power under SNR=1. When the number of fragments is small (e.g., 25), SNR>1.5 is needed to achieve a power of 80%. Higher SNR requires that the biological experiments have higher precision and sensitivity, so that significantly more RNAs can be sequenced when the DNA region covers a true driver element.

A4. Analysis of an HiDRA library
We applied SHARPR-RE to an HiDRA library prepared from the GM12878 lymphoblastoid cell line. The library contains 3,896,416 fragments after quality control, with the length of fragments ranging from 100-600nt (99% of fragments between 168-473nt). We first identified 645,936 tiled regions that were covered by at least two fragments across the whole genome, among which 28,092 regions were covered by more than 10 fragments. The distribution of the signals (ln(#RNA/#DNA)) of these fragments are almost symmetrically centered at zero with heavy tails (Supplementary Figure 14). The average and the variance of the signals are constant across the length of HiDRA fragments after normalization (Materials and Methods, Supplementary Figure 15).
We estimated the regulatory scores for the 22 chromosomes separately and called driver elements based on a cutoff controlling regional FWER<0.05 for the positions in each tiled region. We found that the tiled regions covered by larger numbers of HiDRA fragments were more likely to have a driver element called, which is likely a combination of greater statistical power and enrichment for regions more likely to contain drivers.
As shown in Figure 5C, most driver elements are found within active TSS, TSS Flanking Upstream and active enhancer chromatin states. The median size of driver elements identified from the tiled regions covered by >10 fragments was 52nt after filtering to remove drivers smaller than 5nt. The average size of drivers decreased with an increase in number of fragments in a tiled region, suggesting that more complex libraries with greater numbers of unique fragments should be able to detect shorter driver elements (Supplementary Figure 9). The average size of a driver element converges to ~18nt after the depth of unique HiDRA fragment coverage reaches 50 fragments/kb (Supplementary Figure 9).
length. In the plot, the fragment length is categorized into five groups. B): the distribution of fragment length with respect to the size of the tiled region in which the fragment is located. The size of tiled regions is defined by the number of fragments in the tiled region. The plots are based on the library described in the main text in which the fragments are ranged between 100-600nt (with 99% between 168-473nt) Supplementary Figure 16: An example of estimated regulatory scores from a simulated tile region of 1200nt. The data is generated within a 1200nt tile region with 50 unique HiDRA fragments ranging from 175nt to 450nt. The significant regulatory region (FWER<5%) is highlighted in red. The predicted motif region is highlighted in purple. The yellow dashed lines are the estimated scores ± MSE.
Supplementary Figure 17: Empirical statistical power according to different numbers of fragments and different SNR. Y-axis: empirical power (%). X-axis: Signal-to-noise-ratio defined by !"# = ! ! ! !"#$% . The red bars are the empirical power for the predicted 20nt core driver element.
The blue bars are the empirical power for the identified drivers with a significant regulatory score based on a regional FWER=5%.