A large-scale binding and functional map of human RNA-binding proteins

Many proteins regulate the expression of genes by binding to specific regions encoded in the genome1. Here we introduce a new data set of RNA elements in the human genome that are recognized by RNA-binding proteins (RBPs), generated as part of the Encyclopedia of DNA Elements (ENCODE) project phase III. This class of regulatory elements functions only when transcribed into RNA, as they serve as the binding sites for RBPs that control post-transcriptional processes such as splicing, cleavage and polyadenylation, and the editing, localization, stability and translation of mRNAs. We describe the mapping and characterization of RNA elements recognized by a large collection of human RBPs in K562 and HepG2 cells. Integrative analyses using five assays identify RBP binding sites on RNA and chromatin in vivo, the in vitro binding preferences of RBPs, the function of RBP binding sites and the subcellular localization of RBPs, producing 1,223 replicated data sets for 356 RBPs. We describe the spectrum of RBP binding throughout the transcriptome and the connections between these interactions and various aspects of RNA biology, including RNA stability, splicing regulation and RNA localization. These data expand the catalogue of functional elements encoded in the human genome by the addition of a large set of elements that function at the RNA level by interacting with RBPs.


RBP expression in tissues
Tissue specificity was measured as the entropy deviation from a uniform distribution among all tissues as in 1  Most RBPs showed low specificity with broad expression across tissues ( Supplementary   Fig. 1a), with the 10 least specific including many ubiquitously expressed basic RNA processing factors (e.g. ribosomal proteins RPS11, RPS24, and RPL23A, translation machinery factors EIF4H and EEF2, and broad regulators of RNA processing including HNRNPC and HNRNPA2B1) ( Supplementary Fig. 1b, left). However, a small subset showed highly specific expression, including some with greatly increased expression in the K562 and HepG2 cancer  Fig. 1b, right).

Identification of RBP-dependent gene expression and splicing changes
For analyses in this manuscript, each knockdown/RNA-seq dataset was compared against a non-target shRNA control infected and with library preparation performed within the same experimental batch to minimize potential batch effects in single-RBP comparisons.

Single-RBP Gene Expression Analysis
Significant gene expression changes were determined using DESeq and Cuffdiff as described in Methods, and yielded as few as 1 to as many as 5905 significantly altered genes

Single-RBP Alternative Splicing Analysis
Significant alternative splicing changes were determined using rMATS as described in Methods, and yielded as few as 3 to as many as 35,000 (the RNA helicase and spliceosomal protein AQR 4 in K562 cells) significantly altered alternative splicing events (Supplementary Fig. 4

Batch correction for global analysis of knockdown/RNA-seq datasets
Supplementary Information for Van Nostrand et al.

Page S4
As alterations in gene expression or splicing patterns due to experimental batches are a common problem in large-scale genomics efforts, to enable further integrated analyses we further performed the following batch correction methods to enable cross-RBP studies of the entire knockdown/RNA-seq data resource:

Batch Correction for Gene Expression Analysis
Gene expression batch effects were reduced with ComBat 5 . The HepG2 and K562 samples were normalized separately. Genes whose expected counts were 0 on more than 80% samples of the set were filtered out prior to normalization. After ComBat normalization and Quantile

Experimental quality control of eCLIP experiments
Library complexity, defined as the number of unique molecules recovered during preparation of high-throughput sequencing library, is a critical measurement of experimental quality that effects many aspects of data quality and analysis, determines whether the resulting read density is quantitative, and whether PCR biases and other artifacts will be introduced.
Although library complexity is easily empirically calculated after sequencing and data processing, a quantitative metric for library complexity that can be applied prior to sequencing would enable rapid culling of poor quality experiments and could help guide a desired sequencing depth by estimating an upper bound on the number of recovered RNA fragments.
We previously introduced the extrapolated CT (eCT) metric that estimates the number of PCR cycles needed to obtain sufficient material for sequencing. This metric had appealing characteristics, as it was RBP-specific, showed high correlation with PCR duplication rate, and could be directly compared against eCLIP experiments performed with IgG isotype controls or antibodies in null cell lines 6,7 .
However, although the initial eCT calculation assumed an idealized 2-fold amplification rate per PCR cycle, we observed that this rate is frequently lower in practice. To properly estimate PCR efficiency during eCLIP, we noted that at our standard sequencing depths some experiments had saturated the discovery of unique fragments, which enabled us to accurately estimate the total number of pre-PCR unique fragments for these datasets. Using 6 datasets with a PCR duplication rate of greater than 90%, we observed that the best fit between the number of observed unique fragments and the estimated number of unique fragments occurred at a PCR efficiency of 1.84 (Supplementary Fig. 8a-b). We therefore defined an accurate-eCT (a-eCT) as the eCT calculated with 1.84-fold amplification per cycle instead of 2-fold.
To validate the a-eCT metric, we considered datasets that were beginning to saturate (PCR duplication rate greater than 60%). We observed that a-eCT showed strong predictive power for the number of unique RNA fragments observed (R 2 = 0.46, p < 7.1 × 10 -38 ) ( Supplementary Fig. 8c), an improvement on the prior eCT metric (MSE 0.19 versus 0.86), confirming that a-eCT provides a robust estimate of library complexity ( Supplementary Fig. 8d).
Thus, a-eCT enables prediction of unique fragments prior to sequencing and indicates that eCLIP of distinct RBPs can yield a range from hundreds of thousands to billions of unique fragments ( Supplementary Fig. 8e).
Supplementary Information for Van Nostrand et al.

Page S6
Next, we compared a-eCT against a manual annotation of experiment quality. We observed that experiments that pass manual quality assessment had significantly lower a-eCT values than experiments that failed manual quality assessment with mean a-eCTs of 13.3 versus 14.4 respectively ( Supplementary Fig. 8f, students two-sided t-test; p < 10 -7 ). Low a-eCT (corresponding to a highly complex library) did not always indicate high-quality eCLIP datasets, with failures due to poor reproducibility, lack of significant binding signal, and other failure modes. However, a high a-eCT value was a strong predictor of failure, typically due to a lack of the required number of unique fragments to produce reproducible peaks. To establish a maximum a-eCT threshold beyond which data are unreliable, we observed that the mean a-eCT for IgG control eCLIP experiments (which only immunoprecipitate background RNA) was 19.6.
With that threshold applied, 21 out of 24 datasets with an a-eCT > 19.6 also independently failed manual QC. In all datasets examined no successful experiment had an a-eCT > 20.7, while there were still 9 experiments that did not pass manual quality control that had a higher a-eCT ( Supplementary Fig. 8f).
In total, 331 out of 400 (83%) experiments had higher yield than this IgG-only value in both replicates, indicating successful immunoprecipitation of protein-bound RNA in the majority of experiments ( Supplementary Fig. 8g). As we did observe a small number of high-quality datasets with a-eCT values above this cutoff (typically RBPs with high specificity for a single or small number of RNA transcripts, such as SLBP for histone RNAs), in later experiments we queried those with high a-eCT values with low-depth sequencing prior to full analysis and abandoned 36 such experiments which showed no observable binding specificity, leaving 349 datasets for analysis (Extended Data Fig. 1c).

eCLIP -data processing and peak identification
Primary data analysis of eCLIP data (including adapter trimming, read mapping, cluster identification, and input normalization) was performed as previously published 6  step. Next, potential adapter sequences were removed using cutadapt (v1.8.1), performed in two steps to properly remove non-full length adapter sequences we observed to drive artifact peak identification. At this step, reads with less than 18 bases were removed from further analysis. Next, we mapped reads using STAR (v2.4.0i) 8 against a database of repetitive elements (derived from RepBase (v18.05) 9 with the addition of elements including the 45S ribosomal RNA precursor), and removed reads with identified mapping (an independent method was derived to quantify mapping to repetitive elements, described in the next section). Reads were then mapped against the human genome using STAR (v2.4.0i), requiring unique mapping (all analyses described in this manuscript used mapping to GRCh37 and GENCODE v19 annotations, but mapping to GRCh38 and GENCODE v24 annotations were also deposited at the ENCODE portal). PCR duplicate reads were then identified as those with the same mapped start position and unique molecular identifier and were removed using custom scripts to obtain unique fragments. Read clusters were identified using CLIPper 10 , which applies spine-fitting to identify clusters of enriched read density above local, transcript (both pre-mRNA and mRNA), and whole-genome background. Whole reads were used for this broad analysis of RBPs as the use of read 3′ ends only can cause decreased signal-to-noise for the subset of RBPs that do not crosslink in close proximity to their RNA motif 11 . However, we note that future re-analyses restricting to only read 3′ ends likely will provide increased resolution for motif and splicing regulatory map analyses for some RBPs. Finally, clusters identified in IP samples were compared against paired size-matched input to obtain significantly enriched peaks using a To identify reproducible and significantly enriched peaks across biological replicates, we used a modified Irreproducible Discovery Rate (IDR) method ( Supplementary Fig. 9c). IDR requires that peaks are ranked by an appropriate metric, but we found undesirable results ranking peaks by either significance (due to the dependence on underlying expression) or foldenrichment (due to the large variance of fold-enrichment when few reads are observed in input).
Thus, we adapted relative entropy to better estimate the strength of binding in IP relative to input by defining the relative information content of a peak as , where p i and q i are the fraction of total reads in IP and input respectively that map to peak i. To confirm that this clusters were reproducible when we ranked by information content (Supplementary Fig. 9d).
Given the increased number of reproducible clusters detected, we used information content to perform standard IDR analysis to identify reproducibly bound regions 12 . We then identified the set of non-overlapping peaks from both replicates that maximized information content to define a final set of reproducibly enriched peaks that corresponded to CLIPper-identified regions ( Supplementary Fig. 9c). We defined reproducible and significant peaks to be the set of replicate-merged peaks that met an IDR cutoff of 0.01 as well as p-value ≤ 0.001 and foldenrichment ≥ 8 (using the geometric mean of log 2 (fold-enrichment) and the least significant pvalue between the two biological replicates). This method revealed that an average of 53.1% of peaks identified as significantly enriched in individual replicates were significant and reproducible, indicating high reproducibility for most experiments (Extended Data Fig. 1g).
Furthermore, the number of reproducible peaks identified upon profiling the same RBP in K562 and HepG2 cells was highly correlated, providing further validation that this approach reproducibly captures RBP-specific signal (Extended Data Fig. 1h).

Automated QC Metrics to verify data quality
We next developed a set of metrics to assess the quality of ENCODE eCLIP experiments in an automated manner. We ultimately arrived at two metrics for individual replicates (a minimal unique fragment cutoff, and a "total information in peaks" cutoff) as well as a third metric to assess reproducibility across the two biological replicates (Extended Data Fig. 1d). To evaluate these metrics, we used manual quality assessment of datasets to define a reference set of highand low-quality eCLIP datasets.
The number of unique fragments per dataset varies widely, depending on library complexity and sequencing depth (as described above). We observed that a required number of 1.5M unique fragments maximized the predictive power for datasets passing manual quality assessment (f-score = 0.79) ( Supplementary Fig. 10a). Page S9 more than 7-fold more likely to fail manual quality assessment ( Supplementary Fig. 10b).
Conversely, 30 of 222 manually failed datasets do not meet the criteria (Supplementary Fig.   10c).
Next, we considered a metric based on whether the dataset contained significant binding signal. As described above, we observed that the relative information of a peak better captures the binding information of peaks across genes with widely varying expression levels. Thus, to validate that a dataset contains significant binding information, we calculated the sum of relative information across all peaks in the dataset. We observed that this total information content score maximized the f-score of manually annotated high-and low-quality datasets at a total information content of 0.042 bits (f-score = 0.81) (Supplementary Fig. 10d). The information content model was more accurate (AUC = 0.71) ( Supplementary Fig. 10e, accurately classifying 63% of ENCODE datasets with 0.36 specificity and 0.93 sensitivity ( Supplementary Fig. 10f).
Next, we developed criteria to assay biological reproducibility, using two metrics based upon the Irreproducible Discovery Rate (IDR) approach that has previously been used to assay reproducibility of ChIP-seq peaks: reproducibility between real and pseudo-replicates (Rescue Ratio) and confirmation that the number of reproducible peaks between both replicates is similar (Self-Consistency Ratio) 16 . We found that cutoffs previously used for ChIP-seq data could be similarly applied to eCLIP 16 , and observed that 81.9% of experiments have a passing rescue ratio of <2 ( Supplementary Fig. 10g) and 71.1% of experiments have a passing self-constancy ratio of <2 ( Supplementary Fig. 10h). 223 experiments pass both thresholds, while 88 are borderline (passing one of the two thresholds), and 38 fail both thresholds ( Supplementary Fig.   10i). Notably, these IDR metrics have high specificity, as 151 out of 196 (77%) of experiments that meet unique fragment and total information content cutoffs and were manually judged to be high quality passed this IDR criteria. In contrast, IDR detects potential false positives by correctly failing 9 out of 56 (16%) datasets that met read depth and information content metrics, but failed manual inspection ( Supplementary Fig. 10j).
Finally, we combined these metrics into one overall automated quality call requiring that each experiment passes minimum read and information content cutoffs as well as either being classified as passing or borderline based on IDR metrics ( Supplementary Fig. 10k). Overall our model accurately classified 77% of eCLIP datasets with a sensitivity of 0.84 and a specificity of 0.62 (Extended Data Fig. 1e), better than any individual classification scheme.

Effect of sequencing depth on eCLIP peak identification
Supplementary Information for Van Nostrand et al.
Page S10 How deeply to sequence a CLIP-seq dataset is a major consideration (particularly at large scale), as samples must be sequenced sufficiently to robustly detect true binding signals while balancing experimental cost. To query whether the ENCODE eCLIP datasets were sequenced to sufficient depth, we considered two questions: first, how does sequencing depth affect identification of true binding sites, and second, how many reads are required to detect binding sites in any gene when accounting for variability in gene expression.
First, we asked whether peaks discovered at deeper sequencing depths were still likely to be biologically relevant. To do this we looked at RBFOX2, which is known to bind to the GCAUG motif. Overall, we observed significant enrichment for RBFOX2 binding to its motif, with 36% of RBFOX2 peaks overlapping the motif versus a mean of 6% of peaks overlapping the motif in all other datasets ( Supplementary Fig. 11a). We then down-sampled the unique genomic fragments, re-called peaks, and asked how many peaks discovered at each downsampling step overlapped the RBFOX2 motif. We observed that peaks discovered using only 10% of unique genomic fragments showed the highest motif overlap (38% on average), whereas peaks that were only discovered when going from 90% to 100% of unique genomic fragments were less likely to contain GCAUG (27% on average) ( Supplementary Fig. 11b).
Although this suggests that signal to noise is highest among the most abundantly covered peaks, we note that later discovered peaks were still 3.0-to 7.4-fold enriched above non-RBFOX2 datasets, indicating they still contain true binding signal ( Supplementary Fig. 11b).
Supporting this, we observed that conservation of later-discovered peaks was similar to those discovered earlier with a mean phastcons conservation score of 0.136 versus 0.132 ( Supplementary Fig. 11c). Considering an independent dataset, PRPF8, we observed similar results when testing its known association with the 5′ splice site: although peaks discovered at low sequencing depth were less enriched for true signal, we continued to see true positive signal throughout the range of down-sampling, indicating that it is true that deeper sequencing allows for the continued discovery of high quality peaks ( Supplementary Fig. 11d-e).
Second, we considered the identification of peaks as a function of transcript abundance.
To explore if there was a correlation between sequencing depth and the discovery of peaks in lowly expressed genes, we calculated the correlation between gene expression and the number of reads in each peak for RBFOX2. We observed that lowly expressed genes had fewer reads per peak (as expected), whereas highly expressed genes displayed a large variation in the number of reads per peak, with only a weak correlation overall for both RBFOX2 (R 2 = 0.03) ( Supplementary Fig. 11f). All other RBPs showed a similar week correlation (mean R 2 = 0.13) ( Supplementary Fig. 11g). Next, we asked whether peaks at lowly expressed genes could be Supplementary Information for Van Nostrand et al.
Page S11 detected at standard sequencing depths. Surprisingly, we found that lowly expressed genes (defined as those with TPM < 1) need on average only 670,000 unique genomic fragments to allow for detection of a peak in the gene, and this estimate was similar when varying the fraction of peaks required to be discovered or TPM thresholds ( Supplementary Fig. 11h-j). As ENCODE eCLIP datasets have a mean sequencing depth of over 4.3 million unique genomic fragments, these results suggest that an inability to detect peaks on lowly expressed genes is not a major concern in eCLIP data sequenced to standard depths.
Our analysis above indicates that continued sequencing until fragment saturation can recover true peaks even at extremely high read depths. However, sequencing until fragment saturation is not typically economically reasonable. Thus, we set out to quantify diminishing returns upon deeper sequencing to identify whether we were observing saturation of detected peaks at typical eCLIP sequencing depths ( Supplementary Fig. 11k-m). First, we developed a metric to quantify the diminishing returns of deeper sequencing in eCLIP datasets. Considering the discovery of significant peaks, we queried how many peaks were newly discovered when comparing peaks observed when 90% or 100% of fragments in a dataset were used to identify peaks. We observed that 67% of experiments passing manual QC saturated the discovery of significant peaks (defined as the discovery of fewer than 5% new peaks in the above metric), suggesting that simple peak detection was saturating for most but not all high-quality datasets ( Supplementary Fig. 11k).
Next, we considered whether binding information by total information content was saturating even when peak discovery was not. Summing the total information content across all peaks, we observed that information recovered saturated for 97% of manually accepted datasets (using the same 5% or less discovery metric between 90% and 100% of fragments used to call peaks in a dataset) ( Supplementary Fig. 11k). Exploring downsampling experiments further, we found that 90% of all eCLIP datasets that passed manual quality assessment had saturated information discovery by 8.5M unique fragments (corresponding to 4.3M unique genomic fragments) ( Supplementary Fig. 11l-m). Thus, these results suggest that although additional peaks can be identified, the majority of peak information content is already captured at current sequencing depths for the majority of eCLIP experiments described here.

Transcriptome coverage of eCLIP and knockdown/RNA-seq events
Considering the total coverage of gene expression changes across all 472 knockdown/RNA-seq datasets, observed that 20,542 genes were differentially expressed in at least one knockdown experiment, including 92.1% of genes expressed in both cell types and 91.8% of those Page S12 expressed in at least one of the two (Extended Data Fig. 2g & Supplementary Fig. 12a-b).
Similarly, 17,839 genes had a peak in at least one eCLIP dataset, including 84.2% of genes expressed in both cell types and 92.0% of those expressed in at least one. However, only 4,889 genes had eCLIP peaks from and were responsive to knockdown of the same RBP (Extended Fig. 2g) Fig. 2i & Supplementary Fig. 12d). While profiling a new RBP identified more novel sites than re-profiling the same RBP in the other (K562 or HepG2) cell type, re-profiling the same RBP in a more distinct cell type (H1 or H9 stem cells) yielded even greater increases, suggesting that many additional RBP binding sites remain to be detected (Supplementary Fig.   12e-g). While these results are consistent with previous work suggesting that RNAs are often densely coated by RBPs 17 , we note that many of the RBPs with large numbers of peaks include proteins that coat or transiently interact with RNAs as part of basic RNA processing functions, such as interaction of spliceosomal components (e.g. U2AF1/2, SF3B4) with the 5′ and 3′ splice sites, RNA Polymerase II component POLR2G with pre-mRNAs, ribosomal subunit RPS3 with translating RNAs, and various others, rather than likely marking alternative RNA processing regulatory sites.

Analysis of eCLIP correlation with knockdown-perturbed splicing changes (splicing maps)
Splicing maps depicting differential eCLIP enrichment at stereotypical positions relative to RBP knockdown-altered splicing changes were generated using the RBP-Maps methodology 11 . First, the set of differentially alternatively spliced events of the desired type (cassette/skipped exons (SE), alternative 5′ splice site (A5SS), or alternative 3′ splice site (A3SS) events were identified ( Supplementary Fig. 13a) Next, for each splicing event, per-position input probability densities were subtracted from IP probability densities to attain position-level enrichment or depletion, for regions extending 50nt into each exon and 300nt into each intron composing the event. Subtracted read densities were then normalized to sum to 1 across each event in order to equally weigh each event, creating tracks referred to as 'Normalized eCLIP enrichment' (Supplementary Fig. 13b).  Fig. 13d). To calculate significance, 1000 random samplings were performed from the native cassette exon set using the number of events in the knockdown-included (or excluded), and Supplementary Information for Van Nostrand et al.
Page S14 significance was set as being either lower than the 0.5 th or higher than the 99.5 th percentile for each position.