A Large-Scale Binding and Functional Map of Human RNA Binding Proteins

Genomes encompass all the information necessary to specify the development and function of an organism. In addition to genes, genomes also contain a myriad of functional elements that control various steps in gene expression. A major class of these elements function only when transcribed into RNA as they serve as the binding sites for RNA binding proteins (RBPs), which act to control post-transcriptional processes including splicing, cleavage and polyadenylation, RNA editing, RNA localization, stability, and translation. Despite the importance of these functional RNA elements encoded in the genome, they have been much less studied than genes and DNA elements. Here, we describe the mapping and characterization of RNA elements recognized by a large collection of human RBPs in K562 and HepG2 cells. These data expand the catalog of functional elements encoded in the human genome by addition of a large set of elements that function at the RNA level through interaction with RBPs. Highlights 223 eCLIP datasets for 150 RBPs reveal a wide variety of in vivo RNA target classes. 472 knockdown/RNA-seq profiles of 263 RBPs reveal factor-responsive targets and integration with eCLIP indicates RNA expression and splicing regulatory patterns. 78 RNA Bind-N-Seq profiles of in vitro binding motifs reveal links between in vitro and in vivo binding and indicate that eCLIP peaks that contain in vitro motifs are more strongly associated with regulation. 274 maps of RBP subcellular localization by immunofluorescence indicate widespread organelle-specific RNA processing regulation. 63 ChIP-seq profiles of DNA association suggest broad interconnectivity between chromatin association and RNA processing.


Introduction
RBPs have emerged as critical players in regulating gene expression, controlling when, where, and at what rate RNAs are processed, trafficked, translated, and degraded within the cell. They represent a diverse class of proteins involved in co-and post-transcriptional gene regulation 1,2 .
RBPs interact with RNA to form ribonucleoprotein complexes (RNPs), governing the maturation and fate of their target RNA substrates. Indeed, they regulate numerous aspects of gene expression including pre-mRNA splicing, cleavage and polyadenylation, RNA stability, RNA localization, RNA editing, and translation. In fact, many RBPs participate in more than one of these processes. For example, studies on the mammalian RBP Nova using a combination of crosslinking and immunopreciptation (CLIP)-seq and functional studies revealed that Nova not only regulates alternative splicing, but also modulates poly(A) site usage 3 . Moreover, in contrast to regulation at the transcriptional level, post-transcriptional regulatory steps are often carried out in different sub-cellular compartments of the nucleus (e.g. nucleoli, nuclear speckles, paraspeckles, coiled bodies, etc.) and/or cytoplasm (e.g. P-bodies, endoplasmic reticulum, etc.) by RBPs that are enriched within these compartments. These regulatory roles are essential for normal human physiology, as defects in RBP function are associated with diverse genetic and somatic disorders, such as neurodegeneration, auto-immune defects, and cancer [4][5][6][7][8][9][10] .
Traditionally, RBPs were identified by affinity purification of single proteins 11,12 . However, several groups have recently used mass spectrometry-based methods to identify hundreds of proteins bound to RNA in human and mouse cells [13][14][15][16] . Recent censuses conducted by us and others indicate that the human genome may contain between 1,072 (ref. 17) and 1,542 (ref. 1) RBPencoding genes. This large repertoire of RBPs likely underlies the high complexity of posttranscriptional regulation, motivating concerted efforts to systematically dissect the binding properties, RNA targets and functional roles of these proteins.
The dissection of RBP-RNA regulatory networks therefore requires the integration of multiple data types, each viewing the RBP through a different lens. In vivo binding assays such as CLIP-seq provide a set of candidate functional elements directly bound by each RBP.
Assessments of in vitro binding affinity help understand the mechanism driving these interactions, and (as we show) improve identification of functional associations. Functional assays that identify targets whose expression or alternative splicing is responsive to RBP perturbation can then fortify evidence of function. For example, observation of protein binding by CLIP-seq within introns flanking exons whose splicing is sensitive to RBP levels in RNA-seq provides support for the RBP as a splicing factor and for the binding sites as splicing regulatory Van Nostrand et al. 4 elements. In vivo interactions of RBPs with chromatin can also be assayed to provide insight into roles of some RBPs as transcription regulators and can provide evidence for cotranscriptional deposition of RBPs on target RNA substrates. The regulatory roles of RBPs are also informed by the subcellular localization properties of RBPs and of their RNA substrates.
Furthermore, these data resources comprised of multiple RBPs profiled using the same methodology may be integrated to enable the identification of factor-specific regulatory modules, as well as a factor's integration into broader cellular regulatory networks, through novel integrated analyses.
Van Nostrand et al. 5

Overview of data and processing
To develop a comprehensive understanding of the binding and function of the human RBP repertoire, we used five assays to produce 1,076 replicated datasets for 352 RBPs (Fig. 1, Supplementary Table 1). The RBPs characterized by these assays have a wide diversity of sequence and structural characteristics and participate in diverse aspects of RNA biology (Fig.   1). Functionally, these RBPs are most commonly known to play roles in the regulation of RNA splicing (96 RBPs, 27%), RNA stability and decay (70, 20%), and translation (69, 20%), with 161 RBPs (46%) having more than one function reported in the literature. However, 82 (23%) of the characterized RBPs have no known function in RNA biology other than being annotated as involved in RNA binding (Fig. 1). Although 57% of the RBPs surveyed contain wellcharacterized RNA binding domains (RNA recognition motif (RRM), hnRNP K homology (KH), zinc finger, RNA helicase, ribonuclease, double-stranded RNA binding (dsRBD), or pumilio/FBF domain (PUM-HD)), the remainder possess either less well studied domains or lack known RNA-binding domains altogether (Fig. 1). Each of the five assays used focused on a distinct aspect of RBP activity: Transcriptome-wide RNA binding sites of RBPs: We identified and validated hundreds of immunoprecipitation-grade antibodies that recognize human RBPs 17 and developed enhanced CLIP (eCLIP) followed by sequencing to efficiently identify RNA targets bound by these RBPs at scale 18 . We performed eCLIP to profile 97 RBPs in K562 cells and 84 RBPs in HepG2 cells, for a total of 126 RBPs (of which 55 were characterized in both cell types). This effort identified 717,765 significantly enriched peaks (relative to size-matched input controls for each RBP) that cover 16.5% of the annotated mRNA transcriptome and 2.4% of the pre-mRNA transcriptome In vitro RBP binding motifs: To identify the in vitro RNA sequence and structural binding preferences of RBPs, we developed a high-throughput version of RNA Bind-N-Seq (RBNS) 19 that assays binding of purified RBPs to pools of random RNA oligonucleotides. This effort identified the binding specificities of 78 RBPs. Short oligonucleotides of length k=5 (kmers) with high RBNS affinity clustered into a single motif for about half of the RBPs assayed (37/78). The remaining RBPs had more complex patterns of binding, best described by two motifs (32/78), or even three or more motifs (9 RBPs). These data also indicate that many RBPs are sensitive to the sequence and RNA structural context in which motifs are embedded. To facilitate integrated analyses, all data for each data type were processed by the same data processing pipeline, and consistent, stringent quality control metrics and data standards were uniformly applied to all experiments. Although only 8 RBPs were investigated using all five assays, 249 of the 352 RBPs (71%) were studied using at least two different assays and 129 (37%) were subjected to at least three different assays, providing opportunities for integrated Van Nostrand et al. 7 analysis using multiple datasets. As an example of how these complementary datasets provide distinct insights into RNA processing regulation, we considered the Vascular Endothelial Growth Factor A gene (VEGFA) (Fig. 2a). Although the RBPs IGF2BP1 and IGF2BP3 both showed significant binding to the VEGFA 3'UTR, knockdown of IGF2BP1 increased VEGFA mRNA levels, while knockdown of IGF2BP2 decreased them, pointing to opposing regulatory effects.
Knockdown of HNRNPK also yielded decreased VEGF mRNA, as well as a significant decrease in inclusion of exon 6. This splicing event is likely directly regulated by HNRNPK, as the flanking introns contain multiple hnRNP K eCLIP peaks, some of which contain the preferred binding motif for HNRNPK (GCCCA) identified in RBNS experiments. Similar integrated analysis can provide insight into mechanisms of cryptic exon repression, illustrated by HNRNPL binding to a region downstream of a GTPBP2 cryptic exon that contains repeats of HNRNPL's top RBNS motif, contributing to production of GTPBP2 mRNA with a full-length open reading frame (Extended Data Fig. 1).
The scale of data available enabled us to query the degree to which we have saturated the discovery of RBP binding sites on RNA and RBP-associated RNA processing events. In total, 14,281 genes were differentially expressed in at least one knockdown experiment (Fig. 2b), including 72.8% of genes expressed in both cell types and 71.2% of those expressed in at least one of the two (Extended Data Fig. 2a-b). Similarly, 11,211 genes were bound in at least one eCLIP dataset, representing 86.1% of genes expressed in both cell types and 75.6% of those expressed in at least one (Fig. 2b, Extended Data Fig. 2a-b). However, only 2,827 genes were both bound by and responsive to knockdown of the same RBP, suggesting that a large fraction of knockdown-responsive expression changes result from indirect effects and that many binding events do not directly affect gene expression levels in the conditions assayed here (Fig. 2b, Extended Data Fig. 2a-b). Similar analysis of alternative splicing changes revealed that differentially spliced events were saturated to a lesser degree than differentially expressed genes. (Greater variability in the cumulative number of unique events resulted from the large number of splicing changes in one knockdown dataset, the RNA helicase and spliceosomal protein AQR 22 (Extended Data Fig. 2c).) Considering RBP binding, we observed a total of 23.4 Mb of annotated pre-mRNA transcripts covered by at least one reproducible RBP binding site, representing 9.1 Mb of exonic and 13.4 Mb of intronic sequence (Fig. 2c). This total represents only 1.5% of annotated intronic sequence (1.1% of distal intronic, 2.5% of proximal intronic, and 8.9% of splice site), but 16.5% of annotated exonic sequences (22.9% of 5' UTR, 19.3% of CDS, and 11.4% of 3' UTR, respectively) were covered by at least one peak (Fig. 2d). Restricting our analysis to only genes Van Nostrand et al. 8 expressed (TPM>1) in both cell types resulted in substantial increases in these percentages (Extended Data Fig. 2d). We found that profiling a new RBP often resulted in greater increases in covered bases of the transcriptome than did re-profiling the same RBP in another cell line, with the marginal 181st dataset averaging a 0.38% (for newly profiled RBPs) or 0.32% (for RBPs profiled in a second cell type) increase (Extended Data Fig. 2e-g). We additionally observed an average 1.4% increase in covered bases when we added eCLIP datasets from H1 and H9 embryonic stem cells for RBFOX2 and IGF2BP1-3 (all of which were also profiled in either K562 or HepG2), suggesting that substantial numbers of additional RBP binding sites remain to be detected in cell-types distinct from K562 and HepG2 (Extended Data Fig. 2g).
Although these results correspond well with previous work suggesting that RNAs are often densely coated by RBPs 23 , it remains to be seen what fraction of these peaks mark alternative regulatory interactions versus constitutive RNA processing. Indeed, many may mark relative association of proteins which coat or broadly interact with RNAs as part of their basic function (such as association of RNA Polymerase II component POLR2G with pre-mRNAs, or spliceosomal component association with splice sites).

In vivo binding is largely determined by in vitro binding specificity
Binding of an RBP to RNA in vivo is determined by the combination of the protein's intrinsic RNA binding specificity and other influences such as RNA structure and protein cofactors. To compare the binding specificities of RBPs in vitro and in vivo, we calculated the enrichment or R value of each 5mer in RBNS-bound sequences relative to input sequences and compared it to the corresponding enrichment of the 5mer in eCLIP peaks relative to randomized locations in the same genes (R eCLIP ). Significantly enriched 5mers in vitro and in vivo were mostly in agreement, with 17 of the 26 RBPs having significant overlap in the 5mers that comprise their motif logos (Fig. 3a, left). The top RBNS 5mer for an RBP was almost always enriched in eCLIP peaks (Fig. 3a, center). In most cases, similar degrees of enrichment and similar motif logos were observed in eCLIP peaks located in coding, intronic or UTR regions, suggesting that RBPs have similar binding determinants in each of these transcript regions (Fig. 3a, center; Extended Data Fig. 3a). Strikingly, the single most enriched RBNS 5mer occurred in 30% or more of the peaks for several RBPs including SRSF9, TRA2A, RBFOX2, PTBP3, TIA1, and HNRNPC. For most RBPs, at least half of eCLIP peaks contained one of the top five RBNS 5mers. Therefore, instances of these 5mers provide candidate nucleotide-resolution binding locations for the RBP Van Nostrand et al. 9 ( Fig. 3a, right). Such precise binding locations have a number of applications, e.g., they can be intersected with databases of genetic variants to identify those likely to alter function at the RNA level. When two or more distinct motifs were enriched in both RBNS and eCLIP, the most enriched motif in vitro was usually also the most enriched in vivo (5 out of 7 cases). These observations are consistent with the idea that intrinsic binding specificity observed in vitro explains a substantial portion of in vivo binding preferences for most RBPs studied to date.
For a minority of RBPs (12/26), the top five RBNS 5mers explained less than half of eCLIP peaks. Some of these factors appear to have affinities to RNA structural features or to more extended RNA sequence elements not well represented by 5mers (Dominguez et al., in preparation), while the sequence-specific binding of others may be driven to a large extent by cofactors. In some cases, RBNS revealed affinity to a subset of the motifs that were enriched in eCLIP peaks. For example, C-rich 6mers were most enriched in RBNS data for PCBP2 and also in PCBP2 eCLIP peaks (Fig. 3b). In this example, and in several others, a subset of similar eCLIP-enriched kmers were not enriched at all by RBNS (e.g., the G-rich 6mers in Fig. 3b).
Such "eCLIP-only" motifs, which were often G-, GC-, or GU-rich (Extended Data Fig. 3b), may represent RNA binding of cofactors in complex with the targeted RBP (e.g., G-rich motifs enriched near RBFOX2 peaks may represent sites bound by HNRNPF, HNRNPH and HNRNPM in complex with RBFOX2) 24,25 , or could represent biases in crosslinking or in genomic sequences near eCLIP peaks 26,27 .
The extent to which strength and mode of binding correlates with eCLIP read density and regulatory activity is not well understood. We focused on regulation of splicing because a large proportion of the available cell type/RBP combinations that included KD, eCLIP, and RBNS data involved RBPs with known roles in splicing, and splicing changes could be readily detected in the KD/RNA-seq data. For most datasets involving KD of known splicing RBPs (18/28), eCLIP binding to one or more specific regions near alternative exons was associated with increased splicing changes upon KD of the RBP. In contrast, this association was observed for only one of the seven datasets involving RBPs that lacked known splicing functions (Fisher exact P<0.05, Extended Data Fig. 4a). To explore the relationship between sequence-specific binding and regulation, we classified eCLIP peaks as RBNS+ or RBNS-depending on whether they contained the highest-affinity RBNS motif (Supp. Methods). We then asked whether these classes of peaks differed in their association with splicing regulation. Examining exon-proximal regions commonly associated with splicing regulation, we found that RBNS+ eCLIP peaks conferred stronger regulation of exon skipping -with an average ~25% increase in |ΔΨ| -than did RBNS-peaks (Fig. 3c). Thus, sequence-specific binding appears to confer stronger 1 0 regulation than non-sequence-specific binding, and RBNS motifs can be used to distinguish a subset of eCLIP peaks that has greater regulatory activity. The in vitro data were needed to make this distinction, because a similar analysis of eCLIP peaks classified by presence/absence of the top eCLIP-only 5mer exhibited minimal differences in splicing regulatory activity (Extended Data Fig. 4b). In general, RNA binding directed by intrinsic RNA affinity may involve longer-duration interactions that more consistently impact recruitment of splicing machinery.

Functional Characterization of the RBP Map
Analysis of the knockdown/RNA-seq data enables inference of the function of some RNA elements identified by eCLIP and RBNS. Regulation of RNA stability, which alters steady-state mRNA levels, can be observed by an increase or decrease in mRNA expression upon knockdown of an RBP. We compared differentially expressed genes upon RBP knockdown with eCLIP binding to three regions of mRNAs: 5'UTR, CDS, and 3'UTRs. We observed that binding of 17 RBPs correlated with increased expression upon knockdown, including RBPs with previously identified roles in induction of RNA decay (such as UPF1, XRN2, and DDX6) (  Fig. 5b-c). We further observed a trend in which increasing METAP2 eCLIP foldenrichment correlated with progressively stronger increases in RNA expression upon knockdown, supporting an RNA regulatory role (Fig. 4b).
Additionally, we observed 11 RBPs for which binding correlated with decreased RNA levels following knockdown (Fig. 4a), including the stress granule component TIA1. Surprisingly, although our transcriptome-wide analysis indicated that transcripts with 3' UTR binding of TIA1 decreased upon knockdown in K562 cells (suggesting a globally stabilizing role for TIA1) (Fig.   4c), little to no stabilization activity was observed for 3' UTR-bound mRNAs in HepG2 cells (Extended Data Fig. 5d). Using TIA1 RBNS motif content in 3' UTRs rather than eCLIP binding sites, we additionally observed cell-type specific enrichment of TIA1 motifs in destabilized transcripts upon KD in K562, with no significant effect (though a slight motif enrichment in stabilized genes upon KD) in HepG2 (Extended Data Fig. 5e-g). This distinction is reminiscent of previous studies, which indicate that TIA1 can either induce RNA decay when tethered to a 3' 1 1 UTR 28 , or stabilize target mRNA levels through competition with other RBPs including HuR 29 .
Thus, our results provide further evidence that TIA1 can regulate mRNA stability through dynamic cell type-specific interactions.

RBP association with splicing regulation
Next, we considered how localized RBP binding was associated with splicing regulation. To do this, we generated a 'splicing map' for each RBP 30 , which depicts the average RBP binding at relative positions in the regulated exon and flanking introns of exons that show RBP-responsive splicing changes, relative to binding at non-responsive alternative exons (Fig. 5a, Extended Data Fig. 6). Considering 57 RBPs with eCLIP and knockdown/RNA-seq performed in the same cell line, we observed a wide variety of RNA maps (Fig. 5b). Binding of SR proteins typically correlated with decreased inclusion upon knockdown and hnRNP protein binding correlated with increased inclusion upon knockdown, consistent with classical models of antagonistic effects of SR and hnRNP proteins on splicing 31 (Extended Data Fig. 7a-b).
Surprisingly, in addition to canonical alternative splicing regulators such as RBFOX2 and PTBP1, we observed significant differential association of spliceosomal components near knockdown-sensitive alternative splice sites (Fig. 5b, Extended Data Fig. 7c-d). For example, we observed that comparing cassette exons to constitutive exons revealed strikingly different binding patterns, with cassette exons characterized by increased association of 5' splice site machinery such as PRPF8, EFTUD2, and RBM22 at the upstream 5' splice site but depletion at the cassette exon 5' splice site (Extended Data Fig. 7c). Branch point recognition factors such as SF3B4 and SF3A3 showed similar depletion for the cassette exon branch point and enrichment at the downstream branch point, leading to a distinctive pattern of spliceosomal association at the cassette exon (Extended Data Fig. 7c). Moreover, when considering nonspliceosomal RBPs we similarly observed that while RBP association was higher at cassettebordering proximal intron regions relative to constitutive exons, the upstream 5' splice site and proximal intron showed an even greater enrichment (Extended Data Fig. 7d). Together, these observations for non-spliceosomal RBPs suggest that the upstream 5' splice site of alternative exons represent a greater source of regulatory RBP binding than previously believed.
Splicing maps were also constructed for alternative 5' (A5SS) and alternative 3' splice site (A3SS) events (Fig. 5c). Again, we noticed differential association of spliceosomal components (Fig. 5c, Extended Data Fig. 8a). Focusing on A3SS events, we noted a particularly prominent pattern of association for branch point factors SF3B4 and SF3A3, which typically bind Van Nostrand et al. 1 2 to the branch point region ~50 nt upstream of the 3' splice site. For both, we observed that enrichment was shifted downstream towards the 3' splice site for the set of A3SS events where the distal (downstream) 3' splice site is preferred upon knockdown ( Fig. 5c- splice site scanning and recognition originates from the branch point and can be blocked if the branch point is moved close to the 3' splice site AG 32 , and suggest that regulated branch point recognition plays a key role in A3SS regulation by restricting recognition by the 3' splice site machinery 33 (Extended Data Fig. 8c).
In summary, the RBPs we have surveyed that participate in alternative splicing display a wide diversity of regulatory modes. Moreover, although the target splicing events differ, the splicing map of a given RBP is highly consistent between cell types (Fig. 5b,c, Extended Data

RBP Association with Chromatin
Increasing evidence suggests that regulatory RNAs, both coding and non-coding, are broadly involved in gene expression through their interaction with chromatin 34,35 . Based on the expectation that these regulatory events must enlist specific RBPs participating in cotranscriptional RNA processing, we surveyed 58 RBPs in HepG2 and 45 RBPs in K562 cells for their association with chromatin by ChIP-seq (Supplementary Table 1). We selected RBPs for analysis based on their complete or partial localization in the nucleus and on the availability of antibodies that could efficiently immunoprecipitate each factor. These RBPs belong to a wide range of functional categories, including SR and hnRNP proteins, spliceosomal components and RBPs that have been generally considered to function as transcription factors, such as POLR2G and GTF2F1. Interestingly, 30 of 58 RBPs (52%) in HepG2 cells and 29 of 45 RBPs (64%) in K562 cells showed specific interactions with chromatin, with several hundred to more than ten thousand specific binding peaks identified for each RBP.
We next characterized the RBP-chromatin interactions with respect to established chromatin activities, including DNase I hypersensitive sites and various histone marks (Fig. 6a).
This analysis revealed a general preference of RBPs for euchromatic relative to heterochromatic regions, especially gene promoters, although individual RBPs showed distinct preferences. Collectively, even this moderately sized set of RBPs occupied ~30% of all DNase 1 3 hypersensitive or open chromatin regions and ~70% of annotated gene promoters, which was similar between the two cell types, suggesting broad involvement of RBPs in chromatin activities in the human genome. These data provide a foundation for future investigation of direct roles of specific RBPs in transcriptional control, exemplified by recent studies on SR proteins 21 and RBFOX2 36 .
It is well established that a variety of RNA processing events are co-transcriptionally coupled 20 . It is therefore possible that some RBP-chromatin association events are coupled with their direct RNA binding activities in cells. To explore this possibility, we intersected ChIP-seq peaks with eCLIP peaks for individual RBPs in HepG2 cells, revealing specific RBPs with relatively high degrees of overlap in these two types of interactions (Fig. 6b). Interestingly, the distribution of the overlapped regions varies among individual RBPs. For example, the chromatin and RNA binding activities of SR proteins, as exemplified by SRSF1, are predominantly coincident near gene promoters, while most other RBPs show such overlapping activities further into the gene body (Fig. 6c). These data revealed coordinated actions of RBPs at the chromatin and RNA levels, suggesting that a fraction of their RNA binding events are chromatin-associated. We next sought to quantify the similarity of each RBP pair's chromatin and RNA binding activities by computing the overlap in ChIP-seq and eCLIP-seq signal in nonpromoter regions (Fig. 6d). Clustering of the data indicated that many RBPs might function in conjunction with other RBP(s) to coordinate their chromatin and RNA binding activities (red signals in Fig. 6d). Particularly interesting is the high correlation between poly(rC) binding proteins HNRNPK and PCBP1/2, which share a common evolutionary history and domain composition yet perform diverse functions 37 (red box in Fig. 6d). To illustrate the relationship between RNA and chromatin interactions, we plotted the ChIP-seq density of these three RBPs relative to PCBP2 and HNRNPK eCLIP peaks in non-promoter regions. We found that chromatin binding signals were typically centered around eCLIP peaks, although HNRNPK and PCBP1 to a lesser degree were slightly biased for chromatin binding upstream of RNA binding indicative of a specific topological arrangement of these potential RBP complexes on chromatin in a manner dependent on the direction of transcription (Fig. 6e, top panels). We next asked whether the signal overlapping eCLIP peaks was specific to the DNA binding activity of these RBPs or if it also held for related RNA binding activities. For this purpose, we plotted the eCLIP density of multiple RBPs relative to another RBP's eCLIP peaks. This analysis revealed a significant degree of overlap among the RNA binding activities of HNRNPK and PCBP2, which contrasts to a randomly selected RBP for comparison (Fig. 6e, bottom panels). Combined, Van Nostrand et al. 1 4 these data strongly reinforce coordinated chromatin and RNA binding activities of specific RBPs.
To investigate whether there is correlation between gene expression and RBP association, we clustered the probability of each RBP's association with genes binned by increasing expression levels, revealing a pervasive positive correlation between gene expression and RBP association for the majority of RBPs with the exception of SAFB2 and SFPQ which have only a few binding sites (Fig. 6f, left panels). These data implicated regulatory roles of RBPs in gene expression through chromatin association. To pursue this point, we compared the frequency that a gene is differentially expressed upon RBP knockdown depending on whether or not the RBP was chromatin-associated at that gene, controlling for the different expression levels of these two groups. We found that chromatin association of HNRNPLL, HNRNPL, HNRNPK, and AGO1 correlated with a significantly increased chance of gene expression change upon knockdown (Fig. 6f, right panels). To explore the connection between alternative splicing regulation and RBP-chromatin association, we calculated similar ratios for each RBP and found that RBM22 chromatin association was associated with a significant increase in alternative splicing changes upon knockdown (Fig. 6f, right panels).
Together, these data suggest that chromatin-association of RBPs affects RNA processing and gene expression.

RBP regulatory features in subcellular space
The systematic imaging screen revealed that RBPs display a broad diversity of localization patterns (Fig. 7a), with most factors exhibiting targeting to multiple structures in the nucleus and cytoplasm (Fig. 7b). Next, we integrated RBP localization data with other datasets generated here. First, we considered the nuclear relative to cytoplasmic ratios for each RBP. As expected, we observed a significant shift towards increased binding to unspliced transcripts for nuclear RBPs, whereas cytoplasmic RBPs were enriched for binding to spliced transcripts (Extended Data Fig. 9a-b). Next, we identified a collection of 80 RBPs that exhibit robust localization to SC35 (SRSF2)-labeled nuclear speckles, a class of subnuclear structures enriched for proteins involved in pre-mRNA splicing 38 . Consistent with the established function of speckles in splicing, analysis of splicing changes associated with RBP depletion revealed that speckle-localized RBPs impact larger numbers of splicing events compared to non-speckle proteins (Fig. 7c). Notably, the top 12 (and 19 of the top 20) RBP knockdowns with the strongest impact on splicing (in terms of number of altered splicing events) corresponded to specklelocalized RBPs, including spliceosomal components U2AF1, U2AF2, SF3B4, and SF3A1 as Van Nostrand et al. 1 5 well as splicing regulators HNRNPK and SRSF1 (Fig. 7c). By contrast, nucleolar RBPs had significantly less impact on pre-mRNA splicing than non-nucleolar RBPs, as expected (Fig. 7c).
Furthermore, when we queried the enrichment for RNA binding modalities in eCLIP data, we observed similarly striking enrichment for speckle-localized RBPs binding to proximal introns as well as snRNAs (notably RNU2 and RNU6) (Extended Data Fig. 9c), again consistent with nuclear speckles playing a key role in pre-mRNA splicing.
Focusing on localization to specific cytoplasmic organelles, we noted that 42 RBPs exhibited localization to mitochondria, an organelle with unique transcriptional and RNA processing regulation 39

Preservation of RBP regulation across cell types
Next we evaluated whether RBP regulation is preserved across cell types. Analyzing the 55 RBPs profiled by eCLIP in both K562 and HepG2 cell lines, we found that only a small portion (averaging 11.0%) of peaks were found in genes with cell type-specific expression (TPM (defined as fold-difference ≤ 1.2), and 31.6% of peaks were within genes that changed by at Van Nostrand et al. 1 6 least two-fold (Fig. 8a, Extended Data Fig. 10a). To illustrate, we observed that RBFOX2 eCLIP peaks at least 8-fold enriched in HepG2 were typically also enriched in K562 (average enrichment of 6.1-fold) if the bound gene was expressed within five-fold of the level in HepG2 cells (Covering the "Unchanged", "Weakly differential", and "Moderately differential" classes in Fig. 8a). In contrast, 89.8% of HepG2 peaks in genes with cell type-specific expression were not enriched in K562 (Fig. 8a). Indeed, 49.7% of RBFOX2 HepG2 peaks that were not enriched in K562 occurred in genes with cell type-specific expression whereas only 5.5% occurred in genes within a two-fold change.
Expanding this analysis to all RBPs profiled in both cell types, we observed similar results: an average of 84.3% of peaks in genes with cell type-specific expression were not enriched in the other cell type, whereas 68.0%, 65.5%, and 67.6% of peaks in unchanging, weakly, or moderately differentially expressed genes were enriched by at least 4-fold in the second cell type, respectively (Fig. 8b). Requiring the strict IDR thresholds for peak identification, an average of 44.1% of peaks in genes with similar expression were preserved across cell types (Extended Data Fig. 10c). 48.9% of all peaks that showed no enrichment in the second cell type occurred in genes with cell type-specific expression (a 4.4-fold enrichment), whereas only 21.0% occurred in unchanging, weakly, or moderately differentially expressed genes, respectively (a 3-fold depletion) (Extended Data Fig. 10c). Thus, these results suggest that most RBP binding is preserved across cell types for similarly expressed genes, and that much of differential eCLIP signal between HepG2 and K562 likely reflects underlying gene expression differences more than differential binding.
Next, we asked whether an RBP's positional pattern of splicing regulation tended to be conserved across cell types. Considering splicing maps of cassette exons, we observed that binding of many RBPs had highly similar correlations with either inclusion or exclusion upon knockdown, including alternative splicing regulators HNRNPL and SRSF1 (Fig. 8c-d). We observed that the splicing maps for the same RBP across cell types had significantly higher correlation than random pairings of RBPs, when comparing across all 16 RBPs with both eCLIP and RNA-seq datasets (with sufficient splicing changes) in both K562 and HepG2 (Fig. 8d, Extended Data Fig. 10d).
Cell type-specific regulation of RNAs may also be achieved through differential modulation of RBP levels. To assess which RBPs confer regulation in such a manner, we calculated the expression of each RBP in HepG2 and K562 cells in addition to 31 diverse human tissues from the GTEx Project (Extended Data Fig. 11). Many RBPs had high expression in ENCODE cell lines and across a broad range of human tissues, including Van Nostrand et al. 1 7 ribosomal proteins (RPL23A, RPS11, RPS24), translation factors (EIF4H, EEF2), and ubiquitously expressed splicing factors (HNRNPC, HNRNPA2B1) among the 10 least tissuespecific RBPs (Fig. 8e). However, several other RBPs had highly tissue-specific expression exhibiting either a pattern of high expression in one or a small number of human tissues (e.g., LIN28B, IGF2BP1/3) or were differentially expressed by orders of magnitude across several human tissues (e.g., IGF2BP2 and APOBEC3C), indicating that the RNA targets and regulatory activity of these RBPs are likely modulated through cell type-specific gene expression programs. Of course, even RBPs with similar mRNA levels across tissues may have different protein levels or activity because of post-translational modifications, which are widespread among RBPs.

Discussion
Our study represents the largest effort to date to systematically study the roles of human level, immunohistochemical analysis with our extensive repository of RBP-specific antibodies place these molecular interactions within subcellular contexts. We confirm localization of many RBPs to nuclear speckles, mitochondria and other compartments, and identify many new proteins resident to these sites, emphasizing the necessity of localization data for interpreting RBP-RNA regulatory networks.
Here, we have surveyed the in vivo binding patterns of 126 RBPs, comprising roughly 10% of the human genes predicted to encode proteins that interact directly with RNA. Our observation that additional mapping of new RBPs leads to a greater increase in the functional RNA element coverage than mapping of the same RBPs in additional cell lines argues that expansion of these approaches to additional RBPs will be particularly informative. Additionally, the binding modes in vivo are highly preserved across genes expressed similarly in our two cell We expect that the data generated here will provide a useful framework upon which to build analyses of other aspects of RNA regulation, such as microRNA levels, RNA editing and modifications such as pseudouridylation and m6A methylation, translation efficiency, and mRNA half-life measurements.
We have yet to integrate in vivo RNA structure probing data to evaluate how RBP-mediated RNA processing are influenced by local 41 and long-range RNA structures 42 .
As we continue to embark on comprehensively characterizing all functional RNA elements, genome-scale CRISPR/Cas9-inspired genome-editing 43 and RNA modulation 44 technologies will ultimately provide opportunities to study the impact on cellular and organismal phenotypes resulting from disruption of these RNA elements.

RNA binding protein annotations and domains
RBPs were chosen from a previously described list of 1072 known RBPs, proteins containing RNA binding domains, and proteins characterized as associated with polyadenylated RNA, based on the availability of high quality antibodies 17

eCLIP -experimental methods
Antibodies for eCLIP were pre-screened using a set of defined metrics 17

eCLIP -data processing and peak identification
Data processing, including adapter trimming, repetitive element removal, unique genomic mapping, PCR duplicate removal, and peak calling versus paired size-matched input were performed as previously described in a detailed Standard Operating Procedure 18 . Unless otherwise noted, reproducible and significant peaks were identified by merging peaks identified in each replicate, requiring that the peak meet an irreproducible discovery rate cutoff of 0.01 as well as p-value ≤ 0.001 and fold-enrichment ≥ 8 (using the geometric mean of log 2 (foldenrichment) and -log 10 (p-value) between the two biological replicates). For submission to the ENCODE portal, eCLIP datasets were required to pass several quality metrics, including minimum read depth, reproducibility, and peak saturation metrics. To quantify binding to snRNA and other multi-copy elements, a separate pipeline was developed to require unique mapping to element families instead of unique genomic positions. All analyses described in this manuscript used mapping to GRCh37 and GENCODE v19 annotations, but mapping to GRCh38 and GENCODE v24 annotations were also deposited at the ENCODE portal. To summarize relative enrichment between IP and input, information was defined as the Kullback-Leibler divergence (relative entropy) p i * log 2 (p i / q i ) , where p i is the fraction of total reads in IP that map to a queried element i (peak, gene, or repetitive element), and q i is the fraction of total reads in input for the same element.

Knockdown followed by RNA-seq (KD/RNAseq)-experimental methods
Individual RBPs were depleted from HepG2 or K562 cells by either RNA interference or CRISPR-mediated gene disruption.

KD/RNA-seq -data processing
Reads were aligned to both the hg19 assembly of the genome using the Gencode v19 annotation, respectively, using both TopHat2 46 and STAR 47 . Gene expression levels were quantitated using RSEM 48 and Cufflinks 49 and differential expression levels determined using DESeq 50 and Cuffdiff2 51 . Alternative splicing was quantitated using rMATS 52 and alternative poly(A) site use quantitated with MISO 53 . All analyses described in this manuscript used mapping to GRCh37 and GENCODE v19 annotations, but mapping to GRCh38 and GENCODE v24 annotations were also deposited at the ENCODE portal.

RNA Bind-N-Seq (RBNS) -experimental methods
Van Nostrand et al. RBNS motif logos were made using the following iterative procedure for k=5: the most enriched 5mer was given a weight equal to its excess enrichment over the input library (=R-1), and all occurrences of that 5mer were masked in both the pulldown and input libraries to eliminate subsequent counting of lower-affinity 'shadow' 5mers (e.g., GGGGA shifted by 1 from GGGGG).
All enrichments were then recalculated on the masked read sets to obtain the resulting most enriched 5mer and its corresponding weight, with this process continuing until the enrichment Zscore (calculated from the original R values) was less than 3. All 5mers determined from this procedure were aligned to minimize mismatches to the most enriched 5mer, with a new motif initiated if the number of mismatches + offsets exceeds 2. The frequencies of each nucleotide in the position weight matrix, as well as the overall percentage of each motif, were determined from the weights of the individual aligned 5mers that went into that motif; empty unaligned positions before or after each aligned 5mer were given pseudocounts of 25% each nucleotide, and outermost positions of the motif logo were trimmed if they had >75% unaligned positions.
To improve the robustness of the motif logos, the pulldown and input read sets were each divided in half and the above procedure was performed independently on each half; only 5mers identified in corresponding motif logos from both halves were included in the alignments to make the final motif logo. In Fig. 3a, only the top RBNS motif logo is shown if there were Van Nostrand et al. i.e., j≥4), and the top score with its corresponding alignment offset was used. The matrix of these scores was normalized to the maximum score over all RBP pairs and clustered using the linkage function with centroid method in scipy.cluster.hierarchy to obtain the dendrogram shown in Fig. 3a.   7b) was generated by drawing one line between every pair of categories for each RBP that shared both localization annotations. Nuclear annotations are indicated in purple, cytoplasmic in red, and lines between nuclear and cytoplasmic annotations are indicated in yellow.

ChIP-seq -experimental methods
Van Nostrand et al. 2 5 Chromatin immunoprecipitation was implemented according to ChIP Protocol optimized for

ChIP-seq -data processing
Data processing was performed in accordance with ENCODE uniform transcription factor ChIPseq pipeline (https://www.encodeproject.org/chip-seq/transcription_factor/) and using hg19 as the reference human genome. All datasets containing >10 million usable reads from each replicate, passing IDR, and generating >200 peaks were used for final analysis.

Saturation Analysis
Saturation analysis of eCLIP and KD/RNA-seq data was performed by randomly shuffling the order of datasets 100 times, subsampling 1 through all datasets, and calculating the desired metrics. Gene level saturation analysis of RBP binding was calculated first by taking all unique genes that were bound by an IDR filtered peak in an eCLIP experiment. Then, each eCLIP experiment was iteratively added to the previous experiment, counting only unique genes in any experiment. Saturation analysis of differentially expression genes from KD/RNA-seq was similarly performed, based on differentially expressed genes identified with DESeq2. Genes Van Nostrand et al. 2 6 were identified as differentially expressed if they had a |log 2 (fold-change)| > 1 and an adjusted p-value < 0.05 between knockdown and control. Alternative versions of this analysis used (  Fig. 2d), and plotted as either the total number of bases covered (Fig. 2c), or the fraction of covered bases divided by the total number of bases in that annotation across all genes (Fig. 2d, Extended Data Fig. 2d).
The ratio of bases covered was calculated by dividing the number of bases covered in subsampling of N+1 datasets divided by the number covered in subsampling N datasets.
Van Nostrand et al. 2 7 Analysis of the fold-increase between one and two datasets (Extended Data Fig. 2f) was determined by first taking all 55 RBPs profiled in both HepG2 and K562, and calculating the fold-increase in covered bases by considering 110 comparisons including HepG2 followed by K562 and K562 followed by HepG2. Then, for each of the 110 datasets, 10 random other datasets were chosen from the same cell type, and for each of the 10 the fold-increase in covered bases from adding that dataset to the first was calculated.
To compare the fold-increase between profiling new RBPs in additional cell lines, eCLIP datasets profiling RBFOX2, IGF2BP1, IGF2BP2, and IGF2BP3 in H9 human embryonic stem cells were obtained from the Gene Expression Omnibus (GSE78509) 54

Motif comparisons between RBNS and eCLIP
eCLIP 6mer Z-scores in Fig. 3b were calculated as previously described 55  To produce eCLIP logos in a similar manner for comparison with RBNS logos, an analogous procedure was carried out on the eCLIP peak sequences (only eCLIP peaks at least 2-fold enriched were used): the two halves of the RBNS pulldown read set were replaced with the two eCLIP replicate peak sequence sets (each peak was extended 50 nt upstream of its 5' end as some RBPs have motif enrichments symmetrically around or only upstream of the peak starts), and the input RBNS sequences were replaced by random regions within the same gene as each peak that preserved peak length and transcript region (5' and 3' UTR peaks were chosen randomly within that region; intronic and CDS peaks were shuffled to a position within the same gene that preserved the peak start's distance to the closest intron/exon boundary to match sequence biases resulting from CDS and splicing constraints). The enrichment Z-score threshold for 5mers included in eCLIP logos was 2.8, as this threshold produced eCLIP logos containing the most similar number of 5mers to that of the Z≥3 5mer RBNS logos. Each eCLIP motif logo was filtered to include only 5mers that occurred in both of the corresponding eCLIP Van Nostrand et al. 2 8 replicate logos. eCLIP motif logos were made separately for all eCLIP peaks, only 3'UTR peaks, only CDS peaks, and only intronic peaks, with the eCLIP logo of those 4 (or 8 if CLIP was performed in both cell types) with highest similarity score to the RBNS logo shown in Fig. 3a, where the similarity score was the same as previously described to cluster RBNS logos (eCLIP logos for all transcript regions shown in Extended Data Fig. 3a)

Splicing regulatory effects of RBNS+ and RBNS-eCLIP peaks
To assess the splicing regulatory effects of RBNS+ and RBNS-eCLIP peaks for Fig. 3c If the eCLIP+ vs eCLIP-comparison was significant, the eCLIP peaks were divided into those that did and did not contain the top RBNS 5mer. The

Δ Ψ
values for all RBPs in each of the 6 SE direction/eCLIP regions were combined for comparison in Fig. 3c; see Extended Data. Fig. 4a Van Nostrand et al. 2 9 for RBPs that were significant in each region (12 included/4 excluded upon KD, upstream intron eCLIP peak; 11 included/2 excluded upon KD, SE eCLIP peak; 7 included/7 excluded upon KD, downstream intron eCLIP peak). To assess eCLIP peaks with or without the top 'eCLIP-only'

Overlaps between RBP binding and gene expression perturbation upon KD/RNA-seq
To increase sensitivity for gene expression analysis, significant binding was determined at the level of transcript regions (including 5'UTR, CDS, 3'UTR, and introns) instead of using peaks.
To identify significant enrichment between binding and expression changes, genes with significantly enriched binding to regions (p Next, for each splicing event, read densities were normalized across all regions separately for IP and input, in order to equally weigh each event. Per-position input probability densities were then subtracted from IP probability densities to attain position-level enrichment or depletion, for regions extending 50nt into each exon and 300nt into each intron composing the event, referred to as 'Normalized eCLIP enrichment' (Extended Data Fig. 6b). For shorter exons (<100 nt) and introns (<600nt), densities were only counted until the boundary of the neighboring feature. Van Nostrand et al. 3 1 Plots of eCLIP signal enrichment (referred to as 'splicing maps') were then created by calculating the mean and standard error of the mean over all events after removing the highest (2.5%) and lowest (2.5%) outlying signal at each position, referred to as 'Average eCLIP enrichment' (Extended Data Fig. 6c). Splicing maps were only considered for RBPs with 100 or more altered cassette exon events, or 50 or more alternative 5' or 3' splice site events. To calculate the number of RBPs bound per exon, the set of spliceosomal RBPs was taken from manual annotation of RBP functions (described above and listed in Supplementary Table 1).
The number of reproducible (IDR) peaks at each position relative to splice sites was summed across all RBPs, and divided by the total number of cassette or constitutive exons respectively. Van Nostrand et al. 3 2 To explore the possibility that some RBP chromatin association events might be coupled with their direct RNA binding activities in cells, RNA binding peaks were compared with DNA binding signals as assayed by ChIP-seq to quantify enrichment. Only eCLIP peaks in gene body regions (excluding promoter and terminator regions, defined as the 1kb surrounding regions of TSS and TTS) were considered. The ChIP-seq signals were calculated for each eCLIP peak, together with surrounding regions that are 10 times the length of eCLIP peak on each side. Wilcoxon rank-sum tests were then performed to see whether ChIP-seq signal were enriched at the middle third regions.

Comparison of DNA-and RNA-binding properties of RBPs
To see whether those differentially-expressed genes after RBP knockdown were enriched in RBP binding at chromatin level, equal number of genes with similar expression level either with or without binding were randomly sampled, and the number of differentially-expressed genes after knockdown of the RBP were counted (fold change > 1.5 or < 2/3, adjusted p-value <0.05 by DESeq2), and one-tailed Fisher's exact tests were then performed to test the dependence of RBP binding and differential expression. The above procedure was performed 100 times to give the distribution of the odds ratio. A significant dependence was defined when the null hypothesis was rejected at level of 0.05 for at least 95 times. The correlation between RBP association and genes with regulated alternative splicing events (A3SS, A5SS, RI, MXE and SE events) were investigated similarly.

Analysis of RBP regulatory features in subcellular space
Localization annotations and calculation of nuclear versus cytoplasmic ratio were generated from immunofluorescence imaging as described above. "Nuclear RBPs" were defined as those with nuclear/cytoplasmic ratio ≥ 2, and "Cytoplasmic RBPs" were defined as those with nuclear / cytoplasmic ratio ≤ 0.5. Spliced reads were defined as reads mapping across an annotated GENCODE v19 splice junction (extending at least 10 bases into each exon) and unspliced reads were defined as reads that overlapped an exon-intron junction (extending at least 10 bases into both the exon and intron regions). Significance between groups was determined by Wilcoxon rank sum test.

Preservation of RBP regulation across cell types
Van Nostrand et al. 3 3 To consider binding across cell types, first the highest expressed transcript for each gene was identified using transcript-level quantifications) from the same rRNA-depleted RNA-seq experiments described above (K562, accession numbers ENCFF424CXV,ENCFF073NHK; HepG2, accession numbers ENCFF205WUQ, ENCFF915JUZ) and used as representative for that gene. Next, genes were categorized based on their absolute fold-difference (FD) between K562 and HepG2: unchanged (FD Analysis of preservation of binding across cell types was considered in three ways. First, for each peak identified in one cell type, the fold-enrichment for that region in the other cell type was calculated and considered for each gene type (as in Fig. 8a). Two groups of peaks were then identified: those that were ≥ 4-fold enriched in the other cell type, and those that were not enriched in the other cell type. The fraction of peaks associated with a gene class that were either ≥ 4-fold or not enriched were then considered for each gene class separately (Fig. 8b).
Second, the set of peaks ≥ 4-fold enriched (and the set not enriched) was compiled across all genes, and the fraction associated with each gene class were then reported (Extended Data   Fig. 10b). Finally, binding preservation was calculated by determining the fraction of IDR peaks identified in one cell type that overlap (requiring at least 1nt overlap) IDR peaks identified in the second cell type (Extended Data Fig. 10c).
Correlation between splicing maps was performed as described above, considering all RBPs profiled by eCLIP and RNA-seq in both K562 and HepG2 that had at least 100 differentially included or excluded cassette exon events in both cell types.

RBP expression in tissues
Tissue specificity was measured as the entropy deviation from a uniform distribution among all tissues as in 1 . For each RBP, the log 2 (TPM+1) was calculated for each of the 42 samples Van Nostrand et al. 3 5 Overall project management was carried out by the senior authors X-DF, EL, CBB, BRG and GWY.

Author Information.
Data sets described here can be obtained from the ENCODE project website at http://www.encodeproject.org via accession numbers in Supplementary Table 2  Van Nostrand et al. 3 7 Van Nostrand et al. 4 2 Van Nostrand et al. 4 3   (c) Lines indicate the mean of 100 random orderings of datasets for the number of (red) Van Nostrand et al. 5 3 Van Nostrand et al. 5 4 Extended Data Figure 3  Van Nostrand et al. 5 5 Van Nostrand et al. 5 6 Extended Data Figure 4 | Splicing regulatory activity of RBNS+ and RBNS-eCLIP peaks. (b) Same set of RBPs and corresponding eCLIP+ peak region/SE splicing change types as used in Fig. 3c, but separating eCLIP peaks on whether they contain the top 'eCLIP-only' 5mer (based on the motifs from Extended Data Fig. 3b) instead of the top RBNS 5mer.
Van Nostrand et al. Van Nostrand et al. 5 9 Van Nostrand et al. Van Nostrand et al. 6 2 Van Nostrand et al. Van Nostrand et al. 6 4 Van Nostrand et al. 6 5 Extended Data Figure 9 | eCLIP binding patterns in subcellular space (a) Points indicate nuclear versus cytoplasmic ratio from immunoflourescence (IF) imaging (xaxis) versus ratio of spliced versus unspliced exon junction reads, normalized to paired input (yaxis). RBPs profiled by eCLIP and IF in HepG2 are indicated in blue, and RBPs profiled by eCLIP in K562 (in purple) were paired with IF experiments performed in Hela cells.
(b) Points indicate values as in (a), with RBPs separated into nuclear (nuclear / cytoplasmic ratio > 2) and cytoplasmic (nuclear / cytoplasmic ratio < 0.5). Significance was determined by Kolmogorov-Smirnov test, and red line indicates mean.
(c) Points indicate eCLIP relative information for all profiled RBPs for the indicated snRNA or pre-mRNA region, separated by the observation of co-localization at nuclear speckles.
Significance was determined by Wilcoxon rank-sum test.
(d) Cumulative distribution curves indicate total relative information content for the mitochondrial genome for RBPs with mitochondrial localization by IF (red) and all other RBPs (black).
Significance was determined by Kolmogorov-Smirnov test. Van Nostrand et al. 6 6 Extended Data Figure 10 | Preservation of binding across cell types.  Van Nostrand et al. 6 8 Van Nostrand et al. 6 9 Extended Data Figure 11 | Expression of RBPs across tissues and cell types.

Expression of the 352 RBPs (in Transcripts Per Million) investigated in this study in ENCODE
cell lines HepG2 and K562 as well as 40 human tissues measured by the GTEx project. RBPs sorted by decreasing expression in HepG2.  pyrophosphatase digestion in initiation complexes. Nucleic acids research 7, 15-29 (1979).