Transcriptome-wide high-throughput mapping of protein–RNA occupancy profiles using POP-seq

Interaction between proteins and RNA is critical for post-transcriptional regulatory processes. Existing high throughput methods based on crosslinking of the protein–RNA complexes and poly-A pull down are reported to contribute to biases and are not readily amenable for identifying interaction sites on non poly-A RNAs. We present Protein Occupancy Profile-Sequencing (POP-seq), a phase separation based method in three versions, one of which does not require crosslinking, thus providing unbiased protein occupancy profiles on whole cell transcriptome without the requirement of poly-A pulldown. Our study demonstrates that ~ 68% of the total POP-seq peaks exhibited an overlap with publicly available protein–RNA interaction profiles of 97 RNA binding proteins (RBPs) in K562 cells. We show that POP-seq variants consistently capture protein–RNA interaction sites across a broad range of genes including on transcripts encoding for transcription factors (TFs), RNA-Binding Proteins (RBPs) and long non-coding RNAs (lncRNAs). POP-seq identified peaks exhibited a significant enrichment (p value < 2.2e−16) for GWAS SNPs, phenotypic, clinically relevant germline as well as somatic variants reported in cancer genomes, suggesting the prevalence of uncharacterized genomic variation in protein occupied sites on RNA. We demonstrate that the abundance of POP-seq peaks increases with an increase in expression of lncRNAs, suggesting that highly expressed lncRNA are likely to act as sponges for RBPs, contributing to the rewiring of protein–RNA interaction network in cancer cells. Overall, our data supports POP-seq as a robust and cost-effective method that could be applied to primary tissues for mapping global protein occupancies.

Interaction of proteins with RNA is crucial for post-transcriptional gene regulation such as capping, splicing, polyadenylation and localization which is indispensable for cellular homeostasis and survival 1,2 . Despite the increasingly appreciated role of protein-RNA interactions, the global occupancy profiles of proteins in a cellular environment is not fully elucidated. For instance dysregulated expression of RNA binding proteins (RBPs) has been associated with a broad spectrum of human pathologies including cancers, neurological and hereditary diseases [3][4][5][6] . Therefore, it is critical to investigate the diverse protein occupancy sites and their functional impact on physiology and diseases.
Experimental approaches such as crosslinking followed by immunoprecipitation (CLIP) have been widely used to identify the binding pockets of specific RBPs across the transcriptome 7,8 . CLIP based methods exploit the stability of crosslinked protein-RNA complexes by ultraviolet (UV) irradiation followed by immunoprecipitation and sequencing of the co-purified RNA [9][10][11] . However, these methods are reported to contribute to biases in the interaction profiles due to the inherent nature of UV crosslinking [12][13][14] . Other methods that employ antibody pulldown of protein-RNA complexes such as RIP-seq and DO-RIP-seq [15][16][17] are difficult to scale up for detecting hundreds of RNA interacting proteins at the same time. Methods like POPPI-seq 18 and others have enabled the capture of protein-RNA interaction sites by incorporation of photoreactive nucleosides in UV irradiated cells followed by poly-A pull down and sequencing of the bound RNA. Although, the use of photoreactive nucleosides www.nature.com/scientificreports/ POP-seq illustrates a significant enrichment for protein-RNA interactions. To rigorously evaluate whether POP-seq identifies any non-RBP binding events, we performed three comprehensive analyses as summarized below (i) To estimate non-RBP interactions, we compared the POP-seq peaks with ChIP-seq data of 67 proteins and CLIP-seq data of 79 proteins available for K562 cells (from ENCODE project). We observed a significantly higher overlap (p value < 2.2e−16) of POP-seq peaks with CLIP-seq peaks compared to ChIP-seq peaks as shown in Figure S2A, (ii) To estimate the protein-DNA interactions (false positives) that could be captured by POP-seq, we systematically compared the POP-seq peaks (and 5 random peak profiles separately) with the binding profile of 18 proteins for which both ChIP-seq and CLIP-seq data was available in K562 cells (from ENCODE project). Our results showed a significant enrichment (Odds ratio > 20 averaged across 5 random controls, p value < 2.2e−26, Fishers Exact test) of POP-seq signals overlapping with the CLIP-seq profile than the ChIP-seq profile of these 18 proteins ( Figure S2B), indicating that POP-seq peaks are enriched for protein-RNA interactions. Among the 18 tested proteins, NONO (Non-POU domain-containing octamer-binding protein), which is known to bind both DNA and RNA 30,31 , expectedly demonstrated relatively similar significance of binding to both DNA and RNA compared to random locations. Overall, our analysis shows that irrespective of POP-seq protocols, signals are underrepresented in ChIP-seq data while overrepresented in the CLIP-seq data indicating a clear enrichment for RNA binding events compared to publicly available protein-DNA maps. (iii) To estimate the ribosomal protein interactions captured by POP-seq, we compared the POP-seq peaks with publicly available ribo-seq data in K562 cells. Our analysis showed that ~ 20% of the total POP-seq peaks (with peak length ≤ 50 bp) exhibited 50% end-to-end overlap with the ribo-seq peaks indicating that some ribosomal protein-RNA interactions are captured by POP-seq (Table S3).
Comparison of POP-seq data with Formaldehyde and UV crosslinked RBPs reveals high quality of POP-seq peaks. POP-seq is a technique which provides occupancy levels for proteins on a global transcriptome-wide scale. The functionality of thousands of binding sites that are generated resulting from this method could correspond to scores of RBPs and RNP complexes. Since there are no large-scale assays currently available to validate the RNA-protein interaction sites globally, and the low throughput assays will not cover large number of interaction sites, therefore, we employed orthogonal methods using publicly available data for www.nature.com/scientificreports/ 24 formaldehyde crosslinked RBPs 32 with respect to five random non-peak files (See "Methods") and eCLIP profile of 97 RBPs 28 in the K562 cells from ENCODE (see "Methods"). Our analysis indicates a significant enrichment of POP-seq peaks in both CLIP-seq ( Fig. 3A for top 24 RBPs, supplementary Figure S3) and fRIP-seq profile (Fig. 3B) of individual RBP's compared to 5 random non-peak profiles. Confusion matrix for this analysis is documented in supplementary Table S1 and S2. Overall, our analysis shows that POP-seq can recover high quality peaks corresponding to specific RBPs identified from individual crosslinking protocols.
CRISPR knock out RNA-sequencing data of RBPs supports the functionality of POP-seq peaks. Development of targeted genome editing using CRISPR has revolutionized the genomic research 33 , particularly to understand the molecular mechanism involved in gene regulation and expression 34,35 . A recent study by our research group demonstrated that the functional relevance of protein-RNA interactions can be estimated by the expression of the exons upon perturbation of RBP binding site in their neighborhood using CRISPR-Cas9 system 36 . Therefore, we aimed to interrogate the functional impact of POP-seq captured protein-RNA interactions by its systematic comparison with the eCLIP profile of RBP's, for which knockout data is publicly available in ENCODE project 28,37 . For this analysis, we used two RBPs; (a) DGCR8 (DiGeorge Syndrome Critical Region 8), which is involved in microRNA processing and is implicated in the pathogenesis of cancer 38,39 and (b) IGF2BP1 (insulin like growth factor 2 mRNA binding protein 1) which is a critical post-transcriptional regulator of various mRNA involved in cancer progression 40 . First, we identified the POP-seq peaks from the individual protocol that showed > 50% base-to-base overlap with eCLIP profile of respective RBP (obtained from ENCODE 28 ). Next, we extracted the expression levels of exons 'proximal' (< 1000 bp) to overlapped peaks from CRISPR knock out data set (Material and Methods). We observed that the cumulative expression level of 'proxi-   www.nature.com/scientificreports/ mal' exons was significantly dysregulated with respect to the non-targeting control. We observed that there was a significant reduction in the expression of 'proximal' exons in DGCR8 KO and a significant increase in IGF2BP1 KO with respect to their non-targeting CRISPR control (Fig. 4). However, there could be alternative hypotheses such as the contribution of other binding sites from same or different RBPs that could account for the compensatory effects in expression levels in our CRISPR analysis, which could explain why not all the proximal exon levels are altered. More importantly, RBP binding does not always alter the expression of the target exon/transcript but instead may contribute to editing, structure and localization of bound RNA. However, despite these alternate possibilities, it is promising to observe that the loss of binding sites has a significant impact on the target exons.
Overall, this analysis suggests that POP-seq can capture the functionally relevant protein bound sites and indicates that the dysregulation of exons proximal to the functional binding site occur in an RBP dependent manner.
POP-seq supports the protein-RNA binding sites across regulatory gene families. To obtain a detailed perspective of the peaks captured by POP-seq, we examined the occurrence of the protein-RNA binding sites across different regulatory genes, classified as RNA binding protein, ENO1 (Enolase1) 41 , lncRNA MALAT1 (metastasis associated lung adenocarcinoma transcript 1) 42 and a transcription factor, Jun 43 . ENO1 is a crucial glycolytic enzyme involved in cell growth and is also reported as an oncogene that promotes metastasis by facilitating cell proliferation in multiple cancers including colorectal, lung, and prostate cancer [44][45][46][47][48] . We observed the abundance of protein bound pockets in the genomic loci of ENO1 across the POP-seq protocols (Fig. 5A). We also investigated the protein-RNA interactions in MALAT 1, a highly conserved lncRNA that governs a variety of functions including regulation of gene expression, alternative splicing, neural development and vascular growth [49][50][51] . Several studies report the abundant expression of MALAT1 in multiple cancers such as lung cancer, bladder cancer, breast cancer, colorectal cancer and others 49,[52][53][54][55] . Therefore, identification of regulatory sites targeted by proteins in MALAT1 is crucial for understanding its pathogenesis in cancers. Our data suggest that all the three protocols can capture the abundance of regulatory sites targeted by proteins in the genomic boundary of MALAT1 (Fig. 5B). We observed relatively enhanced signals for protein occupancy sites in MALAT1 locus in UPOP-seq compared to the NPOP-seq and FPOP-seq (Fig. 5B). To our observation, NPOPseq also captured the physiologically relevant protein-RNA interactions in MALAT1 suggesting its application for unbiased protein occupancy profiling (Fig. 5B). We further explored the regulatory binding sites in AP1 transcription factor subunit 'Jun' which is a protooncogene actively involved in cell proliferation, apoptosis, inflammation and carcinogenesis 56-60 . We found  POP seq identifies germline and somatic variants that potentially contribute to altered post transcriptional regulation. Single nucleotide polymorphisms (SNPs) are reported as the most common form of somatic variations and are widely associated with metabolism, cell cycle regulation and DNA mismatch repair 61,62 . In past years, SNPs has emerged as a potential diagnostic biomarker for several cancer types [63][64][65][66] . Therefore, it is imperative to investigate the somatic variations arising due to SNPs and their effect on transcriptome wide protein RNA interaction sites. In order to detect the somatic variations captured by POP-seq, we calculated the proportion of known somatic variations (see "Materials and methods") occurring in the equivalent genomic loci of the peaks. We tested the enrichment of genomic variations from GWAS catalog 67 and Ensembl Variation database 68 (PhenVar, ClinVar and somatic variations) in POP-seq peaks than expected by chance (i.e. 5 random non-peak profiles) using Fisher's exact test. For this analysis, random 'non-peak' files were generated as described previously. Our results indicate a significant enrichment for each SNP cohort with relatively lesser enrichment for GWAS SNP (averaged odds ratio = 1.45, 1.42, 1.35 for NPOP, FPOP and UPOP-seq respectively, p value < 2.2e−16). Similar test for other genomic variations including PhenVar, SomaticVar and ClinVar indicated relatively higher enrichment (Odds ratio ~ 22 averaged across 5 random controls for each cohort, p value < 2.2e−16, Fishers Exact test) in POP-seq peaks compared to non-peaks (Fig. 6A). This observation provides support for the enrichment of both germline and somatic SNPs including those reported with clinical significance to be prominent on protein RNA interaction sites, implying the need for deeper understanding of their functional consequences. Indeed, we identified the occurrence of two clinically relevant genetic risk loci from GWAS; rs45461499 in CDC20 (Cell-division cycle protein 20) and rs7578199 in HDLBP (High Density Lipoprotein Binding Protein) genes that are reported in acute and chronic lymphoblastic leukemia respectively 69,70 (Fig. 6B,C) to harbor protein-RNA interaction sites. Thus, our implementation of POP-seq in K562 cells demonstrate a novel and robust approach to elucidate the occurrence of somatic variants in leukemic patients.

Highly expressed lncRNAs exhibit abundance of POP-seq peaks in K562 cells. Long non-coding
RNAs (lncRNAs) have been widely documented with diverse roles in the transcriptional and post-transcriptional regulation of gene expression 71,72 . Aberrant expression of lncRNA is associated with the pathogenesis of various diseases including cancer 73 and have been profoundly recognized as pivotal targets in cancer therapeutics 74 . However, the mechanism underlying lncRNA regulation is not well elucidated 75 . Therefore, we speculated that associating the expression of lncRNA with the occurrence of POP-seq peaks would provide an insight into the transcriptome wide regulation of lncRNAs. We observed that the highly expressed lncRNAs exhibit abundance of regulatory binding sites in K562 cells (Fig. 7A, "Materials and methods"), suggesting that lncRNAs dynami-  www.nature.com/scientificreports/ cally interact with RBPs in pathological conditions. In general, highly expressed RNAs are expected to be more available for binding by proteins and therefore exhibit higher RNA binding events. Therefore, we tested the association between the expression levels (high and low) with the number of POP-seq peaks per unit length for both lncRNA and non-lncRNA genes across the technical replicates. We found that the replicates exhibited a reproducibility in the trend ( Figure S4A)       www.nature.com/scientificreports/ also be observed for non-lncRNA, we carried out the same analysis across the replicates. We observed that there is tendency for even non-lncRNAs to exhibit higher expression with more binding sites, however the trend is not as robust with lower significance ( Figure S4B) compared to lncRNA as shown in Figure S4A. Several studies have reported that the majority of the lncRNAs exhibit a 'sponge effect' to titrate the abundance of regulatory proteins such as RBPs in a cell type specific manner [76][77][78] . These studies proposed that lncRNA sponges can extensively rewire the post-transcriptional gene regulatory networks by altering the protein-RNA interaction landscape in cell-type and phenotype specific manner. Based on this finding we tested the hypothesis that highly expressed lncRNAs could sponge the regulatory RBPs that would further provide an insight into lncRNA mediated regulation of RBPs in disease context. We uncovered RP11 − 301G19.1, a highly expressed lncRNA in leukemia 79,80 , to illustrate the 'sponge effect' in K562 cells. We found an abundance of this lncRNA in AML patients from TCGA 81 compared to its expression level in normal whole blood from GTEx cohort 82 (Fig. 7B, inset). In order to predict whether the expression of this lncRNA contributes to the survival of AML patients, we employed Kaplan-Meier survival analysis 83 for RP11 − 301G19.1. We found that this lncRNA exhibits a significant (False Discover Rate (FDR) < 0.00059) prognostic impact in AML patients (Fig. 7B). Next, we interrogated the regulatory sites captured by POP-seq in the genomic loci of RP11 − 301G19.1 and observed a consistent occurrence of peaks across all POP-seq protocols. Our results demonstrate that majority of the POPseq peaks overlapped with the eCLIP profile of multiple RBPs ( Figure S5) which further supports the "sponge effect" for RP11 − 301G19.1. A subset of highly expressed RBPs in K562 cells illustrates a general agreement of their eCLIP profile with POP-seq peaks in RP11 − 301G19.1 gene (Fig. 7C). To further elaborate the sponge effect, we investigated a well-studied lncRNA MALAT1 which has been shown to interact with numerous RBPs [84][85][86][87][88] . As illustrated in Fig. 7D, we observed that POP-seq captured peaks were sparsely distributed along the length of MALAT1 in contrast to the fairly uniform distribution of total RNA-seq reads showing higher expression of MALAT1 in K562 cells. We also included a track of ENCODE eCLIP peaks for RNA binding proteins HNRNPA1 and PTBP1 with known binding to MALAT1 84 shown in Fig. 7D. The results suggest that HNRNPA1 and PTBP1 are most likely being sponged by highly expressed lncRNA MALAT1 in K562 cells. Additionally, we also observed that other RBPs with transcriptome wide interaction maps available from ENCODE project exhibited several binding sites overlapping with POP-seq peaks along the length of the MALAT1 and NEAT1 lncRNAs (Supplementary Figure S6A and B). Overall, the results suggest that POP-seq can capture the occupancy sites of RBPs on lncRNAs and thus advance our understanding of lncRNA regulation in diseases.

Discussion
Protein-RNA interaction is a vital phenomenon regulating crucial transcriptional and post-transcriptional processes starting from intercalation of the DNA-RNA juncture to RNA metabolism, translation and decay 89,90 . RNA binding proteins have been widely recognized as the key regulatory proteins for these processes 91 . Although the past few decades have seen a surge in the number of methods for capturing protein-RNA interaction sites occupied by RBPs on a transcriptome-wide scale 18,19,21 . Majority of these protocols employ UV or formaldehyde cross linking and poly-A pull followed by sequencing of RNA. However, this excludes their application to non-polyadenylated RNAs and so far, a systematic analysis to depict the effect of crosslinking on the captured interaction sites has not been investigated.
In this study, we present POP-seq to assess the protein bound RNA fragments using trizol based phase separation. POP-seq can uniquely map the protein bound RNA pockets in a transcriptome wide manner. We implemented three versions of POP-seq to generate an unbiased profile of protein-RNA interactions in K562 cells. We demonstrated that POP-seq captured peaks generally agreed with the eCLIP profile of several RBPs. The abundance of protein-RNA interactions captured by POP-seq was mostly observed in exonic followed by intronic regions, with consistent overlap across the POP-seq protocols in K562 cells. Further analysis of publicly available CRISPR KO dataset of RBPs from ENCODE project 92 , illustrate that POP-seq can capture the functionally relevant protein bound pockets implying that the dysregulation of exons proximal to the functional binding site occur in RBP dependent manner. POP-seq also enables the identification of clinically relevant somatic variations associated to leukemia. Further, POP-seq provides a comprehensive evidence of the potential protein binding sites in most of the regulatory gene families such as TFs, RBPs and lncRNAs.
Since ribosomal protein complexes constitute the basic translation machinery and hence are expected to be highly abundant in cells. To evaluate the prevalence of ribosomal protein occupied sites in our data, we compared POP-seq peaks with publicly available ribo-seq data 93 in K562 cells. We observed that ~ 20% of the total POP-seq peaks (with peak length ≤ 50 bp) exhibited 50% end-to-end overlap with the ribo-seq peaks and this fraction was significantly reduced with increase in % peak overlap as shown in supplementary Table S3. These observations suggest that a fraction of POP-seq peaks correspond to the regions occupied by ribosomal complexes indicating a potential for further optimization of the protocol to enhance the stringency for selectively capturing sites occupied by regulatory RBPs under physiological conditions. Interestingly, POP-seq provides evidence for the 'sponge effect' depicted by lncRNA on multiple RBPs thus advancing our understanding of the post-transcriptional regulatory mechanism controlled by lncRNAs interacting with RBPs in cancer. In summary, POP-seq is a cost-effective and robust approach to elucidate the binding sites of proteins in a transcriptome-wide manner. Thus, it should stand as a generic framework for mapping the global protein-RNA interactions, widening the scope and application of this technique to primary tissues for rapid profiling of protein occupancies.

Materials and methods
Cell culture. K562 cells were obtained from the American Type Culture Collection (ATCC). Cells were cultured in Dulbecco's minimal essential medium (DMEM, Gibco) supplemented with 10% heat-inactivated fetal bovine serum (FBS, Atlanta Biologicals) along with 1% antibiotics (penicillin 5000 Units/ml, Streptomycin 5000 μg/ml). All cells were maintained at 37 °C and 5% CO 2 in a humidified incubator and fresh media was replenished every alternate day until confluent.
Crosslinking. Cells were cultured in T-175 flask until a maximum of 90% confluency was reached. A total of 20 million cells per replicate of each sample were used for UV, formaldehyde, and no-crosslinking approaches. Cells were washed twice with 1X PBS, the supernatant was removed by pipetting and cells were resuspended in 1X PBS. For UV crosslinking, cells in PBS suspension were transferred to 100 mm dishes and UV irradiated at 254 nm with 400 mJ/cm 2 dosage (UV Stratalinker 1800). Immediately after crosslinking, cells were collected in 15 ml tubes and pelleted at 1500 rpm for 5 min. Supernatant was discarded and cells were lysed in 1 ml trizol reagent (Life Technologies) by pipetting up and down to obtain a homogenous cell lysate. For Formaldehyde crosslinking, cells were first washed in 1X PBS twice until all the media is removed. Next, cells were resuspended in 30 ml PBS and crosslinked with 0.5% formaldehyde for 10 min by gentle shaking at room temperature. To stop crosslinking, 1 M Glycine was added to the cell suspension for 5 min by gentle shaking at room temperature. Cells were pelleted down at 1500 rpm for 5 min, lysed in 1 ml trizol and homogenized by pipetting up and down. For non-crosslinked samples, cells were pelleted down, washed in 1X PBS and immediately lysed in trizol for phase separation.
Guanidinium thiocyanate-phenol-chloroform (TRIZOL) extraction. Trizol lysed cells were incubated at room temperature for 5 min to dissociate the weak RNA-protein interactions. Phase separation was achieved by adding 200 µl chloroform and thoroughly mixed by vortexing, followed by incubation at room temperature for 5 min. Samples were then centrifuged at 12,000g for 10 min at 4 °C to obtain three phases: aqueous phase (top), interphase (middle) and organic phase (bottom). The aqueous layer was discarded by pipetting and the organic layer was discarded by passing the tip through the interphase leaving behind up to 100 µl of the organic layer. The interphase was resolubilized in trizol followed by phase separation with chloroform three times. After the third phase separation, the interphase was precipitated by adding 1 ml methanol, spun down to remove supernatant containing methanol.
POP-seq strategy. Following the trizol lysis and phase separation, POP-seq was implemented on the three versions in replicates. Interphase pellet was subjected to RNase A/T1 (Thermo Scientific) degradation in RNase buffer (10 mM Tris-HCl, pH 7.5, 300 mM NaCl and 5 mM EDTA, pH 7.5). 2 µg of RNase A/T1 mix was added to the interphase pellet, mixed by pipetting, and incubated at 37 °C for 1 h. Interphase-RNase mixture was resolubilized in trizol reagent to recover the RNA-protein complexes. The aqueous and organic layers were discarded as described previously. Interphase pellet was precipitated in 1 ml methanol. Next, the interphase was mixed with Proteinase K (Ambion) in appropriate buffer (0.1 M NaCl, 10 mM Tris-HCl, pH 8, 1 mM EDTA, 0.5% SDS and 200 µg/ml proteinase K) and incubated at 50 °C for 2 h. After proteinase K digestion, free RNA was recovered from the aqueous layer by trizol extraction as described previously.
Purified RNA concentration was estimated using Nanodrop and up to 1 µg of RNA was incubated with DNase I, 1U (Thermo Scientific) at 37 °C for 30 min to remove any traces of DNA contamination. 1 µl of 50 mM EDTA was added to the reaction mixture and incubated at 65 °C for 10 min to terminate the reaction. RNA was purified using trizol extraction from the aqueous layer. At this point, r-RNA depletion was performed with 1 µg input RNA using Ribo-cop kit (Lexogen) as per manufacturer instructions. Further, the ends of r-RNA depleted RNA were modified by treating with Calf intestine alkaline phosphatase (CIAP, Invitrogen) and T4 polynucleotide kinase (T4 PNK, Thermo Scientific) as per manufacturer protocol. The end modification enabled the library preparation of these RNA fragments.
RNA integrity, library preparation and sequencing. RNA purity and concentration were assessed at each step using Nanodrop, based on the absorbance ratio 260/280 > 2. RNA integrity was evaluated using Agilent 2100 bioanalyzer system. At least 50 ng of r-RNA depleted RNA was used to generate sequencing libraries using the True-seq small RNA library prep kit (Illumina). All libraries were barcoded and sequenced in parallel on a Next-seq platform for 400 million reads to obtain 75 bp single end reads.
Data processing and statistical analysis of POP-seq peaks. We implemented NGS data processing pipeline to facilitate the analysis of the POP-seq data as shown in Fig. 2A. Firstly, we investigated the quality of sequenced reads using FASTQC (http://www.bioin forma tics.babra ham.ac.uk/proje cts/fastq c/) and deployed FASTX-toolkit (http://hanno nlab.cshl.edu/fastx _toolk it/) for removal of adapters and low quality read fragments wherever applicable. Next, we aligned the high quality reads onto human reference genome (GRCh38. p12) using HISAT 94 followed by post processing using Samtools 95 . To ensure the reproducibility between the replicates, we compared the aligned reads per 10 kb genome using deepTools 'plot correlation' module 26 . We employed Piranha 96 for peak calling with default parameters and obtained the resulting POP-seq peaks in bed format. Source code of the POP-seq data processing pipeline is accessible at GitHub (https ://githu b.com/Janga -Lab/POP-seq). We merged the replicate bed files of respective POP-seq protocols and used several tools such as bedtools 97 , HOMER 29 (annotatePeaks.pl), and R (https ://www.r-proje ct.org/) for annotation, statistical testing and other downstream analysis. We also downloaded the publicly available formaldehyde RNA ImmunoPrecipi- www.nature.com/scientificreports/ tation (fRIP-seq) data 32 for 24 RBPs from GSE67963 and raw ribo-seq data 93 from GSE125218, both generated in K562 cells and processed them using the same pipeline and identified the peaks. Next, we computed the fraction of POP-seq peaks overlapping with the identified peaks from both the datasets independently using bedtools 97 . Similarly, we downloaded the eCLIP 98 profiles of 97 RBPs from ENCODE project 28 and concatenated the unique coordinates into a bed file. Resulting coordinates of the binding sites of RBPs were compared with the POP-seq peaks using bedtools 97 . All POP-seq data have been deposited under Gene Expression Omnibus (GEO) accession number GSE142460. We also downloaded the raw FATSQ reads of total RNA (accession no. ENCLB822JYE from ENCODE project) and processed using the standard NGS pipeline as described in below section.
Comparison of POP-seq peaks with publicly available CLIP-seq and ChIP-seq data. We downloaded the ChIP-seq and CLIP-seq profile of available proteins for K562 cells from ENCODE project 28 . We computed the overlap of POP-seq peaks with CLIP-seq peaks and ChIP-seq peaks using bedtools 97 . This dataset also includes 18 proteins for which both ChIP and CLIP-seq data is available for K562 cells. We generated 5 random peak profiles for each POP-seq protocols using bedtools 'shuffle' function. Each random peak profile contains equal but "non-peak" locations (with consistent peak width distribution), within the gene boundary. Then, we compared the POP-seq derived peaks (and the random peaks separately) with the ChIP and CLIP-seq profile of 18 proteins. We employed Fisher's exact test to estimate the level of significance for the enrichment of POP-seq in protein bound DNA(ChIP-seq)/RNA(CLIP-seq) locations compared to non-peaks.
Integrated data analysis of POP-seq peaks with the CRISPR knock out studies. CRISPR/ Cas9 Knock Down (KD) followed by expression profiling on several RBPs have been conducted as part of the ENCODE project 28 to facilitate the understanding of downstream biological processes associated to loss of function of the respective RBP. We downloaded the raw RNA-sequencing dataset of CRISPR experiments of DGCR8 (n = 6), IGF2BP1 (n = 2) where gRNAs were used to deplete the functional form of RBPs and their non-specific CRISPR control (n = 8) in K562 cells 92 . We processed the raw sequencing reads using standard NGS data analysis pipeline (as described previously). Briefly, we filtered the low-quality reads (Phred Score < 30) using FASTQC tool and aligned them onto human reference genome (hg38) using HISAT 94 . After post processing (using samtools), we used StringTie 99 to quantify the expression levels in Transcripts Per Million (TPM) reads for all the genes annotated in human genome (hg38). Thereafter, we used an ad-hoc script to calculate the exon levels from the resulting gtf files of StringTie and converted the resulting files into an exon expression matrix.
To further investigate the functional relevance of protein-RNA interactions, we identified the POP-seq peaks from the individual protocols that exhibited at least 50% overlap with an eCLIP profile of the respective RBP (as described previously). Next, we extracted the expression levels of exons, proximal (< 1000 bp) to overlapped peaks, from exon expression matrix of the knock down experiments. We compared the distribution of the expression levels of these proximal exons in non-targeting control to that in respective KD and statistically examined the condition specific expression differences using Wilcoxon test 100 . Identification of somatic variants in protein-RNA interacting sites. Several studies suggest that single nucleotide variations (SNVs) play an important role in gene regulation via riboSNitches' 101 i.e. by altering RNA secondary structure or TAM (Transcript associated mutation) 102 that further contribute to transcriptome complexity in higher eukaryotes. Therefore, it is imperative to investigate the genomic variations occurring in protein-RNA interaction sites identified by POP-seq protocols. Hence, we downloaded the somatic variants reported in the GWAS catalog 67 and Ensembl Variation database 68 including phenotype and clinically associated somatic variations (ftp://ftp.ensem bl.org/pub/relea se-97/varia tion/vcf/homo_sapie ns/). In order to detect the somatic variations captured by POP-seq, we tested the enrichment of genomic variations from GWAS catalog 67 and Ensembl Variation database 68 (PhenVar, ClinVar and somatic variations) in POP-seq peaks than expected by chance (i.e. 5 random non-peak profiles) using Fisher's exact test. For this analysis, random 'non-peak' files were generated as described previously. To gain disease specific understanding of the role of SNPs in impacting protein-RNA interactions, we also investigated the POP-seq peaks overlapping with SNPs associated with leukemia (from GWAS) and generated genomic tracks in Integrative Genomics Viewer (IGV) 103 for CDC20 and HDLBP genes.
Comparative analysis of POP-seq peaks across lncRNAs and their association with lncRNA expression. We mapped the protein-RNA interaction sites captured by POP-seq protocols onto known lncRNAs using bedtools. For each lncRNA, the number of POP-seq peaks normalized by respective gene length was calculated. To obtain the expression levels, we processed the raw RNA sequencing dataset (paired end reads, n = 5, in replicates) of K562 cells from ENCODE using a standard NGS data analysis pipeline described earlier and generated a gene expression matrix. TPM values of known lncRNAs 104 were extracted from the resulting matrix and averaged for each lncRNA across the replicates. Further, we binned all the expressed lncRNAs into two groups based on their median TPM value. We compared the number of POP-seq peaks (normalized per unit length of the lncRNA) mapped to the two groups of lncRNAs categorized based on low and high median expression in K562 cells. The difference in normalized peak counts between the two groups was statistically tested using Wilcoxon test 100 .
Additionally, we downloaded the raw RNA sequencing dataset of 'whole blood' cohort from 141 individuals from the GTEx 105 and 174 AML patient samples from The Cancer Genome Atlas (TCGA) 81 . We processed the dataset using the NGS data processing pipeline, to generate expression levels for all human genes annotated in the human genome (hg38). We extracted the expression level of lncRNA-RP11 − 301G19.1 (ENSG00000227706) www.nature.com/scientificreports/