easyCLIP analysis of RNA-protein interactions incorporating absolute quantification

Quantitative criteria to identify proteins as RNA-binding proteins (RBPs) are presently lacking, as are criteria to define RBP target RNAs. Here, we develop an ultraviolet (UV) cross-linking immunoprecipitation (CLIP)-sequencing method, easyCLIP. easyCLIP provides absolute cross-link rates, as well as increased simplicity, efficiency, and capacity to visualize RNA libraries during sequencing library preparation. Measurement of >200 independent cross-link experiments across >35 proteins identifies an RNA cross-link rate threshold that distinguishes RBPs from non-RBPs and defines target RNAs as those with a complex frequency unlikely for a random protein. We apply easyCLIP to the 33 most recurrent cancer mutations across 28 RBPs, finding increased RNA binding per RBP molecule for KHDRBS2 R168C, A1CF E34K and PCBP1 L100P/Q cancer mutations. Quantitating RBP-RNA interactions can thus nominate proteins as RBPs and define the impact of specific disease-associated RBP mutations on RNA association.


RNA-seq Library Preparation
Lentivirus (pLEX-based) expressing wild-type or R168C KHDRBS2, with a uORF to lower expression, was produced as described for shRNA production. Similarly, lentivirus expressing wild-type or R429C FUBP1 with a uORF, PCBP1 wild-type with a uORF, and both PCBP1 wild-type and PCBP1 L100Q without a uORF were produced as described for shRNA production. A375 and HEK293T cells were grown in DMEM with 10% FBS.
HCT116 cells were grown in McCoy's 5A media with 10% FBS. 293T cells were sequentially infected with shRNA lentivirus (if any were used) targeting the endogenous 3'UTR, selected using Puromycin or Blasticidin for at least 3 days, then infected with lentivirus expressing protein or empty vector control (which lack the endogenous 3'UTR). HCT116 cells were infected with protein-expressing vector first, followed by shRNA, and harvested 3 days after shRNA infection without selection, as the essential nature of PCBP1 caused fluctuating expression levels with longer knock-downs. Qiagen RNeasy Mini Plus kit (Cat # 74134) was used to extract RNA, poly(A) libraries were constructed using NEBNext Ultra II RNA Library Prep Kit for Illumina, and libraries were sequenced on an Illumina NovaSeq 6000 using paired-end sequencing.

RNA-seq Analysis
RNA-seq libraries were sequenced on an Illumina NovaSeq PE150 at a depth of 25 million reads per sample. Paired end reads were mapped to the hg38 reference genome with GRCh38 Ensembl annotations using STAR aligner 2 (version 2.5.4b) followed by generation of genes by samples counts matrices with RSEM (version 1.3.0). BAM files generated from RSEM 3 were further analyzed with the featureCounts() function in Rsubread (v 1.32.4) to generate exons by samples counts matrices in R 3.5.1. Genes by counts matrices were further analyzed with the DESeq2 package 4 (v 1.24.0) in R 3.6.1 to calculate differential expression and associated p-values across samples. Each cell line was analyzed separately. Differential exon usage from exons by samples counts matrices was determined using the DEXSeq 5,6 (v 1.30.0) using recommend parameters based on tutorials available on Bioconductor. RNA-seq differential expression heatmaps were generated with the pheatmap package (v 1.0.12) in R using log2 normalized transcript counts using the function normTransform() in DESeq2. Gene ontology (GO) analysis was performed using DAVID v6.8 7 (Huang et al., 2009) on the top differentially expressed genes with adjusted p-value < 0.05 as calculated using DESeq2.

Virus infection with shRNA
Lentivirus was produced in Lenti-X 293Ts by transfecting 5 µg p8.91 vector, 1.6 µg pMDG vector, and 5 µg target vector into an >60% confluent 10 cm plate using Lipofectamine 3000 (ThermoFisher L3000015). Medium was changed after incubation overnight, and virus harvested after two and three days of expression. The two harvests were concentrated using Lenti-X Concentrator (Takara, 631231), combined and resuspended in 500 µL PBS. For infection, 500,000 HCT116 cells per well were seeded into 6 well plates. 30 µL virus (6% of the yield from a 10 cm plate) was added per well, followed by polybrene. Media was changed the next morning. On the third day Puromycin was added, and selection performed for two days.

Virus infection for protein expression
Lentivirus was produced as described for shRNA production. 100,000 HCT116 cells were seeded per well of a 6-well plate, followed by 1-10 µL virus and polybrene. Puromycin was added on the third day after transfection and cells selected for two days.
Comparison with RBFOX2 eCLIP RBFOX2 eCLIP replicates and input controls were download from GEO (GSE77629) as BigWig files, which were then converted to bedgraph files, and coordinates converted from hg19 to hg38 using liftOver. The few regions generating some problematic mapping in coordinate conversion were then identified and those regions were excluded from comparisons with easyCLIP. eCLIP files were then converted back to BigWig. Since the eCLIP files were in reads per million, signal from easyCLIP replicate bam files was normalized to per million before comparison. eCLIP peaks were obtained from the published list, and we followed the authors in subsetting to peaks with SMInput normalized p-values (log10) above 8 and CLIPper pvalues (log10) above 5. 1000 random eCLIP peaks were expanded by 1000 bp on each side of the peak; signal within each region was smoothed in 200 nt windows and evaluated by spearman correlation between replicates of easyCLIP, eCLIP and eCLIP input controls. For easyCLIP peaks, we subset to peaks with a gene-based P value (exon or intron) below 0.01. We expanded the peak position by 1000 bp on each side and subset to peaks with some position of easyCLIP signal with a read pileup density of at least 4 reads per million in that window. Spearman correlations were calculated the same as for eCLIP peaks.
Microscopy of transiently transfected cells 8-well plastic chamber slides ((Lab-Tek Permanox, Sigma #C7182) were coated with 0.01% poly-L-lysine (Sigma #P4707) for 15 minutes, then washed twice with PBS before use. HCT116 cells were plated in 24-well plates and grown for at least a day before transfection. 1 µg plasmid, 1 µL Lipofectamine 3000, and 2 µL P3000 reagent were mixed together in Opti-MEM in wells of a 96-well plate, and then added to HCT116 cells growing in 24-well plates. After six hours, the media was changed. The next day cells were moved to chamber slides and allowed to grow for at least another 24 hours before imaging. Cells were washed once with PBS, then fixed for 10 minutes in 4% formaldehyde (in PBS) at room temperature, rinsed three times with PBS, and then permeabilized with PBS containing 0.5% Triton X-100 and 10% goat serum. After permeabilization, cells were stained for at least 1 hour at room temperature with HA Tag Monoclonal Antibody 16B12 conjugated to Alexa Fluor 488 (ThermoFisher #A21287) at 1:250 dilution in PBS containing 0.05% Triton X-100 and 1% goat serum. After staining, cells were washed three times with PBS containing 0.05% Triton X-100, and the slide chamber removed.
After drying the cells by aspiration, one drop of DAPI mounting solution was added to each well and a coverslip was added and sealed with acetone.
AAVS1 microscopy of PCBP1 integrants 4-well plastic chamber slides (Lab-Tek Permanox, Sigma #C6932-1PAK) were coated with 0.01% poly-L-lysine (Sigma #P4707) for 15 minutes, then washed twice with PBS, left dry for 5-30 minutes, and then either stored under PBS or used immediately. HCT116 cells were plated at <20% confluency and grown at least 24 hours before staining. Cells were washed 1-2 times with PBS, then fixed for 10 minutes in 4% formaldehyde (in PBS) at room temperature, rinsed three times with PBS, and then permeabilized with PBS containing 0.5% Triton X-100 and 10% goat serum. After permeabilization, cells were stained for 1 hour at room temperature with HA Tag Monoclonal Antibody 16B12 conjugated to Alexa Fluor 488, ThermoFisher #A21287 at 1:200 dilution in PBS containing 0.05% Triton X-100 and 1% goat serum. After staining, cells were washed three times with PBS containing 0.05% Triton X-100, then 2-3 times in PBS without detergent, and the slide chamber removed. After letting the cells dry for a few minutes, one drop of DAPI mounting solution was added to each well and a coverslip was added and sealed with acetone.

AAVS1 integrated FHH-tagged protein purification
15 µL anti-HA magnetic beads and 2-4 mg clarified lysate were used per immunopurification. Immunopurifications were carried out at 4° for 1 hour in 1 mL of CLIP lysis buffer.
GST-tagged protein constructs pGEX-6P-1 vector was digested with BamHI and CSRP1-FLAG-HA was cloned in using In-Fusion (Takara). Amplification primers for CSRP1-FLAG-HA were: A second construct, GST-FLAG-HA-HIS-CSRP1 (GST-FHH-CSRP1) was created in order to move the HA tag into the interior of the protein so that degradation of the protein at the ends could not lead to confusion. The resulting 461 amino acid (51585 Da) construct is below, with the FHH tag in bold and CSRP1 underlined: GST-FHH-CSRP1 was characterized and employed the same as GST-CSRP1-FLAG-HA.
The GST-hnRNP C construct (54 kDa) was cloned into the same vector but did not include HA or FLAG tags. The resulting sequence is below: GST-tagged protein purification E. coli BL21 cultures transformed with pGEX-6P-1 were grown in 500 mL at 37° until OD600 ~0.8, at which time Isopropyl-1-thio-β-D-galactopyranoside (IPTG) was added to a final concentration of 0.5 mM, and cultures were grown for another ~1.5 h before harvesting. Cells were harvested by the method of S. Harper et al. 8 , namely centrifuging at 4,000 rcf for 20 min at 4°, resuspending in ~50 mL LB, and centrifuging again at 4,000 rcf for 20 min at 4°. Cell pellets were frozen in dry ice until purification. When thawed, the cell pellet was resuspended in 20 mL of lysis buffer (50 mM Tris pH 8.0, 10 mM βmercaptoethanol, 50 mM NaCl, 5 mM EDTA, 1% Triton X-100, Roche protease inhibitor, 5% glycerol). Lysozyme was added very approximately to ~1 mg/mL, pellet was frozen again in dry ice, then thawed in a water bath and lyzed by sonication. The lysate was clarified by centrifugation at ~21,000 rcf, 4°, for 15 min. 4 mL of 50% glutathione-agarose (Pierce) was washed with resin wash buffer (Dulbecco PBS with 10 mM βmercaptoethanol), and then incubated at 4° in a 50 mL Falcon tube with clarified lysate for ~30 min before loading on a column. The column was washed with 50 mL of 4° wash buffer (Dulbecco PBS with 10 mM β-mercaptoethanol, 5% glycerol and Roche protease inhibitor). Samples were eluted in batch with three incubations at 4° with 1.5-2 mL elution buffer (100 mM Tris pH 8.0, 150 mM NaCl, 10 mM β-mercaptoethanol, 5% glycerol, 10 mM glutathione).

GST-tagged protein quantification
Following the method of K. Janes 9 , BSA standards were run on a gel at 10, 5, 2.5, 1.3, 0.6, 0.3, and 0.15 µg, along with purified protein. Following the method of S. Luo et al. 10 , gels were washed for 10 minutes in water, stained for 10 minutes with staining buffer (50% methanol, 10% acetic acid, 0.02% Coomaisse R250) at room temperature, followed by destaining for 10 minutes with destaining buffer (40% methanol, 7% acetic acid), and washing twice for 10 minutes with water. A third wash was performed overnight. Protein was then visualized by scanning the 700 nm channel on a LI-COR Odyssey scanner. A hyperbolic curve of band fluorescence vs input protein weight was fit to BSA standards. Specifically, the parameters 'a' and 'b' in the equation y = a*x/(b+x), where 'x' is protein weight and 'y' is fluorescence, were fit using least-squares regression. This curve was used to determine the concentration of purified protein.

BCA
For BSA standards, 105 µL PBS was combined with 20 µL BSA (2 mg/mL stock) and 3 µL lysis buffer for the highest concentration of BSA, and 115 µL PBS, 10 µL BSA, and 3 µL lysis buffer for the second highest concentration. For lysate samples, 3 µL lysate was combined with 125 µL PBS. For both standards and samples, serial dilutions were made by a factor of three into PBS with 0.024% lysis buffer. Duplicate wells were used for each sample. 25 µL of each well was transferred to a second 96-well plate and combined with 200 µL working reagent (Pierce BCA kit, 50:1 A:B). Plate was incubated for 20-30 minutes at 37°. Absorbance was measured at 562 nm.

FHH-hnRNP C F54A comparison
Tagged FHH-hnRNP C F54A could only be compared with FHH-hnRNP C by minimal region RNA because both purify the endogenous hnRNP C, which is heavily cross-linked in either case.

Histograms of binding frequency
For each protein, RNAs with no reads were removed before determining the histogram (hence the leftmost bin varies by dataset size). RNAs with no reads were not included in the histogram. RNAs that would be placed outside the rightmost bin were placed in the rightmost bin.
6.4 µL 2 M KCl was added to proteinase K-digested samples, and SDS was precipitated on ice for 15 minutes. SDS was spun out at 16 krcf for 10 minutes. The prepared Streptavidin Dynabeads with 10 pmol biotin-anti-L5 RNA oligonucleotide in 50 µL BIB were then added to PK reactions and diluted to a total volume of 1 mL with BIB. The purification was carried out at 4° for 20 minutes. Beads were washed three times with BIB, twice with PBS, and eluted for 2 minutes at 95° in 15-20 µL water with 100 nM biotin.
10X NT2 was added to 1X final concentration, and PEG to 16% final concentration. 1 µL 100 U/µL RNAse ONE was added and samples incubated for 40 minutes at 37°. RNAse ONE was inactivated by adding 10% SDS to 0.1%. Shift buffer was added to 1X (25 mM Tris pH 7.5, 10 mM MgCl2, and 16% PEG400). 300-400 fmol labelled antisense oligos were added and samples were processed further as described for the ligation efficiency test by anti-sense oligo shift.

Recurrent missense mutations in RBPs
A few proteins were left off Fig 1D because we did not obtain data on them (e.g., BCLAF1).
Repetitive elements were handled in two ways: "repeats-first" or "separate". The details of each approach are described in github.com/dfporter/easyCLIP/README_genome.md.
For "repeats-first" mapping, an alignment file was downloaded from http://www.repeatmasker.org/. This was parsed to extract representatives, which were placed in an artificial chromosome separated by poly(N), and a gtf file for each representative was generated. A STAR index was built with --genomeSAindexNbases 5.
The parameter genomeSAindexNbases must be set well below the default of 14 or building will be very slow. When mapping to the repeats chromosome, --alignIntronMax 1 was used to prevent the insertion of introns by STAR. For "repeats-first mapping", reads were first mapped to a custom-built chromosome of repetitive elements using STAR and "--alignEndsType EndToEnd". Unmapped reads from this stage were then mapped to the regular genome using default parameters. Reads mapping the genome were filtered to remove multimapping reads and MAPQ < 10 reads.
For "separate" mapping, the method from RepEnrich2 was used 11 , specifically RepEnrich2 from github.com/nerettilab/RepEnrich2. The RepEnrich2 method maps every read to a bowtie2 genome comprised of the genomic instances of each type of repeat. All reads were mapped using the RepEnrich2 method and, separately, using STAR to the genome in the same manner as "repeats-first". Reads mapping the genome were filtered to remove multimapping reads and MAPQ < 10 reads. After mapping, reads that mapped, via RepEnrich2, to rRNA, scRNA, snRNA, or tRNA were assigned to those elements (in that priority order). Reads not mapping to those elements, if they mapped uniquely to the genome by STAR, were assigned to the genome. Those reads not mapping uniquely to the genome, but which mapped via RepEnrich2 to an element other than the priority ncRNA (rRNA/scRNA/snRNA/tRNA), were then assigned to a repetitive element in a priority based on the number of instances of the given repeat element class in the genome. The "separate" mapping was used in general, with the some exceptions, including Fig. 2J-K, 6C-D, 7G and biotype analysis.
CLIP analysis: read processing Custom Python scripts (github.com/dfporter/easyCLIP) were used for all analysis. Raw fastq files were split by L5 and L3 barcodes allowing one nucleotide mismatches to the expected barcodes. Mapping results from repetitive elements and the genome were combined, read mates removed, results converted to BED format, and PCR duplicates removed using the random hexamer UMI on the L5 adapter. Software packages samtools (v 1.1) and bedtools (v 2.27.1) were used during CLIP analysis.

CLIP analysis: read assignment
If reads mapped to multiple RNAs, but only one was an exon, reads were assigned to the exon. If reads overlapped with the exons of multiple RNAs, the reads were considered ambiguous. The strand was ignored for repetitive elements. Only transcripts with a "transcript_support_level" tag of "1" or "NA" (the latter is used for ncRNA) in the genomic annotation GTF was used. If a gene had multiple transcripts after filtering, the longest transcript (as in the longest genomic distance between the beginning of the first exon and the end of the last exon) was used.

CLIP analysis: EdgeR
EdgeR (v. 3.30.0) was run to compare the wild-type and mutant forms of RBPs. The design was "model.matrix(~batch+group)", where group denotes wild-type or mutant, and batch denotes samples processed together. The functions glmQLFit and glmQLFTest were run with the default parameters, and outputs are in Supplementary Data 6.

FBL normalized snoRNA binding
For viewing FBL binding to an average snoRNA (Fig. 2K), cross-link locations were defined as the sites of deletions. Frequencies were given as fractions of the nucleotide in the normalized snoRNA with the highest deletion frequency.  Most GENIE sequencing was targeted at high-priority cancer-associated genes, resulting in many fewer recurrent missense for RBPs compared to TCGA data. SMAD4, SF3B1, U2AF1, EIF1AX identify the same mutation as the TCGA data as the highest, or one of the highest frequency mutations. Only a handful of PCBP1 missense mutations were identified in GENIE data, but L100 mutations were the only recurrent mutations observed (2 patients). The TCGA recurrent mutations in DICER1, FUBP1 and DDX3X were also observed in GENIE data, but only FUBP1 R429C was recurrent (4 patients) and was not the most prominent FUBP1 missense mutation.   figure 3. Comparison of easyCLIP to eCLIP. a) The comparison used the same amount of the same anti-RBFOX2 antibody, the same cell line, and the same number of cells to perform easyCLIP on RBFOX2. eCLIP produced 72 fmols of library after 16 PCR cycles per replicate, as reported 1 ; easyCLIP produced~13,000 fmols of library after the same number of cycles per replicate (n=3, extrapolating from PCR amplification of 16% of RT reactions). E.L. Van Nostrand et al. note that at 100% PCR efficiency their largest replicate would reach 100 fmol after 13 PCR cycles 1 . Dividing 100 fmol by 2 13 gives an initial library size of 12 amol for eCLIP (7 million molecules) and a PCR efficiency of 86%. The subsequent information on RBFOX2 mapping in E.L. Van Nostrand et al. 1 may not have come from this benchmark sample, as the authors report 85% unique reads at 20 million reads sequencing depth, which appears impossible with a starting library of 7 million. eCLIP performed a size selection on the amplified library before sequencing, so the fraction of the input 12 amol that was usable is unknown. This easyCLIP sample did not undergo size selection before sequencing, resulting in many inserts too small to map, but 16% of reads were mappable. If easyCLIP PCR was 96% efficient (vs 86% for eCLIP), the starting pool would still be 370 amols. RBFOX2 data was obtained without substantial optimization (three RNAse concentrations were tried) -suggesting RBFOX2 does not represent an optimal case but a typical case. b) Spearman correlations of read density within 1000 nt of an RBFOX2 eCLIP peak for easyCLIP RBFOX2, eCLIP RBFOX2, and eCLIP input controls. c) Same as panel B, but for a random 1000 peak subset of easyCLIP RBFOX2 peaks, limiting to one easyCLIP peak per gene, with peaks defined relative to randomly chosen non-RBPs. d) The fraction of reads mapping to the genome for each set of CLIP-seq replicates, after short inserts were removed (A1CF and KHDRBS2 n=4, PCBP1, CELF1, SF3B1 and HNRNPC n=2, others n=3). Data, mean ± s.d. e) Unique mapped reads. All data was obtained from 293T cells except PCBP1 was obtained from the colon cancer cell line HCT116. Cellular inputs ranged from below 10 million cells (hnRNP C, exact number not recorded), to 10 million (one RBFOX2 replicate), to 20 million (two RBFOX2 replicates), to a maximum of a 15 cm plate. RBFOX2, FBL, and hnRNP C libraries were obtained from antibodies to the endogenous proteins, the others were obtained from FLAG tag purifications from either constructs either integrated at the AAVS1 locus (PCBP1) or transiently over-expressed from a vector (the others). f) The average read length for the indicated datasets (n=10,000 reads randomly selected from fastq). HNRNPC and RBFOX2 libraries were digested more than would have been optimal. Boxplots show quartiles, center line shows the median and whiskers show the maxima and minima except if a value is beyond 1.5 times the interquartile range, it is plotted individually.     HA-hnRNP D is the p45 isoform. Experiments were performed once. b) PCBP1 WT and mutant forms were integrated into HCT116 cells using an AAVS1 safe harbor locus. ∆KH2 PCBP1 lacks the second KH domain, so it runs at a lower molecular weight, but a second form, possibly a dimer, also appears (∆KH2-b     figure 11. Quantification of purified recombinant protein and its application to absolute quantitation of immunopurified protein in CLIP. FHH: Flag-HA-His tag. IB: immunoblot. a) Quantification of immunopurified endogenous hnRNP C using a GST-hnRNP C standard. The gel is a western blot probed with antibodies to hnRNP C. Endogenous hnRNP C is smaller than GST-hnRNP C but is shown at the same vertical position in this panel as GST-hnRNP C for visualization. In the graph, black dots represent GST-hnRNP C standards, the blue line is a best fit hyperbolic curve, and the red dot is immunopurified endogenous hnRNP C. b) Quantification of purified GST-hnRNP C expressed in E. coli. GST-tagged hnRNP C was purified from E. coli using glutathione resin, and then run next to a standard curve of BSA protein on an SDS-PAGE gel. Gel was stained with Coomaisse and fluorescence measured at 700 nm. In the graph, black dots represent BSA standards, the dotted line is a fit hyperbolic curve, and the red dot represents the purified GST-hnRNP C, its position on the y-axis determined from the standard curve. The larger graph is focused on the lower quantities of GST-hnRNP C, while the larger graph is the same graph zoomed out to include all standards. c) Quantification of GST-hnRNP C using a tryptophan-reactive dye (Bio-Rad Stain-Free Gel). Gel was subsequently stained with Coomaisse to determine Coomaisse staining of GST-hnRNP C and BSA was not biased. d) Coomaisse quantification of purified, recombinant GST-FLAG-HA-His-CSRP2 (GST-FHH-CSRP2), the HA standard. CSRP2 was used in this construct because this fusion protein purifies in very high quantities. The hyperbolic curve fit is as in panel B. For panels a-d, experiments were performed at least twice. e) Quantification of GST-FHH-CSRP2 using a tryptophan reactive-dye to test for a bias in Coomaisse-staining of the HA standard. No bias was observed. f) Comparison of the quantification standards for HA and hnRNP C. Dilutions of each standard were run on the same gel and western blotted for GST. The standard curve of each protein stock was used to estimate the quantities of the other stock. The proximity of the dots to the 45°line indicate a good agreement. Experiment was performed once. g) The 4F4 anti-hnRNP C antibody shows little bias between cross-linked and non-cross-linked hnRNP C. Recombinant GST-hnRNP C (made in-house) was incubated with a poly(U) 10 RNA oligonucleotide (IDT) and UV cross-linked. The resulting mixture, along with GST-hnRNP C (Abnova) standards was run on a denaturing SDS-PAGE gel and transferred to a nitrocellulose membrane for immunoblotting against hnRNP C (4F4) or GST. No significant difference between anti-GST and anti-hnRNP C antibodies in the ratio of cross-linked to non-cross-linked hnRNP C was observed. Experiment was performed once. h) Coomaisse quantification of purified, recombinant FBL. Purified FBL protein (Prospec, enz-566) was comprised of FBL amino acids 83-321 with an added 23 amino acid tag added, and the FBL antibody (Bethyl, A303-891A) was made against an immunogen between amino acids 271-321 of FBL. As a result, the purified FBL runs faster than endogenous FBL, but both share the entire immunogen used for immunoblotting. Experiment was performed once. i) Immunoblot quantification of immunopurified FBL using the recombinant FBL visualized in panel H. Experiment was performed three times. The choice of salt has no consistent effect. Higher PEG concentrations are better blocking agents. PEG400 and PEG8000 have a similar performance as blocking agents. g) The choice of 50 mM NaCl or 10 mM MgCl 2 has no effect on oligonucleotide loss during dilution (retention) or on signal per fmol. Data is the mean ± 95% confidence interval for n=48 (left) or n=72 (right) samples. Retention samples are 24 samples serially diluted twice for 48 measurements. Using only one dilution for either panel does not affect the conclusion. h) It is safe to run DNA duplexes on 20% polyacrylamide TBE gels (NuPAGE, 12 well, ThermoFisher) at 16.7% PEG400, but higher concentrations lead to fluorescence loss in the duplex, probably due to unfolding of the DNA duplex. Data is mean ± s.d for n=3 (3.3% and 43.3 %PEG) or n=2 (others) across two experiments. a) The IR800CW and IR680RD dyes decrease in fluorescence when tethered to the same complex. An excess of αL5 and αL3 were mixed with 50 fmol of an oligonucleotide bearing one copy each of the L5 and L3 sequences, termed the staple oligonucleotide. αL5 was paired with either labelled or unlabeled αL3 to determine the effect of tethering αL3 near αL5, and the reciprocal case was applied to αL3. Complexes were run on a TBE gel in TBEN buffer (0.5X TBE plus 50 mM NaCl) and transferred to a nylon membrane for quantification. Data is the mean ± 95% confidence interval from n=6 independent samples over 2 experiments. b) Labelled complexes always traveled higher on the gel (right panel). Each dye shifts~6 nucleotides higher on a TBE gel. Experiment was repeated twice. c) L5 and L3 adapters were ligated together in vitro, run on a TBE-urea gel, gel extracted, purified using streptavidin beads (MyOne C1, ThermoFisher), and then eluted by the indicated method. This image shows an example of eluates dot blotted on nitrocellulose. Note the peculiar shape of formamide dots. No fluorescence is observed in buffer alone. Water+biotin elution used 100 nM biotin. Formamide elution was 95% formamide with 10 mM EDTA (as suggested by ThermoFisher, who state elution is >95% by this method). DNAse elution used an excess of DNAse I (Ambion) in the buffer supplied by the manufacturer. d) Fluorescence quantification of the same linker-linker dimers depicted in panel A after each elution method. "TBE-urea gel" indicates fluorescence in the TBE-urea gel before extraction and streptavidin purification. Heating in water with 100 µM biotin was effectively complete, as it yielded similar L5 (700 nm) fluorescence as DNAse elution, which is likely to be complete, and similar fluorescence overall as formamide elution, which is complete according to the manufacturer (ThermoFisher). Data, mean ± s.d. for n=9 independent samples, except DNAse I n=3. e) Water, formamide and TBE-urea gels all affect relative L5/L3 fluorescence (IR680RD/IR800CW). The ratio of dye molecules is 1:1 in all cases, as all cases represent linker-linker dimers. Data, mean ± s.d. for n=9 independent samples. f) Fluorescence of the αL5 oligonucleotide in the staple-αL5-αL3 complex as a function of staple oligonucleotide quantity. Signal fits to a linear model (solid line). g) Fluorescence of the αL3 oligonucleotide in the same complexes as A. Signal is again highly linear (solid line is a linear fit). h) Known concentrations of L5 and L3 adapters and staple oligonucleotide were shifted by αL5 and αL3 and a fit to a linear model. As with staple oligonucleotides, data is linear: the solid line represents a perfect fit, dashed lines represent + or -3 fmols. i) Error in the estimates made in panel C. The method is reasonably accurate, with average errors around 20%. The parameters (slope and intercept) from panel C were then used to estimate oligonucleotide concentrations for ligation efficiency determinations, after applying a scaling factor based on the fluorescence of αL5/ αL3 oligonucleotides in 50 fmol staple complexes. The calculation is described in github.com/dfporter/easyCLIP/doc/ in the README_fluorescence.md file.  Fig 4H) is very similar; in both cases, there is less than a 2-fold change in ligation efficiency between any two concentrations. Bars represent the mean. Supplementary figure 15. Definition of terms and theoretical basis for study. Three possible statements are given and illustrated; we suggest only the third is likely. Cross-link rate is the binding frequency multiplied by the complex cross-link efficiency. We argue that cross-link rate is proportional to binding frequency as long as the complex cross-link rate is not exactly inversely related to binding frequency. There is no known mechanism to support the possibility of complex cross-link efficiency being inversely proportional to binding frequency, and we suggest it is implausible. As a result, cross-link rates are proportional to binding frequency overall, but the largely unknown nature of complex cross-link efficiencies suggests that exact levels of cross-linking are probably not to be taken as proportionally exact measures of binding frequency.     Supplementary figure 18. Quantification of cross-link rates for endogenous hnRNP C by immunoblot shift. Cells were UV cross-linked cells then hnRNP C was immunopurified. The change in western blot signal corresponding to monomeric hnRNP C was compared between RNAse concentrations (panels A-C). Because this change in signal is specifically for what can be collapsed with RNAse to monomeric hnRNP C, not for the un-collapsible higher molecular weight complexes spread throughout the lane, it should agree with the cross-linking number derived from dividing the RNA quantified in the minimal region by the monomeric hnRNP C signal ( Figure  4C) and be lower than that derived from all RNA across the gel. Western blot quantification is complicated by the fact that absolute quantification requires protein in single bands of at least 5 ng, the narrow region of linear signal in immunoblots, and the fact that protein cross-linked to an over-digested 1-3 base fragment of RNA (~0.3-1 kDa) will run so close to un-cross-linked protein that it would not be distinct for a~70 kDa protein 2 . a) RNAse digestion series of immunopurified hnRNP C (immunoblot, anti-hnRNP C). Experiment performed twice. b) Example replicate of +/-RNAse gels used to quantify the amount of shifted hnRNP C. Experiment performed twice. c) Quantification of the amount of shifted immunoblot signal comparing +/-RNAse gel lanes, as in panel B. The change in western blot signal was~20%, close to the 22% cross-link number from Figure 4C. A more exact comparison was then performed, deriving the amount of hnRNP C protein dependent on both UV cross-linking and RNAsedigestion by absolute quantification of a western blot (panels D-F). Data is mean ± 95% CI for n=4 samples from two experiments. d) Gel used for absolute quantification of UV-and RNAse-depending monomeric hnRNP C signal. Experiment performed once with 3 replicates. e) Standards used for absolute quantification of gel data as in panel D. f) Quantification of the absolute amount of protein present in the bands in replicates like that in panel D. Bars represent the mean (n=3). g) The amount of hnRNP C cross-linked to RNA that is collapsible into the monomeric hnRNP C band, as determined by the absolute quantification data in panel F (n=3). This method also gave a cross-link rate of~20%, again similar to the 22% observed in Figure 4C. It was concluded that this method of determining cross-link rates using absolute quantification of RNA and protein (Figures 2 and 3) was reasonably accurate. This verification was only possible for hnRNP C because of its very high cross-link rate and small size. Supplementary figure 20. a) Total purified cross-linked RNA positively correlates with protein size for randomly selected non-RBPs. b) Read counts (per million reads) of the non-RBPs vs their own RNAs shows each non-RBP enriches for its respective RNA, a consequence of each non-RBP being expressed from a plasmid. This shows each library was generated from cells over-expressing the respective protein-of-interest, despite the fact that barcodes for multiple over-expression experiments were combined after each ligation. It also shows that if you express an RNA highly, it will show up in CLIP data, regardless of the purified protein. Counts were capped at 5,000 reads-per-million for visualization. Libraries for CAPNS6 were extremely small and were not included. c) Spearman correlations of easyCLIP binding in reads-per-gene (counting exons and introns separately) for non-RBPs.   DCP1B  DCP1B  NUFIP1  NUFIP1  RBFOX1  RBFOX1  EIF1AX  EIF1AX  YTHDC2  YTHDC2  U2AF1  U2AF1  DCP1B Q252H  DCP1B Q252H  NUFIP1 R424W  NUFIP 1R424W  RBFOX1 A69T  RBFOX1 A69T  EIF1AX G9D  EIF1AX G9D  YTHDC2 E185K  YTHDC2 E185K  U2AF1 S34F  U2AF1 S34F  YTHDC2 E634K  YTHDC2   . Two-sided t-test. Boxplots show quartiles, center line shows the median and whiskers show maximum and minimum, except for those points beyond 1.5 * interquartile range, which are plotted individually. NS: not significant. b) Scatterplot of the relation between differential binding in CLIP-seq (via EdgeR) and RNA abundance (via Deseq2) for wild-type and L100Q PCBP1. CLIP binding was determined in HCT116 cells with WT and L100Q PCBP1 integrated into the genome. Red dots represent RNAs with FDR <0.05 for both a decrease in abundance and an increase in binding by L100Q PCBP1. c) Among RNAs with FDR <0.05 for both a decrease in abundance and an increase in binding by L100Q PCBP1, a majority had their primary peak location (by maximum signal density) placed outside the 3'UTR, but 23/32 with the cell-cell adhesion GO term had a primary peak in the 3'UTR (P<4E-5 for enrichment by Fisher's exact test). This rises to 23/27 of those with a peak assigned at all (in some cases, if signal was diffuse enough, the RNA was not assigned a peak). Visual inspection showed nearly all 32 had a 3'UTR peak of some kind.