High-throughput, Efficient, and Unbiased Capture of Small RNAs from Low-input Samples for Sequencing

MicroRNAs hold great promise as biomarkers of disease. However, there are few efficient and robust methods for measuring microRNAs from low input samples. Here, we develop a high-throughput sequencing protocol that efficiently captures small RNAs while minimizing inherent biases associated with library production. The protocol is based on early barcoding such that all downstream manipulations can be performed on a pool of many samples thereby reducing reagent usage and workload. We show that the optimization of adapter concentrations along with the addition of nucleotide modifications and random nucleotides increases the efficiency of small RNA capture. We further show, using unique molecular identifiers, that stochastic capture of low input RNA rather than PCR amplification influences the biased quantitation of intermediately and lowly expressed microRNAs. Our improved method allows the processing of tens to hundreds of samples simultaneously while retaining high efficiency quantitation of microRNAs in low input samples from tissues or bodily fluids.

However, the barcode and fixed adapter ends introduces a high risk of ligation bias and, therefore, we introduced optimizations to reduce this bias. Also, we optimized methods to reduce contamination of libraries with self-ligated adapters, significantly enhancing the number of usable reads following sequencing. Finally, by introducing unique molecular identifiers (UMIs), we show that preferential PCR amplification of specific miRNA does not appear to be a significant contributor to miRNA bias. Instead, stochastic capture of miRNAs at very low input concentrations requiring high PCR cycle numbers does lead to poor and irreproducible calling of miRNA expression levels, especially of intermediately and lowly expressed miRNAs. Together, our optimized high-throughput method of library production results in the highly efficient and reproducible capture of small RNAs for downstream sequencing.

Results
Optimization of small RNA library preparation. To improve upon the original Tuschl method 18 , we included two alterations that have been shown to improve ligation efficiency in other protocols (Fig. 1A). The addition of random nucleotides has been shown to increase ligation efficiency while also reducing ligation bias 5,19 . Therefore, we introduced four random nucleotides at the 3′ end of the 5′ adapter and the 5′ end of the 3′ adapter.
In addition, we introduced 4 random nucleotides just 3′ to the barcodes to reduce risk of a fixed secondary structures. Together, these 12 random nucleotides also provide a unique barcode. The addition of polyethylene glycol at each ligation step has also been shown to improve ligation and thus was included in the protocol 5,20 . In our experience, a major contributor to sequencing reads using traditional approaches of library preparation on low RNA input samples are the adapters themselves, especially 5′ adapter, consistent with ligation of multiple consecutive copies of the adapter (Fig. 1B). Therefore, we tested decreasing concentrations of 3′ and 5′ adapters starting with the concentrations used by Tuschl et al.; 0.85 μM and 3.3 μM respectively on RNA prepared from plasma. Titration of the 3′ adapter from 0.85 μM to 0.05 μM, holding the 5′ adapter steady at the original 3.3 μM, showed an increase from 7% to 18% miRNA reads ( Supplementary Fig. S1). A titration of 5′ adapter from 3.3 μM down to 0.33 μM increased the percentage of reads mapping to miRNAs from 7% to 19% ( Supplementary Fig. S1). There is evidence from Shore 20 and others 21 that blocking modifications on one or both adapters decreases adapter dimer formation. We next asked whether a blocking amino modification on the 5′ adapter could also increase the reads mapping to miRNAs, by suppressing ligation of 5′ adapter to itself. Blocking modification increased miRNA reads from 7% to 14% at the high concentrations of adapters ( Supplementary  Fig. S1). In all cases the increase in reads mapping to mature miRNA sequences came primarily at the expense of reads mapping to adapters ( Supplementary Fig. S1). To confirm that the optimized conditions reproducibly result in a higher percentage of reads mapping to miRNAs, we prepared two library pools in parallel using either the original conditions or the optimized conditions using input RNA from two independent plasma samples (Plasma2, Plasma3) and RNA prepared from an established cell line, PC3 (Cell1). The combination of optimized adapter concentrations and 5′ modification led to a four-fold or better increase in the fraction of reads mapping to miRNAs in all samples (Fig. 1C). Analysis of the reads arising from these optimized conditions showed that the increase in mature miRNA reads again came primarily from a decrease in adapter sequences (Fig. 1D).
Once the optimal adapter ligation conditions were determined, we explored the impact of input RNA concentration on miRNA capture efficiency. Based on previous titrations using independent samples, we repeated the highest and lowest inputs in the protocol comparison experiment. We found that capture was improved such that a similar percentage of miRNAs were captured using the optimized protocol at five to ten times less input than the original protocol (50 ul/250 ul Plasma 50 ng/500 ng Cell) in all three biological samples; Plasma 2, Plasma 3 and Cell1 ( Fig. 2A), again due to a reduction in adapter contamination (Fig. 2B).
Examination of the miRNAs in the high input samples revealed that the optimized protocol identified over a hundred more miRNA species than the original protocol in each case (Plasma 2 976/844; Plasma 3 895/713; Cell1 1120/976). Thus, reduction of adapter contamination for low input RNA samples vastly improves quality of resulting small RNA libraries.
Ligation Bias. Ligation bias can lead to differential miRNA capture 10,11,21 . The inclusion of random nucleotides at the ligated ends of the adapters allowed us to evaluate this bias, including which nucleotides and nucleotide positions within ligated adapters have the greatest influence on ligation bias for any particular miRNA. When averaged across all miRNAs, there was little evidence for a strong nucleotide bias at any position (Fig. 3A). However, examination of individual miRNAs showed substantial and distinct biases for particular nucleotides (Fig. 3A). For example, when annealed to BarcodeA, the highly expressed miR-21and the lowly expressed miR-151 showed a bias for GTT at positions 5 to 7, while the intermediately expressed miR-96, showed a bias toward G at position 5 and 6 ( Fig. 3A). These biases were further influenced by the barcode (compare BarcodeA in Fig. 3A to BarcodeB in Supplementary Fig. S2). For example, with BarcodeB, miR-21 now showed a bias toward a C at position 5, miR-96 a bias for an A at position 5, and miR-151 a bias for C at position 5 ( Supplementary Fig. S2). The intermediate expressed Let-7i also showed distinct adapter biases associated with each barcode (Fig. 3A, Supplementary Fig. S2).
Next, we asked how far nucleotide biases deviated from the expected equal distribution across all four nucleotides at each position. To do this we measured cumulative divergence of each nucleotide from its expected 25% contribution at each position. Looking across all miRNAs, the greatest bias was at positions 5 and 6 ( Fig. 3B, top plot). However, once again this effect differed among individual miRNAs (Fig. 3B, lower plots). It also differed slightly between barcodes ( Supplementary Fig. S2). Importantly, libraries produced using the same input, but  Table 1). Together these data show that the usage of random nucleotides at the ligation ends of the adapter, especially the 3′ adapter, insures the unbiased and reproducible capture of all miRNAs even when using distinct barcodes to multiplex sample using the many to one sample handling strategy.

PCR Duplication.
Another potential source of bias in library production is differential PCR amplification of individual miRNAs. For example, it has been proposed that GC content can influence amplification 13,14 . Differential amplification could be further complicated by the specific pool of miRNAs existing in any one sample. Using the 12 random nucleotides in the adapter as unique molecular identifiers (UMIs), we asked whether differential PCR amplification exists and if so, how great is its impact. First, we evaluated whether the 12 random nucleotides provided a great enough diversity of sequences to cover the potential number of input miRNAs at our sequencing depth of over four million reads. If all positions are represented equally there are over 16 million combinations of the four bases at twelve positions, over 65 thousand at eight positions and 256 combinations at four positions. However, ligation bias reduces these numbers. Thus, to empirically determine how many random Ns are required to capture the full complement of input miRNAs, we evaluated the impact of increasing the number of random Ns on either side of the insert and collapsing resulting reads (Fig. 4A-C). The analysis was performed on all miRNAs combined as well as on two individual miRNAs, miR-21 and miR-96, representing a high and intermediate expressed miRNA, respectively. The miRNA counts showed a plateau starting at 10 Ns with miR-96 and 11 Ns for miR-21. Therefore, 12 Ns is predicted to capture the complete or near-complete complement of the most highly expressed miRNAs, even at a high sequencing depth. Next, to determine whether there is preferential amplification of specific miRNAs, we compared total read counts to read counts collapsed using the full 12nt UMI. This analysis showed a linear relationship between total and collapsed reads indicating equal amplification across all miRNAs at low PCR amplification cycles (Fig. 4D). However, dilution (500×) of the same samples followed by high PCR amplification cycle numbers showed a decrease in the linear relationship (Fig. 4E). Comparison of collapsed reads from the 14 and 24 cycle runs showed an even greater difference (Fig. 4F). This later finding suggests stochastic capture at low input plays a more significant role than amplification bias. To further visualize the impact of input and PCR cycle number on expression, we plotted total counts, collapsed counts, and percent unique reads for each miRNA of a sample present in three libraries differentially amplified for 14, 16, or 24 cycles (Fig. 4G-I). Sixteen cycles overlapped well with fourteen cycles whether evaluating total or collapsed reads, even though the duplication rate was more than two-fold greater at 16 cycles (approximately 55% unique reads at 14 cycles versus approximately 25% unique reads 16 cycles). In contrast, 24 cycles showed overlap for the top third of expressed miRNAs, but than a dramatic drop off and large variation for the remaining miRNAs. The overlap in the more highly expressed miRNAs was quite surprising given the extremely high duplication rate (less than 5% unique reads). These findings show that the stochastic failure to capture miRNAs in low input libraries plays a much greater role than differential amplification in influencing the accuracy of evaluating miRNA expression values.

Discussion
Here we describe an approach for analyzing the composition of miRNAs across many low-input samples in a sensitive, high-throughput and cost-effective fashion. A significant problem plaguing small RNA sequencing library production is that the adapter ligation can be inefficient, errant and/or biased resulting in sequencing data that does not accurately represent the ratios of miRNAs in the raw sample. These problems are amplified in low RNA input samples like plasma and urine 7 . Various strategies have been employed to increase efficiency of ligation including choice of ligase 10,11 , addition of PEG 5,20 , and addition of blocking modifications on either or both adapters 20 . Furthermore, a few studies have found that adapters designed with degenerate bases at the ends 18,19,21,22 or internally 19 reduce sequence bias in the ligation product. These strategies have resulted in step-wise improvements in sequencing. The method we present incorporates each of these improvements while using barcodes to allow sample pooling after the first ligation 18 . This many-to-one sample approach was further optimized to obtain miRNA expression data from human plasma samples.
There are few commercially available miRNA sequencing kits optimized for low-input RNA samples. The most commonly used kits (Illumina TruSeq, New England Biochemical's NEBNext for example) recommend 1 ug of total RNA input. Trilink Cleantag 20 and Bioo Scientific Nextflex 23 have protocols for 1 ng input. With the optimized method presented here, miRNA libraries were successfully prepared from plasma samples with RNA levels undetectable by the Thermo Scientific Nanodrop One (<1.6 ng/μL) or Agilent Bioanalyzer RNA 6000 Pico kit ( < 50 pg/μL). The 5′ and the 3′ adapters included four degenerate bases at the ligation site of each. In addition to the blocking modification already present on the 3′ adapter 18 , a blocking amino modification was added to the 5′ end of the 5′ adapter. The concentration of each adapter was then titrated using the low input plasma samples to minimize adapter dimer contamination. These changes increased the efficiency of miRNA capture in the library five-fold.
Unbiased miRNA capture across samples was achieved regardless of nucleotide preferences by particular miR-NAs. Consistent with our findings, Giraldez et al. recently showed that the addition of four degenerate bases to adapters of the TruSeq protocol greatly improved miRNA capture 22 . However, their library production strategy is based on the individual preparation of samples. Our use of the many to one sample approach greatly simplifies the building of libraries from large numbers of samples. In addition, it enables the combination of many low input samples.
Importantly, we show that increasing the number of PCR cycles to compensate for low sample input results in high duplication rates and loss of information for lowly and intermediately expressed species. Thus, combining multiple low input samples through early barcoding not only increases efficiency of library preparation, but also reduces artifacts associated with high PCR cycle number. A disadvantage of this pooled preparation is the potential for over-representation or under-representation of reads from a subset of samples. This can especially be the case when it is difficult to quantify and normalize RNA levels between samples prior to building of libraries. However, we have found it straightforward to rescue the under-represented samples, by repeating their sequencing in a separate pool with other lowly represented samples.
Our optimizations allow for robust screening of miRNAs from low-input biofluids of tens to hundreds of patients simultaneously. These methods should allow for increasingly accurate quantification of miRNAs in blood and tissue samples enabling the discovery of novel biomarkers of disease. Future improvements in RNA extraction and size selection methods will further enhance the utility of this method for small RNA biomarker discovery.

Methods
Ethics Statement. Blood samples were collected, and plasma prepared by standard clinical protocol between 2015 and 2016. Informed consent was obtained from all subjects. All samples were de-identified therefore this study was exempt from review by the UCSF Committee on Human Research. are referred to as Cell1, Cell2. Concentrations were determined by nanodrop (Thermo Fisher Scientific). Low-input RNA was collected from human plasma that was collected following standard clinical procedures using EDTA-coated tubes (Becton Dickinson). The RNA was prepared by thawing the plasma on ice, then following the micro miRNeasy kit (Qiagen 217084) with the change of adding 5x the volume of QIAzol lysis reagent to the plasma and using a final elution volume of 7 μl of RNase/DNase free water per 200 μl input plasma. Nanodrop One (Thermo Fisher Scientific), Bioanalyzer RNA 6000 Pico kit (Agilent) or Qubit (Life Technologies) analysis was employed to assess RNA quantity and quality. RNA from 200 μl of plasma was inconsistent or undetectable by any method. Biological Replicates of plasma samples from independent blood draws of healthy volunteers are referred to as Plasma1, Plasma2 and Plasma3.
Small RNA-seq. Small RNA-seq libraries were made following the general workflow of Hafner et al.
( Fig. 1A 18 ). Input RNA for the cellular sample was 500 ng total RNA while input for the plasma samples was total RNA from 200 μl plasma unless otherwise noted. Sequencing adapters were synthesized by Integrated DNA Technologies using the codes provided in SupplementaryTable 2.xlsx. 3′ adapters were polyadenylated as described in the detailed protocol available in supplementary methods. The adapter concentrations were titrated as noted in the figure legends. 3′ adapters were used at 0.05, 0.2 or 0.85 μΜ while 5′adapters were added at 0.33, 1.0 or 3.3 μΜ. The experiments using the optimal conditions were done using 0.05 μΜ 3′ adapter, 0.33 μΜ amino-modified 5′ adapter. Fragments of the correct size were PAGE purified after each ligation step and after the PCR amplification. Unless otherwise noted, PCR was performed for 14 cycles. The 14-cycle and 24-cycle libraries presented in Fig. 3 were prepared from the same reverse transcription (RT) reaction. The cDNA was diluted 500 times with water prior to 24-cycles of PCR. Sequencing was performed on Illumina HiSeq 2000 sequencers. A detailed protocol is available at blellochlab.ucsf.edu/Protocols and in SupplementaryInformation.pdf.