Unique Molecular Identifiers reveal a novel sequencing artefact with implications for RNA-Seq based gene expression analysis

Attaching Unique Molecular Identifiers (UMI) to RNA molecules in the first step of sequencing library preparation establishes a distinct identity for each input molecule. This makes it possible to eliminate the effects of PCR amplification bias, which is particularly important where many PCR cycles are required, for example, in single cell studies. After PCR, molecules sharing a UMI are assumed to be derived from the same input molecule. In our single cell RNA-Seq studies of Physcomitrella patens, we discovered that reads sharing a UMI, and therefore presumed to be derived from the same mRNA molecule, frequently map to different, but closely spaced locations. This behaviour occurs in all such libraries that we have produced, and in multiple other UMI-containing RNA-Seq data sets in the public domain. This apparent paradox, that reads of identical origin map to distinct genomic coordinates may be partially explained by PCR stutter, which is often seen in low-entropy templates and those containing simple tandem repeats. In the absence of UMI this artefact is undetectable. We show that the common assumption that sequence reads having different mapping coordinates are derived from different starting molecules does not hold. Unless taken into account, this artefact is likely to result in over-estimation of certain transcript abundances, depending on the counting method employed.


Detailed Methods: RNA preparation, reverse transcription and PCR
For run 170420 and run 171108, total RNA was extracted from P. patens as described above and serially diluted to 25 pg/µl for 1 cell equivalents and 250 pg/µl for 10 cell equivalents. 1 ul of the diluted RNA was resuspended in 9 µl RNase free water and 0.56 µl of Takara RNase Inibitor (Takara cat. 2313A) was added. The RNA was freeze-thawed three times at -80 • C and room temperature,respectively, to mimic the lysis conditions used for single protoplasts. Six technical replicates were processed at a time, three 1 cell equivalents and three 10 cell equivalents.
For run 171108, six individual protoplasts were harvested as described above in 5 µl of Hank's Balanced Salt Solution (HBSS) and 0.56 µl of Takara RNase Inibitor was added. The protoplasts were lysed by freezing and thawing three times at -80 • C and room temperature, respectively. Following the freeze-thaw cycles, the RNA and protoplasts were placed on ice.
For RT, 8.2 µl of RNA or protoplast lysate was mixed with 1.8 µl of 10 µM template switching RNA oligo , TSO3 (Supplementary Table S Table S 1). After the RNA/TSO3 mixture was incubated for two minutes at 42 • C, the RT master mix was also placed on the thermocycler and incubated at 42 • C for an additional minute. Next, 10 µl of RT master mix was added to each of the replicate RNA/TSO3 mixtures and mixed by pipetting. The RT reactions were incubated at 42 • C for 90 minutes. Then, the RT reactions were incubated at 70 • C for 10 minutes to inactivate the RT enzyme. The RT reactions were stored overnight at -20 • C.
The following day, RT reactions were thawed on ice and used for PCR optimization. For each replicate, a 10 µl aliquot of cDNA was mixed with 2.5 µl of 10X Advantage 2 PCR buffer (Takara 639201), 1.1 µl dNTP mix 10 mM each (Invitrogen 18427013), 1.3 µl of 10 µM MODIPCR primer (Supplementary Table S 1), 9.1 µl of nuclease free H 2 O, and 1 µl of 50X Advantage 2 polymerase mix (Takara 639201). The PCR reactions were placed in a thermocycler with a heated lid and run with the following cycling parameters: Step 1: 95 • C for two minutes, Step 2: 98 • C for 30 seconds, Step 3: 60 • C for 20 seconds, Step 4: 70 • C for four minutes, Step 5: Repeat Steps 2-4 for 28 cycles, Step 6: 70 • C for five minutes, and Step 7: hold at 4 • C. During Step 4, at cycles 21, 23, 25, 27, and 29, 5 µl aliquots were collected and placed on ice. The aliquots were placed back on the thermocycler at Step 6 for a final extension step. The PCR aliquots for each replicate were run in a 1% TAE agarose gel, stained with a 0.4 ng/µl ethidium bromide/H 2 O solution for 20 minutes and de-stained with distilled H 2 O for 15 minutes. The gel was imaged on a Bio-rad Gel Doc XR+ imaging system with Image Lab Software v.5.2.1. From the gel images, 27 cylces was determined to be the optimal number of PCR cycles for purified RNA and protoplasts.
Following PCR optimization, the remaining cDNA for each replicate, 10 µl each, was used to repeat the PCR reactions described above. The cycling parameters were also repeated as described above but with 27 PCR cycles and without collecting aliquots. The PCR reactions were cleaned by adding 17.5 µl of AmPure XP beads (Beckman Coulter A63881) to each 25 µl PCR reaction. The PCR/bead slurry was mixed until homogeneous by pipetting or gentle flicking of the PCR tubes. The slurry tubes were centrifuged briefly to collect droplets and were incubated at room temperate for five minutes. Following incubation, the PCR tubes containing the slurry were placed on a DyaMag-96 side magnet (ThermoFisher Scientific 12331D) for five minutes until the supernatant was clear. While on the magnet, the cleared supernatant was discarded, 200 µl of 80% ethanol was added to the beads, and incubated on the magnet for 30 seconds. The ethanol was discarded and the beads were washed a second time with 80% ethanol, as described above. The ethanol was discarded and the PCR tubes were centrifuged briefly to collect residual ethanol. The tubes were placed back on the magnet and the residual ethanol was removed by pipetting. The tubes were left on the magnet for 7-15 minutes with the caps open until the beads were completely dry. The tubes were removed from the magnet. The beads were resuspended in 14 µl of EB buffer (Qiagen 19086) and mixed until homogeneous. The EB/bead slurries were incubated at room temperature for five minutes to elute the PCR product and then placed back on the magnet for two minutes until the supernatant was clear. 12 µl of cleared supernatant was transferred to new 200 µl PCR tubes.
1 µl of PCR product for each replicate was quantified, as specified by the manufacturer, using a Qubit DNA High Sensitivity Assay Kit (ThermoFisher Scientific Q32851) and the Qubit3.0 fluorometer (ThermoFisher Scientific Q33216). To assess the size distribution of the libraries , 1 µl of each library was run on a 2100 Bioanalyzer (Agilent G2939A) using a High Sensitivity DNA Analysis kit (Agilent 5067-4626) as specified by the manufacturers' instructions. The size distribution of the libraries ranged from 200-1000 bps. The libraries were stored overnight at -20 • C.

Detailed Methods: Tagmentation and library amplification.
The following day, the PCR product was thawed on ice. The PCR product was simultaneously fragmented and barcoded using Tn5 DNA transposase to transfer adaptors to the target DNA. For each replicate, 10 µl of PCR product was mixed with 4 µl of tagmentation buffer (50mM TAPS-NaOH pH 8.5 (Sigma T130-25G), 25mM MgCl 2 (Ambion Am9530G), and 40% PEG 8000 (Sigma 83271-500ml-F) in nuclease free H 2 O), 2 µl of 10X Tn5 assembly, with each replicate receiving an assembly containing a uniquely barcoded adapter, and 4 µl of nuclease free H 2 O in 200 µl PCR tubes. The reactions were incubated in a thermocycler at 55 • C for seven minutes and then transfered to ice. 0.5 µl of 20 mg/ml proteinase K (Ambion AM2546) was added to each reaction. The reactions were mixed by gently flicking the tubes and centrifuged briefly to collect droplets. The samples were placed back in a thermocycler and incubated at 55 • C for seven minutes and then placed on ice.
For six replicates, 120 µl of Dynabeads M-280 Streptavidin beads (Invitrogen 11206D) were pipetted into a 1.5 ml LoBind Eppendorf tube and placed on a DyanMag-2 magnet (ThermoFisher Scientific 12321D) until the supernatant cleared. The supernatant was discarded by pipetting. The tube was removed from the magnet and 120 µl of BWT (10 mM Tris-HCl , pH 7.5 (Invitrogen 15567-027)), 1 mM EDTA (Ambion AM9260G), 2 M NaCl (Ambion AM9760G), 0.02% Tween-20 (Sigma P9416-50ML)) was added to the beads and mixed by pipetting until the mixture was homogeneous. The bead slurry was placed back on the magnet until the supernatant cleared and the supernatant was discarded. The beads were removed from the magnet and were resuspended in 120 of BWT. 20 µl of washed Dynabeads M-280 Streptavidin beads were added to each 20 µl tagmentation reaction. The bead slurries were mixed by gently flicking the PCR tubes and centrifuged briefly to collect droplets. Then, the bead slurries were rotated at room temperature for 10 minutes to allow the tagmentation libraries to bind to the beads. The bead slurries were centrifuged briefly to collect droplets and placed on a DyaMag-96 side magnet for five minutes until the supernatant cleared. The supernatant was discarded and the PCR tubes were removed from the magnet. Then, the beads were resuspended in 100 µl of PB buffer (Qiagen 19066) and mixed by pipetting. The bead slurry was place on a DyaMag-96 side magnet until the supernatant cleared. The supernatant was discarded. The bead containing tubes were removed from the magnet and the beads were resuspended in 100 µl of TNT buffer (20 mM Tris-HCl , pH 7.5 (Invitrogen 15567-027)), 50 mM NaCl (Ambion AM9760G), 0.02% Tween-20 (Sigma P9416-50ML)) and mixed by pipetting. The bead slurry was place back on the magnet for five minutes until the supernatant cleared. The supernatant was discarded and the TNT wash was repeated again.
This time, the beads were resuspended in 100 µl of restriction mix (1X NEB Buffer 4 (New England Biolabs B7004S) and 0.4 U/µl PvuI-HF (New England Biolabs R3150S) in H 2 O) and mixed by pipetting. The bead slurry was place in a thermocycler and incubated at 37 • C for one hour to digest and remove unwanted 3 fragments carrying the PvuI recognition site. After incubation, the bead slurry was placed on a DyaMag-96 side magnet until the supernatant cleared. The supernatant was discarded and the beads were washed three times with TNT as described above. After the third TNT wash, the beads were resuspended in 20 µl of EB buffer. The eluted tagmenation library was placed on ice.
Next, the tagmentation libraries were amplified by PCR to obtain enough library for Illumina sequencing. For each replicate, 20 µl of tagmentation library was mixed with 5 µl of 10X Advantage 2 PCR buffer (Takara 639201), 1 µl dNTP mix 10 mM each (Invitrogen 18427013), 0.5 µl of 10 µM P5 primer (Supplementary Table S  Step 6: 70 • C for five minutes, and Step 7: hold at 4 • C overnight. The following morning, the amplified libraries were removed from the thermocycler and placed at room temperature.
The amplified libraries were cleaned by adding 35 µl of AmPure XP beads to each 50 µl PCR reaction. The PCR/bead slurry was mixed until homogeneous by pipetting or gentle flicking of the PCR tubes. The slurry tubes were centrifuged briefly to collect droplets and were incubated at room temperature for five minutes. The PCR/bead slurries were placed on a DyaMag-96 side magnet for five minutes until the supernatant was clear. While on the magnet, the cleared supernatant was discarded by pipetting. 200 µl of 80% ethanol was added to the beads and the beads were incubated for 30 seconds. The ethanol was discarded and the beads were washed a second time with 80% ethanol, as described above. The PCR tubes were centrifuged briefly to collect residual ethanol and placed back on the magnet. The residual ethanol was removed by pipetting. The tubes were left on the magnet for 7-15 minutes with the caps open until the beads were completely dry. The tubes were removed from the magnet and were resuspended in 14 µl of EB buffer and mixed until homogeneous. The EB/bead slurries were incubated at room temperature for five minutes to elute the libraries and then placed back on the magnet for two minutes until the supernatant was clear. 12 µl of cleared supernatant was transferred to new 200 µl PCR tubes.
1 µl of each library was quantified using a Qubit DNA High Sensitivity Assay Kit and the Qubit3.0 fluorometer, as specified by the manufacturer. To assess the size distribution of the libraries , 1 µl of each library was run on a 2100 Bioanalyzer using a High Sensitivity DNA Analysis kit (see manufacturers' instructions). The size distribution of the libraries ranged from 200-700 bps. The libraries were stored at -20 • C until used for sequencing. Sequence  FSP  AAT GAT ACG GCG ACC ACC GAT CGT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT  P5 AAT   (7463)). R packages ggplot2 11 and dplyr 12 were used. This software is available and documented (including dependencies) on GitHub. DESeq2 analysis was done with modifications to a template provided by https://gist.github.com/stephenturner/f60c1934405c127f09a6# file-deseq2-analysis-template-r-L62

Figure S 1.
Comparison of RNA-Seq UMI and read counts from sequencing data derived from two P. patens protoplasts. A: read counts per gene. B: UMI counts per gene. UMI and read counts are shown for the same data. Each dot plots a gene having two counts of reads or UMI as indicated by the X and Y axes.