Targeted sequencing of both DNA strands barcoded and captured individually by RNA probes to identify genome-wide ultra-rare mutations

Next Generation Sequencing (NGS) has been widely implemented in biological research and has made a profound impact on patient care. One of the essential NGS applications is to identify disease-causing sequence variants, where high coverage and accuracy are needed. Here, we reported a novel NGS pipeline, termed a Sequencing System of Digitalized Barcode Encrypted Single-stranded Library from Extremely Low (quality and quantity) DNA Input with Probe-based DNA Enrichment by RNA probes targeting DNA duplex (DEEPER-Seq). This method combines an ultra-sensitive single-stranded library construction with barcoding error correction, termed DEEPER-Library; and a DNA capture approach using RNA probes targeting both DNA strands, termed DEEPER-Capture. DEEPER-Seq can create NGS libraries from as little as 20 pg DNA with PCR error correcting capabilities, and capture target sequences at an average ratio of 29.2% by targeting both DNA strands simultaneously with an over 98.6% coverage. Our method tags and sequences each of the two strands of a DNA duplex independently and only scores mutations that are found at the same position in both strands, which allows us to identify mutations with allelic fractions down to 0.03% in a whole exome sequencing (WES) study with a background error rate of one artificial error per 4.8 × 109 nucleotides.

. Density plots of read depths. Density plots were created to show GC content against normalized mean read depth for (A) DEEPER-Seq WES study with normal tissue DNA; (B) DEEPER-Seq WGS study with normal tissue DNA (without enrichment for whole exome). Table S1. 298-gene panel real-time PCR parameters. 298 cancer related gene targets, primer pairs, amplicon sequences, amplicon GC% and amplification efficiency constant for each real-time PCR detection are listed. Table S2. Mutation and ultra-rare mutation detection by DEEPER-Seq.

SUPPLEMENTARY TABLES (attached as separate Excel documents)
Sequence variants detected by DEEPER-Seq, validation results by Sanger sequencing and ultrarare mutation re-detection results are shown and ranked by Mutant Allele Fraction.

Real-time PCR assay
Real-time PCR assays with SYBR green detection was carried out using an ABI PRISM 7500 Sequence Detection System (Applied Biosystems). Briefly, the reaction conditions consisted of 500ng of genomic DNA or DNA library products, 0.2 μM primers, and SYBR Green Real-Time PCR Master Mix (ThermoFisher Scientific) in a final volume of 20μl. Each cycle consisted of denaturation at 95°C for 15 seconds, annealing at 58.5°C for 5 seconds and extension at 72°C for 20 seconds, respectively. Gene specific primers were designed using Primer 3 1 and their sequences are provided in Supplementary Table S1. Reactions were run in triplicate in three independent experiments. The primer pair's standard amplification curve for each gene was established through using sequential dilutions of the "+" clone constructs containing the amplicon sequence, which was originally created to generate the DEEPER-Capture RNA probes. Amplification efficiencies for 298 target amplicons were established and listed in Supplementary  Table S1. Gene abundance ratios between different samples were calculated by the raising the gene specific amplification efficiency (AE) to the power of ΔCt value between different samples. For example, the ratio (r) of gene abundance in sample A vs sample B can be calculated through real-time PCR assay by:

Build a highly accurate reference exome for ultra-rare mutation identification
To highly accurately assess the baseline mutation frequency of DEEPER-Seq pipeline, we constructed six replicates of standard NGS DNA libraries in parallel, each using 100ng normal DNA input. We used these six replicates of exome datasets to re-build our own reference exome database for this particular patient by requesting that if the same SNV was observed in ≥ 5 out of 6 independent datasets, we considered the SNVs as germline variants and updated our reference exome sequence database. For a standard NGS pipeline, the error rate is 1%, and the chance to see exactly the same random error at a fixed position for 5 times is (⅓ * 1%) 5 = 4.12X10 -13 . This number means that if we use this approach to sequence the whole human genome once, we are presumably going to have only one artificial error, because 3 X 10 12 human genome bases X (4.12 X 10 -13 ) = 1.24. However, we are enriching and sequencing the human exome, which is occupying only 1.5% of human genome, therefore the chance to see a single artificial error within the entire human exome is only 1.86% (=1.5%X1.24). An updated highly accurate normal exome reference database of the patient was built accordingly.

DEEPER-Library offers the ultimate ability to detect ultra-rare mutation in limited amount of samples or damaged samples
The DEEPER-Library creates a large number of barcoded DNA read families (URFs), where each family arises from a single-stranded DNA molecule. After sequencing the library, DNA molecules within the URF can be identified and grouped based on the fact that they all share an identical barcode sequence ( Figure 1). Only the URF with at least 3 reads and with 95% molecule members sharing the same sequence at any giving position is adopted as a read family to generate the consensus sequence (a super read). This step efficiently removes artificial PCR errors that occur during repeated rounds of library amplification.
If an artificial error occurs at the very first step of PCR amplification, it will propagate to at most 50% of the PCR products of that sequence. Artificial variants that arise due to PCR errors or sequencing errors can be removed based on the fact that errors occur along with multiple rounds of PCR amplifications, thus being observed from only a subgroup of the reads sharing the same unique barcode. A filter can be adopted to abandon the URF whose sequence uniformity is lower than a threshold, and for this study such threshold was set as 95%. A higher threshold can further improve the sequencing accuracy, but will lead to a lower number of super reads. With a large number of high fidelity super reads collected, each super read, bearing a unique barcode, is aligned to its complementary super read by virtual of sharing a complementary consensus sequence but being differently barcoded. By mapping the super reads arising from both DNA strands individually, artificial errors in super reads can be removed in such a way that a sequence variant at a position is considered real only if a matched sequence variant can be observed at the same position from the other complementary DNA strand super read with a different barcode. The possibility for any artificial sequence variants to have a matched artificial variant at the same position from a complementary DNA strand is < 6.45 X 10 -14 per base.

DEEPER-Capture based DNA capture enables ultimately the best capture efficiency
In 1960, DNA-RNA hybridization was reported for the first time before the term was invented 2 . Since then, numerous studies have reported that an RNA probe can bind to its complementary DNA target sequence with a much stronger affinity than a DNA probe 3,4 . In DEEPER-Capture, capture efficiency is greatly improved by using a large amount of RNA probes to capture both DNA strands of the same DNA duplex molecule, simultaneously ( Figure 4A). DEEPER-Capture achieves an unprecedented 29.2% capture ratio on average, and this phenomenal high efficiency is achieved presumably due to two reasons: 1) The large number of single-stranded RNA probes used in DEEPER-Capture may improve the hybridization reaction. The excessive amount of RNA probes will push the balance of the binding reaction towards forming RNA-DNA duplex, and RNA duplex that can then be easily removed by RNase treatment if needed. The logistics of standard single-stranded RNA probe based capture (Half-DEEPER-Capture) and DEEPER-Capture can be illustrated by the following equations with the assumption that RNA-Probe ( 2) DEEPER-Capture may improve the hybridization reaction by depleting one DNA strand, thus helping to expose the other DNA strand to a large amount of complementary RNA probes, both of which may synergistically increase the reaction constant k2 to be significantly larger than k1. When a DNA duplex is placed in a heated environment around its Tm, the two complementary DNA strands are either separated (for strands or regions with low GC content) or loosely associated (high GC regions). When one of the two complementary DNA strands is captured by an RNA probe, the other DNA strand can be more accessible to its complementary RNA probes. Therefore, DEEPER-Capture may improve target capture efficiency by achieving a much larger k2 over k1.
It has been reported that in NGS capture methods, overlapping baits improves sensitivity and are superior to an immediately adjacent or spaced design, and relatively long baits and RNA-based baits can increase capturing efficiency 5 . The DEEPER-Capture method we reported here utilizes randomly sheared massive amounts of RNA probes with their length ranging from 100 to 150nt, which are heavily overlapped and covering the target DNA regions thousands of times. We demonstrated the superior capturing efficiency of the RNA probes designed and synthesized by our pipeline. Our findings once again supported the previous observations of RNA probes in capture operation. More importantly, we reported for the first time that when the overlapping RNA probes are in excessive amount (compared to DNA molecules) and are targeting both DNA strands simultaneously, a significantly improved capture efficiency can be achieved.
Off-target enrichment was one of the biggest concerns in DEEPER-Capture. The highly efficient DEEPER-Capture approach relies on a large amount of RNA probes that are overlapping with each other and are complementary to both DNA strands of the same targeted genomic region. A major side reaction would be the formation of RNA duplex molecules, RNA(+) : RNA(-), from two complementary RNA single strands. However, this side interaction may have only limited negative impact on the formation of DNA(+/-) : RNA-Probe(-/+) hybrids. Furthermore, RNA duplex molecules as well as the excessive amount of RNA probes can be removed by RNase treatment if necessary, and the captured target DNA sequences won't be affected. A major concern for offtarget enrichment in NGS is the proportion of unwanted genomic DNA fragments that are being enriched through unspecific hybridization. To address this issue, we optimized capturing conditions with different buffer systems, capture reaction temperatures, incubation times and blocking primer sequences and concentrations, etc. Under the optimized condition, DEEPER-Capture showed that only an average of 16.4% of the total reads are off-target reads in a WES study (Supplementary Figure S1B). This is an acceptable ratio for most NGS applications and is lower than other target enrichment methods [5][6][7][8][9][10][11][12][13][14] . Further improvements can be achieved with additional optimized conditions or procedures.
There are several widely used commercial kits designed to capture DNA subgenomic regions. Agilent, NimbleGen and Illumina are three major vendors in this field. Based on chemical natures of their probes, these commercially available approaches can be classified into two categories: 1) RNA probes: Agilent's SureSelect; 2) DNA probes: Roche's NimbleGen SeqCap, Illumina's TruSeq and Nextera. Several studies have been conducted to compare these capture methods in terms of their performance in WES 5,6,8,[15][16][17] . All the platforms mentioned above can capture over 90% of the unique sequences in a WES study with a minimal sample input ranging from 50ng (Illumina Nextera) to 1.1ug (NimbleGene). Agilent SureSelect offers the only RNA probe-based (single-stranded RNA probes) capture method on the market, and has been reported to perform successful capture with down to 6.25ng input DNA to achieve ~300X mean depth of coverage with an SNV detection sensitivity >96% for high prevalence SNVs (allelic fractions >15%) 7 . As we introduced above, RNA baits have unprecedented advantages over DNA baits, such that it bindd to target DNA much stronger than DNA probes, and that RNA baits do not interfere with downstream PCR reactions and can be easily removed. Like us, Agilent adopted RNA probes, but their RNA baits are targeting only one strand of the DNA targets with very limited probe amount. However, in DEEPER-Capture we are capturing both DNA strands simultaneously with an excessive amount of probes, thus achieving an over 3 folds improved efficiency comparing to a single-stranded capture approach.