High-frequency, low-coverage “false positives” mutations may be true in GS Junior sequencing studies

The GS Junior sequencer provides simplified procedures for library preparation and data processing. Errors in pyrosequencing generate some biases during library construction and emulsion PCR amplification. False-positive mutations are identified by related characteristics described in the manufacturer’s manual, and some detected mutations may have ‘borderline’ characteristics when they are detected in few reads or at low frequency. Among these mutations, however, some may be true positives. This study aimed to improve the accuracy of identifying true positives among mutations with borderline false-positive characteristics detected with GS Junior sequencing. Mutations with the borderline features were tested for validity with Sanger sequencing. We examined 10 mutations detected in coverages <20-fold at frequencies >30% (group A) and 16 mutations detected in coverages >20-fold at frequencies < 30% (group B). In group A, two mutations were not confirmed, and two mutations with 100% frequency were confirmed as heterozygous alleles. No mutation in group B was confirmed. The two groups had significantly different false-positive prevalences (p = 0.001). These results suggest that mutations detected at frequencies less than 30% can be confidently identified as false-positives but that mutations detected at frequencies over 30%, despite coverages less than 20-fold, should be verified with Sanger sequencing.

false-positive variant calls with in-house-developed variant interpretation pipeline software. These authors suggested that when 99.99% detection sensitivity is required, cut-off values should be a 10-fold coverage range and 20% variation. In addition, the resulting variants with these values must be confirmed with Sanger sequencing, and a 5-fold coverage range is expected to be sufficient when screening for only homozygous variations 5 . The fifth study used Reference Mapper software, showing that this approach requires that reads harbouring the mutation should exceed 30% of the total reads 6 .
Six studies involving the GS Junior platform have provided a basis for the adopted standards of coverage and frequency. The first study used some web tools, including BWA-MEM, Cutadapt, VarScan, and a series of functions and filters that they programmed. These authors suggested thresholds for optimal accuracy and recommended a 38-fold coverage range to detect heterozygous alleles with a minimum 25% allele frequency for a sensitivity of 99.9% 7 . The second study used Amplicon variant analyser software. This group considered that a sample was truly mutated only when mutations were present in at least 1% of the consensual reads and at least 10 total reads were performed 8 . The third study did not mention the software used, and the investigators accepted the same standards used in the second study 9 . The final three of the six studies applied the Reference Mapper software. Of these, the fourth study showed that sequence variants were further prioritised according to the percentage (over 20%) of reads that contained a given variant 10 . The authors of the fifth study suggested a cut-off of 20-fold sequence coverage with 30% to 70% total variation in a heterozygous form and >90% in a homozygous form 11 . However, the authors of the sixth and final study indicated that only variants detected within a 10-fold coverage range with a 20% frequency should be considered when a detection power of 99.99% was required in the context of molecular diagnostics 12 .
In summary, the required coverage and frequency can differ according to the requirements of sensitivity, research purpose, and software used. That said, a 10-fold coverage range and 20% frequency may be the minimum requirements under special conditions. Some mutations that are detected with only a few reads or at a low frequency and are considered false positives may be true positives. One study using the Reference Mapper software available in the GS Junior platform identified two variants with less than 20-fold depth or <30% total variation that were true-positive variants by Sanger sequencing verification 12 . In the last 3 years, mutation screening studies applying Reference Mapper software in the GS Junior platform also found several mutations with coverages less than 20-fold and frequencies over 30%, but no other false-positive characteristics, and Sanger sequencing also confirmed these to be true positives. In the present study, we collected 10 mutations with coverages <20-fold and frequencies >30% and 16 mutations with coverages >20-fold and frequencies <30%, and tested their validity with Sanger sequencing to determine the false-positive prevalence. We also investigated whether a false-positive detection was more likely to be identified by a coverage <20-fold or by a frequency <30% when the mutation had no other false-positive characteristics.

Results
In group A, mutations were detected in the 10 exons with a coverage range less than 20-fold and at frequencies greater than 30% ( Table 1). None of the target regions were GC-rich sequences. Two mutations (A2 and A10) were not confirmed with Sanger sequencing, although one mutation appeared with 100% frequency. Two other mutations with 100% frequencies (A7 and A9) were confirmed to be heterozygous alleles (electropherograms of A7, Fig. 1), and they were also considered to be false positives in the study when calculating the prevalence of false-positive mutations. The electropherograms of representative A3 are shown in Fig. 2.
In group B, 16 mutations were detected with a coverage range greater than 20-fold and at frequencies less than 30% ( Table 2). None of these 16 mutations were confirmed with Sanger sequencing.
The prevalences of false-positive mutations were 40% in group A and 100% in group B. The false-positive prevalences were significantly different between the two groups (p = 0.001).

Codes Gene and exon Chr
Reference position

Discussion
The 454 GS Junior sequencers may produce false positives, which could lead to misleading conclusions. To avoid such pitfalls, only mutations that fulfil the standards for a positive identification are used for result analyses. Thus, the minimum read depth and minimum percentage of mutated reads are set to eliminate random sequencing errors. In some studies, only the variants detected in over 10-fold or 30-fold coverage ranges 3,4,8,9 or variants with mutations detected in over 30% of the total reads 4,6 have been used to draw a conclusion. However, in a study that used the GS Junior System and GS Reference Mapper software, some true variants were skipped because they were detected in coverages less than 20-fold or at frequencies less than 30% 12 . In the present study, eight true variants were considered false positives based on the cut-off values; however, these were later confirmed as true positives with Sanger sequencing. In contrast, none of the mutations detected at frequencies less than 30% were confirmed with Sanger sequencing. These results confirmed the importance of being aware of different potential generation mechanisms of artifacts, which can result in false-positive calls. Bias can have an influence at several steps, including amplicon-based library preparation, fragment enrichment, and sequencing. False-positive mutations may be generated by chimeric sequences that form during the PCR amplification step. Chimeras arise through a PCR-mediated recombination, which creates an artificial, false-positive haplotype. Chimera formation may be induced by high cycle numbers, high initial template concentrations, and polymerases [13][14][15][16] . Reported chimera formation rates range from 1% to 5% across a variety of polymerases [14][15][16] , and rates are much higher in some other studies [17][18][19] . PCR biases can also be induced by suboptimal primers 20 . Consequently, to reduce biases, optimum template concentrations, fewer cycles, high-fidelity polymerases, and optimal primers require consideration.
Templates with high GC base compositions are frequently difficult to amplify with PCR. GC contents over 65% may induce aberrant amplification from non-target regions, producing multiple bands on gel electrophoresis 21 . Unusual sequence characteristics, like high GC content or the presence of repetitive elements, can also cause poor enrichment in sequence capture enrichment approaches, reducing sequencing efficiency 4 . A high GC content in the 5′-untranslated region of an exon can also hinder efficient target sequence capture 22 . On the other hand, the 454 NGS sequencing has some technical limits in detecting mutations in homopolymers 23 . In the present study, GC content was calculated in the PCR products of the amplicon libraries and in the exons of the custom-designed SeqCap EZ Choice libraries. We found no GC effects in the regions targeted in this study.
Enrichment approaches that capture single DNA fragments in solution have the advantage that many regions can be targeted in parallel; however, some targeted regions may not be captured and other, unwanted regions may be. When the beads fail to sufficiently capture one type of fragment or are washed off the fragment, that sequence will be insufficiently amplified and few reads achieved (Fig. 3A). Alternatively, when mistaken fragments generated during the PCR amplification are captured and amplified but the true fragments at the same position are The read image from the GS Junior system; the sequence is shown in sense sequence. There were three reads, and the reads showed a homologous mutation. A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA. Contig 00053 was the GenBank accession number in the software for the reads. (B) The electropherogram from Sanger sequencing, and the sequence is shown in sense sequence. The mutation was verified to be heterozygous. not sufficiently amplified, the result may be detection of a high-frequency, false-positive mutation (Fig. 3B). Only 5-fold coverage has been identified as sufficient for screening homozygous variations 5 . In another study, though, a mutation with a cut-off of 20-fold sequence coverage and >90% frequency was considered homozygous 11 , and the remaining <10% frequency reads were considered false reads. However, when only the mutated fragments are amplified and the normal fragments are not, the final results may appear to indicate a homozygous variant. In   the present study, three types of 100% frequency variants were detected with GS Junior sequencing; one was not confirmed and two were confirmed as heterozygous alleles with Sanger sequencing (Table 1). Biased amplifications in the emPCR process might result in producing artificial duplicate reads. Artificial duplicates can be generated when several fragments are captured by one bead in one droplet or when some droplets are broken during the emPCR process 24 . The presence of artificial duplicates thus may result in the detection of a low-frequency false-positive mutation (Fig. 3C). In this study, the 16 mutations in group B were low-frequency false positives that could not be confirmed with Sanger sequencing. Similarly, a previous study that used the GS amplicon variant analyser software and the 454 FLX platform reported two mutations that appeared at frequencies of 18.6% and 22.9%, which could not be confirmed with Sanger sequencing 25 . Two other studies using Reference Mapper software and GS Junior sequencing reported two heterozygote mutations detected at frequencies of 22.6% with a 53-fold coverage and 21.3% with a 61-fold coverage, and neither could be confirmed with Sanger sequencing 10 . In addition, variants detected at frequencies of about 30% could not be confirmed with Sanger sequencing 11 . Thus, the characteristic of 'detection at low frequency' appears to be a strong indicator for a false-positive mutation. Clonal amplification of single DNA on one bead in one droplet, which would result in a unique read. If capture beads did not capture enough fragments or the beads were washed off for insufficient amplification, the result may be few final reads. (B) Some mistaken fragments generated during library construction may be captured and amplified, whereas the true fragments of the same position may fail to be amplified, which could result in some kind of high-frequency false positive. If both artificial and natural duplicates were captured and amplified, the result can be some type of low-frequency false positives. (C) If several fragments were captured by one bead in one droplet or some droplets were broken during the process of emPCR, the result also may be a false positive with low frequency.
The GS 454 sequencing platform was reportedly not good at detecting indel mutations in a study using the NGS platform to detect such mutations 26 . In our studies, we did not detect any indels, but we cannot attribute this outcome to the platform because our samples may have harboured no indels. The sequencing errors were not only the result of random chance but also of systematic errors in the technique itself or artifacts, especially in homopolymeric regions. Despite the occurrence of false positives in studies using the Reference Mapper software with the GS Junior platform, the GS Junior platform remains a powerful method for large-scale genome studies. Even when the detected mutations must be verified with Sanger sequencing, the GS Junior platform requires less time than Sanger sequencing to perform the same work.
One shortcoming of this study is the small number of enrolled mutations, and more mutations with coverage <20 and frequency >30% to ensure data validation are needed. In addition, not all types of DNA regions were covered, such as GC-rich regions. We found that even when the quality control met standards recommended in the manufacturer's manual, some mutations occurred when the coverage was <20 and the frequency >30%. The results of this study indicated that mutations detected over a few reads (less than 20-fold coverage) and at high frequencies (over 30%) should be verified with Sanger sequencing in studies that use Reference Mapper software and GS Junior sequencing. Because of unavoidable differences in using apparatuses and variable proficiency in manipulating them, researchers will benefit from establishing their own thresholds for true mutations. For this study, we identified 26 target regions in 8 genes (groups A and B) that were shown in two earlier studies to harbour mutations. One study was a mutation screening study in exons of circadian-relevant genes using a custom-designed SeqCap EZ Choice library 27 , and the other was a mutation screening study in exons of MPDZ (multiple PDZ domain protein) using rapid amplicon library (shotgun). All participants fulfilled the diagnostic criteria for Autism Spectrum Disorder as listed in the Diagnostic and Statistical Manual of Mental Disorders: Fourth Edition (DSM-IV). All 137 patient samples were sequenced for mutation screening of MPDZ, and 28 of these samples were sequenced for mutation screening of circadian-relevant genes. All patients had intellectual disability (ID), and their IQs ranged from 14 to 75.

Materials and Methods
The statistical significance for detected mutations was analysed between the patient and control groups. The hypothesised effects of the detected mutations on respective protein functions were analysed using the Polymorphism Phenotyping v2 (PolyPhen-2) prediction tool (http://genetics.bwh.harvard.edu/pph2/), SIFT (http:// sift.jcvi.org/), and Mutation Taster (mutation t@sting, http://www.mutationtaster.org/). If there was no statistical significance between the groups for a detected mutation, then that mutation was considered a polymorphism.
All gene sequences were obtained from http://www.ncbi.nlm.nih.gov/nucleotide/. Lymphocyte samples were obtained from patients who provided informed consent. Samples were transfected with the Epstein-Barr virus (to establish lymphoblasts) and cultured in vitro. Genomic DNA was extracted from lymphoblasts with the salting-out method. DNA concentrations were determined with a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific).
Amplicon rapid library (shotgun) preparation. Primers were designed to amplify exons and their vicinities for all 46 exons of the MPDZ gene to yield amplicons of 250-700 bp. All PCR was carried out under the following conditions: 25 ng of human genomic DNA, 0.1 μM of each primer, 1.5 μl of 10 × PCR buffer (Takara, Shiga, Japan), 1.2 μl of 200 mM dNTPs (Takara, Shiga, Japan), and 0.1 μl of rTaq DNA polymerase (Takara, Shiga, Japan). The final volume was adjusted to 15 μl with water. Thermocycling conditions were one initial incubation at 94 °C for 3 min, followed by 36 cycles at 94 °C for 30 s, the appropriate annealing temperature for 30 s or 45 s, and 72 °C for 30 s. A final extension step at 72 °C for 10 min was added at the end of the last cycle. The reactions were performed in a Gene Amp PCR system 9700 (PE Applied Biosystems). The products were then purified and ligated with different multiplex identifiers (MIDs; we used12 MIDs; Titanium Rapid Library MID adaptor; Roche, Pleasanton, CA, USA). The MIDs enabled the use of the pooled sequences as a library, according to the GS Junior Titanium Series Rapid library (shotgun) Preparation Manual, June 2012.
NimbleGen SeqCap EZ Choice library preparation. Exons  Emulsion PCR and GS Junior Sequencing. The libraries were pooled in equimolar amounts and combined with capture beads at a ratio of two molecules from each DNA library per capture bead. The pooled DNA was then amplified with emulsion PCR (emPCR). The bead-attached DNAs were denatured, eluted, and quantified with the provided bead counter. All were performed according to the GS Junior Titanium Series emPCR (Lib-L) Manual (June 2012).
A total of 500,000 enriched DNA beads were mixed with Packing Beads. Then, the Pico Titer Plate was sequentially loaded with Prelayer Beads, DNA-Packing Beads, Postlayer Beads, and PPiase Beads. Finally, the Pico Titer Plate was mounted in the 454 GS Junior Sequencer, and the program was run in full processing mode for shotgun sequencing, according to the GS Junior Titanium Series Sequencing Method Manual (June 2012). The resulting reads were aligned, and variants were compared to the reference genome with the 454 integrated software (GS Reference Mapper; Roche, Pleasanton, CA, USA). (Table 3) were designed to target exons that harboured mutations identified in previous studies. PCR was performed to amplify each exon and its neighbouring introns. The PCR products were purified with the MinElute ® PCR purification kit (Qiagen) and checked with 2% agarose gel electrophoresis. Sequencing was performed by applying the same forward and reverse primers that were used for PCR amplification. We used the BigDye ® terminator sequencing kit, version 3.1 (Life Technologies), and a 3730XL DNA analyser (Life Technologies). The sequencing results were interpreted with Sequence Scanner, version 1.0 (Applied Biosystems). Data analysis. Statistical significance was analysed with Fisher's exact test. Analyses were performed with IBM SPSS statistics 21 software, and a p-value less than 0.05 was used to define statistical significance. Data availability. The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.