Parallel Analysis of 124 Universal SNPs for Human Identification by Targeted Semiconductor Sequencing

SNPs, abundant in human genome with lower mutation rate, are attractive to genetic application like forensic, anthropological and evolutionary studies. Universal SNPs showing little allelic frequency variation among populations while remaining highly informative for human identification were obtained from previous studies. However, genotyping tools target only dozens of markers simultaneously, limiting their applications. Here, 124 SNPs were simultaneous tested using Ampliseq technology with Ion Torrent PGM platform. Concordance study was performed with 2 reference samples of 9947A and 9948 between NGS and Sanger sequencing. Full concordance were obtained except genotype of rs576261 with 9947A. Parameter of FMAR (%) was introduced for NGS data analysis for the first time, evaluating allelic performance, sensitivity testing and mixture testing. FMAR values for accurate heterozygotes should be range from 50% to 60%, for homozygotes or Y-SNP should be above 90%. SNPs of rs7520386, rs4530059, rs214955, rs1523537, rs2342747, rs576261 and rs12997453 were recognized as poorly performing loci, either with allelic imbalance or with lower coverage. Sensitivity testing demonstrated that with DNA range from 10 ng-0.5 ng, all correct genotypes were obtained. For mixture testing, a clear linear correlation (R2 = 0.9429) between the excepted FMAR and observed FMAR values of mixtures was observed.

Scientific RepoRts | 5:18683 | DOI: 10.1038/srep18683 beta assays. HID_SNP_v1.0 (containing 103 autosomal SNPs and 33 Y-SNPs), the first beta panel, was tested by Budowle et al. 14 . HID_SNP_ v2.2 (containing 136 autosomal SNPs and 33 Y-chromosome markers), the second beta panel for human identification, was first tested by Morling group 15 and then inter-evaluated by six laboratories 10 . Based on these testing data, 34 upper Y-clade SNPs 13 and 90 autosomal SNPs 11,12 that have high heterozygosity and a low fixation index (F ST ), were include in the first commercially available panel named HID-Ion AmpliSeq ™ SNP-124. Since no data of this panel has been published yet, we evaluated the panel and explored the application in Chinese HAN population this time.

Materials and Methods
The main experiments were conducted in Forensic Genetics Laboratory of Institute of Forensic Science, Ministry of Justice, P.R. China, which is an accredited laboratory by ISO 17025, in accordance with quality control measures. All the methods were carried out in accordance with the approved guidelines of Institute of Forensic Sciences, Ministry of Justice, P.R. China.
Sample preparation. Control Supplementary Table S1.

Library Preparation, Quantification and Emulsion PCR (emPCR). New technology of Ampliseq
and according chemistry of Ion AmpliSeq Library Kit 2.0-96 LV (Life Technologies) were applied for library preparation. Ampliseq technology delivers simple and fast library construction for affordable targeted sequencing of genomic regions. The Ion AmpliSeq ™ workflow is based on a transformative technology that simplifies ultrahigh-multiplex PCR amplification and library construction. Utilizing low input DNA, this single-tube workflow is as simple as setting up a PCR reaction and can avoid contamination. The library-PCR system contained 4 μ L of 5X Ion AmpliSeq HiFi Master Mix and 10 μ L of 2X HID-Ion AmpliSeq ™ SNP-124 Panel. Except for the sensitivity and mixture testing of the panel, the initial DNA input was 10 ng for each library construction. The library-PCR parameters were as follows: 2 min at 99°, 18 cycles of 15 s at 99° and 4 min at 60° followed by a 10° hold. For sensitivity testing, 21 cycles were used when DNA amount lower than 1 ng in order to get sufficient amplification. The resulting amplicons were treated with 2 μ L FuPa reagent (Life Technologies) to partially digest primers. All libraries were barcoded using Ion Xpress TM Barcode Adapters (Life Technologies). After the ligation with barcodes, libraries were purified with Agencourt AMPure XP Reagents (Beckman Coulter, Brea, CA). Then qPCR methods with Ion Library Quantitation kit (Life Technologies) was adopted for accurate library quantification.
The accurately quantified and diluted pool library was then used to generate template positive Ion Sphere TM Particles (ISP) containing clonally amplified DNA with emPCR technology, which performed on Ion OneTouch2 (OT2) (Life Technologies) by using Ion PGM Template OT2 200 Kit. Quality of emPCR products were evaluated with Ion Sphere TM Quality Control Kit (Life Technologies). The optimal amount of library corresponds to the library dilution point that gives percent of template ISPs between 10-30%. The emPCR products were then enriched on the Ion OneTouch TM ES (Life Technologies).
Evaluation of HID-Ion AmpliSeq ™ SNP-124 Panel. For the concordance and accuracy testing, female sample of 9947A (Life Technology) and male sample of 9948 (Promega) were applied as reference samples. 124 SNPs of the two reference samples were sequenced by NGS and Sanger technologies. Here, Sanger method was adopted for validation of NGS results. For the NGS fragment above 150 bp, same primer pair was used for Sanger sequencing; for the NGS fragment below 150 bp, different primer pair was designed for Sanger sequencing (listed in Supplementary Table S1). For the evaluation of the panel and forensic performance of the 124 SNPs, 45 unrelated healthy Chinese HAN individuals were involved in the study. For the sensitivity testing of the panel, serial dilutions of an In-house male control sample were performed to generate DNA concentrations of 10, 5, 2, 1, 0.5 and 0.2 ng/μ L. And 1 μ l of each concentration was added in the library PCR-setup system. In other words, the DNA input for sensitivity testing was ranged between 10 ng and 0.2 ng. The libraries of the 6 different concentrations were tested with 314 Ion chip twice. For the mixture study, mixture DNA from control samples of 9947A and 9948 were generated to give ratios of 100:1, 10:1, 5:1, 1:1, 1:5, 1:10 and 1:100. For the 1:1 ratio, 5 ng of each DNA was mixed together. For the ratios of 100:1 and 10:1, 50 pg, 500 pg of 9948 were added to 5 ng of 9947A. For the ratios of 1:10 and 1:100, 500 pg, 50 pg of 9947A were added to 5 ng of 9948. For the 5:1 ratio, 2.5 ng of 9947A was mixed with 500 pg of 9948. Therefore, the DNA input was ranged from 3 ng to 10 ng for library preparation and the 7 libraries of mixtures were tested with 314 Ion chip twice.

Results and Discussion
Concordance study. Control samples of 9947A (Life Technology) and 9948 (Promega) were chosen for concordance study. 124 SNPs (listed in Supplementary Table S1) of these samples were sequenced by NGS and Sanger technologies. HID_SNP_Genotyper.42 (v4.2) plug-in and Chromas were used for the genotyping analysis of NGS data and Sanger sequencing data, respectively. NGS technology has the property of ultra-high throughput but the read length is remarkably short compared to conventional Sanger sequencing. In this study, the shortest PCR length of targeted SNP for NGS is 77 bp and the longest is 244 bp. Thus, for the fragment below 150 bp, different primer pairs were designed for Sanger sequencing (Supplementary Table S1). Sequencing results of 9947A and 9948 by NGS and Sanger sequencing were listed as Supplementary Table S2-1 and Table S2-2, respectively. Except rs576261 (SNP No. 77) of control DNA 9947A (Supplementary Table S2-1), there was complete concordance between the results from NGS and Sanger sequencing of the two reference samples. For SNP rs576261, the NGS results of 9947A was ' AC' while the Sanger sequencing result was 'C' (Fig. 1). By analyzing BAM file with IGV software, the accurate genotyping at SNP rs576261 of sample 9947A should be 'C' . The sequence context surrounding SNP rs576261 is TCTGTCACCA[A/C]CCCTGGCCTC. The SNP followed by a homopolymer stretch and a possible allele is identical to the stretch. Misalignment of reads and wrong call of alleles leads to wrong genotyping of NGS (Fig. 1A). And the reads for base A (193) and base C (1422) vary quite significantly.
A parameter of F MAR (Frequency of Major Allele Reads) was adopted here. Analysis with HID_SNP_ Genotyper.42 (v4.2) plug-in can provide detail reads at each bases (A, C, G and T) for each SNP. F MAR was calculated as the biggest reads among the four bases dividing the total detected reads. For homozygotes, the optimal F MAR (%) should be equal to 100, while for heterozygotes, the optimal F MAR (%) should be 50. In previous study, Intra-locus balance (the lower peak height dividing the higher peak height for each locus) was applied to measure the balance of heterozygous alleles. According to Eduardoff et al., 50% is a 'perfect balance' and 40% threshold (60:40 heterozygote ratio) can give better equilibrium between gaining the highest proportion of reliable genotypes and balanced signals of SNPs. That means the F MAR (%) for accurate heterozygotes should be 50%-60%. And Intra-locus balance above 70% is desired to ensure accurate heterozygote genotyping and to facilitate mixture interpretation for STRs 16 . Here, if the same principle adopted for SNP calling, F MAR (%) for accurate heterozygotes should be ranged from 50% to 59% (1/(0.7 + 1)). Therefore, the boundaries of F MAR (%) was set as 50-60% for ideal allelic-balance of heterozygotes in this study. And according to Eduardoff et al., SNPs with major allelic reads frequencies of 90% or greater were deemed to be homozygous for that allele 2 , as the presence of other bases at a low proportion in the Ion Torrent PGM data arise from non-specific incorporation, but the proportion of a     Table S3). Sanger method was adopted for validation of NGS results also. For the wrong genotyping of rs7520386 of sample '78#' , abnormal value of F MAR (79.05%) was detected; and for the wrong genotyping of rs214955 of sample ' A12_045′ , lower coverage (<100) was observed. These suggested that attention should be paid to data with lower coverage or abnormal values of F MAR. By analyze all the NGS data of the 45 individuals, imbalance of heterozygotes were found at SNPs of rs7520386, rs4530059, rs214955, rs1523537, rs2342747 and rs576261 (Supplementary Table S4). Among the 6 SNPs, SNPs of rs7520386 and rs4530059 showed higher imbalance with mean F MAR (%) values above 60%. Except the 6 SNPs, all the detected F MAR (%) values of the 45 individuals were plotted in Fig. 2. The range of F MAR (%) values were 50%-60% for heterozygotes and 90%-100% for homozygotes and Y-SNPs. And a minimum threshold of 100× coverage was recommended in this study. There is consistently high coverage with little variation between the samples. However, variation in coverage was observed among the SNPs and each SNP generally showed similarly high or low coverage across the samples. The lower coverage of auto-SNPs were always happened at SNP rs2342747 and rs12997453, which is also observed when genotyping of 9947A and 9948 (Supplementary Table S2-1 and Supplementary Table S2-2). The differences in coverage may primarily be related to differences in PCR amplification efficiency. Modifications of primer concentrations of SNP rs12997453 (lower coverage but ideal performance of heterozygotes) in the pool may provide more library yield. For the SNPs mentioned in Supplementary Table S4, modifications of the primers and/or primer concentrations may provide more balance and higher yield across the SNPs of the panel. Therefore, NGS results of heterozygote SNP with F MAR (%) values above 60% or total coverage below 100x analyzed with HID_SNP_Genotyper.42 (v4.2) plug-in should be checked or discarded. In this study, the 6 SNPs (rs7520386, rs4530059, rs214955, rs1523537, rs2342747 and rs576261) detected with abnormal values of F MAR and SNP of rs2342747, rs12997453 with lower coverage were recognized as poorly performing SNPs. SNP rs2342747 always detected with abnormal values of F MAR and lower coverage maybe should be deleted from the panel.
In the previous Inter-laboratory evaluation of the HID_SNP_ v2.2 with 169-markers for ancestry inference, discordant genotypes detected in 5 SNPs (rs1979255, rs1004357, rs938283, rs2032597 and rs2399332) indicate these loci should be excluded from the panel 10 . Two SNPs of them (rs1979255 and rs938283) were also included in this panel. For the 2 SNPs, the genotyping results of tested samples were correct and the F MAR (%) values were in the range for heterozygotes and homozygotes. Therefore, modification of the primers of 'problematic SNPs' may effectively improve the performance of new panel.   Continued poorly performing SNPs: rs7520386, rs4530059, rs214955, rs1523537, rs2342747, rs576261 and rs12997453). DNA ranged from 2-10 ng are with optimal F MAR (%) values for heterozygotes. Some terrible F MAR data of heterozygotes were observed when DNA ranged from 0.2-1 ng, especially when DNA below 0.5 ng (Fig. 3). Although NGS data analyzed with correct genotypes by default setting of HID_SNP_Genotyper.42 plug-in for all the called samples, further analysis is essential when data with low coverage or abnormal F MAR values. Above results demonstrated that the optimal amount of DNA in the PCR seemed to be above 0.5 ng, comparable to current STR analysis requirements 16,17 . It seems likely that the sensitivity can be improved by further optimization of the primer pool or the PCR or by removing some poor performing SNPs from the panel.
Mixture study. In this study, mixtures of two reference samples (9947A and 9948) with ratios of 100:1, 10:1, 5:1, 1:1, 1:5, 1:10 and 1:100 were studied. Table 1 listed the theoretical F MAR values of mixtures with all possible genotypes except the two control samples with same genotypes. Figure 4 shows the theoretical F MAR and observed F MAR values of mixtures with genotypes mentioned in Table 1. 7 poorly performing auto-SNPs (rs7520386, rs4530059, rs214955, rs1523537, rs2342747, rs576261 and rs12997453) were excluded from this analysis. There was a clear linear correlation between the excepted and observed F MA values of all the mixtures (R 2 = 0.9429), which indicated that the assay generated a loyal representation of DNA samples. Detection of mixtures with auto-SNPs is possible by analyzing the F MAR values with NGS data. NGS data can give balanced heterozygous genotypes, providing a more secure basis for analyzing mixtures. It is vital to reliably recognize SNP data as originating from a mixture and not a single profile with the commonly used SnapShot system. The genotyping results of 34 Y-SNPs of the mixtures were listed in Table S5. 14.71%, 97.06% and 100% of the Y-SNPs were detected in the 100:1, 10:1 and 5:1 mixture of 9947A/9948. For other mixture ratios, 100% of the Y-SNPs were detected. Genetic analysis of 124 SNPs in Chinese HAN population. A total of 45 unrelated individuals (17 females and 28 males) of Chinese HAN population were sequenced with the HID-Ion SNP124 panel on Ion 318 chip twice. All the genotypes obtained at the 7 poorly performing SNPs were checked by Sanger sequencing. Genetic analysis of these 124 SNPs was performed with SNP Analyzer Software 18 . No significant deviation from HWE expectations was detected in the distribution after Bonferroni correction among HAN population (N = 45) of the 90 auto-SNPs. The allelic frequencies and forensic parameters of the 90 auto-SNPs were listed in Table 2. Based on the data of auto-SNPs investigated among HAN individuals, LD analysis was explored. By pairwise LD calculation and Gabriel's method 18 , the results ( Supplementary Fig. S1) shows that no LD was existed among the 90 auto-SNPs. Therefore, the CDP (Cumulative Discrimination Power) was 1-5.2192 −23 for the 90 auto-SNPs in Chinese HAN. For the Y-SNP analysis, 6 haplotypes were found in the 28 unrelated male individuals. These suggested that the HID-Ion SNP124 panel is suitable for personal identification of HAN population from China.

Conclusion
NGS plus Ampliseq technology have the capacity to sequence targeted regions of multiple DNA samples with high coverage simultaneously. Compared with Sanger sequencing, this technology also can reduce labor and cost on a per nucleotide bases and indeed on a per sample basis. In this study, with the commercially available SNP panel, high coverage and high throughput of 124 specified targets were detected. The parameter of F MAR (%) was applied for evaluating allelic performance, sensitivity testing and mixture testing, making the NGS data easy to interpret. Further modification of the panel can been explored based on the obtained data. This pilot study of the Ion Torrent PGM Sequencer has demonstrated considerable potential for SNP detection as a low to medium throughput NGS platform. And although capillary electrophoresis remains the gold standard and most cost-effective option for human identification with short tandem repeats (STRs) 16,17 , the PGM Sequencer System extends forensic analysis capabilities. SNPs regarding the bio-geographical ancestry (BGA) or externally visible characteristics (EVC) or STRs were explored with PGM also 10,15 . These features make markers typing on a NGS platform particularly appealing.