Sequence investigation of 34 forensic autosomal STRs with massively parallel sequencing

STRs vary not only in the length of the repeat units and the number of repeats but also in the region with which they conform to an incremental repeat pattern. Massively parallel sequencing (MPS) offers new possibilities in the analysis of STRs since they can simultaneously sequence multiple targets in a single reaction and capture potential internal sequence variations. Here, we sequenced 34 STRs applied in the forensic community of China with a custom-designed panel. MPS performance were evaluated from sequencing reads analysis, concordance study and sensitivity testing. High coverage sequencing data were obtained to determine the constitute ratios and heterozygous balance. No actual inconsistent genotypes were observed between capillary electrophoresis (CE) and MPS, demonstrating the reliability of the panel and the MPS technology. With the sequencing data from the 200 investigated individuals, 346 and 418 alleles were obtained via CE and MPS technologies at the 34 STRs, indicating MPS technology provides higher discrimination than CE detection. The whole study demonstrated that STR genotyping with the custom panel and MPS technology has the potential not only to reveal length and sequence variations but also to satisfy the demands of high throughput and high multiplexing with acceptable sensitivity.


Results and Discussion
In this study, the custom panel allows for simultaneously detection of up to 34 polymorphic forensic autosomal STRs and the sex determination locus of Amelogenin. Detail information of the 34 STRs and corresponding primers were attached as Supplementary Table S1. The Bed file and a Params file for analysis were accordingly programmed based on the coordinate positions and motif structures. We explored the MPS performance from sequencing reads analysis, concordance study and sensitivity testing, to document the performance capabilities and limitations of the custom panel.
Sequencing reads analysis. 9947A and 9948 (Promega, USA) were adopted as reference samples in this study. Libraries of them were pooled for two separate emPCRs and corresponding emPCR products which correspond to the library dilution point of 17% and 23% of template ISPs were sequenced on individual Ion 314 Chips. The sequencing data yielded concordant genotype results between each replicate and with the CE genotyping results. No significant differences in allele coverage (p = 0.0751) and allele coverage ratio (ACR) values (p = 0.1864) at the 34 STRs were observed between sequencing replicates, thus we combined the two batches of data together for analysis. The averaged depth of coverage (DoC) among the 34 STRs ranged from 1652 to 2760, with ACR values of heterozygotes ranging from 0.67 to 0.94. And Isoalleles, i.e., alleles of the same length but differing in sequence, were observed at D8S1179 of 9947A. Genotype of homozygote 13 was displayed with CE technology ( Supplementary Fig. S1-A), while sequence heterozygote of [TCTA] 1 [TCTG] 1 [TCTA] 11 and [TCTA] 13 was recognized with MPS technology ( Supplementary Fig. S1-B). And Sanger sequencing was conducted to verify the sequences (Supplementary Fig. S1-C).
Sequencing reads observed at each locus can been divided into allele, stutter, and noise. Here, we analysed the allele/stutter/noise information of the 34 STRs with the double sequenced data from the 200 tested samples. Since a maximum of 25 samples can been sequenced on the Ion 318 v2 chip with this custom panel, the double tested 200 samples were sequenced on 16 separate runs. After filter out data from empty wells, polyclonal, tested fragments, adapter dimer and low quality signals, the usable reads of the 16 runs ranged from 68% to 87%. Although the run-to-run variation is unavoidable and do affect the constitution of sequencing reads, sequencing genotypes between each replicate of the 200 samples were concordant. Figure 1 shows the averaged composition ratios at the 34 STRs with the double-sequencing reads from the 200 individuals. The stutter ratios ranged from 2.425% (Penta D) to 12.686% (D22S1045), the noise ratios ranged from 0.097% (D17S1301) to 4.731% (D1S1656), while the allele ratios ranged from 83.888% (FGA) to 97.468% (Penta D). The averaged allele, stutter, and noise percentages were 91.64%, 6.808%, and 1.551%, respectively. Compared with sequencing data of corresponding STRs from the newest commercial MPS kits of Precision ID GlobalFiler TM NGS STR Panel 9 and Illumina ® ForenSeq TM DNA Signature Prep Kit 7,8 , significant differences were observed with the constitution ratios (data not shown). No noise signals was observed at TPOX, D6S1043, D2S1776, D3S1358, D16S539, and D7S820 with Precision ID GlobalFiler TM NGS STR Panel 9 , while no noise signals was observed at Penta E with Illumina ® ForenSeq TM DNA Signature Prep Kit 8 .
Since the evaluation of Precision ID GlobalFiler TM NGS STR Panel was also performed on the Ion Torrent PGM platform, and same kits for emPCR and sequencing were used, the worse data of the custom panel presented here indicate that further optimization of the library primers should been explored in future studies.
With the allele sequencing reads, we used the averaged values of Doc and ACR to evaluate the performance of the 34 STRs. Figure 2A illustrates the Doc information, while Fig. 2B shows the ACR values from the observed heterozygous balance at the 34 STRs. The mean DoC for the 34 loci ranged from a low value of 1144x ± 576.5 at D19S433 to a high value of 3284x ± 1163 at D14S1434. The mean ACR values ranged from 0.6418 ± 0.0998 (D12ATA63) to 0.9350 ± 0.0887 (D2S1338).

Concordance study.
A concordance study of the 200 DNA samples was first performed by comparing the genotypes from identical samples prepared and run in different sequencing reactions. No inconsistent genotype calling were observed between the double sequencing, although the Doc and ACR values of heterozygotes varied due to run-to-run variations. By Fisher's exact test, no significant differences in ACR values (p = 0.1011) at the 34 STRs between each replicate sequencing were observed, indicating the variations of heterozygotes performance with different runs can be ignored.  (Table 1). MPS technology did not identify additional alleles for 18 STR loci, with locus D21S11 showing the highest degree of variation. The additional 72 alleles were identified based on sequence differences in the same PCR fragment.
Among the 200 samples, three samples were detect as homozygote "19" with Powerplex 21 (Promega, Wisconsin, USA) and Goldeneye TM 20A (Goldeneye, co, Ltd, China) kits at D2S1338 locus; however, heterozygotes of 19/24, 17/19 and 19/24 were detected by MPS sequencing, respectively. In other words, allele drop-out may occurred at D2S1338 locus with the primers from the Powerplex 21 and Goldeneye TM 20A kits by CE technology. Sanger sequencing were verified our conjecture ( Supplementary Fig. S2). Since the primer information for the two commercial kits is confidential, we assume that the PCR primers may fail to amplify a particular allele due to variation in the STR flanking regions or primer binding site mutations of these samples.
We also found a sample genotyped with discordant homozygotes "11" and "11.1" at the CSF1PO locus with the Powerplex 21 kit (Promega, Wisconsin, USA) and Goldeneye TM 20A kit (Goldeneye, co, Ltd, China), respectively. The MPS sequencing genotype of this sample was "11". This phenomenon occurred due to insertion of a cytosine 128 bp downstream of the motif region. The results suggested that the reverse primer for the CSF1PO locus in the Goldeneye TM 20A kit (Goldeneye, co, Ltd, China) should be moved to the region between the motif and the mutation site.
And a four-base deletion (rs561167308) in the 3′ flanking region of D13S317 was observed in two samples. Since the Indel was present outside the motif structure but within the amplified region, the PCR-CE detection gave alleles with one less repeat than MPS sequencing.
Above results reveal that no actual inconsistent results were existed among the 6800 genotypes. And STR sequencing information produced with MPS technology, include motif and flanking areas, can obtain a better understanding of STRs and reveal the flaws of current commercial STR kits. Sensitivity testing. Sensitivity study can be defined as the ability to produce reliable profiles from a range of DNA quantities. Initial DNA of 10 ng was recommended for library preparation in the protocol 'Ion AmpliSeq ™ Library Preparation Revision A.0' with our custom panel. To evaluate the sensitivity of this panel, libraries from a serial dilution (10 ng, 5 ng, 2 ng, 1 ng, 0.5 ng, 0.25 ng, and 0.125 ng) of control DNA 9948 were pooled in duplicate and sequenced on an Ion 316 v2 chip. The total obtained sequencing data was 373.28 MB. The CE genotyping results of 9948 at the 34 STRs were obtained by amplification of 0.5 ng DNA with Powerplex 21 (Promega, Wisconsin, USA), Goldeneye TM 20A and Goldeneye TM 22NC (Goldeneye, co, Ltd, China) kits individually. The 14 emPCR products were barcoded as 1-14. Concordant results were obtained between each replicate and with the CE genotyping results at the 34 STRs except when less than 0.5 ng of DNA was used. When 0.25 ng of DNA or less was used, allele drop-out was observed. Within the correct genotypes detected with 0.25 ng and 0.125 ng of DNA, heterozygous imbalance (ACR < 0.6) was observed. The mean DoC was 2208x ± 534 for the 10 ng library and 353x ± 139 for the 0.125 ng library (Fig. 3A). Figure 3B shows the performance of ACRs from the heterozygotes at the seven different concentrations (24 loci detected as heterozygotes). The variation of allelic balance of heterozygotes was greater in the experiments with lower amounts of DNA. The mean ACRs of heterozygotes were all greater than 60% when DNA ranged from 10 ng to 0.5 ng. When 0.25 ng and 0.125 ng of DNA were used, averaged ACR values of detected heterozygous are 69.58% and 59.71%, respectively. Above results demonstrated that the minimum DNA amount for this panel was 0.5 ng. The average DoC of the 34 STRs was 832x, while the average ACR value was 81.22% when 0.5 ng of DNA was used. Alleles and forensic parameters. STRs detected with MPS technology could provide sub-repeat variants that were undetected by PCR-CE typing. Among the 34 STRs, D21S11 was the most sequence-polymorphic locus, with 30 alleles ranging from 9 to 34.2 repeat units. In our study, isoalleles were detected at 16 STRs (Supplementary Table S2). The alleles and corresponding frequencies of the 418 sequenced alleles at the 34 STRs among the 200 individuals were listed in Supplementary Table S2. The forensic parameters obtained from both CE and MPS data were listed in Table 1. The statistical data were investigated regarding the increase in the number of effective alleles due to the presence of sequence variations in the STR repeat regions. The most significant changes in Power of Exclusion in trios (PEtrios) and Power of Exclusion in duos (PEduos) between MPS and CE were observed at D12S391, with +0.1657 and +0.234, respectively. As all 34 autosomal STR loci were independent from each other with linkage disequilibrium analysis, the combined forensic efficiency parameters were calculated based on the allelic frequencies. The Cumulative Power of Exclusion in duos (CPEduos) and in trios (CPEtrios) were 0.999 999 999 999 975 and 0.999 999 982 454 768 with CE methods, while the CPEduos and CPEtrios were 0.999 999 999 999 999 and 0.999 999 999 026 630 with MPS technology. Since the information obtained from MPS technology provided higher discrimination than those obtained from the CE detection, MPS method was expected to be particularly useful for parentage testing, by enabling the resolution of isoalleles, as well as distinguishing variants in flanking regions. However, as indicated by the ISFG guidelines, the current allelic frequencies obtained via MPS technology are not sufficient to quantify any new variations, thus comprehensive MPS databases is required to characterize the extent of the STR sequence variations for estimating the STR allele frequencies 4 .  The start and stop coordinate positions (GRCh38 human reference genome) of motifs and structures referred to the newest forensic genetic nomenclature recommendations of ISFG 4 and NIST STRbase 13 . AmpliSeq Designer was adopted for multi-primer designing, and the candidate targets were submitted to Thermo Fisher AmpliSeq primer design tool (http://www.ampliseq.com). The locus of SE33 was excluded from the final multi-primer designing process, remaining 34 STRs. Primers of the 34 targets were listed in Supplementary Table S1. The library length of them ranged from 243 to 310 base pairs (bps).  Ion Torrent PGM TM sequencing. Samples sequenced on the same chip were pooled in equimolar ratios prior to sequencing. Considering the chip content and coverage depth, we sequence a maximum of 25, 14, and 3 samples in parallel on the Ion 318 v2, 316 v2 and 314 v2 chips. Each sample was sequenced twice in this study. Sequencing was performed on the Ion Torrent PGM TM platform (Thermo Fisher Scientific, USA) using the Ion Torrent PGM TM Hi-Q TM Sequencing Kit (Thermo Fisher Scientific, USA); the number of flows was "850", and the nucleotide flow order was "Samba Gafieira, " which improved the end-to-end success rates and signal-to-noise ratios for STR sequencing. In this study, 200 individual samples were involved for sequencing performance evaluation of the custom panel. For these samples, we pooled 25 different DNA samples for each emulsion PCR (emPCR) and each sequencing reaction; thus, a total of 16 Ion 318 v2 chips were used. For sensitivity testing, purchased 9948 of 10 ng/µL (Promega, USA) was serially diluted to generate DNA concentrations of 5, 2, 1, 0.5, 0.25 and 0.125 ng/µL, and 1 µl of each concentration was added to the library preparation system. Thus, the DNA input for sensitivity testing ranged from 10 ng to 0.125 ng. The samples were subject to 25 target amplification cycles for <1 ng of DNA input. Above DNA were prepared by two independent operators in parallel, thus 14 libraries were generated. To avoid variations in emPCR and sequencing runs, libraries were pooled for one emPCR and labeled by different barcodes to conduct template preparation and then sequenced on one Ion 316 v2 chip.

Samples and library preparation.
Data processing. Raw data were processed with Ion Torrent Suite Software v4.4.0 (Thermo Fisher Scientific, USA), and STR sequence calling was handled with the HID STR Genotyper v4.0 plugin equipped with a self-programmed BED file and a Param file. The analytical threshold of 250 sequencing reads was applied. After the analysis, a PDF report with detailed genotypes, coverage, sequences and coverage plots for each sample at each STR locus and an Excel file listing the barcode, sample, locus, allele, status, coverage and sequence information were obtained.
The MPS-STR reads observed at each locus were divided into allele, stutter and noise reads. Stutters were defined as sequences in which one or two motifs were shorter or longer than the parent allele. Noise was defined as reads that were not alleles or stutters, i.e., PCR/sequence errors. Allele/stutter/noise percentages were determined by dividing the number of reads containing the allele/stutter/noise by the total number of reads for each locus. DoC and ACR parameters were used to evaluate the STR sequencing performance. The ACR parameter was determined by dividing the lower-coverage allele by the higher-coverage allele at heterozygous genotypes.
Sequence allelic frequencies were assessed with direct counting methods. Statistical parameters of polymorphism information content (PIC), exclusion power in duos (PEduos) and exclusion power in trios (PEtrios) to evaluate the forensic efficiency were calculated using the formulas listed in references 15,16 .