Identification of sequence polymorphisms at 58 STRs and 94 iiSNPs in a Tibetan population using massively parallel sequencing

Massively parallel sequencing (MPS) has rapidly become a promising method for forensic DNA typing, due to its ability to detect a large number of markers and samples simultaneously in a single reaction, and sequence information can be obtained directly. In the present study, two kinds of forensic genetic markers, short tandem repeat (STR) and identity-informative single nucleotide polymorphism (iiSNP) were analyzed simultaneously using ForenSeq DNA Signature Prep Kit, a commercially available kit on MPS platform. A total of 152 DNA markers, including 27 autosomal STR (A-STR) loci, 24 Y chromosomal STR (Y-STR) loci, 7 X chromosomal STR (X-STR) loci and 94 iiSNP loci were genotyped for 107 Tibetan individuals (53 males and 54 females). Compared with length-based STR typing methods, 112 more A-STR alleles, 41 more Y-STR alleles, and 24 more X-STR alleles were observed at 17 A-STRs, 9 Y-STRs, and 5 X-STRs using sequence-based approaches. Thirty-nine novel sequence variations were observed at 20 STR loci. When the flanking regions were also analyzed in addition to target SNPs at the 94 iiSNPs, 38 more alleles were identified. Our study provided an adequate genotype and frequencies data of the two types of genetic markers for forensic practice. Moreover, we also proved that this panel is highly polymorphic and informative in Tibetan population, and should be efficient in forensic kinship testing and personal identification cases.


Results
Sequencing quality and concordance analysis. The average cluster density was 740 k/mm 2 , while the average total number of reads was 96,239 and 110,292 for each sample for the two runs (Supplementary  Table S1). Concordance analysis of two software packages (UAS and STRait Razor v2s) showed the same allele calling based on length. The LB alleles were in concordance with corresponding CE results for the 23 shared STR loci from the Forenseq system and Goldeneye DNA ID System 25A amplification system (Peoplespot SciTech Incorporation, Beijing, China) except D22S1045. Allele imbalance was observed at D22S1045 and all of the ACRs from D22S1045 were lower than 0.50 (range from 0.0113 to 0.4918) (Fig. 1), which led to some miscalling of heterozygotes as homozygotes. Considering that allele genotypes of the D22S1045 locus were questionable, D22S1045 were discarded for the following statistics. Sequence-based average ACRs of the other 26 A-STRs ranged from 0.6996 (Penta E) to 0.9572 (TH01) (Fig. 1).
A-STRs dataset were verified and accepted by STRidER (https ://strid er.onlin e/) with the assigned accession number STR000149 22 . Novel alleles and STR allele sequence variations. Thirty-nine novel alleles were detected at 20 loci in the Tibetan population compared with the records in STR Sequencing Project (STRseq) 44 (Table 1). As shown in Supplementary Table S2 (descending order), a total of 353, 166 and 83 alleles were identified through sequencebased approaches, while 241, 125 and 59 alleles were identified by length-based approaches for A-STRs, Y-STRs and X-STRs, respectively. An increase in the allele number by sequencing was observed at 17 A-STRs, 11 Y-STRs, and 5 X-STRs, in which 10 STRs showed greater than 100% (from 100 to 216.67%) increases. In concordance with that of Wang  www.nature.com/scientificreports/ complex repeat STRs. We categorized the sequence variations into three groups, i.e., repeat region variants only (RRVO), flanking region variants only (FRVO) and repeat region plus flanking region variants (RRFR) (Fig. 2). We found that RRVO accounted for the largest number of variations that contributed to the increased number of alleles (Supplementary Table S2). Twenty-five SNPs and two InDels were detected in the flanking region of 21 STRs (Table 2), in which four SNPs and one deletion from four STRs (DYS437, Y-GATA-H4, DYS460, and DYS448) had not been previously reported in the 1,000 Genomes dataset (1,000 Genomes, https ://genom e.ucsc.edu/). The highest increased allele number due to flanking region variations were observed at D7S820, whose alleles with SNPs accounted for 80% of the total kinds of SB alleles. In addition, in the 214 alleles detected in D7S820, 94.39% were observed with flanking region SNPs. D13S317 presented the second highest increase with an FRV ratio of 62.50% and 55.14% of the alleles being observed with flanking SNPs.
Allele frequencies and population genetic parameters of STRs. Both LB and SB allele frequencies and other parameters for each STR locus are listed in Supplementary Tables S3-S8. The frequencies of X-STR alleles in females and males were analysed together since no significant differentiation was observed (p > 0.05). For autosomal loci and X-chromosome loci (females only), both LB and SB allele data met Hardy-Weinberg equilibrium (HWE) expectations after Bonferroni correction (A-STRs: α′ = 0.05/26, X-STRs: α′ = 0.05/7) (Supplementary Table S4). No significant linkage disequilibrium (LD) was detected in 26 A-STRs for both LB data and SB data after Bonferroni correction (α′ = 0.05/325) (Supplementary Table S5). For X-STRs and A-STRs, 20 and 23 pairs showed significant LD with SB and LB data for female samples (p < 0.05), respectively, and no LD was detected after Bonferroni correction (α′ = 0.05/528) (Supplementary Table S5).
Loci with increased number of alleles in SB data compared with LB data also showed gains in observed heterozygosities (H obs ) (Supplementary Table S6), which is a sign of an increase of genetic diversity at these loci. The top three loci that had the largest percentage of increase in H obs were D3S1358 (17.15%), D2S441 (13.90%) and D5S818 (10.39%) successively.
As shown in Supplementary Table S7, a distinct increase in total discrimination power (TDP) and a decrease in cumulative random match probability (CMP) could be observed due to the increasing diversity of SB alleles. The CMP of SB approaches for 26 A-STRs were four orders of magnitude lower than those of LB approaches. The cumulative probability of exclusion of duos (CPE duo ) and trios (CPE trio ) using SB data were higher than those using LB data (Supplementary Table S8). The high strength of evidence indicated the reliability of the 26 A-STRs in both personal identification and parentage testing of duos and trios (Table 3).
For Y-STRs loci, the value of genetic diversity (GD) ranged from 0.2046 (DYS391) to 0.8672 (DYS481) and 0.2046 (DYS391) to 0.9296 (DYF387S1) when using LB data and SB data, respectively. The increased percentage   Table S13). These loci were all positioned on different chromosomes, thus, we considered the 94 iiSNPs as independent for the following statistics.  www.nature.com/scientificreports/ The related forensic parameters for the SNPs were shown in Supplementary Table S14. The combined parameters for the data based on target SNPs and full sequences can be referred in Table 3. The strength of evidence was higher when adjacent flanking region variations of target SNPs in full sequences were taken into consideration. Observed heterozygosities showed improvements in data of full sequences compared with data of target SNPs (Supplementary Table S15). The effective number of alleles (Ae), an important and effective index for evaluation of the selection of microhaplotypes for mixture detection 46 , also showed some level of increases in data based on full sequences when compared with the Ae of corresponding data of target SNPs only (Supplementary Table S15). For data of full sequences, ten loci had an Ae value > 2.00, of which rs1109037 and rs10776839 had values > 3.00, which was a necessary criterion for microhaplotypes being applied to mixture detection 46 . For the target SNP data, in contrast, only five loci showed an Ae > 2.00.
When we combined the length-based data of 26 A-STRs with data from the target SNPs of 94 iiSNPs, eight of 7,140 pairwise comparisons still showed LD (p < 0.00001) after Bonferroni correction (α' = 0.05/7,140) (Supplementary Table S16). Regarding the combination of sequence-based data from 26 A-STRs and data from full sequences of 94 iiSNPs, the same number of pairwise comparisons showed LD (p < 0.00001) after Bonferroni correction (α' = 0.05/7,140) (Supplementary Table S16). None of these pairwise comparisons with significant LD were syntenic. Relative forensic parameters were shown in Table 3. The power for personal identity of the  www.nature.com/scientificreports/ 94 iiSNPs was three to five magnitudes higher than the 26 A-STRs, while in the case of the ability of parentage testing of duos and trios, the 26 A-STRs were higher than the 94 iiSNPs. Moreover, when combining A-STR and SNP markers, the power of the system efficiency was much higher (about 30 to 35 magnitudes lower for CMP and four to ten times higher for CPE) than detection using one category of markers only.

Discussion
The Tibetan population described in this study exhibited many sequence variations in repeat regions and flanking regions based on MPS data. A total of 33 STRs showed a higher diversity of alleles when considering sequence variations rather than considering length-based alleles only, while 25 loci showed no increase in allele number by the SB method. Thirty-nine novel alleles were detected, although only 107 samples were studied. Twenty-five SNPs and two InDels were detected in the flanking regions of 21 STRs. InDels existing in the flanking regions of sequences may influence the length call definitions of alleles. Variants with a substantially differences in frequency distributions between different populations is an indicator of the ancestry-informative value of the locus 47  Five pairs and three pairs of loci showed LD after Bonferroni correction when data of target SNPs and full sequences of iiSNPs were considered, respectively. Referring to previous similar studies 10,38,51,52 , it was not surprising to observe the LD phenomenon for iiSNPs pairs. These loci were considered as independent when calculating forensic parameters since these iiSNP pairs were located on different chromosomes 38,52 . We suppose it was the special population structure of the aimed population group that caused the disequilibrium. Meanwhile, considering the small sample size of this study, failure of LD testing may also result from random effect.
Aiming to correctly interpret the complex data produced by MPS platforms, a convenient and sophisticated software package for data analysis may promote the use of MPS platforms for this type of forensic genetic study. The two software packages we used here for concordance analysis have their own characteristics. As an offline software with customized web browser interface for forensic use, UAS v1.2.16173 can report the LB allele callings for STRs and genotypes of target SNPs, and can calculate RMP and TDP for specific populations. Although this version of UAS doesn't support flanking region analysis, it has been improved and allowed the analysis in the upgraded version (UAS v1.3 or later). STRait Razor v2s can analyse more than 300 loci (including STRs, SNPs and InDels), and focus on both repeat regions and flanking regions of the investigated loci, and can report a SB allele in concordance with the minimal nomenclature requirements recommended by ISFG. Moreover, a much more informative form of SNPs (alleles based on full sequences) can be obtained using an MPS platform rather than alleles based on target SNPs only, which may facilitate mixture deconvolution in the future.
In order to adapt to the backward compatibility of the CE typing system, the nomenclature recommended by ISFG 21 contains repeat number information based on allele length. Similar to the principles of the CE method, the 'CE callings' for SB alleles were determined by comparing the length of the fragment with the same structure length relative to a reference sequence. It is important to note that the CE callings may not represent the actual numbers of repeat units of an allele, especially in alleles with flanking region variations. The annotation of the flanking variants in the nomenclature can indicate the true status of a sequence, which is important for researchers to quickly determine the diversity of a given sequence. An InDel that exists in a flanking region may not be identified but can influence the length of a fragment. In this study, an InDel at D7S820 could explain the influence of InDels on nomenclature (Fig. 3). Through the sequence structure, a [T/TA] insertion (provisionally rs1463708262 (GRCh38 7:84,160,205-84,160,212)) was identified, which resulted in the allele length recognized by STRait Razor v2s as 1 nt longer than the length of allele 10. Thus, the CE number was termed 10.1. Meanwhile, a T-A transversion (rs7789995) was identified close to the insertion (GRCh38 7:84,160,204), which had allele frequencies of 99%, 91%, 94%, 87%, 92% in African, American, East Asian, European, South Asian, respectively   53 . Similar allele callings were observed in 11 loci (D7S820, D13S317, PentaD, DYS385, DYS460, DYS448, DXS10135, DXS10074, HPRTB and DXS7423). This type of inconsistency between alleles and sequences was reported by Novroski et al. 34 Moreover, discrepancies in SB allele nomenclature could be observed when using different coordinates of sequence guides and analysis tools. In previous studies of SB alleles, some researchers followed the nomenclature recommended by ISFG 21 , while others used custom-defined nomenclature 36,37 . Discordant nomenclatures can lead to inconsistent allele calling between laboratories, further confusing precise allele and InDel calling between populations. An optimised consolidation of allele nomenclature and reference genome coordinates for SB alleles should be addressed for a convenient method for data communication between laboratories is urgently needed.

T A T C A A T C T G T C [TATC]13 G T T A -
Lastly, a high system efficiency of the selected 58 STRs and 94 iiSNPs was demonstrated in this Tibetan population. MPS methods make the combined detection of STRs and SNPs more convenient, thereby improving the system efficiency dramatically. Cases involving personal identification and parentage testing of duos and trios can thus be solved with reliable results. The combination of detection of STRs and SNPs may help to solve problems in complex kinship analysis more efficiently. A lofty goal of this field would be reducing the number of markers (or removing loci with low diversity) when conducting duo and trio testings at an acceptable performance level. Furthermore, exploring customized marker subsets for different identification purposes is also an area of future interest.

Conclusions
This study investigated sequence polymorphisms of 58 STRs and 94 iiSNPs in a Tibetan population using massively parallel sequencing, and provided an accurate sequence-based allele frequencies dataset for these loci. Distinct sequence variations were observed in both repeat and flanking regions of these loci, which indicated that the ForenSeq DNA Signature system is highly polymorphic and informative in the studied population. Our study also demonstrated a potential capability for this system to be applied in kinship testing and personal identification cases.

Materials and methods
Sample collection. Peripheral blood samples were collected using FTA cards from 107 unrelated individuals (53 males and 54 females) from Wei Tibetan population in the Tibetan Autonomous Region of western China.
Library preparation and sequencing. DNA libraries were constructed using the ForenSeq DNA Signature Prep Kit according to the manufacturer's recommendations 54 . Briefly, 1.2 mm diameters of FTA cards were punched directly as an input template without DNA extraction. Target amplification and tagging were performed under advised thermal cycling parameters. Index 1 (i7) and index 2 (i5) adapters were added for target enrichment purposes. Then the libraries were purified using Sample Purification Beads (SPB) and normalized using Library Normalization Beads 1 (LNB1). Finally, 5 μL of the normalized library from each sample was pooled into a single microcentrifuge tube. Seven microliters of pooled libraries were added into 591 μL Hybridization Buffer (HT1) and mixed with 2 μL of diluted Human Sequencing Control (HSC) mixture. Sequencing was performed on a MiSeq FGx instrument (Illumina, San Diego, CA) using the MiSeq FGx Reagent Kit (Illumina, San Diego CA) following the manufacturer's instructions. Two runs were performed to cover all samples in this study. Data analysis, allele nomenclature, and sequence variant identification. The raw sequencing data of STRs was first analysed using ForenSeq UAS (v1.2.16173) with default analytical and interpretation thresholds (AT = 1.5%, IT = 4.5% in general, respectively) for allele calling 17 . The intralocus balance threshold was measured as the intensity (number of reads) of the minimum intensity typed allele divided by the intensity of the maximum intensity typed allele and was set as 0.60 (default setting) for all loci except for D22S1045 (intralocus balance threshold = 0.10), which was suggested to be analysed with caution by the manufacture due to the extreme heterozygote imbalance. Alleles were reported using length calling and sequence calling, which contain the repeat region of the locus.
The actual ACR was determined for heterozygote loci by dividing the lower number of reads by the higher number of reads. Then, the FASTQ files were re-analysed by STRait Razor v2s 12 software with the default analytical threshold (AT = 2 reads) and heterozygote threshold (HT = 0.40, the same meaning with intralocus balance threshold). In-house workbooks (written by VBA using Microsoft Excel) were developed to modify the format of the nomenclature produced by STRait Razor v2s so as to completely conform to the requirements recommended by ISFG 21 and the revised Forensic STR Sequence Guide_v3 53 . Manual corrections were also performed to verify the nomenclature of SB alleles identified as 'novel' by STRait Razor v2s. The 94 iiSNPs were genotyped using ForenSeq UAS with default settings (AT = 1.5%, IT = 4.5%), from which we obtained the alleles and genotypes according to the target SNPs. Comprehensive nomenclature following Parson et al. 21 were obtained using STRait Razor v2s 12 , from which we obtained the alleles and genotypes considering the full sequences of SNPs.
The allele sequence variants of STRs were classified into three categories: repeat region variants only (RRVO), flanking region variants only (FRVO), and repeat region plus flanking region variants (RRFR) (Fig. 4) Capillary electrophoresis and concordance analysis. All samples were genotyped using the Goldeneye DNA ID System 25A amplification system. The system contained the 20 expanded Combined DNA Index System (CODIS) core loci plus Penta E, Penta D, D6S1043, a Y indel and Amelogenin for sex identification. DNA amplification was performed according to the manufacturer's instructions. The PCR products were detected using CE on an ABI 3500xL Genetic Analyzer (Applied Biosystems, USA). The results were analysed with Gen-eMapper ID-X Analysis Software (Applied Biosystems, USA). Concordance analysis was performed between the LB alleles produced by Miseq and the corresponding CE results. The comparison of genotyping results from UAS and STRait Razor v2s was also performed.  www.nature.com/scientificreports/ Allele frequencies and forensic parameters. A counting method was utilized to obtain the LB and SB allele frequencies. For A-STRs, a HWE test and pairwise LD analysis was performed using Arlequin 3.5.2.2 55 . The expected and observed heterozygosity (H exp , H obs ), polymorphism information content (PIC), discrimination power (DP), random match probability (RMP), and power of exclusion (PE) for both duos and trios of the 27 A-STRs were calculated with Cervus 3.0.7 56 . For X-STR loci, the differentiation of the X-STR allele frequency distribution of females and males was performed using Arlequin 3.5.2.2 55 . HWE test and LD analysis were performed for X-STRs of females. The mean exclusion chance for father/daughter duos (MEC duo ) and father/ mother/daughter trios (MEC trio ) were calculated on the ChrX-STR.org.2.0 website (https ://chrx-str.org/) 57 .
Finally, an LD test for the 27 A-STRs combined the 7 X-STRs of females was performed using Arlequin 3.5.2.2 55 . Relevant Y-STR parameters, which included genetic diversity (GD), haplotype diversity (HD), allele frequencies and haplotype frequencies, were calculated using an in-house workbook (written by VBA using Microsoft Excel). The formulas were as follows: where n represents the number of alleles, p i represents the allele frequency, N represents the number of haplotypes, and p j represents the haplotype frequency. HWE and LD for both target SNPs and full sequences of iiSNPs were performed using Arlequin 3.5.2.2 55 . Cervus 3.0.7 56 was used to calculate allele/full sequence frequencies, H exp , H obs , PIC, DP, RMP, PE duo and PE trio . The effective number of alleles (A e ) was defined as the reciprocal of the homozygosity: where p i represents the frequency of the ith allele according to Kidd 46 . We also evaluated the performance of combination of LB A-STRs and target iiSNPs, SB A-STRs and full sequence of iiSNPs with the same methods.
Ethics statement. All of the experimental process in this study were strictly followed the ethical research principles, and all methods following were performed in accordance with the relevant guidelines and regulations. All samples were anonymously collected after informed consent was obtained. This study was approved by the Ethics Committee of Sun Yat-sen University with the approval number of No. 11 [2012].

Data availability
The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.