Sequence variations, flanking region mutations, and allele frequency at 31 autosomal STRs in the central Indian population by next generation sequencing (NGS)

Capillary electrophoresis-based analysis does not reflect the exact allele number variation at the STR loci due to the non-availability of the data on sequence variation in the repeat region and the SNPs in flanking regions. Herein, this study reports the length-based and sequence-based allelic data of 138 central Indian individuals at 31 autosomal STR loci by NGS. The sequence data at each allele was compared to the reference hg19 sequence. The length-based allelic results were found in concordance with the CE-based results. 20 out of 31 autosomal STR loci showed an increase in the number of alleles by the presence of sequence variation and/or SNPs in the flanking regions. The highest gain in the heterozygosity and allele numbers was observed in D5S2800, D1S1656, D16S539, D5S818, and vWA. rs25768 (A/G) at D5S818 was found to be the most frequent SNP in the studied population. Allele no. 15 of D3S1358, allele no. 19 of D2S1338, and allele no. 22 of D12S391 showed 5 isoalleles each with the same size and with different intervening sequences. Length-based determination of the alleles showed Penta E to be the most useful marker in the central Indian population among 31 STRs studied; however, sequence-based analysis advocated D2S1338 to be the most useful marker in terms of various forensic parameters. Population genetics analysis showed a shared genetic ancestry of the studied population with other Indian populations. This first-ever study to the best of our knowledge on sequence-based STR analysis in the central Indian population is expected to prove the use of NGS in forensic case-work and in forensic DNA laboratories.

www.nature.com/scientificreports/ of populations to represent mini-India. Understanding the genetic diversity of central Indian population gives a representation of the genetic print pan-India. The study aimed to generate sequence-based allele frequency data, population-specific characteristics, sequence variations, and SNPs in the flanking regions for the forensic casework applications in the studied population.

Results and discussion
Sequencing performance of precision ID NGS STR panel v2. Quality control parameters such as Locus balance (LB), Heterozygous balance (HB) and Stutter ratio of the 31 autosomal STR markers have been mentioned in Fig. 2. Out of all the STR markers, D4S2408 showed the most perfect average LB value (0.992) whereas, D16S539 showed greatest deviation from the ideal LB value ( 722). None of the markers showed a deviation for the threshold set for the stutter ratio i.e., 1.4. The occurrence of the stutter products was observed to be highest in the number for D1S1656 and null stutter product was observed for D3S4529. The average value of stutter ratio ranged from 0.104 (D16S539) to 0.127 (D6S474). As the use of NGS technology is still at its nascent stage in the forensic DNA applications, quality issues of some STR markers need to be addressed by the kit manufacturers prior to their efficient use in routine forensic casework.
Concordance study, allele frequency, forensic and paternity parameters. Out  To the best of our knowledge, this is the first report wherein sequence-based analysis of the 31 STR markers has been carried out on studied markers in any Indian population. Besides, this is also the first allelic report on nine STR markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 in the Indian population. The calculated length-based allele frequency values are given in the Supplementary Table S1. Forensic and paternity parameters of the length-based and sequence-based alleles have been provided in Table 1. The average total allele number of all the genetic markers was calculated as 9.26 and the highest number of size-based alleles (18) was observed on marker Penta E, whereas, D1S1677, D4S2408, and D6S474 showed the lowest number of alleles i.e., 6 ( Fig. 3). The newly analyzed markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 generated a total allele number of 8,7,6,8,7,6,8,11, and 6 respectively. Besides, Penta E showed the highest power of discrimination (0.978), polymorphic information content (0.90), Expected Heterozygosity (0.905) value, and the lowest matching probability (0.022), whereas, FGA showed the highest value for Power of Exclusion (0.778), Typical Paternity index (4.60) and observed heterozygosity (0.891). These findings suggested the usefulness of Penta E and FGA marker in the central Indian population based on the length-based analysis of alleles. D2S441 showed its least usefulness in the terms of polymorphic information content (0.64), power of exclusion (0.329), typical paternity index (1.35), observed and expected heterozygosity (0.630 and 0.690). Similarly, the calculated power of discrimination (0.855) and matching probability (0.145) values did not advocate the usefulness of the D5S818 marker in the studied population. On the contrary, when sequence-based forensic and paternity parameters were calculated in 31 autosomal STR markers, D2S1338 emerged to be the most useful marker in the studied population with the highest values of power of discrimination (0.984), polymorphic information content (0.920), power of exclusion (0.822), and typical paternity index (5.75), and the lowest matching probability (0.016). This suggested that the individual markers should be assessed on the basis of sequence-based alleles to get a clear idea on their usefulness in a specific population. The previous studies also suggested the utility of the Penta E marker with higher forensic and paternity parameters in the Indian population [16][17][18] . This marker has already been established with high forensic efficiency for its effective use in the personal identification in the Portuguese population 19 , Austrian Caucasian population 20 , Northern Italy population 21 and Mexican population 22 . When the newly inducted STR markers i.e., D12ATA63, D14S1434, D1S1677, D2S1776, D3S4529, D4S2408, D5S2800, D6S1043, and D6S474 were analyzed, they showed a similar allelic range and other statistical parameters in the limited published literature from Inner Mongolia, China 23 , Tujia population 24 .
Out of 81 male samples, four samples were found to be of AMELY deletion cases; where, AMELY could not be amplified, but a positive amplification was present in three alternative sex-determining markers i.e., DYS391, SRY, and Y InDel. This result was found to be consistent with the corresponding CE data. Allele no. 10 was found to be present dominantly in 63 samples followed by allele 11 (16 samples) and allele 9 (2 samples). Similarly, Y InDel showed allele 2 in 74 samples and allele 1 in only 7 male samples. AMELY deletion is a global problem 25 and simultaneous amplification of the alternative sex-determining markers 26,27 is highly useful in assigning the sex of a sample appropriately as evidenced in four samples of the current study.  www.nature.com/scientificreports/ length, majorly contribute to such increment in the allele numbers 28 . Substantial gain in allele numbers has been detected at D13S317, D16S539, D1S1656, D5S2800, D5S818, D7S820, and vWA with D5S2800 showing a significant increase in allele numbers due to the variation in flanking region and D3S1358 showed the highest allele gain due to the differing repeat sequence conditions. On the contrary, the genetic markers which showed no gain in allele numbers either by SNPs in flanking regions or sequence length variation included CSF1PO, D18S51, D19S433, D1S1677, D22S1045, D22S1045, D3S4529, D6S1043, FGA, Penta D, Penta E, and TPOX. Besides, the markers which showed an increment in allele number only due to SNPs in flanking regions were D10S1248, D13S317, D14S1434, and D7S820. The increased allele number in D12ATA63, D12S391, D21S11, D2S1338, D3S1358, D4S2408, D8S1179, and TH01, was due to the variation in the repeat sequences only. Short nucleotide polymorphism (SNPs) associated with the flanking region of STRs has widely been reported throughout the globe 13,29,30 . The SNP-STR links SNPs with the STR polymorphism which allows the generation of an STR allele subtype, based on the observed SNP allele in the flanking region. Although many other marker combinations such as deletion-insertion polymorphisms amplified with STRs (DIP-STR) are used widely, a recent study advocated the use of SNP-STRs for forensic application, where an imbalanced DNA mixture is expected 31 . In this regard, the current study depicted the existence of many SNPs in the flanking region of STRs in the studied population (Table 2). rs25768 showed the highest occurrence in the central Indian population associated at upstream of D5S818 marker, whereas, rs73250432, rs369257353, and rs561924992 located at upstream of D13S317, downstream of D5S818, and downstream of D16S539 respectively showed their least occurrence.
Detection of alleles with identical size but different internal sequence variation has been acknowledged as one of the advantages of using NGS for studying STRs 32,33 . The marker-wise isoalleles observed in the central Indian population have been reported in the Table S2. Out of 31 autosomal STR markers analyzed in this study, the isometric heterozygous pattern was observed at only 16 loci i.e., D3S1358, D21S11, vWA, D5S2800, D6S474, D2S441, D12ATA63, D2S1338, D1S1656, D16S539, D8S1179, D12S391, D2S1776, TH01, D5S818, and D4S2408. Allele no. 15 of D3S1358, allele no. 19 of D2S1338, and allele no. 22 of D12S391 showed a maximum number of isoalleles with the same size and different intervening sequences (Fig. 4).
A previous report has suggested a correlation between the allele number and various paternity and forensic parameters of an STR marker such as total possible genotypes, Power of discrimination, Matching probability, Polymorphic information content, power of exclusion, total paternity index, and gene diversity 18 . Keeping this in view, a substantial increase in sequence-based allele numbers in the STRs as observed in the present study increased their evidentiary value. With the increase in the allele number, the potential forensic and paternity applications of the STR markers are substantially increased. An increase in the allele number has further been correlated with the increase in heterozygosity of an STR marker which also increased its informativeness 9 .
Population genetics. When the observed size-based allelic data were compared at 15 consistent STR markers of the different populations and a neighbor-joining tree was constructed (Fig. 5a), the dendogram showed two distinct branches of the population clusters. One cluster included the population of Tibet, Nepal, China Han population from Yunnan Province, Southwest China, northeastern Thai people of Thailand, Hainan Li popula-  (Fig. 5b), where, clustering of populations from Madhya Pradesh (Gond), Jharkhand, Uttar Pradesh, Tamilnadu, Rajasthan, Himachal Pradesh and Odisha states was observed. Therefore, the genetic sharing largely mimiced the geographical clustering. The heat map drawn using Nei's Da distance matrix has been shown in Fig. 6. The overall result of the heat map was found in concordance with the outcomes of the NJ and PCA plot for the interpopulation comparison.

Conclusions
This first report to the best of our knowledge of sequence-based allelic data on the Central Indian population holds prominent usefulness in the forensic case works. Data obtained in this study further emphasized the implementation of NGS-based studies of STRs for forensic application. The size-based alleles showed concordance between the CE analysis as well as the NGS data. Some STR markers demonstrated a substantial variation in the repeat motifs as well as SNPs in the STR flanking regions in this study. A significant increase in the allele number further increased the statistical values of the studied forensic and paternity parameters of the STRs, thus, increasing their usefulness in the forensic applications. As per the recommendations of the ISFG, it is utmost importance to enrich the allelic data of the sequence-based STR genotypes. An increase in the allele number as  www.nature.com/scientificreports/ evidenced in the present study also suggested the population-specific and sequence-based studies of the STR markers. In this context, the present study would be useful for providing the pioneer sequence-based data on the central Indian population.   www.nature.com/scientificreports/ Precision ID Chef Reagents along with other recommended plastic wares and reagents at the designated places onto the Ion Chef™ system. The Ion Chef System automated all template preparation steps, including creating the emulsion mixture, performing the PCR, carrying out the post-PCR purifications, and finally loading the purified templated beads onto the two Ion 530 chips accordingly using the manufacturer's guidelines. Ion S5 systems was initiated by loading a reagent cartridge, buffer, cleaning solution, and waste container as per the Ion S5™ Precision ID Sequencing Kit protocol of the manufacturer. The Ion S5 chip was then loaded and the run started using 200 bp chemistry with 650 flow according to the human identification GlobalFiler™ NGS STR sequencing format. The raw data was extracted from the S5 Torrent Server v5. 10.0 (Thermo Fisher Scientific) and were input into the Converge™ software v2.1 (Thermo Fisher Scientific) for sequence analysis with Homo sapiens hg19 genome. The HID Genotyper plugin v2.1 (Thermo Fisher Scientific) was applied to the analysis procedure at the default thresholds, in which the relative analytical and stochastic thresholds were both 0.05 and the stutter ratio was set as 0.14. Further sequencing performance of Precision ID NGS STR panel v2 was assessed by analyzing locus balance (LB), heterozygous balance (HB), and stutter ratio of the obtained sequences following Avila et al. 34 and Brookes et al. 35 .

Sequencing. A sequencing run on the
Concordance analysis with capillary electrophoresis (CE). All the 138 samples were studied to assess the concordance between CE-STR data and NGS-STR data. All these samples were analyzed using the Power-Plex Fusion 6C System (Promega, USA) following the manufacturer's guidelines. 0.5-1.0 ng of genomic DNA was used to amplify the samples on Veriti 96 well Thermal Cycler (Thermo Scientific, USA). Capillary electrophoresis of the amplified DNA fragments was performed using a 3500xL Genetic Analyzer (Thermo Scientific, USA). The generated STR fragments were analyzed using GeneMapper ID-X v.