Introduction

Human population studies rely greatly on mitochondrial DNA (for example, Smith1). Recently, complete mitogenome sequences have been accumulated for various contemporary human populations worldwide, and detectable differences can be observed even among closely related populations. Recombination does not fundamentally occur because of the haploidy of the mitogenome. Therefore, population histories have been considered based on various analyses such as reconstructions of demographic history using Bayesian Skyline Plots and estimating the time to the most recent common ancestor using the Markov chain Monte Carlo method (for example, Mizuno et al.2; Gojobori et al.3). In addition, ancient mitogenome sequences are expected to provide direct evidence of what happened in our past, such as migration, demography, and the relationships among populations. However, most human remains have been excavated from temperate and subtropical regions where the buried remains have often undergone microbial attack, and the conditions for DNA preservation are generally poor.4, 5 To overcome this problem, we previously proposed a unified method in which emulsion PCR was coupled with target enrichment, followed by next-generation sequencing (NGS).6 This unified method facilitates a more efficient determination of non-duplicated target sequences than shotgun NGS, and we successfully achieved deep and reliable DNA sequencing of the ancient mitogenome, even using poorly preserved archeological samples. However, this problem is still challenging. Environmental conditions such as humidity, temperature, salinity, pH and microbial attack strongly influence the degree of DNA preservation.7 Thus, it is considered that more than 15 depths of coverage NGS data are necessary to obtain reliable complete nucleotide sequences of haploid-like genomes such as the mitogenome and Y chromosome.8 However, it is difficult to satisfy this requirement in all nucleotide positions because of problems obtaining a uniform depth of coverage, particularly for poorly preserved materials such as human remains and archeological samples. Therefore, the mitogenome sequences obtained are often fragmentary. To fill in the missing sequences (nucleotides) in the mitogenome, we propose an imputation approach for estimating missing mitogenome sequences. Imputation has been used widely in genome-wide association studies to predict the genotypes at single-nucleotide polymorphism (SNP) sites.9, 10, 11 The advantage of the imputation approach is that it is possible to use the maximum number of samples with the longest sequence in various analyses for population genetic studies. We applied the imputation approach to mitogenome data from a 1500-year-old human remains excavated from the Moon Pyramid at Teotihuacan, Mexico, where the depth of coverage was very low due to poor preservation of the DNA.

Materials and methods

Imputation

Missing nucleotides were inferred using a k-nearest neighbor (KNN)-based algorithm.12 On the basis of 22,638 worldwide mitogenome sequences, we prepared four types of mitogenome panel. Panel 1 as a worldwide panel comprising 292 mitogenome sequences of 29 haplogroups (L, L3, M, C, E, G, Q, Z, D, N, A, I, O, S, W, X, Y, R, B, F, J, P, T, R0, HV, H, V, U and K); Panel 2 as a worldwide haplogroup A panel comprising 731 mitogenome sequences; Panel 3 as an indigenous American panel comprising 390 mitogenome sequences; Panel 4 as an indigenous American haplogroup A panel comprising 175 mitogenome sequences. This haplogroup nomenclature is based on PhyloTree Build 17 (http://www.phylotree.org/).13 In addition to complete mitogenome sequences from PhyloTree and MitoTool (http://www.mitotool.org/),14 sequences of indigenous Americans were obtained from Mizuno et al.,2 Tamm et al.,15 Achilli et al.,16 Perego et al.,17 Kumar et al.18 We used the sequences that belong to haplogroups A, B, C and D. Sequence alignment was performed using MAFFT (http://mafft.cbrc.jp/alignment/software/).19 Imputation was carried out using a software by Huang et al.12 (http://www.ncgr.ac.cn/RiceHapMap). As there is no need to consider recombination in mitochondrial genome, we set a large window-size (w=500). For other parameters (p: penalty for different genotype in pairwise sequence similarity calculation, k: number of k-th highest similar sequences used for imputation and f: allele frequency threshold to determine genotype), we used the best set of parameters shown by Huang et al.12 (p=−7, k=5, f=0.7). To make sure, we evaluated imputation accuracy by using different k values (k=4, 5 and 7), and both k=4 and 5 showed the highest accuracy. Furthermore, we examined the reliability of imputation algorithm using simulated data. We produced missing data sets by randomly lacking 10, 20, 30, 40 and 50% of the nucleotides for each of randomly selected mitogenome sequences. Then, we conducted imputation and evaluated the sensitivity (filling rate) and specificity (accuracy). Filling rate was calculated as the percentage of nucleotides inferred and accuracy was defined as the percentage of nucleotides correctly inferred. We repeated this procedure 100 times.

Mitogenome sequence from 1500-year-old human remains

DNA was extracted from a human sacrifice (PPL99_3A) excavated at the Moon Pyramid in the Teotihuacan archeological site, Mexico.20 During all of the steps for DNA extraction, purification, and NGS library construction, we took all possible precautions to prevent contamination. The experiments were performed in a laboratory that is dedicated exclusively to ancient DNA work, which is physically isolated from other molecular research laboratories. All manipulations were performed in a laminar flow cabinet that is routinely irradiated with ultraviolet light. Frequent surface cleaning was performed routinely before and after working. A facemask, head cap and clean laboratory coat were always worn. Gloves were replaced frequently. All of the procedures were performed using gamma ray-irradiated disposable tubes and filter pipette tips. All non-disposable glasses and metallic materials were dry heat-sterilized at 160 °C for 2–6 h. All the reagents were molecular biology grade and ultrapure water was used. A series of NGS libraries was constructed by emulsion PCR and target enrichment according to our previously described protocol.6 The library was sequenced in 2 × 101 cycle runs on an Illumina HiSeq1500. Adapter sequences were trimmed using cutadapt21 and the reads measuring <28 nt in length were removed. The Burrows–Wheeler Aligner (BWA) does not consider the circularity of the mitogenome. Then, we joined the chrM sequence from the UCSC hg19 assembly to a haplogroup L sequence with a 9-bp deletion at nucleotide positions (np) 8281–8289 (accession number: EU092665) and we used this as a reference sequence for mapping. The sequence reads obtained were mapped to the reference mitogenome sequence using BWA with the default parameters.22 The mapped sequence reads were aligned with the hg19 sequence using Novoalign (Novocraft, http://novocraft.com/) with the default parameters. After alignment, duplicated reads were removed using Picard (http://picard.sourceforge.net). SNPs were identified using SAMtools.23

To examine the authenticity of the ancient DNA sequences obtained using NGS techniques, we considered two types of DNA degradation pattern, which differ between authentic ancient DNA and contamination with modern DNA. The first was the increased misincorporation frequency of thymine residues at positions where a cytosine is found in the reference sequence at the 5′-end of the mapped sequence reads, as well as the increased misincorporation frequency of adenine residues at positions where a guanine is found in the reference sequence at the 3′-end of the mapped sequence reads. The second was the increased frequency of purine residues at one base upstream of the 5′-end of the mapped sequence reads, as well as the increased frequency of pyrimidine residues at one base downstream of the 3′-end of the mapped sequence reads. The former is due to nucleotide misincorporation caused by the deamination of cytosine residues into uracil, a chemical analog of thymine, whereas the latter is due to depurination as a driving force of postmortem DNA fragmentation. The sequence reads obtained exhibited typical patterns of postmortem DNA degradation, which were consistent with previous studies.6, 24, 25, 26, 27, 28

Results

The use of a larger and more diverse reference panel is considered to improve the accuracy of imputation.29 However, the computational burden is a major concern, especially that incurred for the alignment step of imputation. To overcome this problem, we used nucleotide sequences at variant sites alone and not the entire mitogenome sequences. According to genome-wide association studies, the supplementation of missing data by imputation depends greatly on the panels used.30 We used different types of panel (mitogenome data file) to compare the results of imputation, that is, a panel comprising all of the 29 major haplogroups from around the world, a panel comprising worldwide mitogenome sequences belonging to the estimated haplogroup alone, a panel comprising mitogenome sequences from the population related most closely to an individual under investigation, and a panel comprising mitogenome sequences belonging to the estimated haplogroup from the population related most closely to an individual under investigation.

For 1500-year-old human remains (PPL99_3A) excavated at the Moon Pyramid in the Teotihuacan archeological site, Mexico, 89.3% of the mitogenome sequence was covered by 786 non-duplicated unique mapping reads with quality scores >20. On the basis of the available nucleotides, its haplotype was assumed to be A2, although both the diagnostic site of haplogroup A (np 663 of the revised Cambridge Reference Sequence) and three positions (np 16 290, 16 319 and 16 362) in the hypervariable segment 1 (HVS1) were missing. The average depth of coverage was 4.0-fold and 76.3% (12 640 sites) of the mitogenome sequence was covered by at least two non-duplicated reads with no discrepancies. However, 3931 nucleotides (sites) could not be determined for the following reasons: the depth of coverage was <2, the quality score was <20, or there was discordance in the sequence among the reads obtained, where they accounted for 23.7% of the whole mitogenome sequence (3931 sites) and we designated them as missing nucleotides. The haplogroup of mitogenome sequence of PPL99_3A was estimated after checking the validity of all sites with Phylotree Build 17.13

When we used Panel 1 (worldwide), 3927/3931 sites were filled in, thereby excluding four sites: np 152, 248, 10 873 and 15 301. Imputation using a worldwide haplogroup A panel (Panel 2) filled in 3930 sites but one site remained as not imputed. In addition, using a panel of indigenous American mitogenome sequences (Panel 3), 3930 sites were filled in a similar manner to Panel 2, thereby excluding one site, but their missing positions were different: np 153 and 152 for Panels 2 and 3, respectively. A panel comprising indigenous American haplogroup A mitogenome sequences (Panel 4) filled in all of the missing 3931 sites. As a result, the numbers of remaining missing sites were four, one, one and zero for Panels 1, 2, 3 and 4, respectively (Table 1). Thus, the number of missing sites can be reduced in an efficient manner by imputation. The missing sites, np663 of the haplogroup A diagnostic site and three positions (np 16 290, 16 319 and 16 362) in HVS1, were filled in with the expected nucleotides. We confirmed the imputation result by using PCR and direct sequencing, showing that np 663 was G (Supplementary Information).

Table 1 Result of imputation for a 1500-year-old human remains using different types of panel

To validate the reliability of our imputation approach, we produced simulated data sets by randomly lacking 10, 20, 30, 40 and 50% of the nucleotides for each of the randomly selected mitogenome sequences for Panels 2 (worldwide haplogroup A mitogenome sequences), 3 (indigenous American mitogenome sequences) and 4 (indigenous American haplogroup A mitogenome sequences), respectively. As shown in Figure 1, the accuracy was highest for all the missing data when using Panel 4, although the filling rate using Panel 4 was lower than those using Panels 2 and 3. The filling rate using Panel 3 was relatively high, but the accuracy is lowest for all the missing data. Especially, the degree of deterioration in the accuracy was rapidly decreased, dependent on the missing percentages of the nucleotides. This might be due to its higher sequence diversity (due to the sequences consisting of haplogroups A, B, C and D), compared with those of Panels 2 and 4 (due to sequences consisting of haplogroup A alone).

Figure 1
figure 1

Validation of the imputation approach obtained using simulated data. Filling rate (sensitivity: closed square) was calculated as the percentage of nucleotides inferred and accuracy (specificity: open circle) was defined as the percentage of nucleotides correctly inferred. 10%N, 20%N, 30%N, 40%N, and 50%N designate missing data that randomly lacks 10, 20, 30, 40, and 50% of the nucleotides, respectively. (a) Worldwide haplogroup A mitogenome sequences (Panel 2), (b) indigenous American mitogenome sequences (Panel 3) and (c) indigenous American haplogroup A mitogenome sequences (Panel 4). The values are the mean of 100 times of trials.

Discussion

The results of imputation were quite different according to the panels used (Table 2). We found that np 248, 10 873 and 15 301, which were still missing after imputation by Panel 1, were filled in using Panels 2, 3 and 4. In addition, there were no discrepancies in the nucleotides imputed. However, the nine sites filled in using Panel 1 (np 4248, 4824, 8794, 12 007, 16 111, 16 183, 16 290, 16 319 and 16 362) were filled in with different nucleotides when using Panels 2, 3 and 4, but the nucleotides were identical with these three panels. Intriguingly, Panels 2 and 4 successfully filled in np 152, whereas Panels 1 and 3 failed. For np 146, 235 and 663, the imputed nucleotides differed between Panels 1/3 and 2/4, that is, 146T, 235A and 663A by Panels 1/3, and 146C, 235G and 663G by Panels 2/4. The nucleotide at np 152 is known to vary among haplogroups. Furthermore, np 146 is a hotspot,31 while 235A and 663A are observed more commonly among haplogroups than 235G and 663G (663G is a diagnostic nucleotide for haplogroup A). The failure/success of imputation and the inconsistency of the nucleotides imputed according to the panels used are probably attributable to frequent parallel mutations. The use of a haplogroup-specific panel was highly advantageous for imputation. Imputation using Panels 1 and 3 filled in np 153 with nucleotide A, whereas that using Panel 4 added G; moreover, Panel 2 failed at imputation. The nucleotide at np 153 also varies among haplogroups, where 153A is observed more commonly among haplogroups than 153G. Indeed, 153G is one of the characteristic variant sites for sub-haplogroup A2,32 which probably explains why the worldwide haplogroup A panel failed at imputation. The missing np153 was imputed as nucleotide G using the indigenous American haplogroup A panel; therefore, PPL99_3A was assigned to sub-haplogroup A2. Haplogroup A of the indigenous American people is classified into sub-haplogroup A2, which is consistent with our result.

Table 2 Information of imputed nucleotides using different types of panel

Together with the results of simulation analysis, our results showed that imputation using a common ancestral panel comprising mitogenome sequences belonging to the estimated haplogroup from the population related most closely to an individual under investigation provided more valid and reliable results than imputation using a panel that comprised all of the major haplogroups from around the world, a panel comprising worldwide mitogenome sequences belonging to the estimated haplogroup alone, and a panel comprising mitogenome sequences from the population related most closely to an individual under investigation. A larger and more diverse reference panel is thought to ensure more accurate imputation,29 but the present study demonstrated the risk of deriving genome sequence that can be imputed incorrectly due to recurrent mutations. The mitogenome contains highly mutable nucleotides called hotspots, which cause recurrent mutations.31 Due to the recurrent mutation of nucleotides, there is a possibility that missing nucleotides will be imputed incorrectly. However, by estimating the possible haplogroup at the initial step, it is possible to perform imputation more effectively. The use of an appropriate panel is essential. Thus, employing mitogenome sequences belonging to the estimated haplogroup from the population related most closely to an individual under investigation will obtain the best imputation results.