An in-depth analysis of the mitochondrial phylogenetic landscape of Cambodia

Cambodia harbours a variety of human aboriginal populations that have scarcely been studied in terms of genetic diversity of entire mitochondrial genomes. Here we present the matrilineal gene pool of 299 Cambodian refugees from three different ethnic groups (Cham, Khmer, and Khmer Loeu) deriving from 16 Cambodian districts. After establishing a DNA-saving high-throughput strategy for mitochondrial whole-genome Sanger sequencing, a HaploGrep based workflow was used for quality control, haplogroup classification and phylogenetic reconstruction. The application of diverse phylogenetic algorithms revealed an exciting picture of the genetic diversity of Cambodia, especially in relation to populations from Southeast Asia and from the whole world. A total of 224 unique haplotypes were identified, which were mostly classified under haplogroups B5a1, F1a1, or categorized as newly defined basal haplogroups or basal sub-branches of R, N and M clades. The presence of autochthonous maternal lineages could be confirmed as reported in previous studies. The exceptional homogeneity observed between and within the three investigated Cambodian ethnic groups indicates genetic isolation of the whole population. Between ethnicities, genetic barriers were not detected. The mtDNA data presented here increases the phylogenetic resolution in Cambodia significantly, thereby highlighting the need for an update of the current human mtDNA phylogeny.

www.nature.com/scientificreports/ maternal lineages, and most of the common Cambodian haplogroups probably originated locally before expanding to the surrounding areas during prehistory 2 . The aim of our study was to provide a high quality mtDNA data set from Cambodia in order to understand the complex history of Southeast Asian mtDNA evolution and diversity. This was accomplished by genotyping the so far largest population set of complete mitochondrial genomes of 299 Cambodian individuals, with a cost-efficient, DNA-saving, and reliable high-throughput strategy for mitochondrial whole-genome Sanger sequencing, and by distilling ~ 7,000 mtDNA samples from studies on neighbouring-and global populations. Post-processing of the data was carried out in HaploGrep 2, estimating haplogroups and performing additional quality control steps. Phylogenetic trees were directly generated within HaploGrep 2, data exported in multiple alignment format for maximum likelihood phylogenetic tree inference with RamXL-NG 23 , Bayesian evolutionary analysis with BEAST 2 24 , Median-Joining Networks with Network.exe (v.4) 25 as well as Maximum Parsimony trees.

Results
The study included 299 DNA samples from Cambodian citizens, who lived in Northern Thailand at the time of sample collection. Blood samples were originally collected from 300 individuals, but one sample (C228) showed obvious signs of contamination and was therefore excluded from further analysis. Detailed descriptions of the samples including sex, age, places of birth of the sampled individuals and of their mothers, the mtDNA haplogroup affiliations and the GenBank accession numbers are given in Table S1. Our strategy enabled a 2.8-fold coverage of each mitochondrial profile with diverse electropherograms, generated with 55 forward primers and 41 reverse primers (see Supplemental Material Table S2, and Supplemental Fig. S1 for workflow). In comparison to our previously published whole-mtDNA sequencing strategy 26 , six forward primers and 20 reverse primers were new and replaced old sequencing primers. In addition, all liquid handling steps were performed by liquid handling platforms, thereby reducing potential pitfalls such as contamination or sample mix-up. Finally, the strategy was also economic in terms of DNA consumption, as less than 1 µg of DNA (for some samples only 400 ng, some had to be repeated) was needed for generating a high-quality mtDNA profile.
Out of the 299 Samples, the majority of individuals (79.3%) were born in four provinces of Cambodia ( Fig. 1 13.7%). The remaining 20.7% of samples came from twelve different provinces. In 95.3% of individuals, the places of birth of the sampled persons were identical to the maternal places of birth, thereby underscoring the stability of the geographic origin of samples (see Table 1). Figure 1 represents the place of birth of the investigated samples, the sampling locations in Zhang et al. 2 and gives an overview of the geographic origin of samples analysed.
The largest part of participating refugees declared themselves as belonging to the ethnic group of Khmer (76.9%, Table 1 and Supplemental Table S1). However, according to the official statistics, more than 90% of Cambodians are of Khmer ethnicity 1 . The large proportion of Cham individuals sampled in this study (20.4%)  www.nature.com/scientificreports/ could therefore be explained by the fact that the samples were collected in refugee camps in Thailand, where ethnic minorities, who were prosecuted or expelled by the Khmer rouge, found shelter. We adapted HaploGrep 2 to automatically generate DOT graphs, which is represented in Fig. 2. The 299 samples fall into 224 unique haplotypes and into 90 different haplogroups. The samples can be grouped into macrohaplogroups R (n = 167, 55.9%), M (n = 119, 39.8%), and N (n = 13, 4.3%). Figure 2 illustrates that macrohaplogroup M samples show only few larger clusters (with M17c1a1a being the largest clade), while 76.6% of samples in macrohaplogroup R predominantly fall into haplogroups B5a1 (75 samples), F1a1 (38 samples) and R22 (15 samples), according to the latest PhyloTree version 17 27 . This indicates a lack of phylogenetic resolution, especially in B5a1 and F1a1. Because some refugees were related to each other within the Cambodian samples presented herein, we performed the subsequent analyses by only analysing one sample per family, which left us with 264 samples (see Table S3, samples with relations marked with red-background and Table S4, where our samples are denoted by (Dataset = This)). Figures S5 and S6 represent alternative phylogenetic trees, which were generated with maximum-likelihood (RaxML-NG) and Bayesian (BEAST 2) methods respectively.
Phylogenetic characterisation of mtDNA genomes in macrohaplogroups N and R. The 13 samples in lineage N can be classified into eight different haplogroups according to PhyloTree 17, and the 167 samples in R can be assigned to 29 distinct haplogroups (see Fig. 2). Restricting to unrelated individuals, 164 samples remained for further analysis for both N and R lineages. The samples in lineage N belong to N7, N8, N9, N21, N22, and W3b with two basal N samples.
Samples from macrohaplogroup R are mostly part of B (B4′5, B6), R9 (F1 and R9b), as well as R22 and R23 (see Fig. 2). One exception is sample C149 classified as U2b1, including several private mutations. This haplotype was observed only once in a different dataset (GenBank GQ337626, Popset 255366113), referring to an unpublished sequence, describing samples from the northeast Indo-China corridor. After removing samples with direct relationships, 60 out of 75 samples can be placed in B5a1 (20%), 32 out of 38 in F1a1 and 14 out of 15 in R22. Subsequently, we emphasized our focus on the B5a1 and F1a1 haplogroups. PhyloTree 17 lists 10 subhaplogroups to B5a1, which are grouped into the four lineages B5a1a (including B5a1a1), B5a1b (including B5a1b1), B5a1c (with B1a1c1 down to B5a1c1a1 and B1a1c2) and B5a1d. We generated maximum parsimony as well as maximum likelihood and Bayesian trees based on all previously published sequences defining B5a1 and its subgroups, by further including our B5a1 samples. Most of the Cambodian samples in B5a1 could be placed in B5a1a (77.3%), and B5a1d (17.3%), while only three samples could be classified into B5a1b and one into B5a1c. Supplemental Material Figure S2 gives an overview of the alternative trees and the extended phylogeny. www.nature.com/scientificreports/ Additionally, we performed the same procedure to fine-scale haplogroup F1a1, which also includes four subgroups F1a1a to F1a1d, totalling to nine subgroups. Most samples herein could be placed in F1a1a (36 out of 38 samples, see Fig. 2). Supplemental Figure S3 gives an overview of the Bayesian phylogenetic tree, while Supplemental Table S5 provides the maximum parsimony tree for the N subset, including F1a1. While currently only two samples define haplogroup R22, our data set includes 15 samples from this lineage. Our distilled sample set for population comparison contains a total of 106 haplogroups belonging to R22, most from Cambodia (75 samples including the control region sequences from Zhang et al. 2 comprising our data), as well as in the complete mtDNA genomes described in Thailand 10-12 , in Vietnam 13 (only one sample out of 607 sequences), Myanmar 18 and Laos 16 . Supplemental Table S7 gives an overview of all haplogroups identified in the data sets grouped by country. Supplemental Table S8 provides the rho ± sigma for all haplogroups for the complete mtDNA (np 1-16,569) variants and only synonymous substitutions, respectively.  Table S6 provides the maximum parsimony tree for the M subset. Additionally, median-joining networks were generated and presented in Supplemental Fig. S4.
Relationship to populations in Southeast Asia. We compiled a data set comprising 7,178 complete mtDNA sequences as a representative data set of populations in Southeast Asia (Supplemental Table S4). The data set included 2,504 samples from the 1000 Genome Project Phase 3 28 , which we had analysed before 29 . Figures 1 and 3 give an overview of the sample locations as well as the haplogroup distributions of the differ- www.nature.com/scientificreports/ ent datasets. Figure 3A includes the populations from the 1000 Genome Project, with the super populations reported, (see the 1000 Genomes Project site for details https:// www. inter natio nalge nome. org/ categ ory/ popul ation/), with East Asian ancestries (EAS) are listed individually. Figure 3A highlights the observation that Cambodian samples lack the presence of haplogroup A, and exhibit only a few M7, M8 and M9 haplogroup samples compared to other Southeast Asian populations. Based on the HaploGrep results (see Supplemental Table S3 for detailed haplotypes), the Cambodian ethnic groups were compared to previously published mtDNA data sets from populations in Southeast Asia. For this purpose, we adapted the haplogroup classification to Zhang et al. 2 . Figure 4 shows the Principal Component Analysis plots including the global 1000 Genomes Phase 3 data. As can be seen, the first two principle components of the macrohaplogroups in Fig. 4A allow for sufficient discrimination power, with haplogroups grouped on macro-level (see Fig. 3 and Supplemental Material Table S7 for details). We observed some outliers for the Thailand samples (predominantly Northern Thailand 11 , Khon Mueang 10 , and Mon 12 ) when clustering on macrohaplogroups. The close relation between Southeast Asian populations (EAS cluster) and Americans (AMR cluster) is based on the mutual presence of haplogroup lineages A, B, C, and D in both EAS and AMR super populations.
In addition, we generated a correlation matrix based on the data represented in Fig. 4A. The samples for the correlation plot (Fig. 5) of the compiled dataset (n = 7,178) (Supplemental Table S4) were grouped according to the countries in Southeast Asia and on a continental level for other populations, in order to give a more compact overview. Overall, the highest correlation was found between our data set and the Cambodian samples presented by Zhang et al. 2 . The correlation plot (Fig. 5) is ordered by the first principle component, with samples from Vietnam, Thailand and Laos showing the highest correlations to each other. Interestingly, the 1000 Genome samples from Vietnam (KHV) showed a lower correlation than expected to those three neighbouring countries. Due to the lack of macrohaplogroups M7, M8, M9, and D, the Cambodian samples showed a more profound correlation to South Asians.
Additionally, we prepared the data for MitoBench 30 (see https:// github. com/ mitob ench/ for different repositories), which allowed the calculation of pairwise F ST values based on a dataset compiled from the complete mtDNA sequences. Supplemental Material Table S9 provides the complete dataset with 5,983 multiple aligned

Discussion
The aim of the present study was to corroborate and extend previous phylogenetic analyses on the evolution of mitochondrial genomes in Cambodia. By whole-mitochondrial genome sequencing of 299 Cambodian individuals living in Thailand, we could define several new mtDNA haplogroups (see Supplementary Tables S5 and  S6 for details). The compilation of samples was certainly biased by the fact that the samples were not collected in mainland Cambodia, but in refugee camps close to the Cambodian border. The sample bias is reflected by the overrepresentation of individuals of Cham ethnicity in comparison to today's ethnic composition of Cambodia. However, our careful documentation of the geographic origin of every individual, including the maternal place of birth, renders this population sample a valuable and reliable contribution for appraising the genetic diversity of aboriginal Cambodians. Our data strongly correlates to the previously published data from Zhang et al. 2 , indicating the feasibility of the present mtDNA haplotypes to better understand the genetic matrilineal makeup in Cambodia.
At present, our study provides the largest collection of entire mitochondrial genomes from Cambodia. Therewith, we were able to detect and define new mitochondrial haplogroups, predominantly under lineage B5a1 and F1a1 (see Supplementary Figs. S2 and S3 for the B5a1 and F1a1 haplogroup refinements). On this account, we emphasize the significance of further subjecting larger population samples from geographic white spots on the human evolutionary map to whole-mtDNA-sequencing projects, but also the need to update the underlying database as well as the phylogenetic representation of human mtDNA.
When considering the samples-to-haplogroup ratio, the M clade presents a lower ratio compared to the R clade. The samples-to-haplogroup ratio indicates the quotient of the number of samples divided by the number of subhaplogroups. A lower samples-to-haplogroup ratio means that there are more subhaplogroups under a specific super-haplogroup. This indicates that there are fewer pronounced clusters with many nearly identical samples. There are two explanations for this: On the one hand, the phylogeny has a higher resolution under M. This can be observed by the number of haplogroups defined in PhyloTree 17 (M includes 685 subhaplogroups at different levels), compared to R (n = 165), including B (n = 322) and F (n = 104). Nevertheless, in phylogenetic studies of Southeast Asia new deep rooting M-haplogroups with many seemingly private mutations are frequently found, indicating a still significant undersampling of macrohaplogroup M. On the other hand, the large clusters under B5a1 and F1a1 correspond to 38% of all samples, indicating that Cambodia harbours autochthonous populations with relatively recent expansions, as previously observed by Zhang and colleagues 2 . The higher number of complete sequences generated within the current project allows for a far better refinement of the haplogroups B5a1 and F1a1, extending the phylogenetic sub-clades significantly (Supplemental Figs. S2 and S3).
By generating a correlation matrix (Fig. 5) as well as F ST statistics (Supplemental Table S10), we detected a bias in two studies, where both control region samples and complete mtDNA sequences were analysed. This applies to the Cambodian samples from Zhang et al. 2 as well as our previously presented Myanmar data in Summerer et al. 18 . Supplemental Figure S7 gives an overview of the haplogroup distributions in all analysed data sets, with Supplemental Table S10  Our group provided the mtDNA for two larger ethnicities in Myanmar 18 , which was missing according to Zhang et al., for further evidence that Cambodia was a dispersal centre of mitochondrial lineages in east Asia 2 . The population structure in Bamar showed an extraordinarily diverse haplogroup composition. In contrast, the Karen people displayed a more profound sign of genetic isolation. While both fitted well into the Southeast Asian cluster, the two larger ethnicities in Cambodia (Cham and Khmer) showed an extremely similar haplogroup distribution, which we did not expect to this degree. As the samples derived from different regions of Cambodia, this indicates a stable gene flow between the Cambodian ethnicities, and a lower diversity than expected, compared to populations in Mainland Southeast Asia. Considering that the data derived from Cambodian refugees, it nevertheless matches the general population from different regions (see Fig. 1 for sample origins, Figs. 3 and 5 for population comparison).
Besides the phylogeographic analysis, an important aspect of this work was to ease the handling of mtDNA data. Therefore we adapted HaploGrep 2 to directly generate haplogroup overview plots such as in Fig. 2, which can be further adapted in vector graphic software (e.g. Inkscape). Thereby HaploGrep 2 provides several different lineage outputs (see https:// github. com/ seppi nho/ haplo grep-cmd for more details).
Finally, the presented dataset allows a much better fine scaling of the phylogenetic tree based on the complete mtDNA sequences. Most of the new haplogroups explored within this work are defined by variants in the coding region, with only a few new haplogroups including control region variants. The presented data also highlights the need for an update of the underlying phylogenetic tree.

Methods
DNA samples. Blood samples were collected from 300 Cambodian individuals, who came from 16 of the 22 regions of Cambodia (Fig. 1), but lived in Thailand at the time of the sample collection. The study was approved by the Thai Ministry of Public Health in accordance with international ethical standards. According to the Declaration of Helsinki, participation in the study was on a voluntary basis and informed consent was obtained from all donors (Chiang Mai University, Thailand, January 11, 2007). Detailed information on the geographical origin and ethnic background of the samples including the maternal region of birth was collected from all samples and can be found in Table 1 and Supplemental Table S1. Genomic DNA was extracted on a BioRobot EZ1 advanced Workstation (QIAGEN, Hilden, Germany) Table S2). In sum, 24 sequencing plates were cycled for obtaining full mtDNA genomes from 96 samples, after the two negative controls from PCR amplification have been replaced by PCR products from the last DNA plate (only partially filled). Cycle sequencing products were purified with MultiScreen SEQ384 Filterplates (Merck Millipore Corporation, Darmstadt, Germany). Electrophoretic separation was carried out on an ABI3730 capillary sequencer using POP-7 and a 50 cm capillary array.
Sequence evaluation, quality assurance and data analysis. Sequence electropherograms were aligned to the revised Cambridge Reference Sequence (rCRS; NC_012920) 31 and evaluated independently by two different mtDNA technicians with the sequence analysis software Sequencher (v5.0, GeneCodes, Ann Arbor, MI). Validation was performed by a senior mtDNA scientist using the mtDNA management software eCOMPAGT 32 . MtDNA haplotypes were assigned to haplogroups based on PhyloTree Build 17 27,33 with Hap-loGrep 2 34 . In addition, the phylogenetic tree was generated directly with HaploGrep 2. HaploGrep 2 was also used for data exports in multiple alignment format, for which we also applied MAFFT on a dataset of 6,500 fasta sequences. Additionally different trees for the 264 samples (35 samples removed due to direct kinship relations) including the Reconstructed Sapiens Reference Sequence (RSRS) 35 were generated with maximum-likelihood approaches (modeltest-ng and RAxML-NG 23 ), maximum parsimony (mtphyl-https:// sites. google. com/ site/ mtphyl/) and Bayesian evolutionary analysis (BEAUTi, BEAST 2, TreeAnnotator) 24 . The RSRS sequence was included for rooting the tree, representing the most common maternal ancestor, while we always use the rCRS for representing mutations with their according alternative nucleobase. The data was prepared for storage in MitoDB and MitoBench 30 . The latter was used for the analysis of molecular variance (AMOVA-pairwise F ST ) in Supplemental Table S9, which was also performed on VCF files with vcftools 36 . The analysis of the data was performed in R, with libraries corrplot, corrr, ggbiplot, ggpubr and reshape2. Also the PCA was carried out in R with the standard library and the ggbiplot package. Visualization of the trees were additionally generated in the interactive Tree of Life 37 web-application (https:// itol. embl. de). Mtphyl was used for the Ro(Sigma) calculations (Supplemental Table S8) and the generation of phylogenetic trees for N and M (Supplemental Tables S5 and S6 respectively).

Data availability
The 299 mitochondrial genome sequences reported in this study are deposited in NCBI GenBank under accession numbers KT587350-KT587648, PopSet 1020878286. The latest HaploGrep 2 version is available on https:// github. com/ seppi nho/ haplo grep-cmd