Introduction

The distribution of human genetic diversity has long been a subject of interest, and it has important implications for human evolution, forensics, and the distribution of genetic diseases in populations. Genetic diversity in human populations is low relative to that in many other species, attesting to the recent origin and small size of the ancestral human population (Li and Sadler 1991; Crouau-Roy et al 1996; Kaessmann et al 1999). Since the seminal study of Cann et al (1987), mitochondrial DNA (mtDNA) data have proven to be extremely useful for studying human evolution, including prehistoric migrations and demographic events such as sudden population expansions or extreme bottlenecks (Sherry et al 1994).

The human mtDNA is a closed circular genome of ~16.5 kb in length (Anderson et al 1981), which includes a 1.1 kb-long noncoding (control) region that represents a highly variable sequence (Greenberg et al 1983; Melton et al 1997). The variable sites in the control regions, hypervariable region I (HVR I) and hypervariable region II (HVR II), each ~400 bp, correspond to the origin of replication and the D-loop. Since mtDNA is maternally inherited, exists in a high copy number (1,000–10,000 copies) in each cell, and is rapidly evolving (5–10 times faster than nuclear DNA), sequence polymorphism of the mtDNA has proved useful in the fields of population and evolution study (Gresham et al 2001; Ingman and Gyllensten 2001; Roychoudhury et al 2001; Yao et al 2002; Kivisild et al 1999; Metspalu et al 2004; Forster and Matsumura 2005; Thangaraj et al 2005; Macaulay et al 2005), anthropology (Derbeneva et al 2002; Houck and Budowle 2002; Koyama et al 2002; Yao and Zhang 2002) and forensic science (Budowle et al 2002; Seo et al 2002).

India comprises one of the largest ethnic populations, with more than 1 billion people drawn from diverse cultures, languages and geographical backgrounds. A number of studies have provided some insights into the maternal genetic structure of Indian populations (Bamshad et al 2001; Kaur et al 2002; Basu et al 2003; Kivisild et al 2003; Rajkumar and Kashyap 2003; Metspalu et al 2004; Palanichamy et al 2004; Rajkumar et al 2005; Thangaraj et al 2005). However, more studies are required to add to the pool of information and to help us to better understand the genetic structures of diverse Indian population groups, where many questions remain unanswered. The present preliminary work proposed to analyse the nature of the variations in hypervariable regions (HVR I and II) and perform phylogenetic analyses on individuals from Uttar Pradesh (UP), Bihar (BI) and Punjab (PUNJ), belonging to the Indo-European linguistic group, and individuals from South India (SI), that have their linguistic roots in the Dravidian language, in order to derive the structures of founder female lineages for these regions in India. We have also focused on providing an overview of the maternal gene pool, regional maternal population expansion time/patterns and their phylogenetic relationships with each other and other world populations.

Materials and methods

Subjects

Blood samples were collected (after seeking the required consent) in RBC lysis buffer from unrelated healthy donors from different population groups of India. One hundred eleven samples were collected, which belonged to: UP (n=33), BI (n=26), PUNJ (n=35) and SI (n=17). Genomic DNA was isolated following the routine protocol of Kunkel et al (1977).

Amplification of mtDNA

PCR amplifications of mtDNA regions—HVR I, for a total of 111 individuals, by a designed primer set AB-6F (5′-ACC CAA TCC ACA TCA AAA CC-3′) and AB-6R (5′-TCA AGG GAC CCC TAT CTG AG-3′) and HVR-II, for 87 individuals, by designed primer set F (5′-GGT CTA TCA CCC TAT TAA CCA C-3′) and R (5′-CTG TTA AAA GTG CAT ACC GCC A-3′)—were performed in 12.5 μl reaction volume mix containing 50 ng of template DNA, 6.25 pmol of each primer, 200 μM of dNTPs, 1.5 mM MgCl2, 1× reaction buffer and 0.3 U of Taq pol enzyme (Bangalore Genei, India). The cycle used was denaturation at 94 °C for 1 min, followed by annealing at 62 °C for 1 min, and then extension at 72 °C for 1 min, repeated for 30 cycles followed by a final extension step at 72 °C for 5 min. PCR products were initially checked in 2% agarose gel and then sequenced (using an ABI 3100 sequencer, USA). The sequences obtained were compared with the revised Cambridge Reference Sequence (CRS) (Anderson et al 1981) to find mutations.

Phylogenetic and statistical analysis

The regional/linguistic descriptions and HVR I haplotypic motifs of the samples are given in Appendix 1. The haplotypic motifs were defined as putative haplogroups by comparing them with HVR I data obtained from Metspalu et al (2004), and the coalescence ages for these haplogroups and their standard errors (SE) for mutation rates were calculated as described by Forster et al (1996). Mitochondrial hypervariable regions (HVR I and II) were analysed for statistical and phylogenetic patterns. The software DNASP 4.0.0.4 (Rozas and Rozas 1999) was used to identify the number of polymorphic sites and number of mutations, to calculate nucleotide diversity, the mean number of mismatches, Fu’s Fs statistics (Fu 1997), Tajima’s D values (Tajima 1989), raggedness statistics and to draw graphical patterns of mismatch distributions (Rogers and Harpending 1992). The other statistical inferences, like initial theta (θa) and values of tau (τ), obtained from DNASP 4.0.0.4, were used to calculate Ne = effective population size (θa/2 μ) and population expansion age AYa = (A × τ/2 μ) years ago (Rogers and Harpending 1992). An average mutation rate μ = 0.00124 per site per generation (Forster et al 1996) with an average generation time A = 20 years, was used for the calculations.

The neighbour-joining (NJ) trees for the studied population groups were generated using MEGA 2.1 software. Pairwise genetic distances between studied populations were computed as a linearisation of FST/(1-FST) (Slatkin 1995) using ARLEQUIN 2.000 software (Schneider et al 2000). These linearised distance values were used to create a MDS plot based on HVR I data using SPSS 10.0.5 (Chicago, IL, USA) and a NJ Tree, based on HVR II data, using PHYLIP (Version 3.5c) (Felsenstein 1989; PHYLIP home page) and a TreeView (version 1.6.1) (see http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) for the studied population groups. Median-joining (MJ) networks (Bandelt et al 1999) for HVR I and II were constructed using the software NETWORK 4.1.0.8.

Results

Analysis of mitochondrial HVR I (nucleotide positions: 16023–16414) and HVR II (nucleotide positions: 53–424) showed the presence of a large number of variations, both known and novel mutations, in the samples analysed. Some of the sequences with mutations were deposited in the GenBank database and accession numbers were obtained: HVR I (AF467445–450, AF542192–198 and AY899182–94) and HVR II (AY642000–023).

The analysis of the putative haplogroups classified on the basis of mtDNA HVR I haplotypic motifs (Appendix 1) showed that 53.15% of the individuals from the studied population groups belonged to haplogroup M lineages with a coalescence age of 66,043.6 + 9,995.7 years. Both U2 and U7 lineages of haplogroup U, each 10.8% of the total, showed a deep but different coalescence age of 58,297.7 + 23,729 and 35,875 + 10,984.6 years, respectively. Individuals represented in haplogroup R lineages (9%), most of them (5.4%) belonging to R5, showed a coalescence age of 52,468 + 17592.5 years. Other observed lineages, known to be of European origin (F, H, HV, J, U1 and U5), represented 9.9% of the total population, whereas 6.3% remained undefined.

Using the HVR I sequence data, the number of polymorphic sites, the number of mutations, the mean number of mismatches, the nucleotide diversities, the initial theta values, the raggedness statistic values, Fu’s Fs statistic values, Tajima’s D values, expansion ages and initial effective population sizes of all the studied population groups were calculated and are given in Table 1. Figure 1 depicts the mismatch distribution patterns and respective NJ trees for the studied population groups based on mitochondrial HVR I data. The unimodal nature, the smoothness (as revealed by the very small values for the raggedness statistics) of the mismatch distribution curves, the reasonably good fits with the expected distributions of the observed mismatch distributions, the branching patterns in the NJ trees, the significantly large negative values of Fu’s Fs statistics, and the highly significant values of Tajima’s D (Table 1) clearly indicate that there were significant expansions of the different population groups studied, which is also supported by the star-like configurations in the MJ networks based on HVR I (Fig. 2a) and HVR II (Fig. 2b) data. An extreme sharing of haplotypes with no population-specific differentiation was also observed in both of the MJ networks.

Table 1 Descriptive statistics based on HVR I in Indian population groups belonging to various linguistic backgrounds
Fig. 1
figure 1

Observed (dashed line) and expected (solid line) mismatch distribution curves and respective neighbour-joining trees showing population expansion patterns based on mtDNA HVR I data

Fig. 2a–b
figure 2

a Median-joining network of the studied populations based on mtDNA HVR I haplotypes. b Median-joining network of the studied populations based on mtDNA HVR II haplotypes

An MDS plot based on the genetic distances of Slatkin linearised FSTs from mtDNA HVR I data (Fig. 3a) showed that the studied population groups form a compact group and this cluster includes Mongol, Egyptian and sub-Saharan as the nearest population groups. An NJ Tree (Fig. 3b) based on the genetic distances of Slatkin linearised FSTs (given as UP and BI = 0.00, UP and SI = 0.00, BI and SI = 0.00, PUNJ and BI = 0.05, PUNJ and SI = 0.04, PUNJ and UP = 0.09) from HVR II data revealed that the population groups UP and BI formed a single cluster whereas PUNJ branched out, probably depicting genetic affinity among the UP, BI and SI population groups as compared to PUNJ.

Fig. 3a–b
figure 3

a MDS plot based on Slatkin’s linearised Fst values obtained from HVR I data showing the genetic relationship between the studied and other world population groups. b Neighbour-joining tree of the studied population groups based on the genetic distances of Slatkin linearised FSTs obtained from HVR II data

Discussion

The geography of India has played a decisive role in the peopling of India. Populations within India have been subjected to foreign invasions and migrations from time to time, resulting in no single apparent origin for any present day population groups and a conglomeration of different Y-chromosomal lineages (Quintana-Murci et al 2001; Saha et al 2005). The maternal gene flow in and out of India has been limited since the initial settling of Indian maternal lineages (Metspalu et al 2004). Indian mtDNA lineages belong to either Asian-specific haplogroup M or western Eurasian-specific haplogroups H, I, J, K, U, W and others that were not established anywhere (Kivisild et al 1999). The high frequency and diversity of mtDNA haplogroup M, the major contributor to the Indian maternal gene pool, has been associated with its southwest-Asian origin (Roychoudhury et al 2000, 2001; Richards et al. 2003; Rajkumar et al. 2005), whereas the presence of lineage M1 in Africa (Quintana-Murci et al 1999) and lack of L3 lineages other than M and N in India has become the most parsimonious view of the origin of haplogroup M in east Africa, which has been supported by the most recent view of single rapid coastal settlement of Asia by three major mtDNA haplogroups, M, N and R (Palanichamy et al 2004; Macaulay et al 2005; Thangaraj et al 2005; Forster and Matsumura 2005) as the founding female lineages to Indian population groups. However, the restricted presence of M as M1 and the phylogeography of M1 in Africa, predominantly in the Afro-Asiatic linguistic phylum (Metspalu et al 2004), leaves the question of the origin of haplogroup M unanswered.

This study of human mtDNA attempts to answer a variety of questions regarding the structures and compositions, the genetic relationships, the ancestries and the effects of migrations on present day females within the population groups studied in India, and it also throws light on their genetic relationships with each other and with other world populations. The observed deep coalescence ages, distribution patterns and high diversities of haplogroup M in all of the studied population groups supported the concept of a common founder of the Indian maternal gene pool acting as a major contributor, irrespective of its place of origin. Further, the overlapping coalescence ages within the standard deviation intervals for the U2 and R5 lineages suggested their coexistence with haplogroup M lineages (Metspalu et al 2004), but comparatively less so in frequency and diversity, which could also be due to independent migration events in India. Whereas, the coalescence age of U7 was observed to be far younger, indicating that it was differentiated later and that the trans-Indian subcontinent spread occurred later too. The mismatch distribution patterns, respective NJ trees, various statistical analyses (Table 1), star-like configurations of the clusters with extreme sharing and the lack of population-specific differentiation in the MJ network (Fig. 2a, b) suggested that all of the population groups underwent expansions in ancient times and had a fundamental maternal similarity. There were also indications of some recent mtDNA migrations, apparent in the form of some small clusters without star-shaped phylogeny in the MJ network (Fig. 2a), that could have been associated with various recent invasions of and migration events to India. A greater maternal genetic proximity was revealed in MDS plot analysis (Fig. 3a) for the studied population groups when compared with other world populations, which also indicated the conservation of the east-Asian mtDNA components and the presence of West Eurasian and African/sub-Saharan lineages. The clustering in the NJ tree (Fig. 3b) and the Fst values indicated a close maternal relationship between all of the Indian populations, whereas the branching out of the Punjab supports the interpretation that there was probably a considerable inflow of genes from Indo-European-speaking populations from central and possibly from West Asia into the Punjab (Passarino et al 1996; Kaur et al 2002). The results obtained also suggest that linguistic/ethnic differences evolved later on, by the process of acculturation, and the recent demic diffusion (Quintana-Murci et al 2001) also brought in some western-Eurasian mtDNA components.

To conclude, the present study supports an ancient common ancestry for the studied population groups through common founder female lineages, but it also indicates a maternal gene flow in ancient times with further ethnic differentiation occurring subsequently through a series of demographic expansions, geographical dispersals, social groupings and later Eurasian admixture.

In future studies, a larger sample size and mitochondrial coding region SNP markers could provide more information and lead to a better comprehension of the evolutionary history of the studied and other Indian population groups.