Introduction

Islam was first brought to the Indian Subcontinent in 711 CE, when the Arab military forces conquered Sindh, the lower Indus valley, and incorporated it into the Arabian Empire.1 Subsequently, Sindh not only became an Indo-Muslim state but also an Islamic outpost, where Arabs established trade links with the Middle East and were later joined by mystic teachers, or Sufis. By the end of the tenth century, dramatic changes took place when the Central Asian Turkic tribes accepted both the message and mission of Islam. These aggressively expansive invaders first began to move into Afghanistan and Iran and later into India through the northwest. In the thirteenth century, a Turkic kingdom was established in Delhi, which enabled Persian and Afghan Muslim invaders to further spread across India. Within the next 100 years, the Muslim empire extended its sway east to Bengal and south to the Deccan and remained dominant in the Indian Subcontinent until 1707 CE.1, 2 These last few centuries of expansion of Muslim populations into India were accompanied by extensive religious conversion. Furthermore, the exodus of people from Western Asia, especially from Iran, in the form of mercenaries and businessmen led to significant cultural diffusion of Muslim traditions among the ethnic Indian populations. These Muslim immigrants, who were mostly males, reportedly married local Hindu females and generated a new admixed genetic pool, perhaps with sex-specific differences.2, 3 At present, Islam is the second most practiced religion in India after Hinduism, encompassing 13.4% (138 million) of the total Indian population (Census of India, 2001).

Classical genetic marker studies have revealed that most Indian Muslims are closely related to their neighboring non-Muslim populations, suggesting that they descend primarily from local Hindu converts.4, 5 The exception to this are some Northern and Northwestern Indian Muslims who differ from indigenous Hindu populations, likely because of a higher proportion of genetic lineages of external origin.4, 5, 6, 7 Consistent with historical data, which predict significant local female contributions, the only mitochondrial DNA study that has been reported so far showed that North Indian Muslims exhibit the highest affinity to local Indian regional populations.8 Similarly, yet in contrast to an expectedly higher male contribution of outsiders, the Y-chromosomal evidence that is available so far has revealed predominantly local South Asian-specific lineages among Indian Muslims.8, 9 However, in our recent study based on autosomal STR markers, we have detected genetic signatures characteristic of populations of the Middle East in some of the contemporary Indian Muslim populations.10

According to historical evidence, the Indian Subcontinent has been exposed to several waves of human migrations from the Arabian Peninsula and Iran, the homelands of Indian Muslim rulers.2 The Arabian Peninsula (where Islam was propagated) served as a hub for human migrations, hence the merged genetic signatures of Eurasian and African origin, which has been detected in both maternal11 and paternal12 lineages from the region. Besides Arabia, Iran is a second plausible genetic source for Indian Muslims. It is positioned in the tricontinental nexus and its populations genetically show close proximity to those from the Near East, although with a lesser genetic input from Africa than from the populations of the Arabian Peninsula.13, 14 Besides mtDNA and the Y chromosome, which show relatively low levels of differentiation between these two potential sources, recent studies of lactose tolerance have revealed that Iranian and Arabian populations differ significantly in genetic patterns at this locus.15, 16 The Arabian populations are characterized by a 50–60% frequency of a G−13915 allele, purportedly related to their consumption of camel milk.15 This allele has not been detected so far among Iranian populations who, on the contrary, similar to populations from Europe and the Near East show a moderate frequency of the T−13910 allele, which occurs at a significantly lower frequency in Arabia.15

The extent of gene flow associated with the spread of Islam in the Indian Subcontinent is still largely unknown. Two previous studies have assessed mtDNA and the Y-chromosome haplogroup composition of Indian Muslim communities from Uttar Pradesh and Andhra Pradesh and concluded that the spread of Islam in India was mainly a cultural phenomenon and was not accompanied by significant levels of gene flow from West or Central Asia.8, 9 However, a study of more Muslim populations with a wider geographical coverage, larger sample size and high-resolution informative genetic markers would be required to detect signals of minor genetic contribution. To assess the genetic ancestry of contemporary Indian Muslims, we screened six Muslim populations who follow Shia or Sunni faiths from three different geographical regions of India (Figure 1) with ancestry-informative markers from mtDNA, the Y chromosome and the LCT/MCM6 region.

Figure 1
figure 1

Map of India showing the geographical location of the six Indian Muslim populations included in this study.

Materials and methods

Samples

In total, 472 Indian Muslim mtDNAs, 431 Indian Muslim Y chromosomes and 747 Indian Muslim and non-Muslim MCM6 gene profiles were used in this study. Samples were obtained with informed consent. We compared the mtDNA diversity in Indian Muslims with 15 949 mtDNA profiles from Indian non-Muslims,17, 18, 19 as well as from Pakistan,17 the Middle East,20 Central Asia,13 East Asia21, 22, 23 and Europe.24 We used 3696 previously published Y-chromosomal haplotypes of populations from India,25, 26, 27, 28, 29, 30 Pakistan,26 the Middle East,12, 14, 31 Central Asia,32 East Asia33 and Europe34 to compare with the studied Indian Muslim Y chromosomes. MCM6 gene variants in Indian populations were compared with 581 variants from Pakistan16 and the Middle East.15

mtDNA typing

The first hypervariable segment (HVS-I) of mtDNA was sequenced directly in all samples and variable positions were determined from nps 16 001 to 16 450. The second hypervariable segment (HVS-II) and haplogroup confirmatory diagnostic coding regions were sequenced for 472 samples on the basis of their haplotype information (Supplementary Table 1). In all, 12 samples were selected for whole mtDNA sequencing. The haplotypes defined by control region sequences and coding regions were haplogrouped by their mutational motifs (Supplementary Table 1), following previously published haplogroup trees.35, 36, 37, 38, 39 Complete mtDNA genomes and segments including diagnostic positions were amplified using 24 sets of primers.40 PCRs were carried out with 10 ng of template DNA in a 10 μl reaction volume with 10 pM of each primer, 100 μ M dNTPs, 1.5 mM Mgcl2 and 1 U of Taq DNA polymerase. Thirty-five cycles were performed with 30-s denaturation at 94°C, 30-s annealing at 58°C and 2-min extension at 72°C. The annealing temperature and time were slightly modified for a few sets of primers. PCR products were directly sequenced using the BigDye Terminator cycle sequencing kit and an ABI Prism 3730XL DNA Analyzer (Applied Biosystems, Foster City, CA, USA), following the manufacturer's protocol. The individual mtDNA sequences were compared with rCRS41 using AutoAssembler – ver 2.1 (Applied Biosystems). The sequences generated in this study have been deposited in the GenBank database (accession nos. FJ157366-FJ157837 (mtDNA HVS-I sequences), FJ157838-FJ157849 (complete mtDNA sequences)).

Y-chromosome typing

A total of 431 samples were typed with 23 Y-chromosomal markers (M89, YAP/M145, M96, M35, M78, M130, M356, M9, M45, M304, M172, M410, M69, M82, Apt, M170, M201, M173, M17, M124, M11, M214 and M175). The thermal cycling programs were set up with an initial denaturation at 95°C for 5 min, followed by 30–35 cycles at 94°C for 30 s, at a primer-specific annealing temperature of 52–60°C for 30 s and 72°C for 45 s, followed by a final extension at 72°C for 7 min. PCR products were directly sequenced using the BigDye Terminator cycle sequencing kit (Applied Biosystems) and the ABI Prism 3730XL DNA Analyzer, following the manufacturer's protocol.

LCT/MCM6 gene typing

A 400-bp fragment including the −13.9-kb region of the gene was PCR amplified with primers MCM6i13 and LAC-CL2, as detailed elsewhere.42 PCR products were sequenced using the MCM6i13 or LAC-CL2 primer and the BigDye Terminator cycle sequencing kit (Applied Biosystems) on an ABI Prism 3730XL DNA Analyzer.

Statistical analyses

Phylogenetic trees were constructed using Network 4.2.0.1 (www.fluxus-engineering.com).43, 44 The program Admix 2.0 (http://web.unife.it/progetti/genetica/Isabelle/admix2_0.html)45 was used to calculate the admixture proportions of samples on the basis of the frequency of haplogroups. The age of the L0a2a2 and M52 lineages was estimated on the basis of the molecular clock46, 47 based on synonymous mutation rate, given by Kivisild et al47 and recalibrated by Soares et al46 assuming a mutation rate of one synonymous mutation per 7884 years. PC plots were generated with MVSP 3.1 (http://www.kovcomp.co.uk/mvsp/index.html).48 Arlequin 3.1 (http://cmpg.unibe.ch/software/arlequin3)49 was used to evaluate the genetic structure of the populations by performing analysis of molecular variance (AMOVA), as well as to calculate genetic diversities of mtDNA and the Y chromosome on the basis of haplogroup frequencies.

Results

mtDNA comparisons of Indian Muslim and non-Muslim populations

We analyzed 472 samples for variation in mtDNA control regions and haplogroup-diagnostic coding region sites. Pooled haplogroup frequencies are shown in Table 1 and detailed haplogroup frequencies and definitions are given in Supplementary Tables 1 and 2 and Supplementary Figures 1 and 2. Altogether, haplogroups restricted to the Indian Subcontinent were observed at an average frequency of 63% in Indian Muslim populations as compared with 74% among the non-Muslim neighbors (Table 1). The average contribution of haplogroups of West Eurasian origin to Indian Muslims was 18%, which is not significantly higher than the value observed in non-Muslim populations (14%). In contrast, Iranian Shia Muslims exhibit a high frequency (54%) of West Eurasian lineages. It is interesting that the sub-Saharan African- and Arabian-specific L0a2a2 and R01 lineages were found only in Dawoodi Bohras (TN and GUJ), whereas these lineages were generally absent in Indian non-Muslims, although a related L0a2a2 lineage has been detected previously among the Sindhi population of Pakistan (Figure 2). The Central Asian lineages were found at a lower average frequency of 6% and the haplogroups U7 and W, which exist in similar frequencies in India and Iran, were observed at an average frequency of 6 and 3%, respectively, in Muslim populations. The gene diversity in Muslim populations ranged from 0.80±0.05 to 0.93±0.02, which is slightly higher than that among non-Muslim populations, 0.74±0.02 to 0.86±0.02 (Table 2), and reveals the prevalence of a comparatively high genetic diversity among Indian Muslims. We completely sequenced the mtDNA genome of nine M* samples, which harbor 16223–16275 substitutions in hypervariable segment I (HVS-I), to determine their potential source region. All nine samples were found to share common coding region variants, which enabled us to define a new autochthonous South Asian-specific haplogroup M52, which turned out to share a common origin with one of its sister branches, labeled here as M52a (Figure 3), detected among Indian non-Muslims. The same haplogroup has been recently reported in the Tharus of Nepal and in the Andhra Pradesh population.50 All nine sequences of Muslims are nested within the M52 lineage (Figure 3). Considering this phylogenetic structuring, the newly characterized haplogroup M52 is most likely to have an Indian rather than West Asian or Arabian origin. AMOVA yielded no statistically significant results for any group distinctions on the basis of religion (Indian Muslims and non-Muslims), geography (North India, South India and West India) or other criteria investigated (Supplementary Table 3).

Table 1 mtDNA haplogroup frequencies in six Indian Muslims and potential source populations
Figure 2
figure 2

The phylogenetic tree of mtDNA haplogroup L0a2a2. Synonymous positions are marked with ‘s’ suffix and highlighted in bold. The age estimate is based on the molecular clock determined by Soares et al,46 assuming a mutation rate of one synonymous mutation per 7884 years. Dawoodi Bohra samples are reported in this study, whereas data for other populations are presented by the phylogenetic tree in Behar et al,39 and by the references to the original sources given therein. Indian denotes the Indian Subcontinent.

Table 2 Genetic diversity based on mtDNA and Y-chromosome analysis of Muslim populations of India
Figure 3
figure 3

Phylogenetic tree reconstruction of the newly defined South Asian-specific haplogroup M52, based on 10 complete mtDNA genomes. This tree was redrawn manually from the output of median joining/reduced network obtained using NETWORK program (version 4.5) http://www.fluxus-engineering.com. The published sequence is shown with the initials of the first author (CS).37 Coalescent times were calculated by a calibration method described elsewhere.46 16182C, 16183C and 16519 polymorphisms were omitted. Suffixes A, C, G and T indicate transversions. Synonymous (s) and nonsynonymous (ns) mutations are distinguished. Recurrent mutations are underlined.

Y-chromosomal haplogroup profiles of Indian Muslim and non-Muslim populations

We genotyped 23 Y-chromosomal biallelic markers in a total of 431 Indian Muslims. All paternal lineages could be assigned to branches of the major haplogroups C, F and K (Figure 4 and Supplementary Table 4) according to Y-DNA haplogroup tree 2008,51 which are the three founder haplogroups commonly found in all continents outside Africa.52 Among the 17 Y haplogroups observed in Indian Muslims, as among the non-Muslims, R1a1 showed the highest frequency (31%), followed by haplogroup H (20%). The sub-Saharan African- and Arabian-specific paternal lineages E1b1b1a and J*(xJ2) were present in three Muslim populations (Indian Shia, Indian Sunni and Mappla) with an average frequency of 2 and 8%, respectively, whereas they were rare or absent among non-Muslim populations. Haplogroup G, which is common in the Middle East and rare or absent in Indian non-Muslim populations, was also present in three Muslim populations with an average overall frequency of 5%. The Y-chromosomal gene diversity in Muslim populations ranged from 0.58±0.06 to 0.86±0.01 and from 0.80±0.02 to 0.87±0.004 in non-Muslim populations (Table 2). When the paternal genetic structure of Indian Muslims was investigated by AMOVA, the geographical difference between Indian populations (North, South and West) was significant (5.08%, P<0.001), but the differences between religions (Muslims and non-Muslims) within India were not (P=0.08) (Supplementary Table 3). This reflects the large ‘among population within group’ variation in the analysis of Indian religious groups. There is a notable variation between different Indian Muslim populations, some being highly similar to local Indian populations and others having similarities with external populations, so that when they are all grouped together as ‘Indian Muslims’, the group difference is statistically insignificant from that of non-Muslims.

Figure 4
figure 4

Rooted maximum parsimony tree of Y-chromosome haplogroups defined by binary markers, along with their frequency in six Muslim populations of India and their potential source populations. K*=K* (xL,M,NO,P); 3 of the 12 individuals with a K* affiliation were tested for M184 and all three showed the presence of the ancestral allele. Haplogroups E1b1b1a, G and J*(xJ2) are not specific to Indian populations. Markers shown in italics were not genotyped and are included in context for comparison populations. Comparative data are from *Sengupta et al,26 ¶Regueiro et al14 and §Cadenas et al.12

Analysis of the LCT gene

A total of 747 samples of Indian Muslim and non-Muslim populations were sequenced for a 400-bp fragment, which is ∼14 kb upstream of the LCT gene (Table 3). The C/T−13910 variant was widely observed among both the Indian Muslim (Shia 10%, Sunni 10%, Dawoodi Bohra (TN) 14%, Dawoodi Bohra (GUJ) 11%, Mappla 2% and Iranian Shia 4%) and non-Muslim populations (North India 19%, West India 23% and South India 10%). The Iranian population also exhibits the same mutation with 10% frequency.15 The Saudi Arabian-specific T/G−13915 variant15 was completely absent from the Indian population, yet at the same position, we observed a new T/C−13915 variant (Mappla 1% and South India 1%), which is likely to be an Indian-specific mutation.

Table 3 Allele frequencies of LCT/MCM6 variants in South and West Asia

Population affinities and admixture estimates

Genetic distance-based PC analyses of Indian Muslim and non-Muslim groups, compared with other world populations for both mtDNA and the Y chromosome, are shown in Figures 5a and b, respectively. In the mtDNA PCA plot (Figure 5a), Shia, Sunni, Dawoodi Bohra (GUJ) and Mappla were found to cluster together with Indian non-Muslim populations, whereas Dawoodi Bohra (TN) seems to be an outlier and Iranian Shia cluster with populations from the Middle East. The East Asian, Central Asian, Middle Eastern and European populations clustered separately according to their geography. In the Y-chromosomal plot (Figure 5b), Shia, Sunni, Dawoodi Bohra (GUJ) and Mappla form a group with their neighboring Indian non-Muslim populations and Europeans, whereas the Dawoodi Bohra (TN), again found as an outlier, and Iranian Shia Muslims seem to be genetically closer to the Middle Eastern group.

Figure 5
figure 5

(a) Principal component analysis plot based on mtDNA haplogroup frequencies. UP, Uttar Pradesh; GUJ, Gujarat; AP, Andhra Pradesh; TN, Tamil Nadu. Comparative data references: N-India, W-India and S-India (17–19); Pakistan (17); Middle East (20); Central Asia (13); East Asia (21–23); Europe (24). (b) Principal component analysis plot based on Y haplogroup frequencies. UP, Uttar Pradesh; GUJ, Gujarat; AP, Andhra Pradesh; TN, Tamil Nadu. Comparative data references: N-India, W-India and S-India (25–30); Pakistan (26); Middle East (12, 14, 31); Central Asia (32); East Asia (33); Europe (34).

To obtain quantitative estimates of the Iranian versus Arabian contribution among Indian Muslim groups, admixture analysis was carried out with three putative parental populations, including (i) the geographically closest Indian Hindu population, and a pool of populations from (ii) Arabia or (iii) Iran. With these three putative parental populations, admixture analyses were carried out in two phases. Each phase comprised of two parental populations, that is, (i) the geographically closest Indian non-Muslim population and Arabian population and (ii) the geographically closest Indian non-Muslim population and Iranian population. In the case of Dawoodi Bohra (TN) Muslims, admixture contributions were estimated with local populations from both Tamil Nadu and Gujarat because these Muslims are recent migrants from Gujarat settled in Tamil Nadu. The results of admixture analyses were tabulated (Tables 4 and 5) accordingly.

Table 4 mtDNA – admixture proportions
Table 5 Y chromosome – admixture proportions

Both the maternal and paternal admixture contributions from the closest Hindu parental populations to the respective Shia, Sunni, Dawoodi Bohra (GUJ), Dawoodi Bohra (TN) and Mappla Muslim populations seem to be the highest, with only a minimal contribution from either Iran or Arabia (Tables 4 and 5). The exception is the group of Iranian Shias who show major maternal (71%) and paternal (65%) contribution from Iranian populations (Tables 4 and 5). The sub-Saharan African- and Middle Eastern-specific lineages, such as L0a2a2 (mtDNA) and E1b1b1a (Y haplogroup), were observed among Dawoodi Bohra (TN) and Shia Muslim populations, with a frequency of 5 and 2%, respectively. These significant maternal and paternal lineages, atypical of Indian populations, can be attributed to the nominal Arabian and Iranian admixture contributions. The correlation between the admixture contributions from Arabia and Iran is positive, with significant correlation coefficient values, R2=0.982 for mtDNA and R2=0.939 for Y-chromosome biallelic markers, reflecting the similarity of the genetic composition of the two source pools and thus their poor power to distinguish between the admixture contributions from the two (Figures 6a, b, 7a and b).

Figure 6
figure 6

(a) The correlation coefficient between admixture contributions from Arabia and Iran to Indian Muslim populations based on mtDNA. (b) mtDNA – admixture proportions: correlation coefficient graph.

Figure 7
figure 7

(a) The correlation coefficient between admixture contributions from Arabia and Iran to Indian Muslim populations based on Y chromosome. (b) Y chromosome – admixture proportions: correlation coefficient graph.

Discussion

Historical evidence suggests that Indian Muslims could have originated in two distinct ways: (i) military invasions that led to the establishment of Muslim kingdoms and subsequent immigration of mercenaries, businessmen and political emissaries from Middle Eastern countries, Iran and Arabia, followed by admixture with the local population; and (ii) cultural diffusion as a result of absorption and dominance that resulted in a sizeable population embracing Islam.1, 2, 3 In a nutshell, Indian Muslims could be either the descendants of Iranian and Arabian men who married local Hindu women or the descendants of local converts. We therefore sought to examine contemporary Indian Muslim populations for the occurrence of Middle Eastern genetic signatures, expecting them to be manifested primarily in the male line. For this, we chose six Muslim populations from three different geographical regions of India (Figure 1) that witnessed several human migrations, military invasions from the Middle East and proselytizing of native Hindu populations.1, 2, 3 Despite reported marriages between Muslim males and Hindu females,2, 6 the expected higher Y-chromosomal contribution from the Middle East to contemporary Indian Muslims was not found in this study. Unlike Muslim communities in China and Central Asia,53, 54, 55 which show a marked presence of Western Y chromosomes, Indian Muslims derive most of their Y chromosomes from local neighboring non-Muslim populations, suggesting a regional genetic affinity among Indian Muslim and non-Muslim populations. This suggests that the expansion of Islam in India happened through religious conversions during the implementation of the Muslim faith. In comparison with Indian Muslims following the Shia faith, recent Muslim immigrants from Iran (see Supplementary Text 1 for population history) who also follow Shiism show a genetic proximity to Middle Eastern populations. This shows that this Muslim community maintains its native genetic pool with less genetic affinity to Indian populations. It is interesting that Dawoodi Bohras (TN) were found to exist as a separate genetic entity, with mtDNA lineages L0a2a2 (African specific) and B4ala1 (Polynesian specific), when compared with other Indian Muslim groups. The sub-Saharan African/Arabian mtDNA lineage L0a2a2 can be linked to historical information (Supplementary Text 1) that Dawoodi Bohras belong to a Shia sect of Islam that purportedly migrated to India from Yemen, an area which is known to have a considerable frequency (3%) of African mtDNA lineages, including haplogroup L0a2.56 An alternative interpretation is that L0a2a2 could have persisted in South Asia as the out-of-Africa migration is undermined by the young age estimate of L0a2a2 (Figure 2) and by the absence of this clade among Indian non-Muslim populations. The occurrence of the Polynesian mtDNA lineage B4a1a1 is in accordance with the oral history of the Dawoodi Bohras, which claims that some of their ancestors migrated to India from Thailand. Furthermore, detectable frequencies of other East Asian mtDNA haplogroups, F1a, F1b, F3b, MD, MD5a2 and MG2a, in some contemporary Indian Muslim groups are consistent with historically attested movements of Muslims from Central Asia and contacts with Southeast Asian Muslim communities.55, 57

The paternal haplogroups, E1b1b1a, G and J*(xJ2), frequent widely over Middle East and Arabia,12, 58 from where Islam was propagated, were found to occur at notable frequencies among some of the Indian Muslim groups. Although both maternal and paternal admixture estimates show maximal contribution from the local Indian non-Muslim parental populations, the contribution from Iranian and/or Arabian parental populations cannot be neglected (Tables 4 and 5). The wide spread of the LCT/MCM6 gene C/T−13910 variant among all Indian Muslim populations and the complete absence of the respective Arabian marker in this gene are consistent with gene flow occurring predominantly over Iran than over Arabia. Furthermore, these observations based on uniparental markers are congruent with our recent study on biparental STR markers,10 thus providing a comprehensive view of the genetic heritage of Indian Muslim populations.

Conflict of interest

The authors declare no conflict of interest.