Introduction

The colonization of the islands of the Pacific and Indian Oceans that started approximately 6,000 years ago represents one of the major human dispersals (Cann 2001). There are many theories surrounding the peopling of these islands and no single one enjoys full approval (Jin et al. 1999). Both archeological and linguistic data indicate that Austronesian-speaking populations were responsible for this expansion (Diamond 1988). Currently, there are 1,200 Austronesian languages and 200 million native speakers spanning approximately two-thirds the circumference of the globe (Bellwood 1991; Diamond 2000). Austronesian languages are found as far west as Madagascar, off the coast of Mozambique in east Africa, to the reaches of Easter Island in the remote east Pacific, and include the languages of the islands of Southeast Asia, such as those of the aborigines of Formosa (Kirch 2000). The discontinuous pattern of insular habitation led to social and linguistic isolation which resulted in unique and autonomous cultural evolution in distant islands. In turn, the art, linguistics, physical anthropology, and genetics of the present day native populations can be used to illuminate the Austronesian diaspora, a past that is complicated to reconstruct since no preserved Austronesian writings are dated earlier than 670 AD near the end of the language family’s expansion (Diamond 2000). From the compiled archeological evidence, the Express Train to Polynesia theory was formulated to explain the Austronesian expansion. According to this theory, in this major human spread, people migrated from southern China to Formosa, the Philippines, Indonesia, and finally to the coast of New Guinea and Polynesia (Diamond 1988). Yet, there is still some debate over whether the actual homeland is in Taiwan, southern China, or elsewhere, including eastern Indonesia (Oppenheimer 1998; Richards et al. 1998; Su et al. 2000; Lum 2001; Oppenheimer and Richards 2001). Depending on the actual source population, the island of Formosa may represent a gateway to the Austronesian expansion.

Taiwan, also known as Formosa from colonial times, has been home to Austronesian tribal groups since before 4,000 BC (Ruhlen 1994). Furthermore, Taiwan is the location of the greatest Austronesian linguistic variety with nine of the ten subgroups present, and thus, according to linguistic theory, is believed to be near the origin of the language family (Diamond 2000). The oldest archeological sites are in Taiwan and are known to belong to the Ta-p‘en-k‘eng culture, the earliest ceramic phase (Kirch 1997, 2000). This ceramic cultural tradition then diffused from Formosa through the Philippines and into the equatorial islands of Southeast Asia. The phylogeny of Austronesian languages derived from previous linguistic studies parallels the pattern of island settlement supported by this archeological data with Taiwan as the projected homeland (Gray and Jordan 2000).

There are nine tribal groups that currently constitute the aboriginal population of Formosa: Ami, Atayal, Bunun, Paiwan, Puyama, Rukai, Saisiya, Tsou, and Yami (Jin et al. 1999). They comprise approximately 1.5% of Taiwan’s total population which currently numbers 21 million (Chu 1997; Jin et al. 1999). The remainder of the population consists mainly of the Han Chinese who have been instrumental in the displacement and redistribution of the native groups since ancient times to the rugged inland mountains and the eastern coastline (Kao 1958; Chai 1967). The Atayal is Taiwan’s second largest tribe after the Ami. Currently, there are about 90,000 Atayal tribal members residing in a large area in northern Formosa.

There are few genetic studies involving the aborigines of Formosa, providing a limited understanding on the origins of each tribe and the relationship between the tribes and other worldwide populations (Sewerin et al. 2002). Some investigations have examined the blood types and their frequencies among the tribes (Ikemoto et al. 1942; Huang 1964; Nakajima and Ohkura 1971). Tajima and collaborators studied all nine of the aboriginal groups from Formosa and concluded that specific mtDNA lineages were introduced into Taiwan 11,000–16,000 years ago suggesting ancient migrations of two mtDNA lineages (Horai et al. 1995; Tajima et al. 2003). In addition, Horai’s group delineated three distinct clades of tribes representing the northern mountain (Atayal and Saisiat), southern mountain (Rukai and Paiwan), and the middle mountain/east coast (Bunun, Tsou, Ami, Puyama, and Yami) geographical zones (Tajima et al. 2003). Melton and colleagues found evidence supporting a common source for the tribes in south-central China and hypothesized that in 6,000–4,000 BC Neolithic proto-Austronesians migrated from mainland China to Taiwan bringing with them the mtDNA 9 base pair deletion (Melton et al. 1998). Melton et al. (1998) further suggested that since the origin of the “Polynesian motif” can be traced to Taiwan, the island of Formosa may be the origin of the expansion. The genetic influence of these aboriginal groups from Formosa in Melanesia has also been supported by other mtDNA investigations (Merriwether et al. 1999). Other mtDNA studies have suggested that the Indonesian archipelago may be the origin of the expansion some 17,000 years ago (Richards et al. 1998). Along these lines, Y-chromosome data have been interpreted to suggest that the origin for the Austronesian-speaking populations of insular Southeast Asia and Oceania may be the pre-Neolithic groups that first populated the area (Capelli et al. 2001). Some researchers have postulated that the Taiwanese aborigines may represent population isolates outside the mainstream of the Austronesian dispersal (Oppenheimer 1998; Su et al. 2000). ABO blood group data indicate strong genetic affinities among Austronesian groups from Melanesia and Polynesia that are not shared with non-Austronesian populations from the Pacific (Ohashi et al. 2004). It is evident that these studies provide conflicting results and portray a history that remains fragmented and vague.

More recently, Sewerin and colleagues focused on the genetic polymorphisms in six-point mutation loci of the Ami tribe and provided evidence for the genetic uniqueness of the group (Sewerin et al. 2002). The results of studies such as Sewerin’s have generated a series of questions involving the intertribal associations and the relationships between the aborigines and other Austronesian-speaking populations. In the present study, using the same six genetic markers, we provide information on the genetic diversity and phylogenetic affinities of the Austronesian-speaking Madagascar population. In addition, we report on the genetic relationship between the Atayal and the Ami, along with those of the Atayal and Ami to other Austronesian populations from Bali and Java. Genotypic and allelic frequencies from these loci are reported for the first time from the populations of Bali, Java, Madagascar and the Atayal tribe of Taiwan, and compared to data from eight other geographically targeted worldwide reference groups.

Materials and methods

Populations studied

Sixty-nine unrelated individuals from the east-African island of Madagascar were sampled. This collection from Madagascar represents the general population of the island. In addition, 40 unrelated individuals from the Atayal tribe of Formosa were collected. The Atayalan samples were procured from several villages within their traditional territory, as illustrated in Fig. 1. Also studied were 24 individuals from Java and 34 from Bali. The population from Bali was collected in sites from throughout the 2,200-square-mile island. In Java, the sampling was performed in locations all over the island. Java and Bali are both islands within the borders of Indonesia whereas Madagascar is located about 250 miles off the southeast coast of Africa and approximately 6,000 miles from the island of Formosa. Individuals from Madagascar are known to speak Malagasy, an Austronesian language. Figure 1 illustrates the geographical position of the populations studied. These four populations were compared with the data from eight other worldwide populations previously reported in the literature. Individuals were identified as Atayal members by tracing back biographical information for at least two generations. Table 1 lists the populations examined and their location along with their reference.

Fig. 1
figure 1

Geographical map of the populations studied with expanded map of Taiwan and the distribution of its aboriginal groups

Table 1 Name and origin of populations

Blood collection and DNA purification

The samples were collected as whole blood in EDTA Vacutainer tubes in adherence to the guidelines set forth by Florida International University’s Institutional Review Board. Cells were lysed and leukocyte nuclei were isolated by centrifugation, followed by digestion of nuclei with proteinase K as previously described (Antunez de Mayolo et al. 2002). Total genomic DNA was then isolated by a standard phenol/chloroform extraction and ethanol precipitation (Luis et al. 2003).

DNA amplification and genotyping

The extracted DNA was amplified by polymerase chain reaction (PCR) using the AmpliType PM-DQα1 PCR Amplification and Typing Kit (Perkin Elmer Corp, Norwalk, CT, USA) using the conditions specified by the manufacturer. PCR was performed using a Perkin–Elmer 480 thermal cycler. Following amplification, samples were screened for successful amplification by electrophoresis in a 1× TAE, 2% agarose gel followed by ethidium bromide staining and ultraviolet (UV) visualization. The HLA-DQα1, LDLR, GYPA, HBGG, D7S8, and GC loci were then genotyped for each sample. The chromosomal locations of the loci are: HLA-DQA1, 6p21.3; LDLR, 19p13.1-13.3; GYPA, 4q28-31; HBGG, 11p15.5; D7S8, 7q22-31.1 and GC, 4q11-13 (Sewerin et al. 2002). Typing of these samples involves reverse dot blot technology with allele-specific oligonucleotide probes bound to strips that allow the typing of multiple loci at one time. All alleles can be typed from a single PCR reaction. These loci have been extensively investigated in population genetics studies and were found to satisfy Hardy–Weinberg equilibrium expectations (Sewerin et al. 2002).

Phylogenetic and statistical analyses

For all loci, genotypic and allelic frequencies were calculated using the gene-counting method (Li 1976). To ascertain the phylogenetic relationships between populations, Maximum Likelihood (ML) analysis based on the allelic frequency distributions of the loci were generated using the PHYLIP 3.52c program (Felsenstein 1993). Bootstrap consensus phylogenies (1,000 replications) were generated by the SEQBOOT and CONTML options programs of PHYLIP. The CONTML and CONSENSE programs determined the best-fit tree.

A Principal Component (PC) analysis was performed using the numerical taxonomy and multivariate analysis system (NTSys) PC program to summarize genetic relationships among the populations (Rohlf 2002). Centroid analysis was conducted to examine the relative gene flow experienced by populations and/or effective population size (Harpending and Ward 1982). The centroid model assumes an island model of population structure and expects a linear relationship between heterozygosity and genetic distance from the centroid. The centroid is defined as the overall mean allelic frequency of the populations. The theory is particularly useful in detecting and analyzing outliers. If a population is receiving gene flow from elsewhere at a higher-than-average rate, then the heterozygosity would be higher than expected (Sewerin et al. 2002). Those populations would plot above a linear regression line. Conversely, if the population has remained genetically isolated, the heterozygosity would be lower than expected due to less-than-average gene flow and would segregate below the linear regression line (Sewerin et al. 2002). Alternatively, populations plotting above or below the regression line may be indicative of higher- or lower-than-average, respectively, effective population size. The BIOSYS II program was used to generate expected heterozygosities and to detect any deviations from Hardy–Weinberg equilibrium expectations using the chi-square deviation test. Power of discrimination (PD) values were also determined for all six loci. G-tests were performed with 2×2 contingency tables to ascertain statistical significance and determine whether the populations are homogeneous with each other (Lewontin and Felsenstein 1965).

Results

Genetic parameters of Atayal, Bali, Java and Madagascar populations

Table 2 displays the allelic and genotypic frequencies for the LDLR, GYPA, HBGG, D7S8, and GC loci for all four populations. Allelic and genotypic frequencies for the HLA-DQα1 locus are shown in Table 3. The expected and observed heterozygosities for all six loci in the four populations are presented in Table 4. The observed heterozygosities range from 10% (HBGG) to 83% (HLA-DQα1), both exhibited within the Atayal population. Except for one instance, all loci in the four populations were found in Hardy–Weinberg equilibrium (Table 5). The one exception was GYPA in the population from Java, which following a Bonferroni correction (0.008), was also in Hardy–Weinberg equilibrium. PD values are presented in Table 6. The Atayal population exhibits the largest range in PD, with 0.176 in the HBGG locus and 0.913 in the HLA-DQA1 locus. The six loci G-test indicated that all differences are statistically significant (P<0.0001) for all pair-wise comparisons involving the four populations reported in the present study as well as the eight reference groups.

Table 2 Genotypic and allelic frequencies for loci LDLR, GYPA, HBGG, D7S8, and GC
Table 3 HLA-DQα1 genotypic and allelic frequencies
Table 4 Expected and observed heterozygosities for all six loci and all four populations
Table 5 Hardy–Weinberg equilibrium tests for all six loci
Table 6 Power of discrimination values for all six loci

Phylogenetic analyses

Figure 2 displays the ML phylogeny and bootstrap values. In the resulting dendrogram, four clusters were observed: (1) Zimbabwe, African American, and Madagascar; (2) Caucasian; (3) Ami and Native American; and (4) Atayal and Han Chinese. The Atayal group segregates with the Han Chinese whereas the Ami tribe clusters with the Native Americans. The Madagascar population segregates intermediate between the east-African and east-Asian groups. All bootstrap values, except for one, were at or above 50%.

Fig. 2
figure 2

Maximum-likelihood (ML) tree illustrating human phylogenetic relationships. Tree was generated using allelic frequency data of ten populations, using PHYLIP 3.52c. NS North Slope Borough population

Figure 3 displays the PC analysis. PC 1 (the x-axis) and PC 2 (the y-axis) represents 37% and 27% of the total variability, respectively. The Caucasian populations segregate in the right upper region of the map. PC 1 separates the African American, Zimbabwe, and Madagascar groups from the rest of the populations. The Native American groups cluster together in the extreme right on the plot and include Na-Dene and Eskaleut populations. The Han Chinese, Bali, and Atayal plot loosely together at the lower center of the map with the populations from Java at the periphery. The Ami represent an outlier segregating by itself in the lower right-hand quadrant. Within Fig. 3, the populations of Bali and Java segregate apart from each other with Bali plotting closer to the Atayal and Ami. Java, on the other hand, plots at the fringes of the Caucasian group. Interestingly, the Madagascar population plots midway between the African groups and the loose cluster of Atayal, Bali, and Han Chinese.

Fig. 3
figure 3

Principal component (PC) map. Worldwide populations from diverse geographical areas were used in the analysis. PC1 and PC2 represent the first and second PC values, respectively. PC1 and PC2 account for 37% and 27% of the variability, respectively

Figure 4 depicts the centroid analysis of the Atayal, Bali, Java, and Madagascar groups along with the eight other reference populations. The Caucasian and African populations plot above the linear regression whereas the Native American populations are located beneath the regression line. As outliers, the Ami and Atayal are the most distant groups under the regression line. The Han Chinese, Java, and Zimbabwe map nearly on the regression line while Madagascar plots above.

Fig. 4
figure 4

Centroid gene flow analysis plot. The heterozygosity of each population (y-axis) is plotted versus the centroid (overall mean allelic frequency of population) (x-axis). Plot includes eight worldwide reference populations

Discussion

In this study, six polymorphic loci containing point mutations were examined in four geographically targeted populations from the Austronesian language family. Because the island of Formosa has been hypothesized as the potential Austronesian homeland, studies of the country’s aborigines are of paramount importance (Bellwood 1991). It is likely that the diversification of the Austronesian language family occurred in Taiwan (Diamond 2000) and possibly only one of the groups is responsible for the successive colonization of other islands during the diaspora. This idea emphasizes the importance of ascertaining the relationships among tribes as well as between the tribes and other Austronesian groups and worldwide populations (Tajima et al. 2003).

Figure 2 depicts a ML tree with four clusters. One clade contains populations from Africa (Zimbabwe and Madagascar) and of African descent (African American). In another cluster, the Native Americans segregate with the Ami while the Atayal group cluster with the mainland Han Chinese in a third clade. Caucasians are found together in a fourth cluster. It is not surprising that the Zimbabwe, African American, and Madagascar populations are found together within a cluster. The island of Madagascar lies just off the east coast of Africa in close proximity (about 500 miles) from Zimbabwe. Yet, the relationship exhibited by the groups from Africa and the East Asian/Native American populations underscore two important issues. First is the separation of the Madagascar group away from the Zimbabwe and the African Americans in the African clade and its proximity to the Atayal and Han Chinese. It is significant that the Zimbabwe, an East African population, is genetically closer to African Americans, an admixed group with a major West African genetic contribution than to the sample from Madagascar just off the coast of east Africa. It is possible that the Madagascar population’s genetic affinity to the Atayal and the continental East Asians (Hans) may be the result of the Austronesian migration into Madagascar approximately 3,200 years ago (Ruhlen 1994). The Ami aborigines, on the other hand, segregate more distant from the Madagascar group in a cluster with two Native American groups. These results point to the Atayal and not the Ami aborigines as a stronger candidate for the Formosan source population responsible for the westward Austronesian dispersal. The Navajo and the Alaskan Eskimos, which group close to each other, are recent arrivals (approximately within the last 2,000 years) to the New World representing the distinct language groups Na-Dene and Eskaleut, respectively. The segregation of these African and East Asian/Native American groups in the ML dendrogram mirror the genetic affinities reflected in the PC plot discussed below.

The segregation of the Atayal tribe with the Han Chinese supports the connection between the Austronesian language family and Southeast Asia. Previous studies, using 13 classical loci, indicate that the Toroko, a branch of the Atayal, have higher affinities to the general populations from the Philippines and Thailand than to the groups from southern China and Vietnam (Chen et al. 1985). It is interesting to note that the two Taiwanese aboriginal groups segregate into different clades. In phylogenetic studies using mtDNA haplogroups, the Ami and the Atayal also segregate into distinct clades (Tajima et al. 2003). The Ami and the Atayal have been living for thousands of years in close proximity in adjacent territories (see Fig. 1) on the island of Formosa. There are two possible explanations for this. One is that their differences may reflect separate origins from diverse populations that arrived in successive waves to inhabit Formosa followed by isolation, and the second is that these variations may simply be due to subsequent geographical partitioning and genetic differentiation of tribes from a common origin. The known strong cultural and linguistic differences could have generated barriers capable of preventing gene flow between the two groups preserving their uniqueness.

The PC plot, a two-dimensional illustration of allelic variability between the populations, is depicted in Fig. 3. The first and second PCs account for 64% of the total variation. Within the plot, four groups are evident: (1) Caucasians; (2) a scattered set including the Atayal, Bali and Han Chinese; (3) Native Americans; and (4) Africans and populations of African descent. As with the ML analysis, the population from Madagascar clusters away from the African Americans and the Zimbabwe group, and toward the Atayal, Bali and Han Chinese. Also as observed in the ML study, the intermediate position of the Madagascar group between the African groups and the East Asian populations may reflect the Austronesian contribution to the island population. The large genetic distance between the Ami and the Atayal belies the fact that these tribal groups are close neighbors and corroborate the ML data. Again, independent source populations in mainland Asia and/or extreme genetic drift and/or genetic isolation may be at least partially responsible for their phonetic differences. It is surprising that the Bali and Java groups plot distantly from each other in spite of their geographic proximity. It is obvious that these populations are genetically unique and distinct even though they represent adjacent islands only approximately 10 miles from each other with populations that belong to the same Austronesian language family. Compared with Bali, the island of Java is approximately 23 times larger and more culturally diverse. Subsequent to the Austronesian diaspora, Hindus, Buddhists, and Muslims have invaded Java. Today, Bali is predominately Hindu. Greater genetic flow from different groups and/or effective population size in Java may be an explanation for the observed differences between these two islands. The position of the Java population above and the Bali group below the linear regression in the centroid analysis (see next paragraph) corroborate this scenario. In addition, these populations may have undergone extreme drift and/or genetic isolation subsequent to migration into the two islands.

In the centroid analysis (Fig. 4), populations that plot above the regression line are expected to possess larger effective size and/or experience more gene flow than those below the regression line. The African American group plots above the regression line which supports the fact that this population is highly admixed. Madagascar may also have experienced high levels of gene flow from diverse sources. It is likely that the Austronesian groups that made it to Madagascar admixed with the populations originally from mainland Africa. The position of the Zimbabwe group slightly above the linear regression most likely is the result of the greater diversity and heterozygosity of sub-Saharan African populations (Cavalli-Sforza et al. 1994). Within the plot, all Caucasian populations fall above the regression line. The Han Chinese and Java lie above and nearly on the line. The Native Americans plot below the regression line with the Alaskan population closest to the line. The Bali population is positioned below the linear regression line as well. The location of the Java group above the linear regression and the Balinese below is in agreement with their relative effective population size and corroborate the contention that Java has experienced greater gene flow from different source populations. The Ami and Atayal are outliers located furthest below the line proximal to each other. This data, together with the ML and PC results, indicates that while these two groups are genetically different, they both experienced little gene flow and have remained separated. Population size may have also had an influence as both populations are small and limited in diversity and heterozygosity. Data from several serum protein loci corroborate the low genetic diversity among Taiwanese aborigines (Yuasa et al. 2001). This genetic data supports the idea that the aboriginal tribes of Taiwan have remained isolated from the Han Chinese population and from each other. The fact that these two tribes have different material cultures and social organizations substantiates this conclusion (Chai 1967).

As demonstrated by the ML and PC analyses, it is significant that the Atayal and the Ami aboriginal groups are genetically unique. These results are possibly an indication that the Ami and Atayal may have had different ancestral source populations originating in mainland Asia and subsequent cultural isolation. The centroid analysis argues for a small effective population size along with a limited amount of gene flow for both the Ami and the Atayal. Genetic drift due to founder effect and/or isolation may also be responsible for the differences between the two aboriginal groups. Also, based on the dendrogram and the PC analysis, the Madagascar population exhibits an intermediate genetic relationship between the African/African descent groups and the Atayal/Han Chinese cluster which may be due to an Austronesian influence on the island some 3,200 years ago. It is significant that Madagascar segregates in the direction of the East Asians in the PC plot and away from Zimbabwe which is geographically much closer by one order of magnitude. Madagascar displayed genetic affinity with the Atayal while maintaining a larger genetic distance from the Ami. This data supports previous archeological and linguistic data and supports a westward expansion of Austronesians originating from or near Formosa reaching the island of Madagascar.