Introduction

Taiwan is ethnically diverse1 and approximately 23.5 million people live there. While Mandarin is the official language, over 16 languages are spoken, principally belonging to two linguistic families: Austronesian (AN) and Sino-Tibetan (Mandarin, Minnan, and Hakka). During the last glacial maximum (LGM) over 20,000 years ago (20 kya), ice sheets covered much of the Northern Hemisphere, the sea level was low, and Taiwan still connected the East Asian continent. The discovery of Paleolithic artifacts of a tool industry at the “Changbin culture” site in Taitung and ancient human remains (bones and teeth of the Tso-Chen Man) in present-day southeast Tainan allowed archaeologists to show that Taiwan entered the Paleolithic era approximately 28 kya2,3,4.

The rising temperature toward the late Paleolithic era resulted in the rapid receding of ice and rising sea level to approximately 100 meters. Most locations in western Taiwan, a lowland region with diverse plains and hilly landscapes (Fig. 1), were once coastal locations until the sea level lowered to the present-day coastline. Along with climate change, the birth of agriculture, the abundance of food, organized communities, human migrations, and the progress of navigation began5.

Fig. 1: Location of Non-Austronesian-speaking groups of Taiwan (NAN_TW).
figure 1

A NAN_TW are dots colored pink; they comprise lowland and offshore Taiwanese groups (LL_Tw) and urban Taiwanese (Ur_Tw, composed of Minnan and Hakka individuals from New Taipei city, Kaohsiung city and Neipu Township in Pingtung County). The crosshatched area denotes locations of the Austronesian-speaking indigenous people of Taiwan (AN_Tw). B Basal haplogroup sharing across eleven groups. Detailed haplogroup frequencies are shown in Supplementary Table S1.

The Neolithic era of Taiwan (4,000 BC) saw the first settlement of sedentary groups3 from eastern South China. These groups spoke mutually intelligible dialects derived from Proto-Austronesian (PAN)6 and settled separately all over the island. Isolation by distance or geographic landscape caused cultures and languages to drift apart. Today, linguists recognize nine distinct local branches of PAN. These are still actively used by the indigenous peoples of Taiwan (or Austronesian speakers of Taiwan, AN_Tw), who represent only 2.3% of the Island population1,7. A tenth branch of the Austronesian language family, the Malayo Polynesian1,7, developed in the Batanes Islands (Ivatan) and Luzon in the Philippines (2200 BC)7,8. It rapidly divided further into subbranches found nearby in Malaysia, North Vietnam, Indonesia, the Bismarck Archipelago, and the Pacific Ocean, then reached New Zealand less than a thousand years ago, and finally Madagascar in the Indian Ocean8.

This linguistic diaspora was most likely also the result of trading networks. It started as early as 2800–2200 BC with the nearby Penghu Islands (Pescadores Islands) in the Taiwan Strait because of the need of the late Tabenkeng (or TPK) people in Taiwan to obtain material from the Penghu Island quarries to build their basalt adzes8.

Along with these trading networks, new arrivals soon introduced millet and wet rice agriculture8,9,10,11. The improved food production contributed to population density. Ultimately, from 500 BC to 500 AD, these trading networks included the Philippines, mainland Southeast Asia, India, and even widespread trades within Taiwan. Imports, such as metal technology, all came from mainland Southeast Asia (MSEA). Fujian, Guangdong and Taiwan were not yet part of the Han, and no contact with “Han-China” existed yet12. A millennium later, Taiwan became part of an even more active East and Southeast Asia maritime network involving Fujian merchants, pirates, and Japanese traders and spread further in Southeast Asia (SEA) and MSEA13. Along with these traders, fishers from the China Sea and the Taiwan Strait often visited Matsu, Penghu Island, and the west coast of Taiwan, where some settled13.

The end of six millennia of a uniquely Austronesian-speaking place and the beginning of Taiwan’s recorded history started with the Dutch in 1,624. To expand their deerskin trade, they established a colonial regime among more than 130 lowland villages of Indigenous peoples. The Spanish later established a base in northern Taiwan14. From 1895 to 1945, the Japanese instituted fifty years of colonial rule. Interestingly, the elements of Japanese culture are still part of Taiwan’s everyday life.

In the past few decades, molecular geneticists have characterized the genetic diversity of most Asian populations by first analyzing the highly polymorphic autosomal HLA system15. These studies allowed scientists to use the polymorphism of East Asian populations to investigate a possible relationship between genetics and Austronesian language families in Southeast Asia16. Then, geneticists used the maternal mitochondrial DNA (mtDNA) and paternal Y chromosomal (Y-SNP and Y-STR) gene systems. These uni-parental gene systems never recombine; they are exclusively inherited maternally or paternally. In addition to the linguistic relationship with genetics, many studies have used these gene systems to characterize gene flow, mixture, and the origin of many autochthonous groups17,18 and have more recently used DNA from ancient human remains to retrace the origin of archaic humans in East Asia19,20.

To provide more insights into the distribution and evolutionary history of East Asian populations, several research groups compared high-coverage whole-genome sequencing and genome-wide SNP autosomal data of East Asians, Taiwanese Han, and indigenous people of Taiwan and Island Southeast Asia21,22,23. The three research groups21,22,23 showed that Taiwanese Han had a low level of admixture with autochthonous Taiwanese and were descendants of several lines of ancestry spatially distributed throughout continental Asia. Their finding was generally consistent with the self-information given by the participants. Importantly, they produced documentation about known disease risk variants potentially affecting specific groups21.

However, only a few complete mtDNA sequences from non-Austronesian-speaking Taiwanese (NAN_Tw) people are available in the literature19. Rapid advances in complete genome sequencing also allow sequencing and analysis of large numbers of individuals. We used these methods to obtain 672 partial mitochondrial genomes of unrelated NAN_Tw individuals from main urban centers (Ur_Tw) and lowland locations all over the island (LL_Tw). Of these, 84 were used for complete mitochondrial genome sequencing. This study proposes to analyze the human demography, genetic diversity, and matrilineal ancestry of NAN_Tw (Ur_Tw and LL_Tw) individuals to characterize possible late Neolithic settlements and local expansion of non-Austronesian speakers in Taiwan. In doing so, this project is the first comprehensive mitochondrial DNA analysis of non-Austronesian-speaking Taiwanese individuals.

Materials and methods

Whole blood or buccal swabs were collected by L.M. from 622 (499 LL_Tw and 123 Ur_Tw) healthy, unrelated NAN_Tw individuals (Table 1, Fig. 1, Supplementary Table S1) whose grandparents were from the same locality. We performed DNA preparation using the QIAamp DNA Blood Mini Kit (Qiagen Inc. Chatsworth, California, United States) according to the procedure recommended by the manufacturer with minor adjustments. Complete mtDNA genomic and partial sequencing (HVS-I at nps 16001 to 16569 and coding regions at nps 8001 to 9000 and 9801 to 10900) were performed on both strands using the Perkin-Elmer/Applied Biosystems Division (ABI Taiwan) DyeDeoxy Terminator Cycle Sequencing Kit (Foster City, California, United States) according to the recommendations of the manufacturer on an automated DNA sequencer (ABI Model 377). With samples from the literature (Table 1 and Supplementary Table 1), the NAN_Tw group comprised 767 individuals (i.e., 218 Ur_Tw and 499 LL_Tw, including offshore locations from this study and 50 LL_Tw from Ko et al. 2014)19. Individuals from the Matsu archipelago (n = 50)24 and the Fujian province (n = 148)25,26 represented East Coast China. Complete mitochondrial genomes from the literature included Vietnam (n = 609), Thailand (n = 560), Japan (n = 664), Northeast China (n = 257), Southeast China (n = 65), Malaysia (n = 87), and 260 individuals from the Philippines19,27,28,29,30,31,32 (Table 1).

Table 1 Sample data.

Sequence regions (coding regions nps 8001 to 9000 and 9801 to 10900, and HVS-I region nps 16001 to 16519) from this study and samples from the literature were concatenated and aligned. Assignation of mtDNA to haplogroups of the samples was conducted using HaploGrep 2.0 software along with PhyloTree build 1733. Ambiguous haplogroup assignments were resolved using complete mtDNA genome sequencing (Supplementary Table S1).

Statistical analyses

Population structure and demographic history

The program Arlequin 3.534 was used to determine intrapopulation haplotype frequencies, pairwise fixation indices (Fst), gene diversity (H), nucleotide diversity (π), and the gene flow exchanged between groups (M = Nem) (Table 3). The analysis was performed using the aligned concatenated coding regions at nps 8001 to 9000 and 9801 to 10900 and HVS-I at nps 16051 to 16400 from this study and the matching regions from complete genome datasets obtained from the literature (Table 1).

Gene contributions from two putative parental populations (Han, and Austronesian speaker of Taiwan) as well as the unshared gene portion in the hybrid populations were inferred using the analysis of shared haplogroups between populations35 (Supplementary Table S2). The putative Han parental group comprised Northeast, East and South China, and the putative Austronesian group comprised all AN_Tw haplogroups not shared with the putative Han parental group (Supplementary Table S1).

Past population expansions were tested using Tajima’s D and Fu’s Fs neutrality tests36,37 with p values generated using 1,000 coalescent simulations under a model of selective neutrality. Multidimensional scaling analysis (MDS) of the groups was performed with SPSS software version 1738 using pairwise fixation indices Fst (Arlequin 3.5)34 (Fig. 2A). To explore the genetic variability within individuals, we applied a discriminant analysis of principal components (DAPC) using R software with the adegenet package v. 3.3.139.

Fig. 2: Multidimensional Scaling (MDS) and Discriminant Analysis of Principal component (DAPC).
figure 2

A MDS of population groups: Distances are Fst distances calculated from nucleotide differences. Groups with fewer than 20 individuals were removed. B DAPC plots: Each dot represents an individual. Groups are inertia ellipses color-coded as in the MDS plot. Each group around the central DAPC plot is compared to the LL_Tw. MSEA, Ur_Tw and China showed the most affinity with LL_Tw. Note: East Coast China includes individuals from the Fujian province of China and the Matsu Archipelago.

Coalescence time estimates

mtPhyl software (https://sites.google.com/site/mtphyl/) was used to construct a phylogenetic tree comprising 77 unique complete mtDNA sequences from NAN_Tw. Coalescence time estimates of all clades comprising more than three subbranches, including literature data (Table 1), were calculated using the ρ statistic-based method40 with Gompert’s function for the complete mtDNA genome, as follows:

$$\left[ {{{{\mathrm{Time}}}}\left( {{{{\mathrm{ka}}}}} \right) = \left[ {{{{\mathrm{m}}}}\rho \,{{{\mathrm{exp}}}} - \left( {{{{\mathrm{exp}}}}\left[ { - 0.0263\left( {\rho + 40.28} \right)} \right]} \right)} \right]} \right.$$

where m is the substitution rate estimate of one site per 3.624 kya41. Furthermore, a combined substitution rate of 5.27 E−8 per nucleotide per year (or one substitution per 8,940 years) was used when dating the 2550 concatenated nucleotides (nps 8001 to 9000; 9801 to 10,900 and 16,051 to 16,400) used to construct median joining networks42. It should be noted that this rate was inferred from the combined rate of the HVS-I control region (1.602 E−7 substitutions per site per year, or one site per 17,343 years)41, and the substitution rate of the 2121 nps of coding region (3.42 E−8 per site per year, or one site per 13,786 years). The last was inferred from haplogroup E1 of the 8000-year-old human remains found in Liangdao19,41.

Mismatch distribution and demographic history

Determination of spatial range expansion and stationary population history were tested using mismatch distribution with Arlequin 3.534 along with Harpending’s raggedness index43 to quantify smoothness and the sum of squared deviations (SSD) between observed and expected mismatches (Supplementary Fig. S2). Population expansion was inferred when a significant negative Fu’s Fs value and a nonsignificant SSD were obtained43. Furthermore, the demographic history of the population groups (Supplementary Fig. S2) was inferred using the Bayesian Skyline Plot (BSP) model with BEAST version 1.744 and the following parameters: HKY model, relaxed lognormal molecular clock model, and coalescent Bayesian Skyline with 4 groups. The first 25% of the generations were discarded as burn-in. All analyses were run to achieve an effective sample size (ESS) greater than 200 for all estimated parameters; on average, runs with Markov-Chain-Monte-Carlo (MCMC) chain lengths were greater than 1 × 107, and trees were sampled every 1000 generations. The effective population size for the posterior distribution of the estimated parameter values was determined using Skyline plot analysis with TRACER version 1.7.144.

Population gene flow and ancestry sharing

The population genetic grouping was characterized using Bayesian population mixture analysis with Bayesian Analysis of Population Structure v. 6.1 (BAPS)45. To accurately detect the uppermost hierarchical level of structure for the BAPS analysis, we used the Evanno statistic46 based on the rate of change in the log marginal probability of data between successive K values. Mixture analysis with K = 2 to 22 was first carried out to determine the Delta K (Supplementary Fig. S7). We then used K = 2 to 30 to produce a heatmap showing gene flow between source and target groups using the BAPS option “Admixture of individuals based on mixture clustering” (Supplementary Fig. S3).

Ethics statement

All participants gave written informed consent before biosample collection for subsequent population analysis. The study was authorized by the ethical committee of the Institutional Review Board (IRB) of the Mackay Memorial Hospital in Taipei, Taiwan (# 11MMHIS180).

Results

The study comprised 767 mtDNA samples from NAN Taiwanese individuals, often referred to as Han Taiwanese, or Minnan and Hakka. This group included thirteen lowlander/islander-Taiwanese groups (LL_Tw) and an urban Taiwanese group (Ur_Tw) composed of Minnan and Hakka residents from Kaohsiung City, Pingtung County (thereby named Minnan_Ko and Hakka_Ko), and Minnan and Hakka from Taipei19 (Table 1 and Fig. 1). Samples were initially screened by sequencing 2,669 nucleotide positions (nps) of the mtDNA of the coding and control (HSV-I) regions (nps 8,001 to 9,000, nps 9,801 to 10,900, and nps 16,001 to 16,569). This protocol identified 653 different haplogroups, 278 among NAN_Tw, of which 125 were shared by two Taiwan groups or more. MtDNA haplogroups were assigned using Haplogrep 2 software47. There were minor differences between LL_Tw and Ur_Tw. Basal haplogroups and their most representative subgroups (Fig. 1 and Supplementary Table S1) in NAN_Tw comprised haplogroups A (3.67%), B (23.34% with B4 = 16.14%, and B5 = 6.28%), C (3.28%), D (15.86% with D4 = 8.91%, D5 = 6.42%, and D6 = 0.53%), E (1.57% with E1a = 0.92%, and E2b = 0.52%), F (13.88% with F1a = 7.33%, F2 = 3.28%, F3 = 1.31%, and F4 = 0.78%), G (3.54%, with G1 = 0.92% and G2 = 2.36%), M (24.90% with M7b1, 8.54%, M7c1, 4.33%; M8, 3.66%, and M10a, 3.02%;), N9a (4.06%) and R9 (2.49%) (Fig. 1). To obtain better haplogroup assignments, the complete genome of poorly characterized haplotypes was sequenced. This procedure allowed us to define 62 novel subhaplogroups, of which 31 were exclusive to NAN_Tw (Table 2, Supplementary Fig. S1, and Supplementary Text 1). When naming a novel subhaplogroup, the exclusive nucleotide substitution closest to the tip of its phylogeny was added to the basal haplogroup name. We preferably used an SNP matching our present screening protocol and an underline for naming unreported haplogroups (e.g., M7b2a_8389) (Supplementary Fig. S1 and Table S3)47.

Table 2 Time of most recent common ancestor (TMRCA) of novel haplogroups and their origins.

Genetic structure

Gene diversity indices calculated from haplogroup frequencies generally showed higher values in mainland China, LL_Tw, Thailand, and Vietnam (H > 0.841) groups than in Austronesian-speaking groups, the Philippines, Indonesia, Malaysia, and AN_Tw (Table 3, Supplementary Table S1). To the exception of unreliable results obtained from groups with a small sample size (i.e., Pingtung n = 12, Heping Island = 10, and Xiaoliuqiu n = 11), the nucleotide diversity indices (π) were evenly distributed throughout Taiwan and showed no distinct differences among groups in continental Asia (π = 0.004 to 0.006) (Table 3). Similarly, except for Pingtung showing a mean number of pairwise differences (MNPd = 9.48), MNPd in LL_Tw and Ur_Tw were within the range seen in continental Asia (MNPd = 14.34 to 19.96) (Table 3 and Supplementary Fig. S2).

Table 3 Molecular Diversity.

The gene flow (M = Nem) exchanged between groups (Table 3) is affected by the effective population size, Ne, and the migration rate m. It is a measure of spatial expansion indicating how a population becomes subdivided when mating preferentially with neighbor groups48. Indigenous groups such as AN_Tw, Malaysians & Philippines tend to have a smaller Ne; thus, a smaller M is expected when compared to NAN_Tw. As expected, the low gene flow (M), shown with AN_Tw indigenous groups (M = 14.09), contrasts with the high to very high gene flow (M = 145.55 to 10781) shown with NAN_Tw (Table 3). However, the comparatively low intragroup gene flow in Green Island, Heping Island, and Yilan (M = 27.94 to 45.53) suggests that these groups had low Ne and likely remained relatively homogeneous, isolated and had a low migration rate (low m). We see a similar pattern for the Philippines and Malaysia (M = 12.8 and 11, respectively). A significant negative Tajima’s D value (p < 0.01) was seen in all NAN_Tw (LL_Tw and Ur_Tw) groups with a sample size greater than 20 (Table 3). These results suggest sudden expansion. However, to be adopted, these results need support by the combination of a significant and more powerful Fu’s Fs negative test (p < 0.01) along with a nonsignificant sum of square difference test (SSD, p > 0.02)43, as shown in mismatch distribution analysis (Supplementary Fig. S2, Columns 1 and 3). Although most mismatch plots appeared multimodal, none of the raggedness indices (Ri) was significantly different from the expectation. The NAN_Tw Bayesian Skyline distribution plots (Supplementary Fig. S2, Columns 2 and 4) show the effective population size (y-axis) as a function of time (x-axis). All groups show a higher effective population size (Ne) presently than in the past. Although many groups do not show expansion in the last few 1000 years, they show expansion until the last two to three thousand years BP. For example, while Fujian, Taipei, Changhua and Kaohsiung show an increase in Ne 30 times greater at present than in the past, smaller groups such as Green Island and the Matsu archipelago show population expansion to a much lower level (8 times). Similarly, except for small island groups such as Matsu, Heping, and Green Island showing a very low effective population size (Ne = 250), all other groups showed an Ne ranging between 1700 and 2500. Interestingly, Xiaoliuqiu shows a stationary effective population size. This is not surprising, as this island is very small (2 km wide). Its population often changed and never reached more than a few hundred individuals. However, caution should be taken when analyzing groups with a sample size of less than 20 individuals (Xiaoliuqiu, Heping, and Pingtung).

Mixture between groups

Inference of haplogroup contributions from putative parental populations was obtained using the analysis of haplogroup proportion from molecular data35. We pooled the Fujian, Northeast Asia, and South China groups to constitute a putative Han parental gene source. For the ancestral Austronesian putative gene source, we pooled all Taiwan indigenous peoples (AN_Tw) and kept all exclusive haplogroups (i.e., haplogroups not shared with the Han putative haplogroups) (Supplementary Table S1). When estimating the mixture (Supplementary Table S2), the percentage of unshared haplogroups (Supplementary Table S2, third column) may represent gene flow from other regions (i.e., MSEA or ISEA). Alternatively, it can represent within-group genetic variation resulting from a long period of isolation. Interestingly, on average, LL_Tw and Ur_Tw showed similar profiles, with 23.86% and 20.07% of haplogroups shared with Austronesian indigenous peoples, 33.27% to 34.27% shared with Han peoples, and 42.87% and 45.65% shared exclusive haplogroups (or haplogroups from an undefined source).

Diversity and demographic history

Multidimensional Scaling (MDS, Fig. 2A)

In the following, the East Coast of China is represented by Fujian (individuals from Xiapu County in China), self-declared individuals from the Fujian province who moved to Taiwan, and Matsu individuals24,25,26. The MDS plot (Fig. 2A) shows the affinity of the Ur_Tw groups (blue dots) with mainland China (black filled dots). The AN_Tw (red dots), the Philippines (orange dots in the upper right quadrant), and Japan (gray dots in the upper left quadrant) show much less affinity with NAN_Tw. Due to the low sample size (n < 20) of Pingtung, Miaoli, Xiaoliuqiu, and Heping, these LL_Tw groups appeared as out-groups on the MDS and were removed. The rest of the LL_Tw occupies a more central position, with Changhua, Papura, Kaohsiung, Penghu, and Yunlin showing affinity with different regions of Indonesia49. Last, as expected from Supplementary Table S2, Makatao with a 45% mixture with AN_Tw stands up in the same cluster as the Austronesian speakers.

Discriminant Analysis of Principal Components (DAPC, Fig. 2B)

DAPC (Fig. 2B) places NAN_Tw individuals (blue and purple-filled dots) in a central position. They include Ur_Tw groups (Minnan_Ko, Hakka_Ko, and Taipei in blue) and LL_Tw (purple) individuals and suggest overall homogeneity (Fig. 2B). Similarly, and consistent with previous complete human genome studies21,22,23, LL_Tw individuals show a significant ancestral relationship with mainland China (black dots). On the other hand, the Philippines (orange dots) and AN_Tw (red dots) groups show less mixture, suggesting a low amount of mixture between the NAN_Tw and AN_Tw groups and supporting results from Lo et al. 202023. These results are further supported by Supplementary Table S2, which shows a 3.49% to 5.53% mixture between AN_Tw and Changhua, Miaoli, Taipei, and Minnan_Ko groups and a more than 39% mixture with Makatao and the Hakka_Ko group from Neipu Township in Pingtung County. Additionally, the closeness of AN_Tw individuals with the Philippines group supports the Out of Taiwan hypothesis (OOT)50. As shown in Fig. 2A, Indonesian individuals (green) and MSEA individuals (brown) show partial affinity with NAN_Tw, likely reflecting traces of an ancient trading network between Taiwan, ISEA, and MSEA. In agreement with complete human genome autosomal studies21,23,51, the inertia ellipse representing individuals from China (black) overlaps with Japan (gray in the upper left). Similarly, the lower end of the ellipse (Black) overlaps with mainland Southeast Asia, likely the result of the southward expansion of the Han under the previous Qin Dynasty52.

Bayesian analysis of population structure (BAPS)

Mixture partition using Bayesian analysis (Fig. 3A, C)

We used a visualization tool of Bayesian Analysis of Population Structure (BAPS) to analyze the genetic diversity, mixture, and relative average of the maternal ancestry between NAN_Tw and groups of East Asia53. We did not use tests for linguistic affiliation, as all NAN_Tw belong to the Sinitic family of languages. However, we applied the analysis to test for geographically associated clusters.

Using the average log-likelihood of K (Delta K)46, the BAPS analysis characterized the most significant cluster variation at K = 7 and K = 19 (Supplementary Fig. S7). At K = 7, population clusters matched geography, except for Malaysia differentiating from MSEA and ISEA. The largest group (Light blue) includes all NAN_Tw (Fig. 3A) and China, and at K = 7, there is a homogeneous mixture between the groups21,23,51. Interestingly, this cluster lost significant homogeneity at K = 19 (Fig. 3A) but still includes Japan. The maternal legacy with Japan will be conserved until K = 23 when Japan separates from Northeast China, East Coast China, Taipei, Changhua, Kaohsiung, and Yunlin (Supplementary Fig. S6). East coast China (Fujian), Taipei, and Yunlin will show strong affinity until K = 28 (Supplementary Fig. S6). Finally, Makatao is the only LL_Tw group that remains separated from all other NAN_Tw. From K = 2 to K = 7, Makatao shows undifferentiated affinity with AN_Tw and the Philippines. From K > 9 to K = 23, its affinity will be restricted to AN_Tw, in agreement with Supplementary Table S2 showing Makatao with only 14.68% mixtures with putative parental Han, but 45.5% sharing with AN_Tw and 39.8% not shared with either putative group.

Fig. 3: Mixture partition and map of gene flow.
figure 3

Plots representing the uppermost hierarchical level of structure were constructed using the cluster assignments inferred from the Delta k results (K = 7, and K = 19) shown in Supplementary Fig. S746. Plots are color-coded according to their K value (i.e., K = 7 matches B, and K = 19 matches C). A Mixture partition for K7 and K19. Each vertical line represents a single individual and is colored according to the highest proportion of genetic variation assigned to each BAPS group. The order of populations is set according to the geographic orientation from north to south via Taiwan. B, C Network of gene flow (p = 0.01) for K7 and K19 and n > 20. Colors in (B) and (C) are associated with the mixture plot of the same K value in (A). Arrows indicate the average fraction of sequence variation obtained from the source cluster to the target cluster. Feedback arrows beside clusters indicate the fraction of gene sources arising within the BAPS group. Groups/clusters with less than 20 individuals (i.e., Heping, Miaoli, Pingtung, and Xiaoliuqiu) are not shown in the network of gene flow. Note: A complete heatmap of gene flow for all groups (K = 30) is shown in Supplementary Fig. S3.

Gene flow and levels of relatedness between groups (Fig. 3B, C)

BAPS was further used to estimate the extent of gene flow and the level of relatedness between groups of more than 20 individuals. At K = 7 (Fig. 3B, p < 0.01), four clusters showed exclusive within-group gene sources greater than 79% (Fig. 3B). Malaysia appeared as a group that separated very early in the genetic history of mainland Southeast Asia; however, likely in the last few hundred years, it received 16% of gene flow from Thailand, Vietnam, and Indonesia. Except for Malaysia, the overall gene flow between groups at K = 7 did not exceed 6%. However, at K = 19 (Fig. 3C), clusters without exclusive within-group gene sources (i.e., NAN_Tw, and Sumatra) are targets of gene flows primarily coming from China or other groups from Taiwan. Note that at K = 24 (data not shown), the urban Taiwanese group shows a level of gene source within a group of 82%, likely representing the diversity from all NAN_Tw target groups or resulting from recent demographic movements of lowland individuals toward central and attractive urban centers such as Taipei or Kaohsiung.

We used K = 30 to construct a heatmap of pairwise gene flow between all source and target groups (Supplementary Fig. S3). Japan, Northeast China, East Coast China, and mainland Southeast Asia represent the major sources of maternal heritage for NAN_Tw, compared to the contributions from the Philippines and AN_Tw.

The gene flow from Indonesia and MSEA, inferred in Fig. 2, was not seen in Fig. 3C because of the 15% gene flow pruning applied to obtain a clear visual. However, it is seen at less than 10% in Supplementary Fig. S3, with western Indonesia (Java) appearing as a moderate source compared to MSEA. These patterns of gene sources from Indonesia and MSEA suggest possible northward movements of populations or traders from the South China seas and supplement the apparent affinity between NAN_Tw, Indonesia, the Philippines, and MSEA previously determined with the MDS and DAPC analyses (Fig. 2 and Supplementary Table S1). Furthermore, the sharing of haplogroups between western Indonesia and groups in China (10.2%) also suggests gene flows resulting from past back-and-forth migrations of traders between East China and ISEA54 (Supplementary Tables S1, S2).

In summary, the source populations, shown in the top section of Supplementary Fig. S3 (East Asia), represent gene pools significantly distinct from MSEA- and Austronesian-speaking groups. These populations have high within-group variation. Except for South China and East Indonesia, they receive little from elsewhere and are likely the result of long periods of local dispersal and expansion before they reached Taiwan. In contrast, except for Taipei, Yunlin, Kaohsiung, and Changhua, who belong to a major source cluster (Fig. 3B, C), all other NAN_Tw groups have low within-group variation. Most are the targets of gene flow from mainland East Asia and at a lower level from Island Southeast Asia. Finally, except for Makatao and the Hakka_Ko group from Neipu Township in Pingtung County19, which received gene flow from AN_Tw (45.51% and 39.82%, respectively), the general mixture distribution between AN_Tw and NAN_Tw was less than 5% (Supplementary Table S2).

Discussion

The precise time and mode of the colonization of Taiwan by NAN groups remain a disputed issue. This study sheds light on the diversity, distribution, and origin of mtDNA haplogroups among 672 NAN individuals living in different offshore and inshore locations in Taiwan (Table 1 and Supplementary Table S1). The homogeneous distribution of high polymorphism, high genetic diversity (Table 3), and mixture throughout all groups suggests important gene flow between lowland and urban Taiwanese individuals (Fig. 2). The analysis of haplogroup sharing (Supplementary Table S2) showed little AN_Tw mixture among NAN_Tw, except for Hakka_Ko and Makatao. Interestingly, the matrilineal heritage of Makatao (45.5% sharing with AN_Tw, less than 15% with mainland Asia, and 40% of undefined sources) characterized this LL_Tw group as strongly mixed with AN_Tw (Supplementary Table S2). While strongly Sinicized, the Makatao may be a former indigenous group of Taiwan.

Recent studies51 used genome-wide analyses to date the common ancestor of extant Han Chinese, Korean and Japanese back to 3000 years ago and to characterize recent admixture from the surrounding populations21,22,51. Similar approaches23,55 reported that AN_Tw share ancestry with Tai-Kadai and Austroasiatic speakers and proposed that AN_Tw are genetically associated with Yangtze River Valley agriculturists. Larena et al. (2021)20 used 2.3 million genotypes from 118 ethnic groups in the Philippines to show that modern humans in ISEA interbred with archaic Denisovans. Genome-wide data analysis facilitates our understanding of the evolutionary history of human populations, gene flows, and their genetic diversity. However, more is still required to characterize subtle parental fingerprints within groups in more restricted geographical areas.

The screening method used in this study (using sequencing of nps 8001–9000, 9801–10,900, and the HVS-I of the control region 16,051–16,400) produced sufficient diversity to represent a statistically significant population structure of the NAN-speaking Taiwanese individuals and their relationships with other groups in East Asia. We identified 271 haplotypes in Taiwan, and complete genome sequencing characterized 62 novel mtDNA haplogroups (Table 2, Supplementary Tables S4, S5). Overall, most new variants were found close to the tips of the phylogenetic tree. Furthermore, the study revealed various maternal contributions from Northeast Asia, Fujian, South China, and MSEA individuals (Fig. 2 and Supplementary Table S1), consistent with previous complete human genome studies21,22,23. Three groups of relationships were determined (Supplementary Text S1 and Table 2): (a) A Northeast Asian group (~30%) comprising haplogroups A (A5b1b and A5b1c), C, D (D4, and D5), G, M8, M10a, and N9a1. (b) A group derived from mainland Southeast Asia and Island Southeast Asia (~60%) composed of haplogroups B (B4, and B5), E (E1a1), F (F1a, and F2), M (M7b1, and M7c1), and R9. (c) Finally, out of the 62 novel haplogroups, 16 had an immediate origin in Fujian, and 31 were seen exclusively among NAN Taiwanese (Table 2). Estimates of the time of the most recent common ancestor (TMRCA) using a single genetic system (mtDNA) may produce inconsistent estimates of the population divergence time. Accordingly, our attempts to uncover the prehistory of NAN_Tw settlement in Taiwan should be interpreted with caution. The TMRCA was calculated using complete mtDNA genomes. The TMRCA of haplogroups exclusive to Taiwan (Table 2) showed six haplogroups with a TMRCA higher than 5 kya. The remaining ranged from 1.0 to 4.0 kya. Interestingly, 14 haplogroups exclusive to NAN_Tw showed an age range of 1.0 to 2.6 kya, suggesting long isolation, genetic expansion within Taiwan, and prehistoric settlements of some NAN_Tw groups in Taiwan predating the substantial demographic movements of individuals from Fujian and Guangdong in the last 200 years. To support this hypothesis, we reviewed Wang et al. (2021)55 analysis of complete human genome sequencing of nine ancient human remains (at ~1.5 kya) from the Hanben archaeological site in northeastern Taiwan55. Its most recent dating could be from the early iron age (1.5 to 0.4 kya). Among their genetic dataset, three ancient remains carried the Y haplogroup O3a2b*-M7, and six carried O3a2c*-P164 (now officially renamed O2a2a1a2-M7 and O2a2b-P164, respectively). These Y-chromosome haplogroups are scarce among Austronesian speakers of insular Asia and are mainly associated with NAN individuals from mainland Asia. In agreement with our observation (exclusive mtDNA haplogroups with an age range of 1.0 to 2.6 kya), these findings corroborate late Neolithic era to early metal age (1.6 kya) settlements of non-Austronesian-speaking individuals in Taiwan54. Our previous study on archaeological human remains of the Neolithic Ling-Ding site II near Hualien in Taiwan (Fig. 1)56 characterized four human remains with mtDNA haplogroups (B4b, C4a2, N9a1, and Z). Similarly, these haplogroups, commonly seen in modern urban Taiwanese and continental Asia, support the findings of our study and the possibility of early settlements in Taiwan of non-Austronesian speakers from continental Asia. The BAPS analyses (Fig. 3) arranged the studied groups into four clusters of shared ancestry: (a) a cluster comprising Japan, China groups, and NAN Taiwan groups principally represented by Ur_Tw and LL_Tw, (b) a Malaysian cluster receiving 15% gene flow from Indonesia and MSEA (Thailand and Vietnam), less than 3.4% from China groups and NAN_Tw, and 2% from AN_Tw and the Philippines, (c) an Austronesian cluster comprising AN_Tw, the Philippines, and Makatao groups showing gene flow (2%) toward NAN_Tw, (d) an Indonesia and MSEA cluster indicating a distant relationship (6%) with NAN_Tw.

The northward gene flow from ISEA/MSEA toward Taiwan (6%) seen in Fig. 3 was also suggested by the MDS and DAPC analyses (Figs. 2 and 3) and is evidenced by the sharing of NAN haplogroups (B4a1, B4c1b2a, B4c2, B5a, F1a, F1a1a, M7c1, and M20) between Indonesia, MSEA, South China, and Fujian (Supplementary Table S1). Interestingly, haplogroups M7c1, R9, and M20 were also reported in India and Eurasia and imply ancient gene flows of NAN individuals throughout Southern and Eastern Asia, likely representing fingerprints of migrations and/or trading networks throughout the China Sea12. Although only indicating an undated gene flow through the China sea channel, this observation is in agreement with reports on archaeological sites suggesting trading networks of Chinese ceramic from the 12th century AD characterized in Penghu Island and the north and south of Taiwan12. The evidence of trading systems and cross-regional cultural exchanges between groups of Southern Asia, Eastern Asia, and Taiwan since 400 BC was further suggested by deposits containing glass beads and metal age materials of this period found in the archaeological sites of Jiuxianglan in Southeast Taiwan54. Complete mtDNA genome analysis of these haplogroups and their distribution through MSEA, East Asia, ISEA, and Taiwan and complete human genome analysis should help give more support to these observations.

Conclusion

The substantial distribution of mtDNA diversity found among non-Austronesian speakers of Taiwan and their relationship with neighboring Asian groups offers a better understanding of the matrilineal structure of Taiwan. This is likely the result of repeated cultural influences from various non-Austronesian human settlements from mainland Asia over prehistoric and historic periods. To a lesser extent, it is also the result of Malayo–Polynesian interactions through trade between Western Indonesia and mainland Southeast Asia. Rapid progress in molecular genetics will reduce the stochastic effect from the analysis of uni-parental systems. However, this study contributes to a better genetic characterization of NAN-speaking Taiwanese individuals. It exposes the importance of using other gene systems and analyzing other ethnic groups on the island before pertinent information needed to unravel their genetic heritage becomes unreachable.