Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Inference of multiple-wave admixtures by length distribution of ancestral tracks

## Introduction

Admixture among previously isolated populations has been a common phenomenon throughout the evolution of modern humans (Li et al., 2008; Reich et al., 2009; Wall et al., 2009). The history of population admixture has a strong influence on the landscape of genetic variation in individuals from admixed populations. Therefore, the population history of admixed populations can be reconstructed by utilizing genetic variation information (Xu et al., 2008; Pool and Nielsen, 2009; Price et al., 2009; Moorjani et al., 2011; Pugach et al., 2011; Gravel, 2012; Jin et al., 2012; Loh et al., 2013; Hellenthal et al., 2014; Jin et al., 2014; Pickrell et al., 2014; Ni et al., 2016; Pugach et al., 2016).

A few methods have been developed to infer admixture history based on ancestral tracks information (Pool and Nielsen, 2009; Pugach et al., 2011; Gravel, 2012; Jin et al., 2012; Jin et al., 2014; Ni et al., 2016; Pugach et al., 2016). Pool and Nielsen firstly used the length of ancestral tracks to infer population history (Pool and Nielsen, 2009). They introduced a theoretical framework describing the length distribution of ancestral tracts and proposed a likelihood inference method to estimate the parameters related to historical change in migration rates. In addition, Pugach et al. performed wavelet transforms on the ancestral tracks in an admixed population to obtain the dominant frequency of ancestral tracks to estimate the admixture time (Pugach et al., 2011; Pugach et al., 2016). Jin et al. further explored admixture dynamics by comparing the empirical and simulated distribution of ancestral tracks under three typical two-way admixtures models: the hybrid isolation (HI) model, the gradual admixture (GA) model, and the continuous gene flow (CGF) model (Jin et al., 2012). They later deduced the theoretical distributions of ancestral tracks under HI and GA models (Jin et al., 2014). Gravel extended these studies to multiple ancestral populations and discrete migrations, and provided a numerical estimation of tract length distribution (Gravel, 2012).

However, all of the aforementioned methods have a significant shortcoming. Before estimating the parameters of admixture history, a prior admixture model was required. The method by Pool and Nielsen considered the model in which a target population received migrants from a source population (Pool and Nielsen, 2009). Pugach et al.’s method involved an HI model, and Jin et al.’s methods involved HI, GA, and CGF models (Pugach et al., 2011; Jin et al., 2012; Jin et al., 2014). While Gravel considered models of multiple ancestral populations and discrete migrations, a prior admixture model was also required when dealing with the problem of admixture history inference (Gravel, 2012). However, in real data analysis, we always have little information of admixture history, and the admixture model is often uncertain for some complex admixed populations (Xu et al., 2008; Xu and Jin, 2008; Lipson et al., 2014; Bryc et al., 2015). Therefore, when the prior model deviates from real history, these methods might be unreliable.

In our previous work (Ni et al., 2016), we proposed some general principles in parameter estimation and model selection with the length distribution of ancestral tracks under a general model. However, with the increase of the number of parameters, it is complex and time-consuming to find the optimal solution, and too many parameters can lead to over-fitting. Thus, we only developed a method to infer admixture history under three typical two-way admixtures models.

In this work, we introduced a new method to select the optimal admixture model and estimate the corresponding parameters under a general model. Firstly, we proposed a general discrete admixture model with an arbitrary number of ancestral populations and an arbitrary number of admixture events. This was similar to the general model in our previous work (Ni et al., 2016). Then, we deduced the theoretical distribution of ancestral tracks with some reasonable approximations under the general discrete admixture model. We selected an optimal admixture model based on the length distribution of ancestral tracks. Specifically, we used a likelihood ratio test (LRT) (Wilks, 1938) to determine the number of admixture waves, and employed a method of exhaustion to determine the order of admixtures. We then applied an expectation–maximization (EM) algorithm (Dempster et al., 1977) to estimate the parameters under the optimal model. In our method, no prior knowledge about the admixture history is required, and the admixture model and its corresponding parameters could both be inferred by ancestral tracks. Finally, we conducted simulation studies to demonstrate the effectiveness of our method, and then applied our method to African Americans and Mexicans from the HapMap project phase III data set (International HapMap et al. 2010), and Uyghurs and Hazaras from the Human Genome Diversity Project (HGDP) data set (Li et al., 2008).

## Materials and methods

Now, denote I k  = {i:k i  = k}, then, I k (j) represents the wave ordinal of the jth admixture from ancestral population k. Let n k denote the number of admixture waves from ancestral population k, and thus we have n k  = |I k | and $$\mathop {\sum }^K_{k = 1} n_k = n + 1$$, where n is the total number of admixture waves. The general discrete admixture model is determined by the admixture order O = (k0, k1,…, k n ), the admixture proportion {α i }0 i n , and the admixture time {t i }0 i n+1. If we set

$$m_{k_i}\left( {t_i} \right) = \left\{ {\begin{array}{*{20}{c}} {1 - \alpha _1,} & {if{\kern 1pt} i = 0} \\ {\alpha _i,} & {if{\kern 1pt} i \ge 1} \end{array}} \right.,$$

and set the rest of the elements of the admixture proportion matrix {m i (t)}1 i K,1 t T as 0, we can get the admixture proportion matrix of the previous general model (Ni et al., 2016). This shows that our new model is similar to the previous model. Furthermore, this new model can also cover all of the scenarios of an admixed population with an arbitrary number of ancestral populations and an arbitrary number of admixture events.

### Length distribution of ancestral tracks

Next, we deduced the length distribution of the ancestral tracks from ancestral population k. The deduction process is similar to that of our previous work (Ni et al., 2016). The wave ordinals of the admixture from ancestral population k are I k (1), I k (2), …, I k (n k ), respectively. Define s i as the survival proportion of the ancestral tracks from the ith admixture at generation T. Then,

$$s_i = \alpha _i\mathop {\prod}\limits_{k = i + 1}^n {\left( {1 - \alpha _k} \right)} .$$
(1)

Denote H k (t) as the total ancestry proportion of the kth ancestral population in the admixed population at t generation, and we have

$$H_k\left( t \right) = \mathop {\sum }\limits_{j = 1}^w \alpha _{I_k\left( j \right)}\mathop {\prod }\limits_{i = I_k\left( j \right) + 1}^h \left( {1 - \alpha _i} \right),$$
(2)

where $$w = {\mathrm{max}}\left\{ {j:t_{I_k\left( j \right)} \le t,1 \le j \le n_k} \right\}$$ and $$h = \max \left\{ {i:t_i \le t,1 \le i \le n} \right\}$$.

For simplicity, we assumed that the chromosome length was infinite and there was no genetic drift. In addition, we defined the recombination among tracks from different ancestral populations as effective recombination because we only observed these recombination events among different ancestries. The length of the ancestral tracks was changed by these recombination events. For the tracks from ancestral population k, the effective recombination rate is 1 − H k (t) at t generation. Let u k (j) be the total effective recombination rate for the ancestral tracks from the jth admixture of ancestral population k. Then, we have

$$u_k\left( j \right) = \mathop {\sum }\limits_{h = I_k\left( j \right)}^n \left( {1 - H_k\left( {t_h} \right)} \right)\left( {t_{h + 1} - t_h} \right).$$
(3)

The length distribution of the ancestral tracks from the jth admixture of ancestral population k is an exponential distribution with a rate of u k (j) (Pool and Nielsen, 2009; Gravel, 2012; Ni et al., 2016). A chromosome from the jth admixture of ancestral population k is expected to be split into u k (j) pieces per unit length (units in Morgan). Thus, for the admixed population at T generation, the number of ancestral tracks from the jth admixture of ancestral population k is proportional to $$s_{I_k\left( j \right)}u_k\left( j \right)$$. Let X k be the length of the ancestral tracks from ancestral population k at generation T, and f k (x) is the probability density of X k . Then,

$$\begin{array}{ccccc}\\ f_k\left( x \right) =\!\!\!\! & \mathop {\sum }\limits_{j = 1}^{n_k} P\left( {\begin{array}{*{20}{c}} {{\mathrm{ancestral}}\,{\mathrm{tracks}}\,{\mathrm{are}}\,{\mathrm{from}}\,{\mathrm{the}}\,jth\,{\mathrm{admixture}}} \\ {{\mathrm{of}}\,{\mathrm{ancestral}}\,{\mathrm{population}}\,k} \end{array}} \right)\ u_k\left( j \right){\mathrm{exp}}( - u_k\left( j \right)x)\\ \\ & = \mathop {\sum }\limits_{j = 1}^{n_k} \frac{{s_{I_k\left( j \right)}u_k\left( j \right)}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} s_{I_k\left( j \right)}u_k\left( j \right)}}u_k\left( j \right){\mathrm{exp}}( - u_k\left( j \right)x).\\ \end{array}$$
(4)

The length distribution of the ancestral tracks was a mixed exponential distribution, and consisted with the results from our previous study (Ni et al., 2016).We deduced the length distribution of the ancestral tracks without drift and assumed infinite chromosome length. These assumptions had little influence on the length distribution. In our previous study, simulations had demonstrated that these assumptions were reasonable and accurate (Ni et al., 2016).

### Model selection and parameter estimation

If the admixture model is determined, the length distribution of the ancestral tracks can be written as Formula (4), which is a mixed exponential distribution. The EM algorithm (Dempster et al., 1977) can be used to estimate the parameters in this distribution. However, the admixture model is often unclear in real situations, which means that the number of admixture waves (n k ) and the order of admixtures (O) are unknown. Thus, we must first determine n k and O. Here we used LRT (Wilks, 1938) to select the optimal n k . (Details of the LRT procedures are in Supplementary Text S1) After that, we used the method of exhaustion to validate the accuracy of O. For any order of admixtures, we estimated the admixture proportion {α i }0 i n and admixture time {t i }0 i n+1 using the EM algorithm. However, these parameter estimations must satisfy the following constraint conditions:

(a) $$0 \le \alpha _i \le 1,\,{\mathrm{for}}\,1 \le i \le n$$; and

(b) $$t_1 \le t_2 \le \cdots \le t_n \le t_{n + 1}$$.

If the estimations do not satisfy these conditions, the order is incorrect. After traversing all of the admixture orders, we could determine the correct ones.

The detailed procedures are as follows:

Step 1: Estimate the total admixture proportion m k of ancestral population k. With the inferred ancestral tracks, divide the total length of tracks from population k by the total length of tracks in the admixed population, then we get the estimation of total admixture proportion m k .

Step 2: Determine the number of admixture waves (n k ) for each ancestral population and estimate the parameters of the mixed exponential distribution. For ancestral population k, use LRT to select the optimal number of admixture waves n k and then estimate the parameters $$\left\{ {\left( {\omega _{k1},\lambda _{k1}} \right),\left( {\omega _{k2},\lambda _{k2}} \right), \ldots ,(\omega _{kn_k},\lambda _{kn_k})} \right\}$$ of the mixed exponential distribution using the EM algorithm, where $$\omega _{kj} = \frac{{s_{I_k\left( j \right)}u_k\left( j \right)}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} {\kern 1pt} s_{I_k\left( j \right)}u_k\left( j \right)}}$$ and λ kj = u k (j). Details of the EM algorithm and LRT procedures are in Supplementary Text S1.

Step 3: Select admixture order O without a replacement from the set $${\mathrm{\Omega }} = ( {O:{\mathrm{a}}\,{\mathrm{permutation of sequence}})( {\underbrace {1, \ldots ,1}_{n_1}, \ldots \underbrace {k, \ldots ,k}_{n_k}, \ldots \underbrace {K, \ldots ,K}_{n_K}} ,{\mathrm{where}}\,O(1) \ne O(2)} ).$$ Get I k for each k base on the selected O.

Step 4: Determine {s i }0 i n and {u k (j),1 ≤ j ≤ n k ,1 ≤ k ≤ K} from the following equations:

$$\left\{ {\begin{array}{*{20}{c}} {u_k\left( j \right) = \lambda _{kj},} \\ {\begin{array}{*{20}{c}} {\frac{{s_{I_k\left( j \right)}u_k\left( j \right)}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} s_{I_k\left( j \right)}u_k\left( j \right)}} = \omega _{kj},} \\ {\mathop {\sum }\limits_{j = 1}^{n_k} s_{I_k\left( j \right)} = m_k,} \end{array}} \end{array}} \right.$$

where $$1 \le j \le n_k,1 \le k \le K$$.

Step 5: Determine {α i }0 i n from the following equations:

$$s_i = \alpha _i\mathop {\prod }\limits_{k = i + 1}^n \left( {1 - \alpha _k} \right),$$

where $$0 \le i \le n$$.

Step 6: Determine {t i }0 i n+1 from the following equations:

$$\begin{array}{l}H_k\left( t \right) = \mathop {\sum }\limits_{j = 1}^w \alpha _{I_k\left( j \right)}\mathop {\prod }\limits_{i = I_k\left( j \right) + 1}^h \left( {1 - \alpha _i} \right),{\mathrm{and}}\\ u_k\left( j \right) = \mathop {\sum }\limits_{i = I_k\left( j \right)}^n \left( {1 - H_k\left( {t_i} \right)} \right)\left( {t_{i + 1} - t_i} \right),\end{array}$$

where 1 ≤ j ≤ n k , 1 ≤ k ≤ K

Step 7: Judge whether {α i }0≤in and {t i }0 i n+1 satisfy the following conditions:

(a) 0 ≤ α i  ≤ 1, for 1 ≤ i ≤ n; and

(b) $$t_1 \le t_2 \le \cdots \le t_n \le t_{n + 1}$$.

If these conditions are satisfied, record the corresponding admixture order O, admixture proportion {α i }0 i n , and admixture time {t i }0 i n+1. Then return to Step 3 until all of the possible admixture orders are checked.

Through these above procedures, we obtained all of the reasonable admixture orders O, the corresponding estimators of admixture proportion {α i }0 i n , and admixture time {t i }0 i n+1. Based on the estimations of these parameters, we could recover the history of the admixed population.

However, due to a lack of accuracy in local ancestry inference, only these relatively long tracks are reliable (Pool and Nielsen, 2009; Gravel, 2012). Therefore, we are interested in the conditional length distribution of ancestral tracks longer than specific threshold C. As we know, the length distribution of ancestral tracks from each ancestral population is a mixed exponential distribution. When we consider only tracks larger than C, the length distribution from ancestral population k becomes:

$$f_k( {x|x \ge C} ) = \mathop {\sum }\limits_{j = 1}^{n_k} \frac{{\omega _{kj}}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} \omega _{kj}{\rm exp}( - u_k( j )C)}}u_k ( j ){\mathrm exp}( - u_k( j )x),$$

where $$\omega _{kj} = \frac{{s_{I_k\left( j \right)}u_k\left( j \right)}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} s_{I_k\left( j \right)}u_k\left( j \right)}}$$. However, since this condition distribution is not a mixed exponential distribution, we cannot use the EM algorithm to estimate the parameters. Fortunately, when we consider the random variable Y k  = X k C, we find that the distribution of Y k is a mixed exponential distribution, which can be written as follows:

$$f_k\left( y \right) = \mathop {\sum }\limits_{j = 1}^{n_k} \frac{{\omega _{kj}{\rm exp}( - u_k\left( j \right)C)}}{{\mathop {\sum }\nolimits_{j = 1}^{n_k} \omega _{kj}{\rm exp}( - u_k\left( j \right)C)}}u_k\left( j \right){\mathrm{exp}}( - u_k\left( j \right)y).$$

To take the threshold C into consideration, we must change the procedures of aforementioned Step 2. We can easily obtain samples of Y k from samples of X k . Then, we can use the EM algorithm and LRT to obtain the distribution parameters of Y k . Furthermore, by the relationship between f k (x) and f k (y), we can obtain the parameters of the mixed exponential distribution of X k . Then, the following steps are the same as those aforementioned in Steps 3–7. These procedures were all implemented in our MultiWaver, which can be downloaded at http://www.picb.ac.cn/PGG/resource.php.

In the MultiWaver software, two estimations of admixture time for the first wave were output. One was an estimation of t0, while the other was an estimation of t1. In theory, t1 is equal to t0, but in real data analysis the estimations may be not equal because of random errors and tracks inference errors. Thus, we presented two estimations of admixture time for the first admixture wave in our results. When t1 is proximate or equal to t0, we regard t0 or t1 as a successful estimation of time of the first wave of admixture.

### Simulation studies

We conducted simulations to evaluate the performance of MultiWaver. The simulation data were generated by the forward-time simulator AdmixSim (Yang et al., 2016) (AdmixSim can be downloaded at http://www.picb.ac.cn/PGG/resource.php). General settings of our simulation were the same as those in our previous study (Ni et al., 2016).

Here we divided multiple-wave admixture models into two different types of models. We denoted the model as a simple model if each ancestral population could contribute only once to the admixed population. The others were denoted as a complex model. In the complex model, at least one ancestral population donates admixture more than once. It is important to note that when we infer the admixture history under the complex model, it is very challenging to distinguish the different admixture waves from the same ancestral population.

We focused on evaluating the performance of MultiWaver under these two types of models. For the simple model, we considered a scenario of three ancestral populations (Supplementary Figure S1, Scenario (I)) and a scenario of five ancestral populations (Supplementary Figure S1, Scenario (II)). For the complex model, we considered a scenario of two ancestral populations with two-wave admixtures (Supplementary Figure S1, Scenario (III)). We evaluated the performance of MultiWaver with different admixture times and admixture proportions. For simplicity, we supposed the admixture proportions (α i ,1 ≤ i ≤ n) were equal. We set three different values of admixture proportions for each scenario: 0.1, 0.3, and 0.5. For Scenario (I), the admixture time was set as two different cases: (a) t2 = 20, T = 40, and (b) t2 = 40, T = 60. For Scenario (II), the admixture time was also set as two different cases: (a) t2 = 20, t3 = 40, t4 = 60, T = 80, and (b) t2 = 40, t3 = 80, t4 = 120, T = 140. For Scenario (III), the admixture time was set as four different cases: (a) t2 = 20, T = 40, (b) t2 = 40, T = 60, (c) t2 = 60, T = 80, and (d) t2 = 80, T = 100. Each case was repeated 10 times for a total of 240 simulations across the three scenarios. MultiWaver was applied to the simulated data with the default settings; the results were recorded and summarized.

In real situations, due to the limitations of local ancestry inference, only the ancestral tracks longer than a special threshold can be accurately inferred. Thus, to make our method more available to real situations, we chose the thresholds ranging from 0 cM to 2 cM in steps of 0.25 cM, and then evaluated the robustness of our method under the different thresholds.

We evaluated the performance of MultiWaver with different sample sizes. We considered Scenario (I) and Scenario (III) in which the admixture proportion was set as 0.3 and admixture time was set as t2 = 40, T = 60. At the end of the simulation, we sampled 20, 40, 100, 200, 300 and 400 “individuals”, corresponding to 2, 4, 10, 20, 30, and 40 human samples.

### Application to real data sets

First, we applied our method to some real data sets of African Americans and Mexicans. These two populations are typical admixed populations and their histories are relatively clear. Therefore, they could be used to test the performance of our method for real data. We obtained the data sets of African Americans (ASW), Mexicans (MEX), and the reference populations African (YRI) and European (CEU) from the HapMap Project Phase III data set (International HapMap et al., 2010). Meanwhile, Maya and Pima populations represented American Indian ancestry, which were obtained from the HGDP data set (Li et al., 2008). According to prior studies, African Americans and Mexicans have more than two ancestries (Kidd et al., 2012; Jin et al., 2014). However, the proportion of Native American ancestry of African Americans is <5% (Baharian et al., 2016), and thus, we only considered two dominant ancestries (African and European ancestry) of African Americans. For Mexicans, we considered three ancestries: African, European, and American Indian ancestry (Moreno-Estrada et al., 2013).

Then, our method was used to reconstruct the population history of Uyghurs and Hazaras. The histories of these two populations are more complex. Uyghurs and Hazaras populations were obtained from the HGDP data set. Previous studies have shown that Uyghurs and Hazaras had admixed ancestries mainly from East and West (Li et al., 2008; Hellenthal et al., 2014). Here we used Han and French as the proxies of Eastern ancestry and Western ancestry, respectively (Loh et al., 2013). These reference populations were also obtained from the HGDP data set.

To enhance the reliability of our analysis, HAPMIX (Price et al., 2009) was selected as the local ancestry inference method since it shows good performance in admixture break points inference (Hinch et al., 2011). However, HAPMIX can only be used to detect ancestral tracks for two-way admixtures, and thus it might not be proper for the Mexican population. PCAdmix (Brisbin et al., 2012) has shown great power in inferring the local ancestry of Mexican populations (Moreno-Estrada et al., 2013), and thus we used PCAdmix in this study. The generations pre-set in HAPMIX inference were 10 for African Americans and 80 for Uyghurs and Hazaras. The window size set in PCAdmix was the default. Since phasing data were required for both HAPMIX and PCAdmix, SHAPEIT 2 (Delaneau et al., 2012) was used to infer the haplotype phase. Finally, MultiWaver was used to determine the optimal model and estimate the admixture time accordingly with tracks longer than 1 cM.

## Results

### MultiWaver performed well under simple and complex models

We compared the admixture histories inferred by MultiWaver with the histories set in simulations, and then evaluated the performance of our method in model selection and parameters estimation. For Scenario (I) and (II), the results showed that the estimations of admixture time were highly consistent with the time simulated if we pre-set the admixture model as the simple model (-s option in MultiWaver) (Fig. 2). Our method also performed well when we did not pre-set this option (-s) (Supplementary Figure S2). The model was correctly selected in 85% of the simulations, and only in a few simulations the incorrect model was selected. When the model was correctly selected, the admixture time estimated was consistent with the simulated time.

For the complex model, we found that our method could select the right model with high accuracy (Fig. 3). Model selection was incorrect for only 3 in 40 simulations. In these three cases, the number of admixture waves was wrongly estimated, which led to the inaccurate estimation of admixture time. Thus, selecting a correct model is of crucial importance for admixture history inference. When the admixture model could be correctly selected, only a slight overestimation occurred for the admixture time.

We also evaluated the performance of MultiWaver with different admixture proportions. We found that the overestimation of admixture time in the complex model was related to the admixture proportions (Supplementary Figure S3). When the proportions of each admixture wave became smaller, the estimation error decreased. However, for the simple model, our method performed well for all of the situations.

In conclusion, regardless of which type of admixture model, our method performed well for model selection. Furthermore, the admixture time was estimated well for the simple model, and with only slight overestimation for the complex model.

### Robustness for different thresholds of track length

We tested the robustness of our method for different thresholds of track length. The results showed that our method was robust to thresholds for both the simple model and complex model (Fig. 4). Due to the limitations of our method, the local ancestry inference was not so accurate for short ancestral tracks. Thus, in real data analysis, we had to discard tracks smaller than a threshold. However, short ancestral tracks contain ancient admixture information, and if the threshold was too large, lots of information would be lost. Therefore, we had to balance the trade-off between information and accuracy. In our real data analysis, we set the threshold as 1 cM.

### Good performance with different sample sizes

The results showed that MultiWaver performed well with different sample sizes. For both scenarios, even with 20 “individuals” (corresponding to two human samples), it could select the right model and estimate the admixture time with high accuracy (Supplementary Figure S4). Therefore, our method was insensitive to sample sizes.

### Bootstrapping results with simulation data

To assess uncertainty in the optimal model and parameter estimation, we performed an analysis of the simulation data with bootstrapping sampling, which provided a supporting rate of the optimal model and 95% confidence interval (CI) of the estimations of admixture time (Table S1). We found that our inferred model was consistent with the optimal model based on bootstrapping analysis, giving the highest supporting rate and indicating high accuracy of our method as well as reliability in model selection and parameter estimation. Details of bootstrapping procedures are described in Supplementary Text 1(d).

### Real data analysis

For Mexicans, we used PCAdmix to infer the local ancestries, and a two-wave admixtures model was inferred (Fig. 5b). Each ancestral population contributed once to the admixed population. The time of the first admixture wave was about 18 generations ago, which was close to previous findings (Price et al., 2007; Tian et al., 2007; Wang et al., 2008; Kidd et al., 2012; Moreno-Estrada et al., 2013). The time of the second admixture was 12 generations ago. This time period (12–18 generations ago) was consistent with the time of the exploration of the new world. For our analysis of African Americans and Mexicans, the admixture histories inferred by our method were consistent with recorded histories, thus showing the power of our method in real data analysis.

Conclusively, Uyghurs and Hazaras had a similar admixture history. The ancient admixture might have been caused by the migrations of Indo-Aryan speaking people into the Indian subcontinent (1500 BC). Uyghurs mainly settled in West China, and Hazaras mainly settled in Afghanistan and Pakistan. The residences of these two populations were all near the Silk Road, and thus we thought the recent multiple admixtures might have been caused by the trades or migrations along the Silk Road. In fact, the real history of Uyghurs and Hazaras might be more complex than inferred. However, our method could detect some admixture events and provide some useful information to understand the origin and development of these complex populations.

## Discussion

In our method, it is possible that more than one optimal admixture model satisfied the constraint conditions and should be recorded. However, for all of the simulations we conducted, this phenomenon did not appear. This was reasonable because the admixture history had a one-to-one correspondence with the length distribution of ancestral tracks. If one situation had more than one optimal model, it implied the ancestral tracks were not accurately inferred. To assess uncertainties in the optimal model and parameter estimation, we applied a bootstrapping technique to provide a supporting rate of the chosen model and the CI of estimations of admixture time.

The efficiency of our method was also influenced by the validity of the local ancestry inference. We tested the performance of our method with the inferred ancestral tracks (Supplementary Text S1). We found that MultiWaver tended to overestimate the number of waves, and thus led to overestimating the admixture time with the ancestral tracks inferred by HAPMIX (Supplementary Figure S6 (a)). For multiple-way admixtures, the inaccuracy of ancestral tracks inferred by PCAdmix led to underestimating the time of the first admixture wave (Supplementary Figure S6 (b)). It was very difficult to obtain relatively accurate ancestral tracks with a small length for all of the local ancestry inference methods. To improve the effectiveness of the inference, we suggest using the ancestral tracks longer than a certain threshold C in our method. However, when the threshold became large, ancient admixture information would be lost rapidly. The development of sequencing technology and computational methods is expected to improve the detection of short ancestral tracks in the near future. Then, our method would be promising in recovering even more ancient admixture history, such as the admixture between modern humans and ancient humans (Prüfer et al., 2014; Sankararaman et al., 2014).

## Conclusions

Complex admixture history inference has long been a challenging problem in population genetics. In this work, we proposed a general discrete admixture model to describe admixture history with multiple ancestral populations and multiple-wave admixtures. We deduced that the length distribution of ancestral tracks was a mixed exponential distribution. On the basis of this distribution, we developed a new method, MultiWaver, to infer the multiple-wave admixture histories. We used LRT to select the number of admixture waves, and implemented a method of exhaustion to determine the order of admixtures. When the admixture model was determined, we applied the EM algorithm to estimate the parameters. Simulations and real data analysis showed that MultiWaver was precise and efficient in inferring admixture history.

### Data archiving

All of the data used to perform the application described in this paper are available for free. The data of African Americans and Mexicans were obtained from the HapMap project phase III data set; and the data of Uyghurs and Hazaras were obtained from the Human Genome Diversity Project (HGDP) data set.

## References

1. Baharian S, Barakatt M, Gignoux CR, Shringarpure S, Errington J, Blot WJ et al. (2016) The great migration and African-American genomic diversity. PLoS Genet 12:e1006059

2. Brisbin A, Bryc K, Byrnes J, Zakharia F, Omberg L, Degenhardt J et al. (2012) PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations Hum Biol 84:343–64

3. Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL (2015) The genetic ancestry of African Americans, Latinos, and European Americans across the United States Am J Hum Genet 96:37–53

4. Delaneau O, Marchini J, Zagury JF (2012) A linear complexity phasing method for thousands of genomes Nat Methods 9:179–81

5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38

6. Gravel S (2012) Population genetics models of local ancestry Genetics 191:607–19

7. Hellenthal G, Busby GB, Band G, Wilson JF, Capelli C, Falush D et al. (2014) A genetic atlas of human admixture history Science 343:747–51

8. Hinch AG, Tandon A, Patterson N, Song Y, Rohland N, Palmer CD et al. (2011) The landscape of recombination in African Americans Nature 476:170–75

9. Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA et al. International HapMap C (2010) Integrating common and rare genetic variation in diverse human populations Nature 467:52–8

10. Jin W, Li R, Zhou Y, Xu S (2014) Distribution of ancestral chromosomal segments in admixed genomes and its implications for inferring population history and admixture mapping Eur J Hum Genet 22:930–7

11. Jin W, Wang S, Wang H, Jin L, Xu S (2012) Exploring population admixture dynamics via empirical and simulated genome-wide distribution of ancestral chromosomal segments Am J Hum Genet 91:849–62

12. Kidd JM, Gravel S, Byrnes J, Moreno-Estrada A, Musharoff S, Bryc K et al. (2012) Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation Am J Hum Genet 91:660–71

13. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S et al. (2008) Worldwide human relationships inferred from genome-wide patterns of variation Science 319:1100–4

14. Liang M, Nielsen R (2014) The lengths of admixture tracts Genetics 197:953–67

15. Lipson M, Loh PR, Patterson N, Moorjani P, Ko YC, Stoneking M et al. (2014) Reconstructing Austronesian population history in Island Southeast Asia Nat Commun 5:4689

16. Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D et al. (2013) Inferring admixture histories of human populations using linkage disequilibrium Genetics 193:1233–54

17. Moorjani P, Patterson N, Hirschhorn JN, Keinan A, Hao L, Atzmon G et al. (2011) The history of African gene flow into Southern Europeans, Levantines, and Jews PLoS Genet 7:e1001373

18. Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, Gignoux CR et al. (2013) Reconstructing the population genetic history of the Caribbean PLoS Genet 9:e1003925

19. Ni X, Yang X, Guo W, Yuan K, Zhou Y, Ma Z et al. (2016) Length distribution of ancestral tracks under a general admixture model and its applications in population history inference Sci Rep 6:20048

20. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y et al. (2012) Ancient admixture in human history Genetics 192:1065–93

21. Pickrell JK, Patterson N, Loh PR, Lipson M, Berger B, Stoneking M et al. (2014) Ancient west Eurasian ancestry in southern and eastern Africa Proc Natl Acad Sci USA 111:2632–7

22. Pool JE, Nielsen R (2009) Inference of historical changes in migration rate from the lengths of migrant tracts Genetics 181:711–9

23. Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, McDonald GJ et al. (2007) A genomewide admixture map for Latino populations Am J Hum Genet 80:1024–36

24. Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I et al. (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations PLoS Genet 5:e1000519

25. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S et al. (2014) The complete genome sequence of a Neanderthal from the Altai Mountains Nature 505:43–9

26. Pugach I, Matveev R, Spitsyn V, Makarov S, Novgorodov I, Osakovsky V et al. (2016) The Complex Admixture History and Recent Southern Origins of Siberian Populations Mol Biol Evol 33:1777–95

27. Pugach I, Matveyev R, Wollstein A, Kayser M, Stoneking M (2011) Dating the age of admixture via wavelet transform analysis of genome-wide data Genome Biol 12:R19

28. Reich D, Thangaraj K, Patterson N, Price AL, Singh L (2009) Reconstructing Indian population history Nature 461:489–94

29. Sankararaman S, Mallick S, Dannemann M, Prufer K, Kelso J, Paabo S et al. (2014) The genomic landscape of Neanderthal ancestry in present-day humans Nature 507:354–7

30. Tian C, Hinds DA, Shigeta R, Adler SG, Lee A, Pahl MV et al. (2007) A genomewide single-nucleotide-polymorphism panel for Mexican American admixture mapping Am J Hum Genet 80:1014–23

31. Wall JD, Lohmueller KE, Plagnol V (2009) Detecting ancient admixture and estimating demographic parameters in multiple human populations Mol Biol Evol 26:1823–7

32. Wang S, Ray N, Rojas W, Parra MV, Bedoya G, Gallo C et al. (2008) Geographic patterns of genome admixture in Latin American Mestizos PLoS Genet 4:e1000037

33. Wilks SS (1938) The large-sample distribution of the likelihood ratio for testing composite hypotheses Ann Math Stat 9:60–2

34. Xu S, Huang W, Qian J, Jin L (2008) Analysis of genomic admixture in Uyghur and its implication in mapping strategy Am J Hum Genet 82:883–94

35. Xu S, Jin L (2008) A genome-wide analysis of admixture in Uyghurs and a high-density admixture map for disease-gene discovery Am J Hum Genet 83:322–36

36. Yang X, Ni X, Zhou Y, Guo W, Yuan K and Xu S (2016) AdmixSim: a forward-time simulator for various and complex scenarios of population admixture. bioRxiv 037135

## Acknowledgements

This work was supported by the Strategic Priority Research Program (XDB13040100) and Key Research Program of Frontier Sciences (QYZDJ-SSW-SYS009) of the Chinese Academy of Sciences (CAS), the National Natural Science Foundation of China (NSFC) (grants nos. 91331204, 91731303, 31771388, 11426237 and 31711530221), the National Science Fund for Distinguished Young Scholars (31525014), the Program of Shanghai Academic Research Leader (16XD1404700), the 973 Project (2011CB808000), the Fundamental Research Funds for the Central Universities (2017JBM071 and 2015IBM099), the National Excellent Doctoral Dissertation Foundation of PR China (201213), the National Center for Mathematics and Interdisciplinary Sciences of CAS, and the Key Laboratory of Random Complex Structures and Data Science, CAS (2008DP173182). S.X. also gratefully acknowledges the support of the National Program for Top-Notch Young Innovative Talents of the “Wanren Jihua” Project. We thank LetPub (www.LetPub.com) for its linguistic assistance during the preparation of this manuscript.

## Author information

Authors

### Corresponding authors

Correspondence to Zhiming Ma or Shuhua Xu.

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

## Rights and permissions

Reprints and Permissions

Ni, X., Yuan, K., Yang, X. et al. Inference of multiple-wave admixtures by length distribution of ancestral tracks. Heredity 121, 52–63 (2018). https://doi.org/10.1038/s41437-017-0041-2

• Revised:

• Accepted:

• Published:

• Issue Date:

• ### AdmixSim 2: a forward-time simulator for modeling complex population admixture

• Rui Zhang
• Chang Liu
• Shuhua Xu

BMC Bioinformatics (2021)

• ### Refining models of archaic admixture in Eurasia with ArchaicSeeker 2.0

• Kai Yuan
• Xumin Ni
• Shuhua Xu

Nature Communications (2021)

• ### MultiWaver 2.0: modeling discrete and continuous gene flow to reconstruct complex population admixtures

• Xumin Ni
• Kai Yuan
• Shuhua Xu

European Journal of Human Genetics (2019)