Introduction

Admixture history inference is a fundamental problem for studies on admixed populations [1]. Several methods have been developed to analyze the problem based on various kinds of population admixture information, such as break points of recombination [2], admixture linkage disequilibrium [3,4,5,6], and ancestral tracks [7,8,9,10,11,12]. The length distribution of ancestral tracts provides direct information concerning the decay of the ancestral segment length, which is closely related to the admixture history. Therefore, many methods have been developed based on this type of information [7,8,9,10,11, 13, 14]. The history of several classical admixed populations (African Americans, Mexicans, and Uyghur) have been well studied using these methods [15,16,17,18,19,20,21]. However, there are two shortcomings involved in these methods. First, before estimating the parameters of admixture history, a prior admixture model was required. Second, the prior admixture model was often an overly simplified scenario, such as a hybrid-isolation (HI), gradual admixture (GA), or continuous gene flow (CGF) models. Knowledge of admixture history is often lacking when real data are analyzed, especially for complex admixed populations [2, 15, 22,23,24]. Therefore, results can be unreliable in cases where the selected prior model deviates from the real admixture history.

In our previous work [13], we proposed some principles of parameter estimation and model selection under a general model, but our previous method could handle only three typical two-way admixture models. To solve this problem, we developed the MultiWaver program [25], which could select the optimal admixture model and estimate the corresponding parameters under a general discrete model. Our method could deal with complex admixture scenarios involving multiple ancestral populations with multiple admixture events. In principle, our model could also be used to analyze a population with continuous admixture, since the model can deal with admixture events at any generation. However, if the true admixture model is continuous, the number of parameters could be very large (each wave has two parameters, one admixture time and one admixture proportion); consequently, the model could become very complex. In MultiWaver, we applied a likelihood ratio test (LRT) to select the best-fit model. Using this method, more parameters means greater penalties. Thus, the method tends to select an optimal multiple-wave (discrete) model rather than a continuous model, and the continuous model is often neglected.

In this work, we extend the MultiWaver software to MultiWaver 2.0, which can handle both discrete and continuous models. In the new method, we consider four different models (the HI, GA, CGF, and multiple-wave models) (Fig. 1). Our new method can automatically select the optimal model among those candidate models, and the confidence interval (CI) of admixture time and supporting rate for each candidate model can be obtained via a bootstrapping procedure. We conducted simulation studies to demonstrate the effectiveness of our method. Finally, we applied our method to African Americans from the HapMap project phase III dataset [26] and to Uyghurs and Hazaras from the Human Genome Diversity Project (HGDP) dataset [27].

Fig. 1
figure 1

Four different types of admixture model. a Hybrid isolation (HI) model; b Gradual admixture (GA) model; c Continuous gene flow (CGF) model. POP1: the reference population 1; POP2: the reference population 2; m is the proportion of population 1 and α = 1 − m1/T. d The multiple-wave model, where POPki is the ancestral population of the ith admixture, αi is the admixture proportion of the ith admixture, and ti is the admixture time of the ith admixture

Materials and methods

Model selection and parameter estimation

In order to infer the admixture history, we need to select an optimal model from the four different models listed above. For discrete admixture models (the HI model and multiple-wave models), we apply the LRT method to select the optimal discrete model. The results are the same as those obtained using the MultiWaver software [25]. Next, we compare the optimal discrete model with continuous admixture models (GA and CGF). However, when we include the GA and CGF models in the analysis, we find that any pairs of GA, CGF, and discrete model are all not nested, which means that no model is a special case of another. The LRT method is unavailable in this case. Therefore, we apply another method, Bayesian Information Criterion (BIC) [28, 29], to select the optimal model. The value of the BIC can be calculated by the formula:

$${\mathrm {BIC}} = k\,\ln(n) - 2\,\ln\left( {L_{{\mathrm {max}}}} \right),$$

where k is the number of parameters, n is the sample size, and Lmax is the maximized value of the likelihood function. Details of the model selection procedure are illustrated in Fig. 2.

Fig. 2
figure 2

Flow chart of the algorithm for model selection. Lmax (GA), Lmax (CGFR), Lmax (CGFD), and Lmax (multi-wave) are the maximized values of the likelihood function under GA, CGFR, CGFD, and multiple-wave models, respectively. Best M is the optimal model, where M is the set of GA, CGFR, CGFD, and multiple-wave models

Whether one uses the LRT or BIC method, it is necessary to calculate Lmax and to estimate the parameters of the admixture models. In our previous study [13], we employed the length distributions of ancestral tracks under the HI, GA, and CGF models. These models involve only two parameters, the admixture proportion m and the admixture time T (Fig. 1a−c). Thus, we can easily calculate Lmax and the estimates of m and T via a binary search algorithm. For multiple-wave models, the length distribution of the ancestral tracks can be written as a mixed exponential distribution [25]. We can then use the expectation–maximization algorithm [30] to calculate Lmax and to estimate the admixture time (ti) and proportion (αi) (Fig. 1d); this produces the same estimates as the MultiWaver method [25]. After obtaining Lmax for the four models, we then select an optimal model via the LRT and BIC methods. The optimal model and the corresponding estimators of admixture time and proportion can then be used to describe the inferred admixture history. The details concerning the LRT are described in Supplementary Text 1.

Bootstrapping procedures

To assess the uncertainty in optimal model selection and estimated parameter values, we also apply the bootstrapping technique in MultiWaver 2.0 to obtain a degree of support for the chosen model and the CI of the admixture time. We conduct the bootstrapping by resampling the same number of segments with replacement and use these resampled segments to infer the admixture model and its corresponding admixture time. Details of the bootstrapping procedure are described in Supplementary Text 1. MultiWaver 2.0 can be downloaded at http://www.picb.ac.cn/PGG/resource.php.

Simulation studies

We conducted simulations to evaluate the performance of MultiWaver 2.0. The simulation data were generated by the forward-time simulator AdmixSim [31]. AdmixSim can be downloaded at http://www.picb.ac.cn/PGG/resource.php. The population size of the admixed population was arbitrarily set to 5000 and remained constant in our simulations, and the length of the simulated chromosome was 3.0 Morgans, which approximates the length of chromosome 1 of the human genome. At the end of the simulation, 100 “individuals” (pairs of chromosomes) were sampled, and the ancestral tracks were recorded.

For the symmetric admixture models (HI and GA) (Fig. 1a, b), the proportions of admixture varied from 20 to 50% in steps of 10%. For the asymmetric admixture model (CGF) (Fig. 1c), we divided the analysis into two sub-models. If population 1 was a gene flow recipient, we denoted it as a CGFR model; otherwise we denoted it as a CGFD model. We set the proportions of admixture in CGF model from 20 to 90% in steps of 10%, and we set the admixture time to 20, 40, 60, 80, and 100 generations. For the multiple-wave model (Fig. 1d), we considered a scenario of two ancestral populations with two-wave admixture. For simplicity, we assumed that in each wave of admixture the proportions (αi,1 ≤ i ≤ n) were equal. We used four values of admixture proportions: 0.2, 0.3, 0.4, and 0.5. The admixture time were set as five cases: (a) t2 = 10, T = 20, (b) t2 = 20, T = 40, (c) t2 = 40, T = 60, (d) t2 = 60, T = 80, and (e) t2 = 80, T = 100. Each simulation setting was repeated ten times for a total of 1400 simulations across the four admixture models. MultiWaver 2.0 was applied to the simulated data with the default settings and the results were recorded and summarized.

Application to analysis of real datasets

Several real admixed populations histories were analyzed by our method. First, we applied our method to African Americans. The data for African Americans and reference populations CEU and YRI were obtained from HapMap Project Phase III [26]. Next, we applied our method to reconstruct the population history of Uyghurs and Hazaras. We used the Han and French populations as the proxies for Eastern ancestry and Western ancestry, respectively [4]. Data used in this analysis were obtained from the HGDP dataset. Haplotype phasing was performed by SHAPEIT 2 [32]. Local ancestry was inferred by HAPMIX [33]. MultiWaver 2.0 was used to select the optimal model and to estimate the admixture time and proportion using tracks longer than 1 cm.

Results

MultiWaver 2.0 performed well in parameter estimation and model selection

With the extensively simulated data, we could systematically evaluate the performance of our method in regard to parameter estimation and model selection. The model was correctly selected in 88% of the simulations. For the simulations using the HI and GA scenarios, our method was able to distinguish the correct model in nearly all simulations; for the CGFR, CGFD, and multiple-wave models, our method identified the correct model with an accuracy of 82.0% (Table 1). We found that the simulations in which our method failed were often those including very recent admixture time and small admixture proportion.

Table 1 The accuracy of our method in model selection

We also evaluated the performance of our method for time estimation. Our method was able to estimate admixture time with high accuracy (Fig. 3). Figure 3 shows one set of simulations and the corresponding bootstrap results for CGFR, CGFD, GA, HI, and multiple-wave models. For the HI, CGF, and GA models, the results were highly consistent with the time simulated, while there was a slight overestimation for the multiple-wave model. We conclude that regardless of model selection or parameters estimated, our method performed well.

Fig. 3
figure 3

Admixture time estimated under four different types of admixture models. a CGFR model; b CGFD model; c GA model; d HI model; and e multiple-wave model. The x-coordinate is the admixture time in generations ago, with 0 being the present time. The y-coordinate is the density of admixture time estimated from 100 bootstrap-resampling datasets. There are five subgraphs for each model. Each subgraph represents the result from one simulation assuming a certain admixture time. The red dashed lines represent the given admixture time for the simulation. The admixture proportion was 0.3

Real data analysis

For African Americans, the program inferred the GA admixture model and the admixture time to be 12 generations (Fig. 4a, Table S1). In our previous study [13, 25], the African American population was inferred as a GA scenario with AdmixInfer and as a two-wave admixture model with MultiWaver. While both results are supported by various historical records, a best-fit model is desirable. The MultiWaver 2.0 program was able to solve this problem using a decision-making framework. We compared the likelihood of the two methods with the BIC and found that the GA model was the most likely scenario. In other words, the GA model appears to be superior to multiple-wave admixture models for African Americans.

Fig. 4
figure 4

Inferred admixture history based on analysis of real datasets. Inferred admixture history of a African Americans (ASW), where CEU and YRI were taken as the representative ancestral source populations of ASW; b Uyghurs, and c Hazaras. Han is Han population representing Asian ancestry, Fre is French population representing European ancestry

In addition, we applied our method to reconstruct the admixture histories of Uyghurs and Hazaras. These two admixed populations were inferred as GA types by AdmixInfer [13] and inferred as multiple-wave types by MultiWaver [25]. The results of MultiWaver 2.0 confirmed that the admixture pattern of Uyghurs and Hazaras followed a multiple-wave admixture model, rather than a GA or a CGF model (Fig. 4b, c, Table S1).

Discussion

MultiWaver 2.0 is an improved version of MultiWaver in that it can consider both discrete and continuous admixture models simultaneously. In MultiWaver 2.0, we apply the principles of the BIC to select the optimal model. Simulation studies suggest that our method is precise and efficient in model selection and parameter estimation.

Admixture history of a real population is often very complex. Previous methods have always required some strong pre-knowledge of the admixture pattern. If the admixture pattern is wrongly selected, the inferred admixture history may deviate from the actual history. Here, we provide a general framework to try to deal with this problem. MultiWaver 2.0 can automatically select the best-fit model from the candidates. Indeed, when the true admixture histories deviate from any of the given candidate models, our method might not have a good inference. However, the models we provided cover most admixture cases in the real data analysis and the framework of our method is much more flexible. In the future, if a new representative model is proposed, it can be easily introduced into this framework.

However, some problems remain. First, we found that the penalty for the number of parameters in the BIC method was not sufficient for our method of model selection; thus, the simulations under the CGFR and CGFD models were often wrongly determined as multiple-wave models. This was especially true for population with recent admixture time and small admixture proportion. Second, overestimation occurred for the admixture time when inferring admixture history under the multiple-wave model. This problem also occurred in MultiWaver. However, the overestimation was related to the admixture time and the admixture proportion of each wave. In our previous study, we used this relationship to adjust the estimation of admixture time [25].

Similar to other ancestral tracts information-based methods, our method is sensitive to the accuracy of local ancestry inference (LAI). For existing LAI methods, short ancestral tracts are very difficult to detect. To remove the influence of short ancestral segments, we suggest using only the ancestral tracks longer than a certain threshold C in our software. Besides the small tracts effect, the admixture model used by the LAI method is also a strong priori assumption. The history inference results might tend to be similar to those of the LAI model. To overcome this problem, joint inference of ancestral tracts and admixed history may be implemented in the future.