Abstract
Since anatomically modern humans dispersed Out of Africa, the evolutionary history of Eurasian populations has been marked by introgressions from presently extinct hominins. Some of these introgressions have been identified using sequenced ancient genomes (Neanderthal and Denisova). Other introgressions have been proposed for still unidentified groups using the genetic diversity present in current human populations. We built a demographic model based on deep learning in an Approximate Bayesian Computation framework to infer the evolutionary history of Eurasian populations including past introgression events in Out of Africa populations fitting the current genetic evidence. In addition to the reported Neanderthal and Denisovan introgressions, our results support a third introgression in all Asian and Oceanian populations from an archaic population. This population is either related to the NeanderthalDenisova clade or diverged early from the Denisova lineage. We propose the use of deep learning methods for clarifying situations with high complexity in evolutionary genomics.
Introduction
All modern humans are genetically related to each other at a time depth of up to 300 thousand years ago (Kya)^{1,2} and share a common African root^{3,4,5}. The migratory routes used by anatomically modern humans (AMH) after the African diaspora and aspects of the interbreeding between AMH and presently extinct hominins living at the time in Eurasia (here referred as Eurasian Extinct Hominins, EEH) are still under debate. Recently, several genetic studies have argued that there was only one main Out of Africa (OOA) event^{6,7,8} that happened less than 100 Kya, with earlier dispersals, which most likely does not contribute much to presentday variation^{9,10,11}. Sequencing of ancient Neanderthal and Denisovan fossils supported introgression events into AMH out of Africa;^{12,13} however, recent studies also support the presence of gene flow from AMH into Neanderthals^{14,15}, thus suggesting a complex hominin evolution.
A complex demographic model considering archaic introgressions out of Africa is needed to properly address two main points. First, the number of introgression events from Neanderthal and Denisova into AMH, which remains controversial, and their impact on the AMH gene pool. A difficult point is to account for the higher amount of Neanderthal introgression in Asia than in Europe^{8}. Moreover, the geographic limits of Denisovan introgression are not well established^{14,16}. Interestingly, Australian Aborigines also harbour high amounts of EEH due to several introgression events^{7,9,12}. Although the evidence for Denisovan introgression in Oceanian populations is clear, the amount of introgression varies between 3 and 6% depending on the methodology^{14,16,17}. Also, there is disagreement about the number of introgression events necessary to explain patterns of admixture in modern Australian Aborigines^{7,8,11}. The second main point is to identify all the cryptic ghost archaic populations that have introgressed with hominins with known genomes (modern humans from the Old World, Neanderthals and Denisovans). It has been suggested that other archaic hominins could have interbred with Denisovans^{14,16}. Furthermore, introgression of extinct hominins beyond those of Neanderthals and Denisovans into some AMH populations, both within Africa^{2,18}, as well as in South Asia^{6} and Tibetan populations^{19} has been proposed. Recently it has been also found that there could have existed two different pulses of Denisovan admixture in East Asia^{20}.
However, comparing the likelihood of competing complex demographic models given the observed data is cumbersome^{21}, requires manual expertise^{22} and/or fitting^{23}. Consequently, all of these different demographic models have not been tested together using the same methodology. Furthermore, it would be interesting to have an estimation of the parameters that define a demographic model given the data. However, this is not available in commonly applied algorithms (i.e. Admixtools^{23}). Within this context, Approximate Bayesian Computation (ABC) is a flexible statistical framework that allows estimating the posterior distribution of a parameter/model through the generation of simulated datasets for cases when there is no closeform of the likelihood of the function^{24}. ABC is widely used in the field of population genetics^{25}. However, identifying the proper transformations of the data to generate informative summary statistics (SS) is model and parameter dependent, and including noninformative redundant SS can jeopardize the performance of the ABC approach^{22}. Jiang et al.^{26} recently proposed a theoretical improvement of the ABC framework by considering Deep Learning (DL) for generating informative SS from a set of (raw) SS. DL is a type of Artificial Neural Network (ANN) topology based on multiple hidden neural layers, and it has been used for population genetics inference in nonABC frameworks^{27}. Since ANN performs a nonlinear transformation of the input features towards the output^{28}, DL provides an optimal way of extracting and maximizing the nonredundant information out of the raw SS^{26}. Thus, a DL can be trained using a large number of raw SS as input features to predict the parameters that generated the simulated data, or to classify a simulated data in a set of models, and use this classification prediction or parameter estimation as a new SS (SSDL).
In the present study, we considered the multidimensional absolute site frequency spectrum (SFS) as raw SS. SFS accounts for the number of single nucleotide polymorphisms (SNPs) present in a particular combination of derived alleles in the sampled populations^{29}. This statistic is the core of many frequency based statistics applied in population genetics for distinguishing among competing topological trees and detecting archaic introgression (i.e. Dstatistics, F4, F4 ratio etc^{23}.), determining migration rates, time of split and/or differences in the effective population sizes in models of two populations that split and migrate (i.e. Fst^{29}). Moreover, the SFS is not affected by recombination^{30}. SFS has been shown to allow robust demographic estimates^{21,29,31} and the folded SFS has been previously used in classical ABC frameworks for estimating the population size^{32}. However, SFS is high dimensional and it combinatorically increases on the considered number of samples and populations^{29}. Our proposed methodology takes advantage of the nonlinear nature of the DL for compressing the SFS to extract the underlying patterns defined by all the dimensions of the SFS that allow distinguishing among proposed models or obtaining the value of the demographic parameters of interest. This SSDL is then used in an ABC framework.
We apply our newly developed method in a large set of whole genome sequences of present and ancient remains to estimate the posterior probability of competing complex demographic models with a focus on introgression events. We find that ABCDL is powerful for distinguishing between plausible introgression models and propose a third archaic introgression event happened (in addition to Neanderthal introgression to all OOA populations and Denisova introgression for Oceanian population), which is common to all Asia and Oceanian populations.
Results
Dstatistics and F4 ratio analysis
Dstatistics and F4 ratio statistics, tests of frequent use in population genetics, have consistently demonstrated the existence of Neanderthal and Denisovan ancestry in out of African populations^{16}. Using a common variant calling (see Methods and also Supplementary Methods), we found with Dstatistics and F4 ratio test that all Asian populations (East Asians, Andamanese and Indian Tribal populations) have higher amounts of Neanderthal ( > 1%) and Denisovan ancestry ( > 1%) compared to Europeans (Table 1). Interestingly, when trying to calculate the excess of Neanderthal ancestry in Asian populations compared to Europeans using the F4 ratio test, different outgroups (ancestral alleles mainly constituted from the chimpanzee reference genomes^{33} and Denisovan^{12}) gave different results (Fig. 1). This was unexpected, as the F4 ratio test and Dstatistics should not be affected by changing the outgroup under the simplistic topology used in these tests. When using the ancestral alleles from the 1000 Genomes as outgroup^{33}, the excess of Neanderthal ancestry in Asian populations is around 0.5% (s.e. 0.15%) and statistically significant (Z score > 3). In contrast, when using Denisovan as outgroup, this estimate diminishes and loses statistical significance (~0% ± 0.15%). We then used simulations considering different demographic models (Supplementary Figure 1), varying the percentage of introgression while keeping constant all the other parameters (Supplementary Table 3), to fit the observed Dstatistics and F4 ratio statistics. The simulation results suggest that a simple model for the excess of Neanderthal and Denisova ancestry in Asian populations is not enough for fitting these statistics and other ancestral component is needed (Supplementary Figure 3). Moreover, our analyses suggest that Dstatistics and F4 ratio are sensitive to the outgroup (Denisova or chimpanzee) used at least for archaic introgression model. In Oceania, results get more complex to interpret due to the derived allele sharing between Neanderthals and Denisovans. Given the possible biases due to the reference genome (mostly of European origin) used to calculate Dstatistics, we remapped all the genomes to the chimpanzee reference and recalculated the results (Supplementary Table 2). Although most of the Neanderthal and Denisovan ancestry calculations remained the same, we note a decrease of African ancestry by changing the reference from human to chimpanzee in Indian Tribal and Oceanian populations. Overall, these results show a complex situation that needs a broader approach to distinguish among competing demographic models.
ABCDL analysis
In order to quantify the posterior probability of different demographic models given the observed genetic diversity in current populations and archaic individuals, which cannot be achieved by the Dstatistics and F4 ratio analyses, we applied an ABCDL analysis (see Methods and Supplementary Note 2).
We considered several plausible competing demographic models which describe scenarios that have been proposed by different studies to explain the observed genetic diversity of introgression patterns among current old world AMH populations, Neanderthals and Denisovans (see Fig. 2 and Supplementary Table 5). The implemented demographic models are variations from an initial model with many accepted features (Model A): OOA origin of modern humans, with a Eurasian split between Europeans and the group comprising two subgroups, East Asians, Indian and Andamanese on one hand, and Papuans and Australians on the other. Introgression of Neanderthals is found in all OOA populations, and that of Denisovans in the Oceanian populations. Introgression from an extinct group of distant hominins into Denisovans^{14,16} has been considered in all models, and this introgression does not affect directly our core results. From this basic model, several variations have been proposed to better fit the genetic data; here we consider the following. Model B: decrease of European Neanderthal ancestry due to admixture with a modern human, “basal Eurasian”, a ghost population nonintrogressed by Neanderthals (Xn)^{34}. Model C: two introgression events from Neanderthals, the first affecting all OOA populations and the second only affecting Asia^{35,36}. Model D: two introgression events both from Neanderthals and Denisovans, with a Denisovan introgression on Asians before the split of Oceanian populations^{14,16}. The remaining models have a single introgression of Neanderthals and Denisovans (as in Model A) but with an introgression to Asian and Oceanian populations from an extinct (Xe) hominin population that is a sister group of Neanderthals (Model E), or is a sister group of Denisova (Model F;^{20}), or is an outgroup of both Neanderthals and Denisovans (Model G;^{6}). Given the recently discovered Neanderthal–Denisovan hybrid^{37}, we also considered a scenario in which this extinct hominin is an admixed population between Neanderthals and Denisovans (Model H)^{37}, a similar model of a trichotomy as will be seen below. Furthermore, Model E and Model F can be considered particular cases of Model H when the percentage of Denisovan admixture is 0 or 100%, respectively. In all the models, the basic tree structure remained constant although all the demographic parameters—including the amount of introgression were estimated.
We first evaluated the performance of the proposed ABCDL methodology for distinguishing between the proposed eight models. We used simulated data as pseudoobserved data and run the full ABCDL pipeline; we applied a ‘hard’ model classification to assign the simulated data to the model with the highest posterior probability (see Supplementary Note 2). The confusion matrix (Supplementary Table 6) summarizing the model classification shows that the proposed ABC approach identifies the model that generated the data (Model_{sim}) with posterior probability P(Model_{sim} = Model_{abc}  Data) > 50% in all the models. When comparing the eight considered models by ABCDL with the observed data, the ABC model inference supports model H (P(Model = HData) = 0.46), namely the introgression of genes from an archaic DenisovanNeanderthal admixed extinct population (Xe) into Asian AMH populations, before the split of Oceanian groups, closely followed by Model F (P(Model = FData) = 0.38), which models Xe as a sister population of Denisovans. Model H is slightly more likely than Model F (1.2 times more likely), and much more likely than models E (5 times more likely), G (7.6 times) and D (107 times). Models not considering the presence of an extinct archaic ‘ghost’ population introgressing AMH A, B, C and D obtained a posterior probability of ~0 (see Supplementary Figure 4). Therefore, only models with admixture of an extinct population in the Neanderthal–Denisovan clade, but with significant differences to each of them, are supported by the ABCDL framework.
Next, we implemented ABCDL to estimate the posterior distribution of each parameter for Model H (Fig. 3b and Supplementary Figure 5). Given that Model F is also similarly supported, for completion we also estimated the posterior distribution of the parameters in this model (Fig. 3a). As internal check of the performance of the DL in Model H, we first estimated the Pearson correlation between the parameter values used for the simulation and the predicted parameter value by the DL (Supplementary Table 7). For almost all the parameters from the Model H, we obtained a high correlation between the real parameter and the predicted one by the DL with the exception of the effective population sizes of the unknown ghost populations, which is a very acceptable expectation as we do not have direct information of the genetic variation in these populations. When applied to the observed data (Table 2 and Supplementary Table 8), our ABCDL approach for estimating parameters of Model H suggests that (i) the introgression of Neanderthals with AMH out of Africa was 1.3% (CI from 0.18% to 2.5%) and occurred 69 Kya (CI from 56 Kya to 88 Kya; assuming a generation time of 29 years^{38}), whereas the split between African and OOA populations took place 121 Kya (CI from 78 Kya to 167 Kya); (ii) the Denisovan introgression to Oceanian populations was 1.6% (CI from 0.4% to 2.5%) and took place 43 Kya (CI from 29 Kya to 50 Kya); finally, (iii) the estimated amount of introgression in the AMH ancestor of current Asian populations by the extinct archaic population Xe was 2.6% (CI from 0.7% to 4.6%) at 51 Kya (CI from 45 Kya to 58 Kya). According to our analyses, Xe appeared 304 Kya (CI from 211 Kya to 375 Kya) from an admixture event of Denisovans and Neanderthals that would have occurred only 14 Kya after the divergence of Denisovans and Neanderthals (314 Kya; CI from 300 Kya to 343 Kya). The estimated divergence between Denisovan and Neanderthals is relatively recent compared to the estimated in^{16} (381–473 Kya when considering a mutation rate of 0.5e09 per bp per year or 1.45e08 per generation assuming 29 years by generation) but higher than in^{16} (190–236 Kya when considering a mutation rate of 1e09 per bp per year or 2.9e08 per generation). Since the number of nonshared variants depends on the time of divergence, the genetic drift at each population and the mutation rate, a possible explanation for the differences in time of divergence could be the mutation rate, which substantially variate among studies. Demographic parameters estimated for Model F are similar in magnitude to the ones estimated in Model H, with credible intervals that overlap the ones obtained for Model H (Supplementary Table 9).
In order to see whether the proposed Model H fit observed Dstatistics, we generated simulations using the mean of the estimated posterior probability of each parameter, and computed Dstatistics assuming different population tree topologies for stressing the impact of archaic introgression in Asian populations. Simulation results from ABCDL for Model H replicate the empirical results of the Dstatistics (Table 1, last column) by reproducing the excess of Neanderthal and Denisova ancestry in Asian populations compared to Europeans.
Taken together, we have an overwhelming support for the existence of a third extinct branch of the Neanderthal–Denisovan clade; considering the credible interval of the time of split between the archaic populations, a model with a trichotomy is a good consensus for Model DH as in all the models the Xe separates from Neanderthal or Denisova at a similar time depth of Neanderthal and Denisova split (Supplementary Table 10).
Discussion
We have developed a new methodology for statistically comparing complex demographic models and estimating demographic parameters with focus on applying it to the complex demographic history of introgressions in Eurasia. This methodology takes advantage of the nonlinear capabilities of DL for identifying patterns and extracting features, and of the wellestablished statistical ABC framework for properly estimating the posterior distribution of models and parameters. Overall, the computational methods used here take full advantage of machine learning technologies through DL to improve the performance of the ABC statistical framework. This new approach overcomes some previous limitations for comparing demographic models, minimizes manual fitting and allows estimating parameters in complex demographic models.
The ABCDL analysis supports that all Asian and Oceanian populations, including East Asians, received an introgression from an unknown extinct archaic hominin population. However, the exact relationship of this extinct hominin population with the other known archaic populations is not completely disentangled. The two models mostly supported by our ABCDL model comparison suggest that Xe corresponds to one of the two sisterDenisovan lineages (Model F) or this hominin was closely related to the Neanderthals and Denisovans with a slightly greater similarity to Denisovans than to Neanderthals (Model H). In our analyses, Model H is slightly better supported than Model F. Nevertheless, the Bayes factor between Model H and F is so close to one that the difference in the posterior probability of the two models is hardly worth mentioning^{39}. Furthermore, in both models ABCDL parameter estimation suggests that Xe had a similarly long independent evolutionary history (Table 2, Supplementary Table 8, 9 and 10). Therefore, we can only claim with confidence that this population had separated early and close to the Neanderthal–Denisova split time. The consensus of the best models is represented by a trichotomy of the three branches. Given the results from the confusion matrix, supporting a similar posterior probability for identifying by our ABCDL approach the model that generated the simulation, it is unlikely that this model uncertainty is due to the relatively recent time of divergence that we estimated between Neanderthal and Denisovans. In contrast, this lack of confidence for distinguishing among F and H could be due to a limitation of using a wholegenomic SFS approach, which does not take into account additional information from the spatial distribution within the genome of introgressed regions (i.e. Sprime^{20}).
Independently of this limitation, both F and H models can explain some previously contradictory results. For example, they would explain the dearth of African ancestry in some populations (like Andamanese, Indian and Oceanian populations^{6} or TibetoBurman). The admixed Neanderthal–Denisova archaic ghost population of Model H that introgresses in Asia also explains previous results showing that Asian populations have a higher amount of Neanderthal and Denisovan introgression than Europeans^{12,16}. Furthermore, both most supported models agree with the conclusion recently reached by Browning et al.^{20} that proposed two Denisovanlike introgressions in Asian and Oceanian populations using a completely independent methodology. However, Browning et al.^{20} proposed that Altai Denisovanlike admixed with East Asians whereas the same sisterDenisovan population would have introgressed with Asian and Papuan populations. In our model comparisons we had not considered such model given that East Asian, South and SouthEast Asia populations showed a similar amount of Denisovan ancestry in our analyses; we considered a shared Denisovanlike ancestry from the same population source as a more parsimonious scenario than the described in^{20}. Similarly, Neanderthal dilution in Europeans is also suggested before^{40}. In our analyses, the only model considering Neanderthal dilution in Europeans (Model B) is not supported compared to models that consider introgression from unknown hominin. This result does not necessarily exclude Basal Eurasian admixture, as we have not considered more complex models considering Neanderthal dilution in Europeans and archaic introgression in Asian and Oceanians, which we disregarded a priori as less parsimonious than the considered ones. Nevertheless, we would like to point out that the less Neanderthal ancestry in European compared to Asians can be partially explained (if not all) by the Model F/H due to shared ancestry (and thus shared derived alleles) between all EEH populations.
In the same context, there are other demographic models that have not been considered here, since they are not relevant for our conclusions. The exact relationship of Oceanian populations with Eurasian populations is contested. Some studies^{7,11} suggest that European and East Asian have more recent ancestry than Oceanian populations. In contrast, other studies argued that East Asian and Oceanian populations have a recent ancestry^{6,8}. In the present study we have considered the later model as it is gaining recent support^{41,42}. Nevertheless, an early split of Australian and Papuan populations from Eurasians would not have directly affected the percentage of estimated Xn introgression in East Asia and South Asia but only the proportions of Denisovan introgression in Oceania. In fact, we observed that the percentage of Denisovan introgression in Oceania significantly drop to 1.6% (95% CI 0.4%, 2.5%) in Model H compared to previous estimates^{7,8}. However, taking into account the percentage of Xn introgression in Oceania, the total amount of Denisovanlike introgression is 4.2%, within the range of proposed values in Oceania^{7,8}. Similarly, it has been suggested that the introgressed EEH populations split off from the Altai and Denisovan samples (that we used as proxies in our analyses) quite early^{16}. In principle, this could affect the estimated percentage of Neanderthal introgression out of Africa and the Denisovan introgression in Oceania. Nevertheless, this sample bias cannot explain the relative increase of both Neanderthal and Denisova ancestry in Asian populations compared to Europeans (Table 1 and Fig. 1). The drift, which these EEH populations had faced after their separation of the Neanderthal and Denisovan populations from the proxy samples, would be at random and independent of the relative increase of derived alleles shared between Asian and EEH populations compared to Europeans. This suggests that we need at least one more introgression event to explain this differential increase. Moreover, we have not modelled an ancestral AMH population introgressed into Neanderthals and admixture between Neanderthal and Denisova^{14,43}, as this would affect all OOA populations equally and will have negligible effect on them, thus not affecting differences between Neanderthal and Denisova ancestry in Asians. Sankararaman et al.^{44} proposed a back migration from Oceanian populations to Asia to explain the different proportions of archaic ancestry. However, such a model implies that such Oceanian back migration had a similar genomic impact on populations across a vast region (Indian, Andamanese and East Asian), which seems highly unlikely. Similarly, it has been proposed that the Neanderthal ancestry is diminished over time by differential negative selection acting stronger in Europe. Although it is accepted that Neanderthal ancestry decreases over time in European populations most likely by negative selection^{45}, it is unclear why this process would not be similar in Asian populations^{46}. We also have not inspected a model where a population that is an outgroup of both human and Neanderthal–Denisova lineages has introgressed in Asian populations. A similar population has been hypothesized to have introgressed into Denisovans^{14,16}, but an introgression from such a population into Asians will not result in the observed Asian excess of Neanderthal and Denisova ancestry compared to Europeans as detected by Dstatistics and can be omitted from the analysis.
In conclusion, we have implemented a method that couples ABC with DL that allows to use all the data at the same time and properly comparing complex demographic models. Our novel analytical approach based on ABCDL supports a model of human evolution in which the OOA populations have had introgression not only from Neanderthals and Denisovans, but also from a third, related group, still genetically and archaeologically uncharacterized. However, the whole genome SFS approach that we used in our ABCDL implementation does not consider additional information—such as the length and nature of the introgressed fragments—that can be important for distinguishing between some of the models. Further development of the ABCDL framework will be required to include the fragment length of the ancestry of the genomic fragments. Similarly, new analyses and ancient genome sequences will be required to better understand the nature and consequences of this third related group in the genome of modern humans.
Methods
Samples and variant calling
We used already published and publicly available data from 1000 Genome Project 3rd Phase high coverage data^{47}, Simons Genome Diversity Project (SGDP)^{8} and Mondal et al. 2016^{6} (available at European Nucleotide Archive: PRJEB11455 and PRJEB16019) (see Supplementary Table 1). We downloaded four Europeans (EUR), four East Asians (ASN) and four Africans (AFR) BAM files from 1000 Genome Project; two Papuans and two Australian Aboriginal (PAC) sequences (FASTQ files) from SGDP data set; two Irula, two Birhor (IND); two Jarawa and two Onge (AND) bam files from our previous work^{6}. We also downloaded two Neanderthals (NEAN)^{13,16,43} and one Denisova (DENI)^{12} BAM files from their respective sites. In case the BAM files were not accessible, we mapped the FASTQ files ourselves to create BAM files. All the information on the samples and their respective origin is mentioned in Supplementary Table 1. The number of individuals were kept similar for EUR, ASN, AFR, IND, AND and PAC to avoid unwanted sample size bias for Dstatistics, F4ratio test and ABC analysis.
Dstatistics and F4 ratio test
We have used the mapability mask as suggested previously^{5} thus removing repeated regions (although without this filter the results are same). We have also used two Irulas from SGDP data set to map on the chimpanzee genome reference to exclude any biases from the reference genome (see also Supplementary Methods). We used Admixtools1.4^{23} to calculate Dstatistics and F4 ratio test. We developed a pipeline (Supplementary Figure 2) using the observed values of Dstatistics (EUR, AFR, NEAN, Ancestral), Dstatistics (ASN, AFR, NEAN, Ancestral), Dstatistics (PAC, AFR, DENI, Ancestral), Dstatistics (ASN, EUR, NEAN, Ancestral) and Dstatistics (ASN, EUR, DENI, Ancestral) to estimate the percentage of archaic introgression using simulated data generated with ms^{48} (Supplementary Table 4) in different simple demographic models (Supplementary Figure 1) while keeping all the other demographic parameters constant (Supplementary Table 3). In case of models with a single introgression parameter estimation, the optimize function from R was used; in case of two parameters optim function was used with LBFGSB methods with default parameters^{49}. Both optimize and optim functions are iterative methods for solving nonlinear optimization problems. These are used to find better solution for a given function (See Supplementary Note 1 for more details).
ABCDL method
In order to statistically test among competing demographic models and estimating the posterior distributions of the parameters of a given model, we developed an ABCDL approach using as raw summary statistic the SFS (see also Supplementary Note 2). For each population we considered two samples, except in the case of Neanderthal and Denisovan, where we only considered the Altai Neanderthal and Denisovan individuals. Simulations were generated using FastSimcoal2^{21}. Each simulation comprised simulating 9643 genomic regions, comprising 651 Mb, and one diploid individual per population using the demographic topology of one of the proposed trees (see Fig. 2). We used the option multiSFS of FastSimcoal2 to generate the SFS. We considered a generation time of 29 years per generation^{38}, a uniform recombination rate of 1.0e8 and a mutation rate of 1.61e8 ± 0.13e8^{50}. At each simulation and genomic region, we sampled the mutation rate μ assuming a normal distribution with mean 1.61e8 and standard deviation of 0.13e8. However, since the SNP density at each region depends on the fraction of considered fragments out of genes and CgP islands defined in^{51} (see Supplementary Note 2 for data preprocessing), we scaled the mutation rate μ_{r} of each region r by the fraction of included fragments (L_{c}) over the total length of the region (L_{r}):
In our implementations of the ABC algorithm, we considered an error threshold of 1000/100,000 (i.e. retaining the parameters/models of the 1000 simulations out of 100,000 showing the smallest error with the observed SS). For model comparison, we considered running a multinomial logistic regression on accepted simulations; for parameter estimation, a lineal local regression on accepted simulations. All of these analyses were conducted with the R package abc^{52}, using the script postpr with option mnlogistic for model comparison and the script abc with option loclinear for parameter estimation.
For the DL topology, we generated a supervised four layer feedforward DL network using Encog3.4^{53} (see Supplementary Note 2). An innovative aspect of our implementation of DL is the development of SFSlike noise injection. The latest hyperparameter is motivated by the fact that the DL is trained with data simulated from the proposed models of interest, whereas the real model that generated the observed genomic data is by definition more complex and (at best) partially overlaps/generalizes with (some) of the considered models. In this case, it can be expected that a trained DL with data from models different from the real one will produce biased SSDL with regards to the real data. In our study, we propose overcoming this problem by applying an innovative modified version of the noise injection algorithm^{4}. In the classical injection algorithm, white noise (i.e. from a normal distribution with mean 0 and a given standard deviation) is dynamically added at each iteration to the elements of the training data set in order to force model generalization. However, properly adding white noise requires independence between the input variables, which is not the case of the considered SS. In order to perform a noise injectionlike approach in the case of the SFS statistic, we took advantage that we have two individuals for each population—with the exception of Neanderthal and Denisovan. We split the observed data set in two, one comprising samples that will be used to add “observed datalike” noise to the simulated SFS, and a second one that will be used for the estimation of the posterior distributions using the ABC approach. At each iteration of the DL training algorithm, we define for the j cell of the SFS of simulation i from model s:
where α is sampled from a uniform distribution in the range [0,0.2] and r corresponds to the observed “real” data. Introducing noise from an independent observed data set offers an additional advantage. By analogy with image detection, we can imagine that the simulated training data corresponds to a sharp highresolution image of an object from the known simulated model/parameter, whereas the observed r data is at best an image showing one of the model/parameter objects at low resolution on a foggy environment (i.e. the parts of the model generating the data not considered in our models). This difference between the data used for training and the final observed data that we want to classify can introduce biases in the final model classification. By generating new SFS patterns by means of noise injection, increasing robustness in the network classification against model departures can be expected.
Code availability
We implemented a JAVAbased pipeline available at https://github.com/oscarlao/ABC_DL to (i) generate FastSimcoal2 input files from the prior distributions of each parameter and read the SFS from the simulation, (ii) implement feedforward DL using the JAVA Encog3.4^{53} and (iii) generate input files for ABC analyses with the R package abc^{52}. Details of the full implementation of the DL and ABCDL algorithm are provided in the Supplementary Note 2.
Reporting Summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All the sequence data came from their respective publicly available data sources and all other downstream data analysis are available on request to the corresponding authors.
References
Schlebusch, C. M. et al. Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000 years ago. Science 358, 652–655 (2017).
Skoglund, P. et al. Reconstructing prehistoric african population structure. Cell 171, 59–71.e21 (2017).
Soares, P. et al. Correcting for purifying selection: an improved human mitochondrial molecular clock. Am. J. Hum. Genet. 84, 740–759 (2009).
Poznik, G. D. et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Ychromosome sequences. Nat. Genet. 48, 593–599 (2016).
Li, H. & Durbin, R. Inference of human population history from individual wholegenome sequences. Nature 475, 493–496 (2011).
Mondal, M. et al. Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat. Genet. 48, 1066–1070 (2016).
Malaspinas, A. S. et al. A genomic history of Aboriginal Australia. Nature 538, 207–214 (2016).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Rasmussen, M. et al. An aboriginal Australian genome reveals separate human dispersals into Asia. Science. 334, 94–98 (2011).
Grün, R. et al. Useries and ESR analyses of bones and teeth relating to the human burials from Skhul. J. Hum. Evol. 49, 316–334 (2005).
Pagani, L. et al. Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538, 238–242 (2016).
Meyer, M. et al. A highcoverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012).
Green, R. E. et al. A draft sequence of the neandertal genome. Science 328, 710–722 (2010).
Kuhlwilm, M. et al. Ancient gene flow from early modern humans into Eastern Neanderthals. Nature 530, 429–433 (2016).
Posth, C. et al. Deeply divergent archaic mitochondrial genome provides lower time boundary for African gene flow into Neanderthals. Nat. Commun. 8, 16046 (2017).
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Reich, D. et al. Genetic history of an archaic hominin group from Denisova cave in Siberia. Nature 468, 1053–1060 (2010).
Hsieh, P. H. et al. Modelbased analyses of wholegenome data reveal a complex evolutionary history involving archaic introgression in Central African Pygmies. Genome Res. 26, 291–300 (2016).
Lu, D. et al. Ancestral origins and genetic history of Tibetan highlanders. Am. J. Hum. Genet. 99, 580–594 (2016).
Browning, S. R., Browning, B. L., Zhou, Y., Tucci, S. & Akey, J. M. Analysis of human sequence data reveals two pulses of archaic denisovan admixture. Cell 173, 53–61.e9 (2018).
Excoffier, L., Dupanloup, I., HuertaSánchez, E., Sousa, V. C. & Foll, M. Robust demographic inference from genomic and SNP Data. PLoS Genet. 9, e1003905 (2013).
Prangle, D. in Handbook of Approximate Bayesian Computation (ABC) (eds Sisson, S. A., Fan, Y. & Beaumont, B.) 320 (Taylor & Francis, Abingdon, 2108).
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Sunnåker, M. et al. Approximate Bayesian computation. PLoS Comput. Biol. 9, e1002803 (2013).
Bertorelle, G., Benazzo, A. & Mona, S. ABC as a flexible framework to estimate demography over space and time: Some cons, many pros. Mol. Ecol. 19, 2609–2625 (2010).
Jiang, B., Wu, T., Zheng, C. & Wong, W. Learning summary statistic for approximate Bayesian computation via deep neural network. Stat. Sin. 27, 1595–1618 (2017).
Sheehan, S. & Song, Y. S. Deep learning for population genetic inference. PLoS Comput. Biol. 12, e1004845 (2016).
Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009).
Lapierre, M., Lambert, A. & Achaz, G. Accuracy of demographic inferences from the site frequency spectrum: he case of the yoruba population. Genetics 206, 139–449 (2017).
Lukic, S. & Hey, J. Demographic inference using spectral methods on SNP data, with an analysis of the human outofAfrica expansion. Genetics 192, 619–639 (2012).
Boitard, S., Rodríguez, W., Jay, F., Mona, S. & Austerlitz, F. Inferring population size history from large samples of genomewide molecular data—an approximate bayesian computation approach. PLoS Genet. 12, e1005877 (2016).
Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Vernot, B. & Akey, J. M. Complex history of admixture between modern humans and neandertals. Am. J. Hum. Genet. 96, 448–453 (2015).
Wall, J. D. et al. Higher levels of Neanderthal ancestry in east Asians than in Europeans. Genetics 194, 199–209 (2013).
Simonti, C. N. et al. The phenotypic legacy of admixture between modern humans and Neandertals. Science 351, 737–741 (2016).
Slon, V. et al. The genome of the offspring of a Neanderthal mother and a Denisovan father. Nature 561, 113–116 (2018).
Fenner, J. N. Crosscultural estimation of the human generation interval for use in geneticsbased population divergence studies. Am. J. Phys. Anthropol. 128, 415–423 (2005).
Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice. 1st edn., (CRC Press, Boca Raton, Florida, 1995).
Lazaridis, I. et al. Genomic insights into the origin of farming in the ancient Near East. Nature 536, 419–424 (2016).
Wall, J. D. Inferring human demographic histories of nonafrican populations from patterns of allele sharing. Am. J. Hum. Genet. 100, 766–772 (2017).
Narasimhan, V. M. et al. The Genomic Formation of South and Central Asia. Preprint at bioRxiv (2018). https://doi.org/10.1101/292581.
Prüfer, K. et al. A highcoverage Neandertal genome from Vindija Cave in Croatia. Science 358, 655–658 (2017).
Sankararaman, S., Mallick, S., Patterson, N. & Reich, D. The combined landscape of denisovan and neanderthal ancestry in presentday humans. Curr. Biol. 26, 1241–1247 (2016).
Fu, Q. et al. The genetic history of Ice Age Europe. Nature 534, 200–205 (2016).
Kim, B. Y. & Lohmueller, K. E. Selection and reduced population size cannot explain higher amounts of Neandertal ancestry in East Asian than in European human populations. Am. J. Hum. Genet. 96, 454–461 (2015).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Hudson, R. R. Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics 18, 337–338 (2002).
R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, 2017).
Lipson, M. et al. Calibrating the human mutation rate via ancestral recombination density in diploid genomes. PLoS Genet. 11, e1005550 (2015).
Bock, C., Walter, J., Paulsen, M. & Lengauer, T. CpG island mapping by epigenome prediction. PLoS Comput. Biol. 3, e110 (2007).
Csilléry, K., François, O. & Blum, M. G. B. Abc: An R package for approximate Bayesian computation (ABC). Methods Ecol. Evol. 3, 475–479 (2012).
Heaton, J. Encog: Library of interchangeable machine learning models for Java and C#. J. Mach. Learn. Res. 16, 1243−1247 (2015).
Acknowledgements
We thank Iago Maceda Porto for useful comments on the manual and testing of the JAVA suite. M.M was supported by the European Union through the European Regional Development Fund (Project No. 20142020.4.01.160030). For J.B, this study has been possible thanks to grant BFU201677961P (AEI/FEDER, UE) awarded by the Agencia Estatal de Investigación (MINECO, Spain) and with the support of Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 702). Part of the “Unidad de Excelencia María de Maeztu”, funded by the MINECO (ref: MDM20140370). O.L. was supported by a Ramón y Cajal grant from the Spanish Ministerio de Economia y Competitividad (MINECO) with reference RYC201314797, a BFU201568759P (MINECO/FEDER) grant and the support of Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 937). O.L. also acknowledges the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) to the EMBL partnership; Centro de Excelencia Severo Ochoa; CERCA Programme/Generalitat de Catalunya; the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) through the Instituto de Salud Carlos III; Generalitat de Catalunya through Departament de Salut and Departament d’Empresa i Coneixement; the cofinancing by the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) with funds from the European Regional Development Fund (ERDF) corresponding to the 20142020 Smart Growth Operating Program.
Author information
Authors and Affiliations
Contributions
M.M., J.B. and O.L. conceived the study. M.M. and O.L. performed analyses. M.M., J.B. and O.L. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mondal, M., Bertranpetit, J. & Lao, O. Approximate Bayesian computation with deep learning supports a third archaic introgression in Asia and Oceania. Nat Commun 10, 246 (2019). https://doi.org/10.1038/s41467018080897
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467018080897
This article is cited by

Harnessing deep learning for population genetic inference
Nature Reviews Genetics (2023)

Congruent evolutionary responses of European steppe biota to late Quaternary climate change
Nature Communications (2022)

Widespread Denisovan ancestry in Island Southeast Asia but no evidence of substantial superarchaic hominin admixture
Nature Ecology & Evolution (2021)

Refining models of archaic admixture in Eurasia with ArchaicSeeker 2.0
Nature Communications (2021)

Modeling colorectal cancer evolution
Journal of Human Genetics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.