Since anatomically modern humans dispersed Out of Africa, the evolutionary history of Eurasian populations has been marked by introgressions from presently extinct hominins. Some of these introgressions have been identified using sequenced ancient genomes (Neanderthal and Denisova). Other introgressions have been proposed for still unidentified groups using the genetic diversity present in current human populations. We built a demographic model based on deep learning in an Approximate Bayesian Computation framework to infer the evolutionary history of Eurasian populations including past introgression events in Out of Africa populations fitting the current genetic evidence. In addition to the reported Neanderthal and Denisovan introgressions, our results support a third introgression in all Asian and Oceanian populations from an archaic population. This population is either related to the Neanderthal-Denisova clade or diverged early from the Denisova lineage. We propose the use of deep learning methods for clarifying situations with high complexity in evolutionary genomics.
All modern humans are genetically related to each other at a time depth of up to 300 thousand years ago (Kya)1,2 and share a common African root3,4,5. The migratory routes used by anatomically modern humans (AMH) after the African diaspora and aspects of the interbreeding between AMH and presently extinct hominins living at the time in Eurasia (here referred as Eurasian Extinct Hominins, EEH) are still under debate. Recently, several genetic studies have argued that there was only one main Out of Africa (OOA) event6,7,8 that happened less than 100 Kya, with earlier dispersals, which most likely does not contribute much to present-day variation9,10,11. Sequencing of ancient Neanderthal and Denisovan fossils supported introgression events into AMH out of Africa;12,13 however, recent studies also support the presence of gene flow from AMH into Neanderthals14,15, thus suggesting a complex hominin evolution.
A complex demographic model considering archaic introgressions out of Africa is needed to properly address two main points. First, the number of introgression events from Neanderthal and Denisova into AMH, which remains controversial, and their impact on the AMH gene pool. A difficult point is to account for the higher amount of Neanderthal introgression in Asia than in Europe8. Moreover, the geographic limits of Denisovan introgression are not well established14,16. Interestingly, Australian Aborigines also harbour high amounts of EEH due to several introgression events7,9,12. Although the evidence for Denisovan introgression in Oceanian populations is clear, the amount of introgression varies between 3 and 6% depending on the methodology14,16,17. Also, there is disagreement about the number of introgression events necessary to explain patterns of admixture in modern Australian Aborigines7,8,11. The second main point is to identify all the cryptic ghost archaic populations that have introgressed with hominins with known genomes (modern humans from the Old World, Neanderthals and Denisovans). It has been suggested that other archaic hominins could have interbred with Denisovans14,16. Furthermore, introgression of extinct hominins beyond those of Neanderthals and Denisovans into some AMH populations, both within Africa2,18, as well as in South Asia6 and Tibetan populations19 has been proposed. Recently it has been also found that there could have existed two different pulses of Denisovan admixture in East Asia20.
However, comparing the likelihood of competing complex demographic models given the observed data is cumbersome21, requires manual expertise22 and/or fitting23. Consequently, all of these different demographic models have not been tested together using the same methodology. Furthermore, it would be interesting to have an estimation of the parameters that define a demographic model given the data. However, this is not available in commonly applied algorithms (i.e. Admixtools23). Within this context, Approximate Bayesian Computation (ABC) is a flexible statistical framework that allows estimating the posterior distribution of a parameter/model through the generation of simulated datasets for cases when there is no close-form of the likelihood of the function24. ABC is widely used in the field of population genetics25. However, identifying the proper transformations of the data to generate informative summary statistics (SS) is model and parameter dependent, and including non-informative redundant SS can jeopardize the performance of the ABC approach22. Jiang et al.26 recently proposed a theoretical improvement of the ABC framework by considering Deep Learning (DL) for generating informative SS from a set of (raw) SS. DL is a type of Artificial Neural Network (ANN) topology based on multiple hidden neural layers, and it has been used for population genetics inference in non-ABC frameworks27. Since ANN performs a nonlinear transformation of the input features towards the output28, DL provides an optimal way of extracting and maximizing the non-redundant information out of the raw SS26. Thus, a DL can be trained using a large number of raw SS as input features to predict the parameters that generated the simulated data, or to classify a simulated data in a set of models, and use this classification prediction or parameter estimation as a new SS (SS-DL).
In the present study, we considered the multidimensional absolute site frequency spectrum (SFS) as raw SS. SFS accounts for the number of single nucleotide polymorphisms (SNPs) present in a particular combination of derived alleles in the sampled populations29. This statistic is the core of many frequency based statistics applied in population genetics for distinguishing among competing topological trees and detecting archaic introgression (i.e. D-statistics, F4, F4 ratio etc23.), determining migration rates, time of split and/or differences in the effective population sizes in models of two populations that split and migrate (i.e. Fst29). Moreover, the SFS is not affected by recombination30. SFS has been shown to allow robust demographic estimates21,29,31 and the folded SFS has been previously used in classical ABC frameworks for estimating the population size32. However, SFS is high dimensional and it combinatorically increases on the considered number of samples and populations29. Our proposed methodology takes advantage of the nonlinear nature of the DL for compressing the SFS to extract the underlying patterns defined by all the dimensions of the SFS that allow distinguishing among proposed models or obtaining the value of the demographic parameters of interest. This SS-DL is then used in an ABC framework.
We apply our newly developed method in a large set of whole genome sequences of present and ancient remains to estimate the posterior probability of competing complex demographic models with a focus on introgression events. We find that ABC-DL is powerful for distinguishing between plausible introgression models and propose a third archaic introgression event happened (in addition to Neanderthal introgression to all OOA populations and Denisova introgression for Oceanian population), which is common to all Asia and Oceanian populations.
D-statistics and F4 ratio analysis
D-statistics and F4 ratio statistics, tests of frequent use in population genetics, have consistently demonstrated the existence of Neanderthal and Denisovan ancestry in out of African populations16. Using a common variant calling (see Methods and also Supplementary Methods), we found with D-statistics and F4 ratio test that all Asian populations (East Asians, Andamanese and Indian Tribal populations) have higher amounts of Neanderthal ( > 1%) and Denisovan ancestry ( > 1%) compared to Europeans (Table 1). Interestingly, when trying to calculate the excess of Neanderthal ancestry in Asian populations compared to Europeans using the F4 ratio test, different outgroups (ancestral alleles mainly constituted from the chimpanzee reference genomes33 and Denisovan12) gave different results (Fig. 1). This was unexpected, as the F4 ratio test and D-statistics should not be affected by changing the outgroup under the simplistic topology used in these tests. When using the ancestral alleles from the 1000 Genomes as outgroup33, the excess of Neanderthal ancestry in Asian populations is around 0.5% (s.e. 0.15%) and statistically significant (Z score > 3). In contrast, when using Denisovan as outgroup, this estimate diminishes and loses statistical significance (~0% ± 0.15%). We then used simulations considering different demographic models (Supplementary Figure 1), varying the percentage of introgression while keeping constant all the other parameters (Supplementary Table 3), to fit the observed D-statistics and F4 ratio statistics. The simulation results suggest that a simple model for the excess of Neanderthal and Denisova ancestry in Asian populations is not enough for fitting these statistics and other ancestral component is needed (Supplementary Figure 3). Moreover, our analyses suggest that D-statistics and F4 ratio are sensitive to the outgroup (Denisova or chimpanzee) used at least for archaic introgression model. In Oceania, results get more complex to interpret due to the derived allele sharing between Neanderthals and Denisovans. Given the possible biases due to the reference genome (mostly of European origin) used to calculate D-statistics, we re-mapped all the genomes to the chimpanzee reference and recalculated the results (Supplementary Table 2). Although most of the Neanderthal and Denisovan ancestry calculations remained the same, we note a decrease of African ancestry by changing the reference from human to chimpanzee in Indian Tribal and Oceanian populations. Overall, these results show a complex situation that needs a broader approach to distinguish among competing demographic models.
In order to quantify the posterior probability of different demographic models given the observed genetic diversity in current populations and archaic individuals, which cannot be achieved by the D-statistics and F4 ratio analyses, we applied an ABC-DL analysis (see Methods and Supplementary Note 2).
We considered several plausible competing demographic models which describe scenarios that have been proposed by different studies to explain the observed genetic diversity of introgression patterns among current old world AMH populations, Neanderthals and Denisovans (see Fig. 2 and Supplementary Table 5). The implemented demographic models are variations from an initial model with many accepted features (Model A): OOA origin of modern humans, with a Eurasian split between Europeans and the group comprising two subgroups, East Asians, Indian and Andamanese on one hand, and Papuans and Australians on the other. Introgression of Neanderthals is found in all OOA populations, and that of Denisovans in the Oceanian populations. Introgression from an extinct group of distant hominins into Denisovans14,16 has been considered in all models, and this introgression does not affect directly our core results. From this basic model, several variations have been proposed to better fit the genetic data; here we consider the following. Model B: decrease of European Neanderthal ancestry due to admixture with a modern human, “basal Eurasian”, a ghost population non-introgressed by Neanderthals (Xn)34. Model C: two introgression events from Neanderthals, the first affecting all OOA populations and the second only affecting Asia35,36. Model D: two introgression events both from Neanderthals and Denisovans, with a Denisovan introgression on Asians before the split of Oceanian populations14,16. The remaining models have a single introgression of Neanderthals and Denisovans (as in Model A) but with an introgression to Asian and Oceanian populations from an extinct (Xe) hominin population that is a sister group of Neanderthals (Model E), or is a sister group of Denisova (Model F;20), or is an outgroup of both Neanderthals and Denisovans (Model G;6). Given the recently discovered Neanderthal–Denisovan hybrid37, we also considered a scenario in which this extinct hominin is an admixed population between Neanderthals and Denisovans (Model H)37, a similar model of a trichotomy as will be seen below. Furthermore, Model E and Model F can be considered particular cases of Model H when the percentage of Denisovan admixture is 0 or 100%, respectively. In all the models, the basic tree structure remained constant although all the demographic parameters—including the amount of introgression- were estimated.
We first evaluated the performance of the proposed ABC-DL methodology for distinguishing between the proposed eight models. We used simulated data as pseudo-observed data and run the full ABC-DL pipeline; we applied a ‘hard’ model classification to assign the simulated data to the model with the highest posterior probability (see Supplementary Note 2). The confusion matrix (Supplementary Table 6) summarizing the model classification shows that the proposed ABC approach identifies the model that generated the data (Modelsim) with posterior probability P(Modelsim = Modelabc | Data) > 50% in all the models. When comparing the eight considered models by ABC-DL with the observed data, the ABC model inference supports model H (P(Model = H|Data) = 0.46), namely the introgression of genes from an archaic Denisovan-Neanderthal admixed extinct population (Xe) into Asian AMH populations, before the split of Oceanian groups, closely followed by Model F (P(Model = F|Data) = 0.38), which models Xe as a sister population of Denisovans. Model H is slightly more likely than Model F (1.2 times more likely), and much more likely than models E (5 times more likely), G (7.6 times) and D (107 times). Models not considering the presence of an extinct archaic ‘ghost’ population introgressing AMH -A, B, C and D- obtained a posterior probability of ~0 (see Supplementary Figure 4). Therefore, only models with admixture of an extinct population in the Neanderthal–Denisovan clade, but with significant differences to each of them, are supported by the ABC-DL framework.
Next, we implemented ABC-DL to estimate the posterior distribution of each parameter for Model H (Fig. 3b and Supplementary Figure 5). Given that Model F is also similarly supported, for completion we also estimated the posterior distribution of the parameters in this model (Fig. 3a). As internal check of the performance of the DL in Model H, we first estimated the Pearson correlation between the parameter values used for the simulation and the predicted parameter value by the DL (Supplementary Table 7). For almost all the parameters from the Model H, we obtained a high correlation between the real parameter and the predicted one by the DL with the exception of the effective population sizes of the unknown ghost populations, which is a very acceptable expectation as we do not have direct information of the genetic variation in these populations. When applied to the observed data (Table 2 and Supplementary Table 8), our ABC-DL approach for estimating parameters of Model H suggests that (i) the introgression of Neanderthals with AMH out of Africa was 1.3% (CI from 0.18% to 2.5%) and occurred 69 Kya (CI from 56 Kya to 88 Kya; assuming a generation time of 29 years38), whereas the split between African and OOA populations took place 121 Kya (CI from 78 Kya to 167 Kya); (ii) the Denisovan introgression to Oceanian populations was 1.6% (CI from 0.4% to 2.5%) and took place 43 Kya (CI from 29 Kya to 50 Kya); finally, (iii) the estimated amount of introgression in the AMH ancestor of current Asian populations by the extinct archaic population Xe was 2.6% (CI from 0.7% to 4.6%) at 51 Kya (CI from 45 Kya to 58 Kya). According to our analyses, Xe appeared 304 Kya (CI from 211 Kya to 375 Kya) from an admixture event of Denisovans and Neanderthals that would have occurred only 14 Kya after the divergence of Denisovans and Neanderthals (314 Kya; CI from 300 Kya to 343 Kya). The estimated divergence between Denisovan and Neanderthals is relatively recent compared to the estimated in16 (381–473 Kya when considering a mutation rate of 0.5e-09 per bp per year or 1.45e-08 per generation assuming 29 years by generation) but higher than in16 (190–236 Kya when considering a mutation rate of 1e-09 per bp per year or 2.9e-08 per generation). Since the number of non-shared variants depends on the time of divergence, the genetic drift at each population and the mutation rate, a possible explanation for the differences in time of divergence could be the mutation rate, which substantially variate among studies. Demographic parameters estimated for Model F are similar in magnitude to the ones estimated in Model H, with credible intervals that overlap the ones obtained for Model H (Supplementary Table 9).
In order to see whether the proposed Model H fit observed D-statistics, we generated simulations using the mean of the estimated posterior probability of each parameter, and computed D-statistics assuming different population tree topologies for stressing the impact of archaic introgression in Asian populations. Simulation results from ABC-DL for Model H replicate the empirical results of the D-statistics (Table 1, last column) by reproducing the excess of Neanderthal and Denisova ancestry in Asian populations compared to Europeans.
Taken together, we have an overwhelming support for the existence of a third extinct branch of the Neanderthal–Denisovan clade; considering the credible interval of the time of split between the archaic populations, a model with a trichotomy is a good consensus for Model D-H as in all the models the Xe separates from Neanderthal or Denisova at a similar time depth of Neanderthal and Denisova split (Supplementary Table 10).
We have developed a new methodology for statistically comparing complex demographic models and estimating demographic parameters with focus on applying it to the complex demographic history of introgressions in Eurasia. This methodology takes advantage of the nonlinear capabilities of DL for identifying patterns and extracting features, and of the well-established statistical ABC framework for properly estimating the posterior distribution of models and parameters. Overall, the computational methods used here take full advantage of machine learning technologies through DL to improve the performance of the ABC statistical framework. This new approach overcomes some previous limitations for comparing demographic models, minimizes manual fitting and allows estimating parameters in complex demographic models.
The ABC-DL analysis supports that all Asian and Oceanian populations, including East Asians, received an introgression from an unknown extinct archaic hominin population. However, the exact relationship of this extinct hominin population with the other known archaic populations is not completely disentangled. The two models mostly supported by our ABC-DL model comparison suggest that Xe corresponds to one of the two sister-Denisovan lineages (Model F) or this hominin was closely related to the Neanderthals and Denisovans with a slightly greater similarity to Denisovans than to Neanderthals (Model H). In our analyses, Model H is slightly better supported than Model F. Nevertheless, the Bayes factor between Model H and F is so close to one that the difference in the posterior probability of the two models is hardly worth mentioning39. Furthermore, in both models ABC-DL parameter estimation suggests that Xe had a similarly long independent evolutionary history (Table 2, Supplementary Table 8, 9 and 10). Therefore, we can only claim with confidence that this population had separated early and close to the Neanderthal–Denisova split time. The consensus of the best models is represented by a trichotomy of the three branches. Given the results from the confusion matrix, supporting a similar posterior probability for identifying by our ABC-DL approach the model that generated the simulation, it is unlikely that this model uncertainty is due to the relatively recent time of divergence that we estimated between Neanderthal and Denisovans. In contrast, this lack of confidence for distinguishing among F and H could be due to a limitation of using a whole-genomic SFS approach, which does not take into account additional information from the spatial distribution within the genome of introgressed regions (i.e. Sprime20).
Independently of this limitation, both F and H models can explain some previously contradictory results. For example, they would explain the dearth of African ancestry in some populations (like Andamanese, Indian and Oceanian populations6 or Tibeto-Burman). The admixed Neanderthal–Denisova archaic ghost population of Model H that introgresses in Asia also explains previous results showing that Asian populations have a higher amount of Neanderthal and Denisovan introgression than Europeans12,16. Furthermore, both most supported models agree with the conclusion recently reached by Browning et al.20 that proposed two Denisovan-like introgressions in Asian and Oceanian populations using a completely independent methodology. However, Browning et al.20 proposed that Altai Denisovan-like admixed with East Asians whereas the same sister-Denisovan population would have introgressed with Asian and Papuan populations. In our model comparisons we had not considered such model given that East Asian, South and South-East Asia populations showed a similar amount of Denisovan ancestry in our analyses; we considered a shared Denisovan-like ancestry from the same population source as a more parsimonious scenario than the described in20. Similarly, Neanderthal dilution in Europeans is also suggested before40. In our analyses, the only model considering Neanderthal dilution in Europeans (Model B) is not supported compared to models that consider introgression from unknown hominin. This result does not necessarily exclude Basal Eurasian admixture, as we have not considered more complex models considering Neanderthal dilution in Europeans and archaic introgression in Asian and Oceanians, which we disregarded a priori as less parsimonious than the considered ones. Nevertheless, we would like to point out that the less Neanderthal ancestry in European compared to Asians can be partially explained (if not all) by the Model F/H due to shared ancestry (and thus shared derived alleles) between all EEH populations.
In the same context, there are other demographic models that have not been considered here, since they are not relevant for our conclusions. The exact relationship of Oceanian populations with Eurasian populations is contested. Some studies7,11 suggest that European and East Asian have more recent ancestry than Oceanian populations. In contrast, other studies argued that East Asian and Oceanian populations have a recent ancestry6,8. In the present study we have considered the later model as it is gaining recent support41,42. Nevertheless, an early split of Australian and Papuan populations from Eurasians would not have directly affected the percentage of estimated Xn introgression in East Asia and South Asia but only the proportions of Denisovan introgression in Oceania. In fact, we observed that the percentage of Denisovan introgression in Oceania significantly drop to 1.6% (95% CI 0.4%, 2.5%) in Model H compared to previous estimates7,8. However, taking into account the percentage of Xn introgression in Oceania, the total amount of Denisovan-like introgression is 4.2%, within the range of proposed values in Oceania7,8. Similarly, it has been suggested that the introgressed EEH populations split off from the Altai and Denisovan samples (that we used as proxies in our analyses) quite early16. In principle, this could affect the estimated percentage of Neanderthal introgression out of Africa and the Denisovan introgression in Oceania. Nevertheless, this sample bias cannot explain the relative increase of both Neanderthal and Denisova ancestry in Asian populations compared to Europeans (Table 1 and Fig. 1). The drift, which these EEH populations had faced after their separation of the Neanderthal and Denisovan populations from the proxy samples, would be at random and independent of the relative increase of derived alleles shared between Asian and EEH populations compared to Europeans. This suggests that we need at least one more introgression event to explain this differential increase. Moreover, we have not modelled an ancestral AMH population introgressed into Neanderthals and admixture between Neanderthal and Denisova14,43, as this would affect all OOA populations equally and will have negligible effect on them, thus not affecting differences between Neanderthal and Denisova ancestry in Asians. Sankararaman et al.44 proposed a back migration from Oceanian populations to Asia to explain the different proportions of archaic ancestry. However, such a model implies that such Oceanian back migration had a similar genomic impact on populations across a vast region (Indian, Andamanese and East Asian), which seems highly unlikely. Similarly, it has been proposed that the Neanderthal ancestry is diminished over time by differential negative selection acting stronger in Europe. Although it is accepted that Neanderthal ancestry decreases over time in European populations most likely by negative selection45, it is unclear why this process would not be similar in Asian populations46. We also have not inspected a model where a population that is an outgroup of both human and Neanderthal–Denisova lineages has introgressed in Asian populations. A similar population has been hypothesized to have introgressed into Denisovans14,16, but an introgression from such a population into Asians will not result in the observed Asian excess of Neanderthal and Denisova ancestry compared to Europeans as detected by D-statistics and can be omitted from the analysis.
In conclusion, we have implemented a method that couples ABC with DL that allows to use all the data at the same time and properly comparing complex demographic models. Our novel analytical approach based on ABC-DL supports a model of human evolution in which the OOA populations have had introgression not only from Neanderthals and Denisovans, but also from a third, related group, still genetically and archaeologically uncharacterized. However, the whole genome SFS approach that we used in our ABC-DL implementation does not consider additional information—such as the length and nature of the introgressed fragments—that can be important for distinguishing between some of the models. Further development of the ABC-DL framework will be required to include the fragment length of the ancestry of the genomic fragments. Similarly, new analyses and ancient genome sequences will be required to better understand the nature and consequences of this third related group in the genome of modern humans.
Samples and variant calling
We used already published and publicly available data from 1000 Genome Project 3rd Phase high coverage data47, Simons Genome Diversity Project (SGDP)8 and Mondal et al. 20166 (available at European Nucleotide Archive: PRJEB11455 and PRJEB16019) (see Supplementary Table 1). We downloaded four Europeans (EUR), four East Asians (ASN) and four Africans (AFR) BAM files from 1000 Genome Project; two Papuans and two Australian Aboriginal (PAC) sequences (FASTQ files) from SGDP data set; two Irula, two Birhor (IND); two Jarawa and two Onge (AND) bam files from our previous work6. We also downloaded two Neanderthals (NEAN)13,16,43 and one Denisova (DENI)12 BAM files from their respective sites. In case the BAM files were not accessible, we mapped the FASTQ files ourselves to create BAM files. All the information on the samples and their respective origin is mentioned in Supplementary Table 1. The number of individuals were kept similar for EUR, ASN, AFR, IND, AND and PAC to avoid unwanted sample size bias for D-statistics, F4-ratio test and ABC analysis.
D-statistics and F4 ratio test
We have used the mapability mask as suggested previously5 thus removing repeated regions (although without this filter the results are same). We have also used two Irulas from SGDP data set to map on the chimpanzee genome reference to exclude any biases from the reference genome (see also Supplementary Methods). We used Admixtools1.423 to calculate D-statistics and F4 ratio test. We developed a pipeline (Supplementary Figure 2) using the observed values of D-statistics (EUR, AFR, NEAN, Ancestral), D-statistics (ASN, AFR, NEAN, Ancestral), D-statistics (PAC, AFR, DENI, Ancestral), D-statistics (ASN, EUR, NEAN, Ancestral) and D-statistics (ASN, EUR, DENI, Ancestral) to estimate the percentage of archaic introgression using simulated data generated with ms48 (Supplementary Table 4) in different simple demographic models (Supplementary Figure 1) while keeping all the other demographic parameters constant (Supplementary Table 3). In case of models with a single introgression parameter estimation, the optimize function from R was used; in case of two parameters optim function was used with L-BFGS-B methods with default parameters49. Both optimize and optim functions are iterative methods for solving nonlinear optimization problems. These are used to find better solution for a given function (See Supplementary Note 1 for more details).
In order to statistically test among competing demographic models and estimating the posterior distributions of the parameters of a given model, we developed an ABC-DL approach using as raw summary statistic the SFS (see also Supplementary Note 2). For each population we considered two samples, except in the case of Neanderthal and Denisovan, where we only considered the Altai Neanderthal and Denisovan individuals. Simulations were generated using FastSimcoal221. Each simulation comprised simulating 9643 genomic regions, comprising 651 Mb, and one diploid individual per population using the demographic topology of one of the proposed trees (see Fig. 2). We used the option --multiSFS of FastSimcoal2 to generate the SFS. We considered a generation time of 29 years per generation38, a uniform recombination rate of 1.0e-8 and a mutation rate of 1.61e-8 ± 0.13e-850. At each simulation and genomic region, we sampled the mutation rate μ assuming a normal distribution with mean 1.61e-8 and standard deviation of 0.13e-8. However, since the SNP density at each region depends on the fraction of considered fragments out of genes and CgP islands defined in51 (see Supplementary Note 2 for data preprocessing), we scaled the mutation rate μr of each region r by the fraction of included fragments (Lc) over the total length of the region (Lr):
In our implementations of the ABC algorithm, we considered an error threshold of 1000/100,000 (i.e. retaining the parameters/models of the 1000 simulations out of 100,000 showing the smallest error with the observed SS). For model comparison, we considered running a multinomial logistic regression on accepted simulations; for parameter estimation, a lineal local regression on accepted simulations. All of these analyses were conducted with the R package abc52, using the script postpr with option mnlogistic for model comparison and the script abc with option loclinear for parameter estimation.
For the DL topology, we generated a supervised four layer feedforward DL network using Encog3.453 (see Supplementary Note 2). An innovative aspect of our implementation of DL is the development of SFS-like noise injection. The latest hyper-parameter is motivated by the fact that the DL is trained with data simulated from the proposed models of interest, whereas the real model that generated the observed genomic data is -by definition- more complex and (at best) partially overlaps/generalizes with (some) of the considered models. In this case, it can be expected that a trained DL with data from models different from the real one will produce biased SS-DL with regards to the real data. In our study, we propose overcoming this problem by applying an innovative modified version of the noise injection algorithm4. In the classical injection algorithm, white noise (i.e. from a normal distribution with mean 0 and a given standard deviation) is dynamically added at each iteration to the elements of the training data set in order to force model generalization. However, properly adding white noise requires independence between the input variables, which is not the case of the considered SS. In order to perform a noise injection-like approach in the case of the SFS statistic, we took advantage that we have two individuals for each population—with the exception of Neanderthal and Denisovan. We split the observed data set in two, one comprising samples that will be used to add “observed data-like” noise to the simulated SFS, and a second one that will be used for the estimation of the posterior distributions using the ABC approach. At each iteration of the DL training algorithm, we define for the j cell of the SFS of simulation i from model s:
where α is sampled from a uniform distribution in the range [0,0.2] and r corresponds to the observed “real” data. Introducing noise from an independent observed data set offers an additional advantage. By analogy with image detection, we can imagine that the simulated training data corresponds to a sharp high-resolution image of an object from the known simulated model/parameter, whereas the observed r data is at best an image showing one of the model/parameter objects at low resolution on a foggy environment (i.e. the parts of the model generating the data not considered in our models). This difference between the data used for training and the final observed data that we want to classify can introduce biases in the final model classification. By generating new SFS patterns by means of noise injection, increasing robustness in the network classification against model departures can be expected.
We implemented a JAVA-based pipeline available at https://github.com/oscarlao/ABC_DL to (i) generate FastSimcoal2 input files from the prior distributions of each parameter and read the SFS from the simulation, (ii) implement feedforward DL using the JAVA Encog3.453 and (iii) generate input files for ABC analyses with the R package abc52. Details of the full implementation of the DL and ABC-DL algorithm are provided in the Supplementary Note 2.
Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.
All the sequence data came from their respective publicly available data sources and all other downstream data analysis are available on request to the corresponding authors.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank Iago Maceda Porto for useful comments on the manual and testing of the JAVA suite. M.M was supported by the European Union through the European Regional Development Fund (Project No. 2014-2020.4.01.16-0030). For J.B, this study has been possible thanks to grant BFU2016-77961-P (AEI/FEDER, UE) awarded by the Agencia Estatal de Investigación (MINECO, Spain) and with the support of Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 702). Part of the “Unidad de Excelencia María de Maeztu”, funded by the MINECO (ref: MDM-2014-0370). O.L. was supported by a Ramón y Cajal grant from the Spanish Ministerio de Economia y Competitividad (MINECO) with reference RYC-2013-14797, a BFU2015-68759-P (MINECO/FEDER) grant and the support of Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 937). O.L. also acknowledges the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) to the EMBL partnership; Centro de Excelencia Severo Ochoa; CERCA Programme/Generalitat de Catalunya; the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) through the Instituto de Salud Carlos III; Generalitat de Catalunya through Departament de Salut and Departament d’Empresa i Coneixement; the co-financing by the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) with funds from the European Regional Development Fund (ERDF) corresponding to the 2014-2020 Smart Growth Operating Program.