Temporal mapping of derived high-frequency gene variants supports the mosaic nature of the evolution of Homo sapiens

Large-scale estimations of the time of emergence of variants are essential to examine hypotheses concerning human evolution with precision. Using an open repository of genetic variant age estimations, we offer here a temporal evaluation of various evolutionarily relevant datasets, such as Homo sapiens-specific variants, high-frequency variants found in genetic windows under positive selection, introgressed variants from extinct human species, as well as putative regulatory variants specific to various brain regions. We find a recurrent bimodal distribution of high-frequency variants, but also evidence for specific enrichments of gene categories in distinct time windows, pointing to different periods of phenotypic changes, resulting in a mosaic. With a temporal classification of genetic mutations in hand, we then applied a machine learning tool to predict what genes have changed more in certain time windows, and which tissues these genes may have impacted more. Overall, we provide a fine-grained temporal mapping of derived variants in Homo sapiens that helps to illuminate the intricate evolutionary history of our species.

The past decade has seen a significant shift in our understanding of the evolution of our lineage. We now recognize that anatomical features used as diagnostic for our species (globular neurocranium, small, retracted face, presence of a chin, narrow trunk, to cite only a few of the most salient traits associated with "anatomical modernity") did not emerge as a package, from a single geographical location, but rather emerged gradually, in a mosaic-like fashion across the entire African continent and quite possibly beyond [1][2][3] . Likewise, behavioral characteristics once thought to be exclusive of Homo sapiens (funerary rituals, parietal art, 'symbolic' artefacts, etc.) have recently been attested in some form in closely related (extinct) clades, casting doubt on a simple definition of 'cognitive/behavioral' modernity 4 . We have also come to appreciate the extent of repeated (multidirectional) gene flow between Homo sapiens and Neanderthals and Denisovans, raising interesting questions about speciation [5][6][7][8] . Last, but not least, it is now well established that our species has a long history. Robust genetic analyses 9 indicate a divergence time between us and other hominins for whom genomes are available of roughly 700kya, leaving perhaps as many as 500ky between then and the earliest fossils displaying a near-complete suite of modern traits (Omo Kibish 1, Herto 1 and 2) 10 . Such a long period of time is likely to contain enough opportunities for multiple rounds of evolutionary modifications. Taken together, these findings render completely implausible simplistic narratives about the 'modern human condition' that seek to identify a specific geographical location or genetic mutation that would 'define' us 11 .
Genomic analysis of ancient human remains in Africa reveal deep population splits and complex admixture patterns among populations [12][13][14] . At the same time, reanalysis of fossils in Africa 15 points to the extended presence of multiple hominins on this continent, together with real possibilities of admixture 16,17 . Lastly, our deeper understanding of other hominins points to derived characteristics in these lineages that make some of our species' traits more ancestral (less 'modern') than previously believed 18  www.nature.com/scientificreports/ In the context of this significant rewriting of our deep history, we decided to explore the temporal structure of an extended catalog of single nucleotide changes found at high frequency (HF ≥ 90%) across major modern populations we previously generated on the basis of 3 high-coverage "archaic" genomes 19 , that is, Neanderthal/ Denisovan individuals, used as outgroups. This catalog aims to offer a richer picture of molecular events setting us apart from our closest extinct relatives. In order to probe the temporal nature of this data, we took advantage of the Genealogical Estimation of Variant Age (GEVA) tool 20 . GEVA is a coalescence-based method that provides age estimates for over 45 million human variants. GEVA is non-parametric, making no assumptions about demographic history, tree shapes, or selection (for additional details on GEVA, see "Methods"). Our overall objective here is to use the temporal resolution afforded by GEVA to estimate the age of emergence of polymorphic sites, and gain further insights into the complex evolutionary trajectory of our species.
Our analysis reveals a bimodal temporal distribution of modern human derived high-frequency variants and provides insights into milestones of Homo sapiens evolution through the investigation of the molecular correlates and the predicted impact of variants across evolutionary-relevant periods. Our chronological atlas allows us to provide a time window estimate of introgression events and evaluate the age of variants associated with signals of positive selection, tissue-specific changes, and specifically an estimate of the age of emergence of (enhancer) regulatory variants associated with different brain regions. Our enrichment analysis uncovers GOterms unique to specific temporal windows, such as facial and behavioral-related terms for a period (between 300 and 500 k years) preceding the dating of human fossils like that of Jebel Irhoud. Our machine learning-based analyses predicting differential gene expression regulation of mapped variants (through 21 ) reveals a trend towards downregulation in brain-related tissues and allowed us to identify variant-associated genes whose differential regulation may specifically affect brain structures such as the cerebellum.

Results
The distribution of derived alleles over time follows a bimodal distribution (Fig. 1a,b; see also Fig. S2 for a more elaborated version), with a global maximum around 40 kya (for complete allele counts, see "Methods"). The two modes of the distribution of HF variants likely correspond to two periods of significance in the evolutionary history of Homo sapiens. The more recent peak of HF variants arguably corresponds to the period of population dispersal and replacement following the last major out of Africa event 22,23 , while the older distribution contains the period associated with the divergence between Homo sapiens and other Homo species 9,24 .
In order to divide the data into smaller temporal clusters for downstream analysis we considered a k-means clustering analysis (at k = 3 and k = 4 , Fig. S1). This clustering method yields a division clear enough to distinguish between "early" and "late" Homo sapiens "specimens" 10 , with a protracted period overlapping with the split with other Homo species. (The availability of ancient DNA from other hominins would yield a better resolution of that period.) However, we reasoned that such a k-means division is not precise enough to represent key milestones used to test specific time-sensitive hypotheses. For this reason, we adopted a literature-based approach, establishing different cutoffs adapted to the need of each analysis below. Our basic division consisted of three

Variant subset distributions.
In an attempt to see if specific subsets of variants clustered in different ways over the inferred time axis, we selected a series of evolutionary relevant sets of data publicly available, such as genome regions depleted of "archaic" introgression (so-called 'deserts of introgression') 27,28 , and regions under putative positive selection 29 , and mapped the HF variants from 19 falling within those regions. We also examined genes that accumulate more HF variants than expected given their length and in comparison to the number of mutations these genes accumulate on the Neanderthal/Denisovan lineages ('length' and 'excess' lists from 19 -see "Methods"). Finally, we also examined the temporal distribution of introgressed alleles 27,30 . A bimodal distribution is clearly visible in all the subsets except the introgression datasets (Fig. 2b). Introgressed variants peak locally in the more recent period (0-100 kya). The distribution roughly fades after 250 kya, in consonance with the possible timing of introgression events 6,16,28,31 . As a case study, we focused on those introgressed variants associated with phenotypes highlighted in Table 1 of 32 . As shown in Fig. S3, half of the variants cluster around the highest peak, but other variants may have been introduced in earlier instances of gene flow. We caution, though, that multiple (likely) factors, such as gene flow from Eurasians into Africa, or effects of positive selection affecting frequency, influence the distribution of age estimates and make it hard to draw any firm conclusions. We also note that the two introgressed variant counts, derived from the data of 27,30 , follow a significantly different distribution over time ( p < 2.2-16, Kolmogorov-Smirnov test) (Fig. 2c). www.nature.com/scientificreports/ Finally, we examined the distribution of putatively introgressed variants across populations, focusing on lowfrequency variants whose distributions vary when we look at African vs. non-African populations (Fig. S4). As expected, those variants that are more common in non-African populations are found in higher proportions in both of the Neanderthal genomes studied here, with a slightly higher proportion for the Vindija genome, which is in fact assumed to be closer to the main source population of introgression 33 . We detect a smaller contribution of Denisovan variants overall, which is expected on several grounds: given the likely more frequent interactions between modern humans and Neanderthals, the Denisovan individual whose genome we relied on is likely part of a more pronounced "outgroup". Gene flow from modern humans into Neanderthals also likely contributed to this pattern.
In the case of the regions under putative positive selection, we find that the distribution of variant counts has a local peak in the most recent period (0-100 kya) that is absent from the deserts of introgression datasets, pointing to an earlier origin of alleles found in these latter regions. Also, as shown in Fig. 2d, the distribution of variant counts in these regions under selection shows the greatest difference between the two peaks of the bimodal distribution. Still, we should stress that our focus here is on HF variants, and that of course, not all HF variants falling in selective sweep regions were actual targets of selection. Figure S5 illustrates this point for two genes that have figured prominently in early discussions of selective sweeps since 5 : RUNX2 and GLI3. While recent HF variants are associated with positive selection signals (indicated in purple), older variants exhibit such associations as well. Indeed some of these targets may fall below the 90% cutoff chosen in 19 . In addition, we are aware that variants enter the genome at one stage and are likely selected for at a (much) later stage 34,35 . As such our study differs from the chronological atlas of natural selection in our species presented in 36 (as well as from other studies focusing on more recent periods of our evolutionary history, such as 37 ). This may explain some important discrepancies between the overall temporal profile of genes highlighted in 36 and the distribution of HF variants for these genes in our data (Fig. S6).
Having said this, our analysis recaptures earlier observations about prominent selected variants, located around the most recent peak, concerning genes such as CADPS2 38 (Fig. S7). This study also identifies a set of old variants, well before 300kya, associated with genes belonging to putative positively-selected regions before the deepest divergence of Homo sapiens populations 39 , such as LPHN3, FBXW7, and COG5 (Fig. S8).
Finally, focusing on the brain as the organ that may help explain key features of the rich behavioral repertoire associated with Homo sapiens, we estimated the age of putative regulatory variants linked to the prefrontal (PFC), temporal (TC), and cerebellar cortices (CBC), using the large scale characterization of regulatory elements of the human brain provided by the PsychENCODE Consortium 40 . We did the same for the modern human HF missense mutations 19 . A comparative plot reveals a similar pattern between the three structures, with no obvious differences in variant distribution (see Fig. S9). The cerebellum contains a slightly higher number of variants assigned to the more recent peak when the proportion to total mapped variants is computed. This may relate to the more recent modifications reported for this brain region 41 , which contributed to the globularized shape of our brain(case). We also note that the difference of dated variants between the two local maxima is more pronounced in the case of the cerebellum than in the case of the two cortical tissues, whereas this difference is more reduced in the case of missense variants (Fig. S9). We caution, though, that the overall number of missense variants is considerably lower in comparison to the other three datasets.   Table 1, linked to DAAM1, has been identified as a putative causal variant affecting the precuneus. All these brain structures have been independently argued to have undergone recent evolution in our lineage 41,[48][49][50] , and their associated variants are dated amongst the most recent ones in the table.

Discussion
Deploying GEVA to probe the temporal structure of the extended catalog of HF variants distinguishing modern humans from their closest extinct relatives ultimately aims to contribute to the goals of the emerging attempts to construct a molecular archaeology 52 and as detailed a map as possible of the evolutionary history of our species 53 . Like any other archaeology dataset, ours is necessarily fragmentary. In particular, fully fixed mutations, which have featured prominently in early attempts to identify candidates with important functional consequences 52 , fell outside the scope of this study, as GEVA can only determine the age of polymorphic mutations in the presentday human population. By contrast, the mapping of HF variants was reasonably good, and allowed us to provide complementary evidence for claims regarding important stages in the evolution of our lineage. This in and of itself reinforces the rationale of paying close attention to an extended catalog of HF variants, as argued in 19 .
While we wait for more genomes from more diverse regions of the planet and from a wider range of time points, we find our results encouraging: even in the absence of genomes from the deep past of our species in Africa, we were able to provide evidence for different epochs and classes of variants that define these. But whereas different clusters can be identified, the emerging picture is very much mosaic-like in its character, in consonance with recent work 1,3 . In no way do we find evidence for earlier evolutionary narratives that relied on one or a handful of key mutations.
Our analysis shows a bimodal distribution of the age of modern human-derived high-frequency variants (in consonance with the findings of 54 on a more limited set of variants ). The two peaks likely reflect, on the one hand, the point of divergence between Homo sapiens and other Homo species and, on the other, the period of population dispersal and replacement following the last major out of Africa event.
Our work also highlights the importance of a temporal window right before 300 ky that may well correspond to a significant behavioral shift in our lineage, such as increased ecological resource variability 55 , and evidence of long-distance stone transport and pigment use 56 . Other aspects of our cognitive and anatomical make up emerged much more recently, in the last 150 k years, and for these our analysis points to the relevance of gene expression regulation differences in recent human evolution, in line with [57][58][59] .
Lastly, our attempt to date the emergence of mutations in our genomes points to multiple episodes of introgression, whose history is likely to turn out to be quite complex.

Methods
Homo sapiens variant catalog. We made use of a publicly available dataset 19 that takes advantage of the Neanderthal and Denisovan genomes to compile a genome-wide catalog of Homo sapiens-specific variation. The original complete dataset is available at https:// doi. org/ 10. 6084/ m9. figsh are. 81840 38. As described in the original article, this catalog includes "archaic"-specific variants and all loci showing variation within modern populations. The 1000 genomes project and ExAc data were used to derive frequencies and the human genome version hg19 as reference. As indicated in the original publication 19 , quality filters in the "archaic" genomes were applied (specifically: sites with less 5-fold coverage and more than 105-fold coverage for the Altai individual, or 75-fold coverage for the rest of "archaic" individuals were filtered out). In ambiguous cases, variant ancestrality was determined using multiple genome aligments 60 and the macaque reference sequence (rheMac3) 61 .
In addition to the full data, the authors offered a subset of the data that includes derived variants at a ≥ 90% global frequency cutoff. Since such a cutoff allows some variants to reach less than 90% in certain populations, as long as the total is ≥ 90%, we also considered including a metapopulation-wide variant ≥ 90% frequency cutoff dataset to this study (Fig. S2). All files (including the original full and high-frequency sets and the modified, Table 1. Big40 Brain volume GWAS 46 top hits with high predicted gene expression in ExPecto ( log > 0.01 , RPKM), along with dating as provided by GEVA. 'Functional connectivity' is a measure of temporal activity synchronization between brain parcels at rest (originally defined in 51 ). www.nature.com/scientificreports/ stricter high-frequency one) are provided in the accompanying code. Controls in 1 were obtained through a probabilistic permutation approach with sets of random variants (100 sets, 50,000 variants each).

GEVA. The Genealogical Estimation of Variant Age (GEVA) tool 20 uses a hidden Markov model approach to
infer the location of ancestral haplotypes relative to a given variant. It then infers time to the most recent ancestor in multiple pairwise comparisons by coalescent-based clock models. The resulting pairwise information is combined in a posterior probability measure of variant age. We extracted dating information for the alleles of our dataset from the bulk summary information of GEVA age predictions. The GEVA tool provides several clock models and measures for variant age. We chose the mean age measure from the joint clock model, that combines recombination and mutation estimates. While the GEVA dataset provides data for the 1000 genomes project and the Simons Genome Diversity Project, we chose to extract only those variants that were present in both datasets. Ensuring a variant is present in both databases implicitly increases genealogical estimates (as detailed in Supplementary document 3 of 20 ), although it decreases the amount of sites that can be looked at. We give estimated dates after assuming 29 years per generation, as suggested in 62 . While other measures can be chosen, this value should not affect the nature of the variant age distribution nor our conclusions.
Out of a total of 4,437,804 for our total set of variants, 2,294,023 where mapped in the GEVA dataset (51% of the original total). For the HF subsets, the mapping improves: 101,417 (74% of total) and 48,424 (69%) variants were mapped for the original high-frequency subset and the stricter, meta-population cutoff version, respectively.
ExPecto. In order to predict gene expression we made use of the ExPecto tool 21 . ExPecto is a deep convolutional network framework that predicts tissue-specific gene expression directly from genetic sequences. ExPecto is trained on histone mark, transcription factor and DNA accessibility profiles, allowing ab initio prediction that does not rely on variant information training. Sequence-based approaches, such as the one used by Expecto, allow to predict the expression of high-frequency and rare alleles without the biases that other frameworks based on variant information might introduce. We introduced the high-frequency dated variants as input for ExPecto expression prediction, using the default tissue training models trained on the GTEx, Roadmap genomics and ENCODE tissue expression profiles.
gProfiler2. Enrichment analysis was performed using gProfiler2 package 42 (hypergeometric test; multiple comparison correction, 'gSCS' method; p values 0.01 and 0.05). Dated variants were subdivided in three time windows (0-300 kya, 300-500 kya and 500 kya-1 mya) and variant-associated genes (retrieved from 19 ) were used as input (all annotated genes for H. sapiens in the Ensembl database were used as background). Following 21 , variation potential directionality scores were calculated as the sum of all variant effects in a range of 1 kb from the TSS. Summary GO figures presented in Fig. S11 were prepared with GO Figure 63 .
For enrichment analysis, the Hallmark curated annotated sets 64 were also consulted, but the dated set of HF variants as a whole did not return any specific enrichment.