Latin Americans show wide-spread Converso ancestry and imprint of local Native ancestry on physical appearance

Historical records and genetic analyses indicate that Latin Americans trace their ancestry mainly to the intermixing (admixture) of Native Americans, Europeans and Sub-Saharan Africans. Using novel haplotype-based methods, here we infer sub-continental ancestry in over 6,500 Latin Americans and evaluate the impact of regional ancestry variation on physical appearance. We find that Native American ancestry components in Latin Americans correspond geographically to the present-day genetic structure of Native groups, and that sources of non-Native ancestry, and admixture timings, match documented migratory flows. We also detect South/East Mediterranean ancestry across Latin America, probably stemming mostly from the clandestine colonial migration of Christian converts of non-European origin (Conversos). Furthermore, we find that ancestry related to highland (Central Andean) versus lowland (Mapuche) Natives is associated with variation in facial features, particularly nose morphology, and detect significant differences in allele frequencies between these groups at loci previously associated with nose morphology in this sample.


Supplementary Figure 1. Birthplace of the CANDELA individuals included in this study
Circles are centred on birthplace with size proportional to the number of individuals born at that location (scale provided on the left). A total of 6,589 individuals across five countries are shown. 6

Supplementary Figure 2. Approximate geographic location of the 117 reference population samples used in this study
Populations have been color-coded as: blue (38 Native American), red (42 European), yellow (15 East/South Mediterranean), green (15 Sub-Saharan African) and purple (7 East Asians). Numbers inside the dots correspond to those used in Supplementary Table 1 with additional information on these samples.

Supplementary Figure 3. Tree relating the 56 surrogate clusters defined by fineSTRUCTURE and retained for ancestry inference
Brackets on the right indicate the 35 groups of clusters displayed in Figure 1.

Supplementary Figure 4. Percentage of Native American ancestry sub-components for 377 Brazilians with >5% Native American ancestry, as inferred by SOURCEFIND
The figure was made with ggplot2. The centre line corresponds to the median, the bounds of box represent the first ( 1 ) and the third ( 3 ) quartiles, and the whiskers approximate to 1 -1.5*Inter-Quartile Range and 3 + 1.5*Inter-Quartile Range. Outlying points are plotted individually.

Supplementary Figure 5. Geographic distribution of Sub-Saharan African ancestry subcomponents in CANDELA individuals, as inferred by SOURCEFIND
This pie map follows the same convention of Figure 2. Each pie represents an individual, with pie location corresponding to birthplace. Since many individuals share birthplace, jittering has been performed based on pie size and how crowded an area is. Pie size is proportional to total ancestry from all sources depicted in that specific figure, and only individuals with >5% of such total ancestry are shown. Coloring of pies represents the proportion of each sub-continental component estimated for each individual (color-coded as in Fig. 1).

Supplementary Figure 6. Differences in Sub-Saharan African sub-continental ancestry between former Portuguese and Spanish colonies
Average sub-continental ancestry proportion for the 1,472 individuals with >5% Sub-Saharan African ancestry in Brazil and the four Spanish American countries sampled (Chile, Colombia, Mexico and Peru) as inferred by SOURCEFIND.

Supplementary Figure 7. Geographic distribution of East Asian ancestry subcomponents in CANDELA individuals, as inferred by SOURCEFIND
This pie map follows the same convention of Figure 2. Each pie represents an individual, with pie location corresponding to birthplace. Since many individuals share birthplace, jittering has been performed based on pie size and how crowded an area is. Pie size is proportional to total ancestry from all sources depicted in that specific figure, and only individuals with >5% of such total ancestry are shown. Coloring of pies represents the proportion of each sub-continental component estimated for each individual (color-coded as in Fig. 1).

Supplementary Figure 10. Differences in allele frequencies between Mapuche and Central Andean populations
A two-sample t-test -log10(p-values) across all SNPs, when testing whether each SNP's allele frequency for inferred Mapuche segments is the same as that for inferred Central Andean segments. Dashed blue lines denote the six GWAS hit SNPs associated with facial features, and red lines the mean across these.

Supplementary Figure 11. Average -log (p-values) for the differences in allele frequencies between Mapuche and Central Andean populations across sets of six randomly selected SNPs
10,000 sets of six randomly selected SNPs that match the number of inferred Mapuche and Central Andean segments at the six GWAS hit SNPs, plus match the minor allele frequency of either the inferred (A) Central Andean or (B) Mapuche segments at the six GWAS hit SNPs. Dashed red lines give the average -log10(p-value) across the six GWAS hit SNPs, and p-value gives the proportion of random samples with average -log10(p-values) greater than the dashed red line.

Supplementary Figure 12. Global overview of the analysis strategy
Yellow trapezoids represent the data used and/or generated, blue rectangles the implemented analyses/approaches and Green diamonds the results obtained from the different analyses/approaches.

Supplementary Figure 13. Overview of the analysis strategy for the definition of homogeneous clusters of reference population individuals
Yellow trapezoids represent the data used, blue rectangles the implemented analyses/approaches and Green diamonds the results obtained from the different analyses/approaches.

Supplementary Figure 14. Example CANDELA individual for which GLOBETROTTER infers admixture involving three sources at about the same time
Dots are the inferred relative probabilities that a pair of DNA segments in this individual are inherited from: (A) Native American (NAM) and European (EUR) sources, (B) NAM and Sub-Saharan African (SSA) sources, (C) EUR and SSA sources, as a function of the genetic distance between the DNA segments. The fitted exponential curves result in an estimated admixture date of 11 generations ago.

129
Argentina.3(7/13)+Argentina.4(2/2) 9 34 fS Clust: Cluster assigned by fineSTRUCTURE. Decision: Some reference samples were used only as donors for the subsequent ancestry inference. Others are also used as surrogates for the ancestral populations in SOURCEFIND analyses. Some were removed from the reference set. Donor/Surrogate: This is the final grouping used for generating the copying vectors used for the sub-continental ancestry analyses. Groups without the Out.* prefix are the ones selected as surrogates as described in Supplementary Table 3.   Number of SNPs matching the MAF (within <1%) of the given population at each GWAS hit SNP, and matching the counts (within <20) of inferred Native haplotypes assigned to each of the Mapuche and Central Andes populations. SNPs were randomly selected from these matching SNPs to provide 10,000 sets of 6 SNPs, one matching each GWAS hit SNP, in order to provide an empirical p-value (last column) testing whether the allele frequencies at the 6 GWAS hit SNPs were more differentiated, on average, between our assigned Central Andes and Mapuche haplotypes than is typical.

Supplementary Note 1. ADMIXTURE and Principal Components Analysis (PCA)
The merged dataset was pruned to select SNPs not in Linkage Disequilibrium (LD) using PLINKv1.9 with the option --indep 50 5 2, resulting in 150,858 SNPs being retained. Supervised ADMIXTURE analyses were carried out in order to obtain continental ancestry estimates independent from those obtained with SOURCEFIND. For this, the same reference individuals included in the SOURCEFIND analyses were grouped into continental reference groups, considering three scenarios: We also applied unsupervised ADMIXTURE and PCA to the same dataset, providing results for K = 2-10 clusters for ADMIXTURE ( Supplementary Fig. 8) and 10PCs for PCA ( Supplementary Fig. 9). Below we describe some major features from these analyses, which are relevant for the discussion of the SOURCEFIND results. ADMIXTURE analysis (Supplementary Fig. 8) at K = 3 detects three major continental ancestry components, reaching 100% frequency in certain European, Native American and Sub-Saharan African reference populations. CANDELA individuals show highly variable proportions of these three components.
At K = 4, another major continental ancestry component is inferred, reaching 100% frequency in certain East Asians. This Asian component is found at low frequency in Native Mexicans and in CANDELA samples from Mexico, possibly reflecting a closer genetic affinity of Natives from Mexico to East Asians, compared to Native Americans further South, as has been inferred from other analyses 6  At K=6 a component is seen at high frequency in the Colombian CANDELA data. Comparing this profile with results for the same CANDELA samples at K=5 it is apparent that this component corresponds mainly to the inferred European ancestry and possibly reflects extensive drift in the Colombians (the PCA results described below also provide suggestive evidence of this interpretation).
At K=7 a third Native American component is observed, reaching 100% frequency in the Chilean Mapuche Natives. This component reaches high frequency in Chilean CANDELA samples.
At K = 8 a minor component specific to Sub-Saharan Africans is detected, which reaches highest frequency in East African samples.
At K = 9 a component reaching frequencies close to 100% in North East Europe is observed. This component shows a gradient of decreasing frequency from North West Europe to Iberia. It is also observed in the CANDELA samples, reaching highest frequency in Brazilians.
At K = 10 a component reaching maximum frequencies in Western Europeans is detected, distinct from a component seen mostly in Southern Europe but which reaches maximum frequencies in East/South Mediterraneans. These two components are detected at variable frequencies in the CANDELA samples.
Of the 10 total components at K=10, six reach frequencies close to 100% in certain reference population groups (from East Asia, Sub-Saharan Africa, North East Europe, the Andes, Mesoamerica and the Mapuche). Of these, two have a close correspondence with components defined by the SOURCEFIND analyses (the Mapuche and North East European components). The other three components detected by ADMIXTURE in the reference data are further subdivided by the fineSTRUCTURE analyses. In addition, most ADMIXTURE components show gradients in frequency across many reference samples, which are recognized as distinct population clusters in the SOURCEFIND analyses (Fig. 1). Some basic observations from the PCA analyses ( Supplementary Fig. 9) are as follows: PC1-PC3 represent axis of differentiation between continental populations (PC1 distinguishing Africans from Non-Africans, PC2 Europeans from Native Americans and PC3 East Asians from Native Americans). The CANDELA individuals are spread out mostly along the European-Native American axis, consistent with their mostly Native American-European admixture. Certain CANDELA individuals also show evidence of some African or East Asian ancestry.
PCs from PC4 onwards detect sub-continental axis of genetic differentiation. PC4 detecting genetic variation within Africa and distinguishing West Africans from South Africans.
PC5-PC7 represent axis of differentiation between Native Americans (PC5 corresponding to a Mexican-Southern Chilean Natives axis. PC6 to a Mapuche-Chibchan axis and PC7 to a Mapuche-Central Andean axis). The CANDELA samples place themselves along these axes of Native American variation in accordance with the Natives from the corresponding geographic region. These observations illustrate how Native American population structure is being reflected in the CANDELA samples, a pattern standing out even more clearly in the SOURCEFIND results shown in Figure  1.
Interestingly, the Colombian samples are placed somewhat at an offset along the Chibchan (i.e. Native Colombians) axis, consistent with the component detected by ADMIXTURE at K = 6 representing a case of drift specific to the Colombian sample.
PC8 corresponds to an axis of South/East Mediterranean-NorthEast European differentiation.
PC9 to an axis of West African-East African differentiation.
And PC10 to an axis of Japan-China/Vietnam differentiation.
Overall on the first 10 PCs, PCA revealed four axis of continental and six axis of subcontinental genetic differentiation, placing CANDELA individuals along some of these axes. From PC11 onwards, there are no discernible patterns of genetic structure.
Simulations were performed modelling the admixture in Latin America in order to assess the robustness and accuracy of sub-continental ancestry estimations (NNLS and SOURCEFIND) as well as the estimated dates of admixture (GLOBETROTTER). Since the precision of subcontinental ancestry estimates is affected by the relatedness of surrogate clusters, and their level of genetic drift, these simulations also allowed the exploration of which sub-continental ancestries cannot be reliably distinguished. Subsets of some of the 56 surrogate clusters were used to generate simulated admixed individuals following the procedures described in e.g. The SOURCEFIND approach described in Methods is computationally expensive, due in part to having to run 50 independent runs in order to sample the parameter space effectively (as assessed by our simulations). Therefore, for some analyses reported here we used an alternative, more computationally efficient version of SOURCEFIND that uses the same likelihood function, but which removes and replaces the prior on the values with a truncated Poisson (mean=3) prior on the number of contributing surrogates S'. At each MCMC iteration, this alternative SOURCEFIND allows only a maximum of S' surrogates to have > 0 and for the values of each of these S' surrogates to be 0.01,…,1 in increments of 0.01. The proposed move at each MCMC iteration is as follows. Some percentage X<100 of the current contribution from a randomly chosen surrogate group s (currently contributing > 0) is distributed across the other currently included surrogates. (This set of other included surrogates contains up to S' members, with new randomly chosen surrogates added if the total number of surrogates is less than S'.) With probability 0.1, X = 100 * ; otherwise with probability 0.9, X is instead chosen from a Binomial (n,q) distribution with number of draws n=100* and probability of each draw q=0.5. Then with probability 0.5, X/100 is added to the contribution of a single other surrogate; otherwise it is distributed randomly across the other currently included surrogates. This proposal is then accepted or rejected using a Metropolis-Hastings step. Here we used S'=6 and performed 100,000 total MCMC iterations, sampling posterior values of 1 ,…, every 5000 iterations after discarding the initial 50,000 iterations as burn-in. Results under this approach ran much more quickly and gave qualitatively similar conclusions in applications to simulated and nonsimulated data, as described in this section and Supplementary Note 6.
In particular in each of simulations (i)-(iv) described below, we provide plots illustrating the accuracy of both the initial SOURCEFIND version (called SOURCEFIND1 in this section) and the computationally efficient version of SOURCEFIND (called SOURCEFIND2). For these simulations, accuracy is only very slightly reduced when using SOURCEFIND2 relative to SOURCEFIND1. Regarding computation time, analysis of a single CANDELA individual took ~10 minutes using each run of the initial SOURCEFIND version with 200K MCMC iterations, hence taking 10x50=500 minutes to do 50 independent runs. In contrast, it took ~25 seconds with 100K MCMC iterations and a single run of the more computationally efficient version. We note that additional independent runs of the computationally efficient version may improve performance, while reducing the gains in computation time (e.g. doing 50 independent runs would make it only ~20x faster than the initial SOURCEFIND). The computationally efficient version of SOURCEFIND is available at www.paintmychromosomes.com.

Simulations to assess the accuracy of sub-continental ancestry estimates
For each set of simulations in this section, we generated 100 simulated individuals as mixtures of three surrogate clusters intermixing 15 generations ago. From the clusters selected for the simulations, we used less than half the individuals in a cluster to simulate admixed individuals. The remaining individuals in a cluster were used for the CHROMOPAINTER/SOURCEFIND inference. Simulations were as described in Price et al. 2009 8 and assume a model of instantaneous admixture followed by random mating among the admixed individuals. Briefly, each simulated haploid genome consists of a mosaic of blocks, with each block of size M (in Morgans) sampled from an exponential distribution (of rate=15). For each block, the SNP data exactly matched that of a randomly sampled haplotype from one of the surrogate clusters, with the probabilities for selecting a haplotype from each of the three surrogate clusters specified by the admixture proportions being simulated as indicated below. This random selection process was repeated independently for each block. Two haploid genomes were randomly combined to generate each simulated diploid individual.
SOURCEFIND1 analyses were performed with 20 independent runs using 200,000 iterations each run as described in methods, with SOURCEFIND2 analyses run as described in the previous section. NNLS was performed using the procedure encoded in GLOBETROTTER described in Hellenthal et al. 2014 7 , which uses the non-negative linear least squares function (nnls) in R. As with the real data analysis, for each run results with highest posterior probability values were chosen, averaging inferred ancestry proportions across the 20 runs using this probability as a weight. We note that accuracy of both NNLS and SOURCEFIND depends in part on the number of individuals used in each surrogate group, so that removing ~30% of the individuals from each simulating group when performing inference may decrease accuracy.
Four sets of simulations with different admixture percentages were performed and these are described below. The values in parenthesis indicate the fraction of individuals from a cluster that were used to generate the admixed individuals in that simulation.
(i) 40% CentralSouthSpain (16/48), 30% NorthWestEurope2 (32/101), 30% SouthMexico1 (5/16) When using NNLS as described in e.g. Leslie et al. 2015 9 , ancestry from SouthMexico1 is inferred with high accuracy, showing little marginal uncertainty and little misassignment even to Nahua1, a striking result considering that these two surrogate clusters are closely related as shown in the fineSTRUCTURE tree ( Supplementary Fig. 3). The accuracy obtained with SOURCEFIND1/SOURCEFIND2 is even higher, having a nearly perfect match to the true simulated proportions and sources.
The following pyramid chart in Supplementary Figure 17  In the case of CentralSouthSpain, NNLS shows high levels of misassignment to other Iberian surrogates. The highest missassigned values are to CentralNorthSpain, which is the group genetically most similar to CentralSouthSpain. Additional contributions are inferred for East/South Mediterranean populations (up to ~5%). In contrast, SOURCEFIND estimations are highly accurate, with very minor inferred incorrect contributions related to Italy1. Importantly, there are no mis-inferred contributions from East/South Mediterranean populations when using SOURCEFIND.
The estimation of NorthWestEurope2 ancestry is typically more accurate, with some incorrect assignment to NorthWestEurope1 (max ~10%), that is considerably stronger under NNLS.
Overall, this simulation demonstrates the increased resolution of SOURCEFIND compared to NNLS for resolving ancestral origins among Iberian populations. SOURCEFIND also has reduced mis-specified contributions related to East/South Mediterranean groups.
40% Portugal/WestSpain (16/53), 30% Italy1 (7/19), 30% Aymara (6 /16) NNLS analysis results in a poor discrimination of Aymara from Quechua2 ancestry, consistent with the high genetic similarity of these two groups and the small size for the Aymara cluster (n=16). We note that when Quechua2 ancestry is included in the simulations instead of Aymara (simulation iii below) higher accuracy is obtained, showcasing the increased accuracy when using more surrogate individuals from the admixing group when performing inference. In the case of the SOURCEFIND analysis, both Aymara and Quechua2 ancestries are accurately estimated under both simulation scenarios.
Both NNLS and SOURCEFIND slightly overestimate the Portugal/WestSpain contribution and slightly underestimate the ancestry from Italy1. However, SOURCEFIND inferences are closer to the simulated proportions than those of NNLS. Furthermore, as in the previous simulation, NNLS infers East/South Mediterranean contributions, as well as several other incorrect European contributions, which are not inferred in the SOURCEFIND analyses.
(iii) 40% Quechua2 (15/56), 40% CentralSouthSpain (16/48), 20% WestAfrica3 (22/99) Estimated contributions from WestAfrica3 and Quechua2 are very accurate under both NNLS and SOURCEFIND, with the latter again showing more accurate estimates overall. We note that NNLS infers a notable spurious contribution from Basque, which suggests that inferred Basque-like contributions in the Americas using this approach should be treated with caution 10 . These simulation results suggest that, for NNLS, the presence of different mis-specified signals of ancestry across the Iberian groups may be proportional to the amount of true ancestry from these sources, which could allow the establishment of noise thresholds in NNLS inference. For example, if the highest values of Basque ancestry in an individual with 20% CentralSouthSpain is around 2% for simulations here, and around 4% for an individual with 40% CentralSouthSpain (see simulation set (iii)), we could in theory predict that an individual in the real dataset with 80% CentralSouthSpain-like ancestry may have ~8% Basque ancestry attributable to noise. SOURCEFIND does not show this problem, instead showing only a slight mis-assignment of this Iberian component to the closest group (CentralNorthSpain).
The two Native American components, although closely related, are distinguishable by both approaches, although SOURCEFIND shows greater precision.

Simulations to assess the accuracy of per individual estimation of time since admixture and the effect of time since admixture on ancestry estimation
Simulations with a single admixture event We simulated an additional 1,430 individuals with different proportions of admixture from two sources (CentralSouthSpain and Quechua2) and different times since admixture. Using the procedure described in the previous section, each individual was simulated as descending from an instantaneous admixture event that occurred g generations ago, with a proportion p of ancestry from CentralSouthSpain, and 1-p ancestry from Quechua2. We simulated p = 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95 and g = 5-17 generations, with 10 simulated individuals for each combination of p and g, resulting in a total of 1,430 simulated individuals.
We used 16 CentralSouthSpain and 20 Quechua2 individuals to generate the admixed individuals, using the remaining 32 CentralSouthSpain and 36 Quechua2 individuals to infer ancestry using SOURCEFIND and GLOBETROTTER. SOURCEFIND and GLOBETROTTER were run separately on each simulated individual as described for the real data samples, with the slight modification that GLOBETROTTER was allowed to use all surrogates to describe the admixture (rather than only including surrogates inferred by SOURCEFIND to contribute >1%). In contrast to the simulations above, for these simulations we used the more computationally efficient version of SOURCEFIND (i.e. SOURCEFIND2), described at the start of this Supplementary Note, to infer proportions. Figure 21) shows that on average, GLOBETROTTER's individual estimated dates accurately reflect the simulated dates (grey bar), and that this accuracy is not affected by variation in the admixture proportions (the figure has been produced with the default parameters of the function boxplot.default in R: The centre line corresponds to the median, the bounds of box represent the first ( 1 ) and the third ( 3 ) quartiles, and the whiskers approximate to 1 -1.5*Inter-Quartile Range and 3 + 1.5*Inter-Quartile Range. Outlying points are plotted individually).

Supplementary Figure 21. GLOBETROTTER's inferred dates (y-axis) across individuals, for simulations mixing CentralSouthSpain and Quechua2 at the given proportions (legend) and times (x-axis)
Similarly, the figure below shows that SOURCEFIND's accuracy in inferring ancestry proportions in the simulated individuals did not depend on the date of admixture (simulated proportions highlighted with a grey bar).

Supplementary Figure 22. SOURCEFIND's inferred proportion of ancestry related to Iberian (IBR) and Native American (NAM) sources (y-axis) across individuals (circles), for simulations mixing CentralSouthSpain and Quechua2 at the given proportions (x-axis) and times (legend)
We also examined directly if in the 1,430 simulated individuals there is a pattern similar to that inferred in the CANDELA data (Fig. 3A), where Native American ancestry increases for more recent admixture events. To do so, we mimicked our real data analysis by first extracting the 1,297 simulated individuals for which GLOBETROTTER inferred to have a single date of admixture with one source best-matching a Native group and the other source best-matching a European group. We then binned individuals based on their inferred admixture date, and calculated the average inferred ancestry proportions in each bin. The figure below shows that no pattern is observed in the simulated data (Supplementary Table 4), suggesting that the pattern observed in the CANDELA data is not an artefact of the GLOBETROTTER estimation. We explore this trend relating inferred Native American ancestry to inferred admixture date with additional simulations below.

Supplementary Figure 23. Mean ancestry percentages in the simulated individuals estimated by SOURCEFIND grouped by the number of generations since admixture
Simulations with two sequential admixture events To further evaluate the trend of increasing Native ancestry at more recent dates of admixture seen in the CANDELA data, we simulated 1,050 additional individuals with two sequential admixture events. As before, we simulated different proportions of admixture from two sources (CentralSouthSpain and Quechua2), and varied the times for the two admixture events. Using the exponential sampling procedure described above, we first simulated individuals stemming from an instantaneous admixture event occurring 2 generations previously, with 55% CentralSouthSpain ancestry and 45% Quechua2 ancestry. We then simulated a second instantaneous admixture event with p ancestry from the population generated in the first admixture event, and 1-p ancestry from Quechua2 occurring g generations ago. We simulated p = 0.86-0.98 (at 0.02 intervals) and g = 5-14 generations, with 15 simulated individuals for each combination of p and g (1,050 simulated individuals in total). Note that, under this simulation procedure, the first admixture event occurred g+2 generations ago, the more recent event occurred g generations ago, and the final expected proportion of ancestry from CentralSouthSpain is 0.55*p. SOURCEFIND and GLOBETROTTER were run separately on each simulated individual as before. As with the previous section, for these simulations we used the more computationally efficient version of SOURCEFIND (i.e. SOURCEFIND2), described at the start of this Supplementary Note, to infer proportions.
In 923 (~88%) of the 1,050 individuals, GLOBETROTTER concluded only a single date of admixture, which is not surprising given the inherent difficulty in distinguishing between two pulses of admixture separated by only 2 generations that involve the same source groups. The figure below (Supplementary Figure 24) shows results when assuming a single date of admixture, which infers dates that typically are 2 generations above g (simulated date given with the grey bar). Therefore, GLOBETROTTER most often concludes a single date of admixture, with the inferred date primarily reflecting the older event because this is reflected in the sizes of observed Iberian ancestry segments (the figure has been produced with the default parameters of the function boxplot.default in R as explained in the previous section).

Supplementary Figure 24. GLOBETROTTER's inferred dates (y-axis) across individuals, for simulations with two sequential admixture events, at the given proportions (legend) and times (x-axis).
The figure below (Supplementary Figure 25) illustrates that SOURCEFIND accurately estimates the admixture proportions in the simulated individuals (grey bar gives simulated proportion).

Supplementary Figure 25. SOURCEFIND's inferred proportion of ancestry related to Iberian (IBR) and Native American (NAM) sources (y-axis) across individuals (circles), for simulations with two sequential admixture events, at the given proportions (x-axis) and times (legend).
In addition, as above, we extracted the 923 simulated individuals that GLOBETROTTER inferred to have a single admixture event between source groups that best-matched Native and European surrogate groups. We binned these individuals based on their inferred admixture date, and calculated the average ancestry inferred proportions in each bin. While not as striking as that observed in our real data (Fig. 3A of the main text), the figure below shows an analogous trend for decreasing Native American ancestry at increasing g that is significant (p<0.001) under the same simple linear regression model used for analysing this trend in the real data (Supplementary Table 4). This is because individuals here are simulated with different proportions of admixture from the earlier admixture event occurring g+2 generations ago. Individuals with more simulated ancestry from this earlier admixed group have (i) more European ancestry and (ii) inferred dates that may be slightly older by retaining more signal from this older admixture event. Indeed, a simple linear regression of the bias in date estimate (in generations ago) for these 923 individuals on their expected proportion of Spanish ancestry shows a significantly positive association (p<0.007). In contrast, for the 1,297 simulated individuals described in the previous section with only a single simulated admixture date, there is no such significant trend (p=0.33). Overall these simulation results suggest that mixture between unadmixed and admixed Natives over time, such as that we simulated in this section, could lead to the trend we observe in Figure 3A.

Assessing the reliability of East/South Mediterranean ancestry estimation
The simulations above do not include East/South Mediterranean (ESM) ancestry. We can therefore use them to assess the amount of spurious ESM ancestry inferred in our analyses. For the 400 individuals described in the first section of this Supplementary Note, where the proportion of simulated Iberian ancestry ranges from 20-40%, SOURCEFIND estimates that none of these have >1% ESM ancestry. Across all simulations (2,880 individuals), only 2 (~0.07%) had >5% ESM ancestry (maximum = 6.2%, with both of these 2 simulated individuals having >90% Iberian ancestry, and 72 (2.5%) had inferred ESM ancestry >2%. In the main text, we note that ~23% of CANDELA individuals are inferred by SOURCEFIND to have >5% ESM ancestry (Fig. 1E). Furthermore, in the CANDELA data, 878 (~14.6%) individuals are inferred to have >10% ESM ancestry, an amount never inferred in any of the simulations performed (even in individuals with 90% simulated Iberian ancestry). The simulation results are therefore consistent with ancestors of these Latin American individuals having substantially greater ESM ancestry than the present Iberian groups sampled.  -Lobe size. Small to bigger size.

Supplementary Note 3. Average continental and sub-continental ancestry from SOURCEFIND and ADMIXTURE Supplementary
-Helix rolling. The outer rim of the ear that extends from the superior insertion of the ear on the scalp (root) to the termination of the cartilage at the earlobe (less to more pronounced helix rolling).
-Fold of antihelix. Less to more pronounced fold of antihelix.
-Antitragus size. Small to bigger size. The anterosuperior cartilaginous protrusion lying between the incisura and the origin of the antihelix. The anterosuperior margin of the antitragus forms the posterior wall of the incisura.

Supplementary Note 5. Correlation of regression p-values from different approaches to the Central Andes -Mapuche ancestry contrast
Regression analyses for testing the phenotypic effect of the contrast between Central Andes and Mapuche ancestry were performed using three different approaches to define these ancestry components, based on the SOURCEFIND, ADMIXTURE, or PCA ( Supplementary Figures 8  and 9).
• SOURCEFIND: estimates of the Aymara, Quechua1, Quechua2, and Colla components were added and the sum (taken as the Central Andes component) contrasted to the Mapuche component. Regressions were performed in three ways: (i) including individuals from all countries, (ii) including only Peruvians and Chileans (both the Central Andes and Mapuche components are only present in these two countries, Fig. 1), (iii) including only Chileans (as only this country has both the Central Andes and the Mapuche components at high frequency, Figure 1). • ADMIXTURE: the unsupervised run at K=7 distinguishes three components in Native Americans ( Supplementary Fig. 8). A light-blue colored component reaches 100% frequency in the Central Andean groups while a grey colored component reaches 100% frequency in the Mapuche. The difference in the proportions of these two ancestry components was taken as the Central Andes -Mapuche contrast. Regression analysis was performed including individuals from all countries. • PCA: PC7 places Central Andean and Mapuche clusters at opposite ends ( Supplementary Fig.  9). Individual values on this PC were taken directly as an approximation to the contrast between these two components. However, since this PC shows some confounding with other ancestry differences, the regression included only Chileans (as these individuals have a relatively low frequency of other Native American ancestry components, Fig. 1, Supplementary Fig. 8). Below are Spearman's rank correlations calculated between the -log p-values obtained in the regressions described above (Sample sizes: all individuals N = 5794, Peruvians and Chileans N = 2594, Chileans N = 1542).

Supplementary Note 6. Robustness of SOURCEFIND ancestry inference to the exclusion CANDELA individuals used as reference samples
To assess the impact of having included some CANDELA individuals as reference samples in the SOURCEFIND analyses, we repeated our analyses after excluding CANDELA individuals from the reference samples. We also removed individuals that were excluded from the surrogate groups as described in methods (Definition of homogeneous clusters of reference population individuals), resulting in this analysis including only 55 surrogate clusters. The loss of one surrogate cluster relative to the initial analyses was due to the Germany surrogate cluster consisting entirely of CANDELA individuals. As described at the start of Supplementary Note 2, for this analysis we used an alternative, more efficient version of SOURCEFIND (i.e. SOURCEFIND2) that used a truncated Poisson prior on the number of contributing surrogates and allowed a maximum of 6 surrogates to contribute at each MCMC iteration.
Maps with the distribution of the new estimated individual ancestry proportions are shown below. Ancestry matching to European, East/South Mediterranean, Sub-Saharan African, and East Asian groups are largely consistent with the results shown in Figure 2. For Native American ancestry, results are similar across most of the CANDELA sample. However there is a marked decrease in inferred ancestry related to the AndesPiedmont and Quechua1 surrogate groups. This is probably due to these groups being made up of only one individual, after removal of CANDELA samples, thus decreasing the power of ancestry inference from these groups. The inferred ancestry contributions of these groups are, for the most part, substituted by ancestries from related, geographically proximate, surrogate groups.