A major factor for the population decline of Native Americans after European contact has been attributed to infectious disease susceptibility. To investigate whether a pre-existing genetic component contributed to this phenomenon, here we analyse 50 exomes of a continuous population from the Northwest Coast of North America, dating from before and after European contact. We model the population collapse after European contact, inferring a 57% reduction in effective population size. We also identify signatures of positive selection on immune-related genes in the ancient but not the modern group, with the strongest signal deriving from the human leucocyte antigen (HLA) gene HLA-DQA1. The modern individuals show a marked frequency decrease in the same alleles, likely due to the environmental change associated with European colonization, whereby negative selection may have acted on the same gene after contact. The evident shift in selection pressures correlates to the regional European-borne epidemics of the 1800s.
The decline of Native American populations after European contact has been linked to several factors including warfare, alterations in social structure and an overwhelming introduction of European-borne pathogens1,2,3. Although the extent of the population decline remains contentious, European-borne epidemics may have disproportionately contributed to the phenomenon4,5. The debate has prompted researchers to explore the possibility of genetic susceptibility, where low-genetic variation in HLA genes and immunologically naïve populations are linked to the exacerbated pathogen-associated mortality rates6,7,8. Assumptions of homogeneity among certain immune genes8, however, are based on surveys of living Native Americans who represent the surviving members of communities affected by European contact and colonization. Thus, they fail to consider immune-related genetic factors that may have existed before contact.
The immunological history of the indigenous people of the Americas is undoubtedly complex. As people entered the Americas and expanded into different regions, ∼15,000–20,000 years before present (BP)9,10, groups encountered environments with varying ecologies and with relatively little gene flow from other continental populations until European contact11. We hypothesize that indigenous people adapted to local pathogens, resulting in long-lasting changes to immune-related loci. Ancient immune adaptations are suspected to have occurred throughout human history as populations spread into varying environments across the globe12. If the indigenous people of America adapted to local pathogens, those adaptations would have proven useful in ancient times but not necessarily after European colonialists altered the environment with their pathogens, some of which may have been novel13,14,15. Existing genetic variation as a result of adaptation before European contact could thus have contributed to the indigenous population decline after European contact.
To investigate the possibility of a pre-existing genetic component, we sequenced 50 exomes of ancient and modern individuals from the Northwest Coast of North America, dating from before and after European contact. We confirm the genetic continuity between the ancient and modern individuals, establishing a single continuous population through time. We show a 57% reduction in effective population size after European contact, inferred from our demographic model. We also detect signatures of positive selection on immune-related genes in the ancient but not the modern individuals. The strongest selection signal in the ancients derives from the human leucocyte antigen (HLA) gene HLA-DQA1, with alleles that are close to fixation. The important immune function of HLA-DQA1 supports an ancient adaptation to the environments of the Americas. The modern individuals show a marked decrease in the frequency of the associated alleles (the most pronounced variant showing a 64% difference). This decrease is likely due to the environmental change associated with European colonization, which resulted in a shift of selection pressures, whereby negative selection may have acted on the same gene after contact. Furthermore, the selection pressure shift could correlate to the European-borne epidemics of the 1800s, suffered in the Northwest Coast region. This is among the first studies to examine a single population through time and exemplifies the power of such studies in uncovering nuanced demographic and adaptive histories.
Samples and sequencing
To investigate possible immune-related genes under selection before European contact, we sequenced the exomes of ancient and modern First Nations individuals of the Prince Rupert Harbour (PRH) region of British Columbia, Canada (Supplementary Fig. 1, Supplementary Note 1). We then performed genomic scans for positive selection and functional characterization of genes exhibiting the strongest signals. Exomes of 25 modern individuals from two Coast Tsimshian communities, Metlakatla and Lax Kw’alaams (henceforth referred to as ‘Tsimshian’), were sequenced to a mean depth of 9.66 × . The 25 ancient individuals from archaeological sites in the PRH region (henceforth referred to as ‘PRH Ancients’; Supplementary Fig. 1) range in age from ∼6,260 to 1,036 cal BP (Supplementary Table 1), with most of the individuals falling between 3,000 and 1,036 cal BP. The ancient exomes were sequenced to a mean depth of 7.97 × (Supplementary Table 2). Contamination estimates (that is, exogenous DNA stemming from modern sources), using the exome-wide data, revealed a mean contamination of 0.94% with a 95% confidence interval of 0.83–1.10% (ref. 16; Supplementary Table 3). All 25 ancient individuals exhibited patterns of C→T and G→A transitions consistent with deamination due to post-mortem DNA damage17,18 (Supplementary Fig. 2). Mitochondrial haplogroups were determined for each ancient individual, all showing haplogroups previously identified in Native Americans19 (Supplementary Table 1).
Genetic relationship between ancient and modern individuals
Before proceeding with selection scans we investigated the genetic relationship among the ancient and modern individuals to confirm continuity between the two groups. For these analyses, all C/T and G/A polymorphic sites were removed to guard against biases resulting from DNA damage. Multidimensional scaling was performed to assess the genetic relationships of our samples to individuals from the 1,000 Genomes Project20, other Native American populations21 and two ancient individuals from America22,23 (Fig. 1b, Supplementary Note 2). The analysis revealed an affinity among the PRH Ancients and the Tsimshian, with the Tsimshian drifting towards Europeans as expected from presumed European admixture24. We next used ADMIXTURE25 to separate our samples and other worldwide populations into clusters. This analysis (at K=5 clusters) suggests that the Tsimshian are a mixture of ancestral components stemming from the PRH Ancients and Europeans (Fig. 1a). Next, the evolutionary relationship among our samples and other worldwide human populations was evaluated via TreeMix26. With a single-migration event, the PRH Ancients exhibit minimal drift and appear ancestral to the Tsimshian with European admixture occurring between the two groups (Fig. 1c). These analyses, combined with local oral histories and evidence from archaeology and mitogenomes19, allowed for the inference that the ancient and modern individuals represent a single population through time, which includes pre- and post-European contact.
We also inferred the population history of the Tsimshian by taking into account the bottleneck that occurred after European contact1,27,28 (Fig. 2, see ‘Methods’ section). Utilizing both ancient and modern exome-wide data, the demographic parameters were inferred utilizing the joint derived site frequency spectrum of potential synonymous sites with respect to human reference genome hg19. The best-fitting model suggests that a bottleneck occurred ∼175 years BP (bootstrap 95% confidence interval: 125–225 years, Table 1) in the ancestors of the modern Tsimshian with an accompanying reduction in effective population size of 57%. The timing of the bottleneck coincides with the documented smallpox epidemics of the 19th Century and historical reports of large-scale population declines29,30. A majority of the European admixture in the population likely occurred after the epidemics24,29.
Scans for positive selection
To safeguard against false-positive signals of positive selection due to the apparent admixture of Tsimshian individuals with Europeans, we performed an admixture correction (see ‘Methods’ section). The populations were scanned for selection signals, with and without correcting for admixture, utilizing the population branch statistic (PBS)31. PBS has proven effective in detecting positively-selected loci among high-altitude populations31,32. Twenty-five Han individuals from Beijing (CHB), part of the 1,000 Genomes Project20, served as the third comparative population. The statistic computes the amount of differentiation at a given locus along a branch leading to a specific population by comparing transformed FST values between each pair of three populations. Figure 3a displays population-specific differentiation for the mean across the exome and for our top candidate gene (HLA-DQA1) discussed below.
We calculated PBS on a per-gene basis, with P values for each gene computed by comparing the observed PBS scores with the distribution calculated under neutral simulations. We report genes with P values below the 0.05 significance level (Table 2). The genes showing the most extreme and significant PBS values in the PRH Ancients represent strong candidates for positive selection, of which the top candidate, HLA-DQA1, is directly involved with immune function (Table 2). Enriched gene ontologies were also identified from the ranked list of genes generated from the PBS scan, which highlight immune function related to antigen presentation (Table 3). To assess whether the selection signals were extreme relative to expectations under neutrality, the PBS scores were compared with the distribution of scores based on neutral simulations using our inferred demographic history (Supplementary Fig. 3). Variants from the top candidate with the most pronounced frequency changes were confirmed via Sanger sequencing in all ancient samples reporting data (Supplementary Fig. 4, Supplementary Note 3).
Relevance of the HLA-DQA1 gene
The most extreme PBS score belonged to the HLA-DQA1 gene, which encodes for the alpha chain of the major histocompatibility complex (MHC), class II, DQ1 isoform. The HLA-DQA1 single nucleotide polymorphism (SNP) with the most pronounced frequency difference between the PRH Ancients (100%) and the Tsimshian (36%) falls in the 5′ untranslated region (Table 4). This region may be indicative of selection acting on the regulation of the gene, as the associated alleles exhibit evidence of chromatin alterations and eQTL hits in a variety of cells—including monocytes-CD14+ and primary T helper 17 cells33 (Supplementary Table 4). The chromosomal region where the gene is located also shows strong differentiation along the branch leading to the PRH Ancients (Fig. 3).
HLA-DQ is one of the three main types of MHC class II molecules, along with DR and DP, and is mainly expressed on antigen presenting cells34. MHC class II molecules are responsible for binding to extracellular pathogen peptides and presenting them to CD4+T helper cells, which activate a targeted adaptive immune response towards the associated microbe35. The molecules are known to be highly polymorphic, mainly due to sequence differences corresponding to the binding domain of the molecule, which can impact binding affinities36. Because of this variety in binding domains, differing MHC class II isoforms can have differing disease outcomes due to the restriction imposed on T-cell activation35. The polymorphic nature of these molecules across different populations, however, would not explain the heightened differentiation in the PRH Ancients with respect to their presumed descendants, the Tsimshian.
Haplotype structure and local ancestry of HLA-DQA1
The top candidate for selection in the ancient population, HLA-DQA1, showed large-allele frequency changes in the UTR5 region of the modern population (Table 4). Although there is a slight reference bias in the ancient samples due to mapping and possibly the design of our capture probes (Supplementary Fig. 5, Supplementary Note 4), the high frequency cannot be attributed to this feature since the derived alleles putatively under selection are for the alternate allele. To assess whether the frequency change was due to European admixture, we examined the haplotype structure among populations. To visualize the haplotypes in the HLA-DQA1 region, we phased the ancient and modern samples using Beagle 4.1 (ref. 37). We took a randomly chosen haplotype from ancient sample PRH 125, and computed the number of pairwise differences to this haplotype for each haplotype in the modern and ancient samples as well as the Great Britain (GBR) samples from the 1,000 Genomes Phase 3 data38. We then ordered the haplotypes based on their number of pairwise differences to this arbitrarily chosen haplotype from sample PRH 125, and grouped them by population. Supplementary Figure 6 shows similar haplotypes between the ancient and modern individuals, while those of the European population are distinct.
We next explored the local ancestry of the HLA-DQA1 gene. We used RFMix39 to infer ancestry along chromosome 6 in the modern Tsimshian population. We utilized the PopPhased program, which corrects the phasing errors, and a window size of 0.2 cM, four generations since the admixture event between the Tsimshian and Europeans, and 100 trees generated per random forest. For the reference panel, we used Phase 3 data from the 1,000 Genomes Project38. We used 25 individuals each from the GBR (European panel), CHB (East Asian panel) and PEL (Native American Panel, Peruvian in Lima). The PEL chosen showed little to no admixture (see ‘Methods’ section). Supplementary Figure 7 indicates that only one haplotype could be attributed to European ancestry, while the remaining 49 are attributed to Native American ancestry.
Simulations of the HLA-DQA1 allele trajectories
To explore whether the allele frequency differences between the two time periods could be explained by long-term balancing selection, drift or changes in selection pressures, we performed a series of simulations based on our demographic model. First, we examined whether long-term balancing selection under heterozygote advantage could explain our data. The parameters inferred from our demographic model were implemented in the forward-time simulator SLiM40. A de novo mutation was introduced 5 million years in the past (assuming a generation time of 25 years) that evolved under heterozygote advantage (per-generation selection coefficient s=0.1, and dominance parameter h=100) until the present. The distribution of the resulting PBS scores can be contrasted with the observed data in Fig. 4a. Because the distribution of the PBS scores under long-term balancing selection is shifted towards small values compared with neutrality (Fig. 4a), the data are inconsistent with long-term balancing selection under heterozygote advantage.
To model evolutionary forces acting on the HLA-DQA1 derived alleles after European contact, we chose the frequency of the allele showing the greatest change of 0.67. We utilized a simulation based approach, described in detail in the ‘Methods’ section, to evaluate models under positive selection, neutrality and negative selection. We also used the same approach to obtain estimates of the correlation between the time of environmental change (t) and the selection coefficient (s) (Supplementary Fig. 8). Figure 4c shows that neither a strict positive selection scheme, nor one involving positive selection followed by a shift to neutrality, could fit our data (none of the simulations reach the observed frequency in the modern population). However, the model with a shift from positive to negative selection was compatible, where 26% of the simulations either reach or surpass the observed frequency.
We also investigated if the observed allele frequency in the ancient population could be better explained by drift rather than selection. Using the same general method, we simulated the initial allele frequencies at the time Native Americans split off from East Asian population by randomly sampling allele frequencies from a backward simulation conditioned on the modern CHB frequency. We then simulated allele frequencies from 60 generations ago—the time during which the ancient population was sampled. The resulting distribution in Supplementary Fig. 9 shows that a neutral scenario is not a good fit for our data. We also see that the empirical distribution of all SNP frequency changes between the ancient and modern individuals show the HLA-DQA1 variants as outliers (Fig. 4b).
Our unique data set has allowed us to examine the demography of a single Native American population through three distinct time frames. We first examined the population from a time span of 5,000 years leading to European contact. Selection scans on the ancient individuals from this period revealed a top candidate for positive selection, HLA-DQA1, giving the inference of an immune-related adaptive event. We next inferred the severity of the population collapse after European contact, which correlates with historical population declines associated with regional smallpox epidemics30, as well as general estimates of Native American population declines based on mitochondrial DNA diversity27,41. During the contact period, previous long-standing positive selection on the HLA-DQA1 gene may also have been significant. The HLA-DQ receptor has been associated with a variety of colonization era infectious disease, including measles42, tuberculosis43,44, and with the adaptive immune response to the vaccinia virus, which is an attenuated form of smallpox45,46. Further studies are needed to investigate if the ancient alleles putatively under positive selection may pose a differential disease outcome with respect to European-borne pathogens, as well as their effect on downstream target genes.
However, when examining the population post-contact and into contemporary times, variants of the HLA-DQA1 gene experience a marked frequency change. This change presents a more complex scenario when taking into account all three time frames. First, scans for positive selection in the modern Tsimshian, with and without correcting for European admixture, revealed no statistically significant selection on immune-related genes (Supplementary Tables 5 and 6). The gene ontology enrichment analyses also did not suggest a correlation with immune function (Supplementary Table 7). Second, demography alone was unable to explain the large frequency change in the HLA-DQA1 alleles between the ancient and modern groups based on simulations (Fig. 4c). European admixture in the modern individuals also did not account for the frequency changes since the haplotypes in this region can be attributed to Native American ancestry (Supplementary Fig. 7). Furthermore, HLA-DQA1 remained a top PBS hit in scans involving both a European admixture correction (Supplementary Table 9) and with an additional scan involving unadmixed Native American individuals from a different modern population (suggesting a regional adaptive event) (Supplementary Table 8; ranked fourth best candidate, with the top three functionally uncharacterized).
We therefore explored alternative explanations for the observed frequency change of the HLA-DQA1 alleles in the time after contact. Since HLA genes have been previously postulated to be under balancing selection in humans47,48, we examined the possibility that long-term balancing selection could explain our data by simulating under a model of heterozygote advantage conditional on our inferred demographic model. We found that this specific type of balancing selection is a poor fit to the data, whereby the HLA-DQA1 gene is still an extreme outlier relative to the simulation results (Fig. 4a). Next, we used a forward simulation based approach to trace the HLA-DQA1 allele trajectories under different selection models after the point of European contact. We found that simulations under our demographic model, which was modified to not include European admixture given our local ancestry results (Supplementary Fig. 7), was insufficient to explain the frequency change in the modern population—with none of the 104 simulations reaching the observed frequency (Fig. 4c). However, on applying a model of negative selection at the time of contact, we found that simulated allele frequencies were compatible with the observed frequencies in the modern population (Fig. 4c). Although we were unable to precisely identify the selection coefficient necessary to drive the allele frequency change (since the likelihood surface is relatively flat, Supplementary Fig. 8), it is likely that relatively strong negative selection occurred. Such strength would be expected under a time frame of less than seven generations and correlates with the high mortality rates associated with the regional smallpox epidemics of the 1800s, which reached upwards of 70% (ref. 30).
The results presented here reveal an evolutionary history that spans thousands of years. The immune-related alleles that exhibit strong signals of positive selection in the ancient Native Americans from the Northwest Coast, likely correlate to an adaptation to pathogens that were present in the ancient environments of the region. Our results also suggest that the indigenous population may have experienced negative selection on the same immune-related genetic component after European contact and the ensuing population collapse. The shift may represent a form of balancing selection due to fluctuating environments49. This inference was only made possible through our examination of a single population through time, revealing nuanced demographic events and the utility of such studies. Furthermore, the evolutionary history detailed here helps to better understand the experiences of Native Americans with disease, in both ancient and colonial periods, by demonstrating a shift in immune-related selection pressures associated with the environmental impact of European contact.
Ethics and community engagement
This project was made possible through the active collaboration of the Metlakatla and Lax Kw’alaams First Nations. The communities are located in the Prince Rupert Harbour (PRH) region of British Columbia. RSM and JSC established a collaborative DNA study with these two communities in 2007 and 2008, respectively, visiting annually to report the most recent DNA results and obtain feedback on the results. The 25 exomes from modern individuals generated in this study came from these two communities. The two communities agreed to allow DNA analysis of ancestral individuals recovered from archaeological sites in the region and currently housed at the Canadian Museum of History. During and after community visits and extensive consultation, a research protocol and informed consent documents—agreed on by the indigenous communities and researchers—was approved by the University of Illinois Institutional Review Board (#10538). All individuals signed an informed consent document. RSM, JSC and JL visited the community annually during the study to report the latest results and continue to visit the First Nations to report on this and related studies.
DNA extraction and library preparation
We prepared DNA extracts from 25 ancient individuals from the Prince Rupert region of British Columbia (Supplementary Table 1) and prepared DNA sequencing libraries in a clean room facility. The 25 modern DNA samples underwent similar procedures in a separate facility designated for modern DNA only (Supplementary Note 5).
Exome capture and Illumina sequencing
For the contemporary samples, a combination of the Illumina TruSeq Exome Enrichment Kit and the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA) were used (Supplementary Table 2), following the manufacturer’s protocol. One library per individual was sequenced (single-end reads) and pooled for a total of four libraries per lane on the Illumina HiSeq 2000 at the High-Throughput Sequencing Division of the W.M. Keck Biotechnology Center at the University of Illinois Urbana-Champaign.
For the ancient samples, only the Illumina TruSeq Exome Enrichment Kit was used. For each ancient individual, four libraries were captured and then pooled for sequencing on one lane. For the capture, the manufacture’s protocol was used with the following modifications: the Qiagen MinElute PCR Clean-up kit was used instead of beads, post-capture amplification involved 12 cycles instead of 10 and the hybridization temperature was decreased to 50 °C.
To estimate contamination across the genome-wide data, we used the ContEst tool16. The tool uses a Bayesian approach to calculate both the posterior and the maximum a posteriori probability of contamination level within a BAM file of an individual. This method has been shown effective in detecting contamination in exomes with low coverage16. HapMap_3.3 global population frequencies for each SNP, mapped to b37, were used for the estimates. All ancient samples demonstrated contamination below 1%, except for PRH Ancient 163. The estimates are shown on Supplementary Table 3.
See Supplementary Note 6 for details on mapping. Reads below a length of 35 were filtered out before mapping to hg19. For analyses requiring genotype calls (for example, TreeMix, ADMIXTURE, MDS and RFMix), SAMtools-1.1 (ref. 50) was utilized with a minimum mapping quality of 30, a minimum base quality of 20, a minimum read depth of 6 and a max read depth of 80. Sites were also filtered for violation of a one-tailed test for Hardy–Weinberg Equilibrium at a P value<10−4 (ref. 51). Due to the low-mean read depth of both the PRH Ancients and the Tsimshian, genotypes were not called directly for the selection scan or the demographic modelling. Instead, the program ANGSD (ref. 52) was used to compute genotype likelihoods using the SAMtools model and estimated allele frequencies directly from these likelihoods. This method was applied to all populations considered in the selection scan. For the demographic model, the derived joint site frequency spectra (SFS), for all populations considered, were also inferred using ANGSD. Each alignment used in the estimation was filtered for a minimum mapping quality of 30, a minimum base quality of 20, trimmed at each end for 5 bp to minimize biases from DNA deamination and a minimum P value threshold of 10−6.
PBS selection scan
To detect regions under positive selection in both the PRH Ancient and the Tsimshian, the PBS31 was utilized. The PBS31 has proven powerful in detecting hypoxia adaptation in high-altitude populations31,32. It uses a set of three populations (call them X, Y and Z), and assumes that they have the rooted relationship ((X,Y),Z). In actuality, the calculation for the PBS does not require a rooted tree, and so their specific rooted relationship does not matter.
An analogous statistic can be calculated for populations Y and Z. In this study, we are concerned with the situation in which X is the PRH Ancients, Y is the modern Tsimshian and Z is the Han Chinese (CHB from the 1,000 Genomes Project). We therefore are interested in computing:
Because the ADMIXTURE and TreeMix analyses (Fig. 1a,c) indicate a likely admixture event between the modern Tsimshian and Europeans, the allele frequencies in the modern Tsimshian were corrected for admixture using the method described by Huerta-Sánchez et al.32. Let and represent the allele frequency at a locus in the Tsimshian population pre and post admixture, respectively. Further, assuming that we use Europeans as a proxy, we let be the allele frequency at the same locus in a reference European population (we used the Great Britain (GBR) population from the 1,000 Genomes Project). Assuming that the proportion of ancestry derived from Europeans at the locus is α, under a model of instantaneous admixture, the allele frequency in the Tsimshian post admixture would be
Rearranging, we can solve for the allele frequency before admixture as
We estimated α at each locus by choosing an α value that minimized the FST between the admixture-corrected Tsimshian and the Han Chinese (CHB) outgroup population. We used the admixture-corrected allele frequencies for the scans for positive selection.
We used ANGSD (ref. 52) to compute allele frequencies for the modern Tsimshian, the PRH Ancient, the Han Chinese (CHB) and the Great Britain (GBR) populations directly from the raw sequencing reads, accounting for the uncertainty in genotype calling. Allele frequencies were based on 25 unrelated individuals (Supplementary Table 10, Supplementary Note 7) from each population. While samples sizes of 10 provide sufficient statistical power for genome-wide FST differentiation53, our ancient sample size was increased to 25 (equating to 50 haploid samples) to offset the statistical power loss due to the varying nature of ancient exome coverage. We required that reads had a map quality of at least 30 and each nucleotide had a quality of at least 20. We also only called allele frequencies at sites in which data from at least five individuals was not completely missing. To additionally guard against post-mortem deamination, we trimmed the first and last five nucleotides of each read in the PRH Ancient samples. The allele frequencies in the modern Tsimshian were subsequently corrected for potential European ancestry (see procedure in the directly preceding paragraph). The total number of loci used (including monomorphic and polymorphic sites) for each scan are as follows: Ancient, Tsimshian, CHB=2,556,963; Ancient, Tsimshian, CHB (GBR corrected)=1,594,924; Ancient, Peruvian, CHB=3,334,664. See Supplementary Note 8 for additional detail on the PBS scan.
PBS selection scan P values
To compute P values for the per-gene PBS scan, we first obtained the distribution of RefSeq transcript lengths, and added 20 kilobases to that length. This procedure was to mimic the per-gene scan in which we computed PBS for a given gene with the inclusion of 10 kilobases upstream and downstream of the gene. On the basis of the inferred demographic model (Fig. 2), we performed 105 random neutral simulations using FastSimCoal2 (66). For each replicate simulate, we drew a sequence length uniformly at random from the distribution of RefSeq transcript lengths (plus 20 kilobases). In each simulation, we sampled 50 haplotypes (25 diploid individuals) from each of the four populations (representing PRH Ancient, Tsimshian, CHB and GBR). We then attempted to correct the allele frequencies in the population representing the modern Tsimshian, using a procedure identical to that described in the ‘Methods’ section.
P values for the per-gene scan in the PRH Ancients were obtained by identifying the proportion of the 105 neutral simulations in which the PBS values for the population representing the PRH Ancients was more extreme. Associated P values for the top two candidate genes are indicated in Table 2. It should be noted that the P values were generated assuming a neutral model. However, the data are from genes (in particular exomes), which are likely not evolving neutrally and many of which are probably under selective constraint. This selection constraint would act to shift the empirical distribution of PBS values to those that are smaller. Therefore, neutral loci would tend to have higher PBS values. Indeed, contrasting the simulated and empirical PBS distributions (Supplementary Fig. 3), we can see that the empirical distribution of PBS is shifted to smaller values. However, the top candidate HLA-DQA1 is a substantial outlier according to the empirical distribution. Therefore, simulations involving purifying selection rather than neutral simulations would likely have made HLA-DQA1 more significant.
Demographic history model
Parameters for the demographic model (Fig. 2) were inferred with FastSimCoal2 (ref. 54). The fixed parameters were implemented from Gravel et al.55 and were as follows: out of Africa bottleneck (N=1,861, T=51kya)55, split between the CHB and GBR (serving as the ghost population) (NGBR=1,032, NCHB=550; T=23kya)55. Admixture between the GBR and Tsimshian (T=100 years, admixture fraction=0.33) were taken from Verdu et al.24. One hundred optimizations were run for the inferred values, taking the best likelihood parameters from each of the 100 sets. The data was simulated with an effective sequence length of 7.4 Mb and per-base per-generation mutation and recombination rates of 2.5 × 10−8. The optimizations utilized joint derived SFS for the CHB, PRH Ancients and Tsimshian. The European population (Great Britain denoted by GBR) served as a ghost population in the model. This SFS contained 7.4 Mb of monomorphic and polymorphic sites based on hg19 potential synonymous sites, where data was reported for each individual. A parametric bootstrapping approach was used to construct the 95% confidence intervals. The inferred parameters and confidence intervals are listed in Table 1.
Long-term balancing simulation under heterozygote advantage
To examine whether long-term balancing selection could better explain our data than positive selection, the parameters inferred from our demographic model were implemented into the forward-time simulator SLiM (ref. 40). SLiM does not permit continuous exponential growth, so the equivalent effective population size of the CHB and GBR populations were computed. That is, the effective size with the same amount of elapsed coalescent time as one under exponential growth. A de novo mutation was introduced 5 million years in the past (assuming a generation time of 25 years) that evolved under heterozygote advantage (per-generation selection coefficient s=0.1, and dominance parameter h=100) until the present. Only simulations for which the selected mutation was not lost were kept. The simulation involved a region that was equal to the length of the HLA-DQA1 transcript +20 kb (10 kb upstream and downstream, as in our PBS analysis). Fifty chromosomes were sampled at random in each of the four populations (representing the PRH Ancients, modern Tsimshian, CHB and GBR populations), and PBS was calculated as in all other analyses. An admixture correction was also applied to the data from these simulations. The distribution of the resulting PBS scores can be seen overlapped with the observed data and neutral simulations in Fig. 4a. Because the distribution of the PBS scores under long-term balancing selection is shifted towards small values compared with neutrality, the data are inconsistent with long-term balancing selection.
Selection shift simulations
To model evolutionary forces acting on the derived allele at rs9272426 after European contact, we utilized a simulation based approach to evaluate models under positive, neutral and negative selection, similar to that described in Nakagome et al.56. We also used the same approach to obtain estimates of the correlation between the time of environmental change (t) and the selection coefficient (s). First, we ran a Wright–Fisher model based backward simulation of the derived allele frequency in CHB under the neutral model to sample the allele frequency (fT:605) at 605 generations ago, which is the estimated time at which ancestral Native Americans split from CHB according to our demographic model, assuming the current frequency was 0.475 (1,000 Genomes CHB frequency) and a constant population size of 8,250 diploid individuals. We computed this constant population size to be the effective size with the same amount of elapsed coalescent time as one under exponential growth assumed in for the CHB in our demographic model. We then started forward simulation in the Tsimshian with the initial frequency of fT:605 by taking into account demographic effects based on our model in which effective population size (Ne) was 13,975 diploids between 605 and 7 generations ago and decreased to 6,006 diploids at 7 generations ago (Fig. 2). Since we detected signatures of selection on the derived allele based on the PBS statistic in our ancient samples at 60 generations ago (fT:60=1.0), we sampled a selection coefficient (soriginal) from U(0.0, 0.1) and only accepted trajectories if the frequency at 60 generations ago was >80%. After the trajectory reached 12 generations ago, when European contact occurred in British Columbia, we assumed three different models with or without a shift in selective pressures on this allele by changing s to 0.0 (neutral) or newly sampling s from U(0.0, −0.3) (negative selection) or by using s=soriginal (positive selection). Then, we calculated the allele frequency at present by binomial sampling with the total chromosomes (50 in our samples) and the current frequency in the trajectory (fpresent).
We estimated a joint posterior distribution of t and s given fT:60 in the ancient samples and the observed frequency in the modern samples (fobs) under the negative selection model. Similar to the first step, we sampled fT:605 from a neutral distribution generated by the backward simulation in CHB and model the trajectory that started from fT:605 and increased the frequency with soriginal sampled from U(0.0, 0.1). The information on fT:60=1.0 was incorporated into the trajectory by rejecting it if fT:60<0.8. At this simulation, we also sampled t from U(0, 30), as well as s from U(0.0, −0.5). We used 0.37 as the observed frequency in our modern Tsimshian sample, with 19 derived and 31 ancestral observed alleles in the modern samples. We estimated a joint posterior distribution of t and s given the observed frequency in our ancient and modern samples by accepting 10,000 samples (Supplementary Fig. 8).
We also investigated if the observed allele frequency in the ancient population could be better explained by drift rather than selection. Using the same general simulation method described above, we simulated the initial allele frequencies at the time Native Americans split off from East Asian populations by randomly sampling allele frequencies from a backward simulation conditioned on the modern CHB frequency. We then simulated allele frequencies at 60 generations ago, the time at which the ancient population was sampled. The resulting distributions in Supplementary Fig. 9 show that a neutral scenario is not a good fit for our data.
Assessment of differences in coverage between populations
We investigated the distributions of coverage in the PRH Ancients, modern Tsimshian and CHB populations across the whole genome, across chromosome 6, and in the HLA-DQA1 region (Supplementary Fig. 10). As expected from the degraded nature of ancient DNA, the PRH ancients exhibit more missing data than both modern populations. However, the level of missing data across the genome and across chromosome 6 is similar. The HLA-DQA1 region shows that the coverage for PRH Ancients is less than the modern Tsimshian, and that the modern Tsimshian is less than CHB. Although we observe a decrease in coverage in the modern Tsimshian and PRH ancients relative to the background level of coverage, the number of observed alleles is always >20 (that is, 10 diploid individuals), which is sufficient to compute accurate values of FST and is over twice as large as the minimal threshold of non-missing individuals (five diploids or 10 alleles) for calling an allele frequency in our ANGSD pipeline. Further, the high frequency variant identified using our PBS scan was confirmed by Sanger sequencing in 18 diploid individuals (36 total alleles) from the PRH Ancients (Supplementary Fig. 4), indicating that it is not sample size that is driving the observed PBS patterns.
We started with the identical filtered data set of called genotypes described in the ‘Methods’ section. TreeMix26 was applied to the data set to generate maximum likelihood trees and admixture graphs from allele frequency data. The Yoruban (YRI) 1,000 Genomes Project population was used to root the tree (with the –root option). We accounted for linkage disequilibrium by grouping M adjacent sites (with the –k option), and we chose M such that a data set with L sites will have approximately independent sites. A total of 1,820 polymorphic loci were used for this analysis. At the end of the analysis (that is, number of migrations) we performed a global rearrangement (with the global option). We considered admixture scenarios with m=0 and m=1 migration events. Each migration scenario was run with 100 replicates, and the replicate with the highest likelihood was chosen to represent the maximum likelihood tree or graph for the given migration scenario.
Supplementary Figure 11 displays the results for the maximum likelihood tree with no admixture (m=0) events. Here, the present-day Tsimshian fall ancestral to modern Central and South American samples (Surui, Karitiana and Mayan), as well as the ancient sample from Montana (Anzick-1) and the PRH ancient samples genotyped in this study. However, ADMIXTURE results from Fig. 1a revealed a large European component within the modern Tsimshian, but not in the PRH Ancients, likely causing the modern Tsimshian to fall intermediate between Native Americans and Europeans on the tree. In addition, the Mayan fall ancestral to the Anzick-1 sample, a result of not masking out recent European admixture in that sample (also see Fig. 1c). Figure 1c (TreeMix graph) and Supplementary Fig. 12(b) (corresponding residuals) display results with a single admixture event (m=1). The extra migration event corrects for the European ancestry observed in the modern Tsimshian and shifts the placement on the Tsimshian as a sister population (that is, forms a clade) with the PRH ancient population.
The ancient data are available from NCBI Sequence Read Archive, accession no PRJNA288803. The data from modern individuals are available via a data access agreement with RSM at the University of Illinois. All other data are available from the authors on reasonable request.
How to cite this article: Lindo, J. et al. A time transect of exomes from a Native American population before and after European contact. Nat. Commun. 7, 13175 doi: 10.1038/ncomms13175 (2016).
This project was made possible through the active collaboration of the Lax Kw’alaams and Metlakatla First Nations. We thank Jun Li for furnishing comparative exome data from contemporary Native American populations. We also thank Alvaro Hernandez and Chris Wright at the University of Illinois Biotechnology Center. The research was funded by the National Science Foundation (#DEB-1557151, #BCS-1413551 & #BCS-1518026) and by the Office of the Vice Chancellor of Research, University of Illinois at Urbana-Champaign, by the Canadian Museum of History, Gatineau, Quebec, Canada and by Pennsylvania State University and University of California, Merced startup funds. Portions of this research were performed with the Advanced CyberInfrastructure computational resources provided by the Institute for CyberScience at Pennsylvania State University.
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/