A time transect of exomes from a Native American population before and after European contact

A major factor for the population decline of Native Americans after European contact has been attributed to infectious disease susceptibility. To investigate whether a pre-existing genetic component contributed to this phenomenon, here we analyse 50 exomes of a continuous population from the Northwest Coast of North America, dating from before and after European contact. We model the population collapse after European contact, inferring a 57% reduction in effective population size. We also identify signatures of positive selection on immune-related genes in the ancient but not the modern group, with the strongest signal deriving from the human leucocyte antigen (HLA) gene HLA-DQA1. The modern individuals show a marked frequency decrease in the same alleles, likely due to the environmental change associated with European colonization, whereby negative selection may have acted on the same gene after contact. The evident shift in selection pressures correlates to the regional European-borne epidemics of the 1800s.

. This bias could also be attributed to differential mapping. However, this bias does not correlate with our top PBS hit, since the SNPs are all called for the alternative allele to the reference.

Supplementary Figure 6 | Haplotype visualization of HLA-DQA1.
Haplotype graph of the HLA-DQA1 gene. We counted the differences between a randomly chosen haplotype from ancient individual PRH 125 and all other haplotypes of PRH Ancients, modern Tsimshian, and the British (GBR) samples from the 1,000 Genomes Phase 3 data 1 . Each column is a haplotype, and the two haplotypes of individual 125 are the first two columns (indicated in red). Each row is a SNP. Black indicates a derived allele (allele that differs from the chimpanzee reference), and white indicates an ancestral allele (allele that is identical to the chimpanzee reference). Note that while this region has been portrayed to be a continuous segment, it is not, and we have simply concatenated the exons in this gene. The colors refer to the different populations considered in this plot, as indicated by the legend.

Supplementary Figure 7 | Local ancestry of HLA-DQA1.
Ancestry components of the HLA-DQA1 UTR5 region in the modern Tsimshian. We used RFMix 2 to examine the local ancestry across chromosome 6 in the modern Tsimshian, with the above figure highlighting the region where the highest SNP frequency changes are observed, with respect to the ancient population. Only one of the 50 haplotypes is predicted to be of European ancestry, suggesting that admixture cannot explain the frequency differences between the two populations.

Supplementary Figure 8 | Joint posterior distribution of selection coefficients and time for negative selection model.
Given the negative selection simulation results, we assumed that the best fit for our data is a model that includes a shift from positive selection in the PRH Ancients to negative selection in the modern Tsimshian, after European contact. We examined the values of time and strength of negative selection consistent with the change in allele frequencies observed between the modern and ancient individuals in HLA-DQA1. Time of contact is estimated to be 300 years ago (12 generations ago) for the Pacific Northwest. Figure 9 | Simulations of HLA-DQA1 allele frequencies under a neutral model. The light grey depicts the distribution of initial allele frequencies of the Native American population at the time of divergence from East Asian populations, using a backwards simulation conditioned on the present-day CHB frequency. The blue distribution depicts the allele frequencies at the sampling point of the ancient population. The neutral model is not a good a fit for our data, given the observed frequencies in the ancient population, which are nearly fixed in the UTR5 region of the HLA-DQA1 gene. Figure 10 | Comparison of missing data across populations. As expected from the degraded nature of ancient DNA, the PRH ancients exhibit more missing data than both modern populations. The level of missing data across the genome and across chromosome 6 is similar. The HLA-DQA1 region shows that the coverage for PRH Ancients is less than that for the modern Tsimshian, and that the modern Tsimshian have less coverage than CHB. Though we observe a decrease in coverage in the modern Tsimshian and PRH ancients relative to the background level of coverage, the number of observed alleles is always greater than 20 (i.e., 10 diploid individuals), which is sufficient to compute accurate values of F ST and is over twice as large as the minimal threshold of non-missing individuals (five diploids or 10 alleles) for calling an allele frequency in our ANGSD pipeline. Further, the high frequency variant identified using our PBS scan was confirmed by Sanger sequencing in 18 diploid individuals (36 total alleles) from the PRH Ancients ( Supplementary Fig. 11), indicating that it is not sample size that is driving the observed PBS patterns. Figure 11 | TreeMix with zero migration events. The modern Tsimshian population falls ancestral to the Native Americans and the PRH Ancients. This placement is due to European admixture into the modern Tsimshian. The modern Tsimshian form a sister group to the PRH Ancients when accounting for European admixture (Fig. 1c) Supplementary Sequence-based sex determination was completed as described in Skoglund et al. 3 . The cal BP ranges were calculated as in 4 and account for the marine reservoir effect on the north coast of British Columbia. Mitochondrial haplogroups designated with * were assigned via restriction fragment length polymorphism. All others haplogroups are based on off target hits that aligned to the mitochondrial genome, which included the control region. Contamination estimates were calculated with ContEst 5 and average read length was calculated withMapDamage2 6 , using a minimum mapping quality score of 30. Protease secreted from the pancreas and has a digestive function in the intestine. 25 Peruvians were selected from phase 3 of the 1,000 Genomes Project, which showed little to no admixture. HLA-DQA1 remains a top hit, indicating that European admixture in the modern Tsimshian is not skewing our results. Gene functions were derived from GeneCards 8  All seven archaeological sites are ancient shell middens that represent hundreds and thousands of years of culture history. In addition to their evidence for habitation, subsistence (food remains and food getting technology), woodworking tools, and the arts, the middens have traditionally served as cemeteries apparently with ritual significance 13

Multi-dimensional scaling (MDS) applied to called genotypes
We intersected the called genotypes for the 25 ancient and 25 modern exomes from this study and from the Mayan, Surui, and Karitiana exomes from Szpiech et al. 28 with called genotypes from whole genome sequences from the 1,000 Genome Project Phase 2 samples 23 , the Saqqaq ancient sample from Rasmussen et al. 29 , and the Anzick-1 ancient sample from Rasmussen et al. 1 . To guard against biases generated by post-mortem deamination, sites where a C/T or G/A polymorphism was observed were removed. Further, sites that were not biallelic were also removed. In addition, only sites for which each population was not completely missing data were retained. A total of 29,333 polymorphic loci were employed for this analysis.
For each site in the filtered dataset, we calculated a distance between individuals and , where if the pair of individuals had different homozygous genotypes, if one of the pair was homozygous and the other was heterozygous, and if both individuals were homozygous for the same genotype. If at least one of the individuals has missing data at site , then . Let be the number of overlapping sites in the filtered dataset, and suppose is the number of sites at which individuals and both have non-missing data. We define the allele sharing distance between individuals and as ∑ We then applied classical multi-dimensional scaling to the matrix defined by this set of pairwise distances.
Plotting the first two components reveals that the first component separates out Africans from East Asians, and the second component separates out Africans and East Asian from Europeans. The modern Central and South American populations (Surui, Karitiana, and Mayan) fall closest to the East Asians, with the ancient Anzick-1 sample from Montana and the PRH Ancient individuals from this study falling near the modern Central and South Americans. In contrast, though the modern Tsimshian fall closest to the modern Central and South American and the other ancient samples from the Americas, they also lie intermediate between the modern Native American and modern European samples.

Assessment of population structure using ADMIXTURE
We started with the identical filtered dataset of called genotypes described in the Methods. We further pruned the dataset by removing sites in strong linkage disequilibrium ( ) using PLINK 2 . The program ADMIXTURE 30 was used to assess global ancestry of the ancient and present-day samples from this study. We computed cluster membership for K=2, through K=5 clusters, as displayed in Figure 1a. A total of 29,333 polymorphic loci were employed for this analysis.
At K=5, the ancient samples separate into their own cluster, depicted in gray (Fig. 1a). This cluster is also the major ancestry component of the present-day samples from this study, and is also a large proportion of other ancient and modern Native Americans (i.e., Anzick-1, Maya, and Surui). It should be noted that this genetic component (gray) decreases as the population sampling locations move south, whereas the ancient Saqqaq sample from Greenland share little, potentially suggesting a different migration wave. Further, the modern Tsimshian samples from this study display a large ancestry component matching Europeans, which is consistent with admixture from European contact with the Americas. This result is also reflected in the MDS analysis (Fig. 1b), in which the modern Tsimshian individuals have ancestry intermediate between other Native American groups and Europeans.

HLA-DQA1 SNP validation and variant location
The HLA-DQA1 SNP showing the highest frequency changes (located at positions chr6: 32605189 and chr6: 32605197) between the PRH Ancients and Tsimshian were confirmed via Sanger sequencing from 18 of the ancient individuals ( Supplementary Fig. 4). The remaining 7 ancient samples either did not amplify in the specified region or the sequence was not readable. The extraction method was the same as described above. Forward and reverse PCR primers were constructed as follows: CCTCACAATTACTCTACAGCTCAG and CTCATGCACTCACCCACAA.

Gene function of genes putatively under positive selection
Gene function descriptions derived from GeneCards 8 . See Supplementary Tables 5, 6, 8, and 9.

Gene ontology enrichment
Gene ontologies were performed using the ranked list of genes from the PBS scan and produced with GOrilla (http://cbl-gorilla.cs.technion.ac.il) 31 . See Supplementary Table 7.

Supplementary Note 4 -Assessment of biases from capture probes
We examined the possible effects of reference bias in our data due to the capture probe design. We chose the two samples with average coverage greater than 20x, samples 302 and 443, and called heterozygous sites with Samtools. Only sites with 20 reads or higher were kept. The distribution of the proportion of reads matching the reference, across SNPs in the exome, were plotted in Supplementary Figure 23. The two ancient samples show a slight bias, both showing a mean proportion of ~0.56 in favor of the reference allele. This bias could be attributed to differential mapping and potentially the design of the capture probes. However, this bias does not correlate with our top PBS hit, since the SNPs are all called for the alternative allele to the reference.

Supplementary Note 5 -DNA extraction and library prep
DNA extraction DNA extractions and PCR amplification setups were completed in an ancient DNA laboratory facility at the University of Illinois Urbana-Champaign. The ancient DNA lab is a positively pressured clean room with hepa-filtered air. The clean room contains an anteroom and air flows from the ancient DNA lab to the anteroom to the hallway. Personnel working in the ancient DNA lab wear disposable hairnets, facemasks, laboratory coveralls and booties. All equipment, reagents and consumables are dedicated for use in the ancient DNA laboratory. The ancient DNA lab is routinely cleaned with bleach and all containers are wiped with Takara DNA Off (Mountain View, CA) before placed in the ancient DNA lab. Personnel are restricted from entering the ancient DNA after being in a modern DNA laboratory. A database containing mitochondrial control region sequence is maintained of all personnel working in the ancient laboratory and of any personnel who may have come into contact with the human remains prior to DNA analysis. Contamination controls were used with every DNA extraction and PCR setup in order to detect any contamination from reagents. A series of negative controls are routinely performed in the ancient DNA lab.
Each tooth was soaked in 6% sodium hypochlorite for 3 minutes, rinsed three times with UVirradiated molecular grade water, and dried in a UV Crosslinker for 20 minutes, so as to remove surface contamination. Approximately 0.20 grams of tooth powder was incubated in 4 ml of demineralization/lysis buffer (0.5 M EDTA, 33.3 mg/ml Proteinase K, 10% N-lauryl sarcosine) for 24 hours at 37ºC. The digested sample was then concentrated to approximately 250 µl using Amicon centrifugal filter units. Following concentration, the digest was run through silica columns using the MinElute Qiagen PCR Purification Kit (Qiagen, Hilden, Germany) and eluted in 60 µl of DNA extract.

DNA screening for mtDNA
In order to test for viable DNA before proceeding with library building and exome sequencing, each ancient individual was amplified for the hypervariable region I of mitochondrial DNA from 2 µl of extract, utilizing the reagents and conditions described in Malhi et al. 15 . Native American maternal haplogroup was also confirmed for each sample via restriction fragment length polymorphism analysis for Native American haplogroups (A, B, C, D, and X) 16 , or via off-target reads from the exome capture and previously performed captures of the mitochondrial genome, as described in Cui et al. 11 (Supplementary Table 1). All 25 ancient samples demonstrated Native American mitochondrial haplogroups.

Library build
Libraries were created in the ancient lab facility using the New England Biolabs Ultra Kit for Illumina (E7370S, Ipswich, MA) following the manufacturer's protocol with the following modifications. DNA fragmentation was not performed. DNA purifications were done using the MinElute Reaction Cleanup Kit (Qiagen, Valencia, CA). Library amplification was done in two steps. The first round of amplification utilized the kit's reagents and protocol with 12 cycles of (10s at 98°C, 30s at 65°C and 30s at 72°C). For the second round, we achieved a sufficient DNA concentration for the exome enrichment (~500ng), without excessive amplification, by creating 4 PCR reactions from the initial amplified product and then pooling them before using a Qiagen MinElute PCR Clean-up kit. For the 2 nd PCR, we created a 50 µl reaction, utilizing 0.2μM of primers P5 (5'-AATGATACGGCGACCACCGA) and P7 (5' CAAGCAGAAGACGGCATACGA) 17 , 5μl from the initial PCR, 25µl of Phusion® High-Fidelity PCR Master Mix with HF Buffer (New England Biolabs, Ipswich, MA), 3% DMSO (New England Biolabs, Ipswich, MA), 0.2mg/ml BSA (New England Biolabs, Ipswich, MA). PCR conditions were as follows: 4m at 98°C, 10 cycles of (10s at 98°C, 30s at 62°C and 30s at 72°C), with a final extension at 72°C for 10m. Library fragment sizes were confirmed via a BioAnalyzer High Sensitivity assay to be above 130bp.
Copenhagen sample sequencing PRH Ancient samples 406 and 507 were extracted, libraries built, and sequenced at the Centre for GeoGenetics in Copenhagen, Denmark, using the protocols described in Rasmussen et al. 18 . The exome enrichment was also conducted at the Copenhagen facility utilizing the Illumina TruSeq Exome Enrichment Kit (Illumina, San Diego, CA), with the protocol modifications described above.

Supplementary Note 6 -Mapping and damage patterns Mapping
Raw data from the Illumina HiSeq 2000 platform was base called with CASAVA 1.8.2. Sequences were de-multiplexed with a requirement for a full match of the six nucleotide indexes that was used for library preparation. Illumina adapter sequences were trimmed using Trimmomatic-0.32 19 with a minimum length of 25 and removing leading and trailing quality or N bases below a quality score of 3. Reads were additional trimmed for 5 bp at each end to minimize transitions due to DNA damage. Trimmed reads were aligned to the human reference genome, Hg19, HS Build37.1, using bwa 20 with seeding disabled and with parameters set according to published recommendations for ancient DNA 21 . SAMtools-1.1 22 was used to sort and remove duplicate reads based on mapping positions.
Our bioinformatics pipeline was confirmed for accuracy by sequencing and processing two samples from the 1,000 Genomes Project 23 with the methods described above. DNA samples from NA18524 and NA18486 were retrieved from the NHGRI sample repository (Coriell Institute, Camden, NJ). The results were plotted on an MDS plot with the HGDP-CEPH Diversity Panel 24 mapped to Hg19, and the two samples clustered with their expected populations (Fig S2).

DNA damage patterns
DNA damage (type I and type II) was assessed by comparing TC/GA and CT/A G transitions, respectively using MapDamage 2.0 6 . A specific pattern of DNA damage has been identified in other ancient DNA studies 25,26 . These studies show a pattern of increased type II DNA damage at the beginning and end of degraded DNA fragments. The MapDamage results show signatures of DNA damage, consistent with use of both the AT overhang library technique and blunt end 27 , which suggests that the ancient sequences originate from ancient DNA templates and not modern contaminants ( Supplementary Fig. 2).

Supplementary Note 7 -Relatedness analysis
Relatedness of the Ancient PRH individuals was assessed using KING 9 , where the kinship coefficient Φ for ranges of >0.354, 0.177-0.354, 0.0884-0.177, 0.0442-0.0884 correspond to duplicate/MZ twin, first-degree, second-degree, and third-degree relationships, respectively. Pairwise comparisons revealed no inferred close relatives, except for PRH 516 and PRH 507, where a possible second-degree relationship was inferred (Supplementary Table 10). However, it should be noted that these two individuals are temporally separated by hundreds of years, and therefore this inferred relationship is likely a false positive.

Supplementary Note 8 -PBS Selection Scans
We performed two scans for positive selection. The first was a per-gene scan, in which we calculated PBS for the PRH Ancient population ( ) using all data at given gene. Gene annotations were derived from RefSeq, utilizing the longest transcript for a given gene. The transcript length was taken as the transcription start to the transcription stop, and included both introns and exons.
between each pair of populations was calculated using all SNPs that fell between the transcription start and stop of the gene, as well as 10 kilobases upstream of the transcription start and 10 kilobases downstream of the transcription stop (similar to how it was performed in Huerta-Sánchez et al. 32 ). We then ranked each gene in the genome with decreasing PBS. The top two candidates are displayed in Table 2.
The second scan was a per-SNP scan, in which we calculated PBS for the PRH Ancient population ( ) and the modern Tsimshian population ( ) at each SNP. That is, between each pair of populations was calculated for a given SNP, and this set of values was used to calculate PBS for that SNP. We then created Manhattan plots, and highlight chromosome 6 in Figure 3.
Results from this scan are highlighted in Supplementary Table 8. Using his different reference sister Native American population still suggests that HLA-DQA1 is a reasonable candidate, as it is ranked fourth in the scanwith the three genes ranked above it devoid of functional characterizationusing Peruvians rather than modern Tsimshian as a reference sister population.