Diversity of HLA Class I and Class II blocks and conserved extended haplotypes in Lacandon Mayans

Here we studied HLA blocks and haplotypes in a group of 218 Lacandon Maya Native American using a high-resolution next generation sequencing (NGS) method. We assessed the genetic diversity of HLA class I and class II in this population, and determined the most probable ancestry of Lacandon Maya HLA class I and class II haplotypes. Importantly, this Native American group showed a high degree of both HLA homozygosity and linkage disequilibrium across the HLA region and also lower class II HLA allelic diversity than most previously reported populations (including other Native American groups). Distinctive alleles present in the Lacandon population include HLA-A*24:14 and HLA-B*40:08. Furthermore, in Lacandons we observed a high frequency of haplotypes containing the allele HLA-DRB1*04:11, a relatively frequent allele in comparison with other neighboring indigenous groups. The specific demographic history of the Lacandon population including inbreeding, as well as pathogen selection, may have elevated the frequencies of a small number of HLA class II alleles and DNA blocks. To assess the possible role of different selective pressures in determining Native American HLA diversity, we evaluated the relationship between genetic diversity at HLA-A, HLA-B and HLA-DRB1 and pathogen richness for a global dataset and for Native American populations alone. In keeping with previous studies of such relationships we included distance from Africa as a covariate. After correction for multiple comparisons we did not find any significant relationship between pathogen diversity and HLA genetic diversity (as measured by polymorphism information content) in either our global dataset or the Native American subset of the dataset. We found the expected negative relationship between genetic diversity and distance from Africa in the global dataset, but no relationship between HLA genetic diversity and distance from Africa when Native American populations were considered alone.

Diversity of Lacandon HLA alleles. Forensic parameters of genetic diversity were calculated to assess HLA diversity in Lacandon using polymorphism information content (PIC), power of discrimination (PD), and Hardy-Weinberg equilibrium (HWE) ( Table 8). In this regard, HLA-A, HLA-B and HLA-DRB1 were the most polymorphic loci with PIC values of 0.8444, 0.8227 and 0.7555 respectively; whereas HLA-DPB1 and HLA-DPA1 were the less diverse HLA loci with PIC values of 0.3547 and 0.1581, respectively. A significantly (p < 0.05) lower observed heterozygosity (OH) than expected heterozygosity (EH) was observed for HLA-B, HLA-C, HLA-DRB1, HLA-DRB3/4/5, HLA-DQA1, and HLA-DQB1 loci. In contrast, HLA-A locus exhibited a higher OH than EH value (p < 0.0001).
Genetic similarities with other populations. A PCA plot and a population phylogenetic tree were constructed using 180 populations (including the Lacandon group studied in this work) with HLA-A, HLA-B and HLA-DRB1 data from a worldwide population dataset. Figure 2 and Supplementary Fig. 1 illustrate the results of the Principal Components Analysis (PCA). The Lacandon Mayans (purple star) cluster together with other Mexican Native American and Mexican Admixed populations, including Mayans from Guatemala. In addition, a Neighbor-Joining (NJ) analysis ( Fig. 3) revealed that Lacandon Mayans are more closely related to Mixe, Mixtec and Zapotec Mexican Native American populations, which are geographically speaking the closest ones to the Lacandon Mayans. It is interesting to note that using HLA genes as genetic estimators, it is possible to mimic the results (although not to the same resolution) obtained with genome-wide data 49 . non-overlapping associations between HLA alleles. Plots of the frequencies of all possible HLA-A~B, HLA-B~C and HLA-B~DRB1 allele combinations are shown in Fig. 4, and plots for all HLA class I and class II associations can be found in Supplementary Fig. 3. Visual inspection suggests a high degree of non-overlap between HLA-B and -C in particular. Figure 5 displays the ⁎ f adj metric (a parameter used to rank the strength of non-overlapping associations between different pairs of HLA loci) for all possible pairwise combinations of HLA loci with HLA-A (4a), with HLA-B (4b) and with HLA-DRB1 (4c), showing also the distribution of ⁎ f adj values obtained when the alleles at the relevant loci are randomized (retaining their population frequencies within the dataset). The heatmap in Fig. 4 (4d) illustrates how many standard deviations above the mean randomized ⁎ f adj value the actual Lacandon ⁎ f adj value is for each indicated pair of loci. HLA-B and -C exhibit the highest degree of non-overlap by this measure, followed by HLA-A and -B, HLA-DRB1 and HLA-DRB3/4/5, HLA-A and -C, and HLA-DRB3/4/5 and HLA-DQA1. All ⁎ f adj values for each pair of HLA loci in the dataset are displayed in Supplementary Table 4.
Assessment of the correlation between pathogen richness and HLA diversity. We calculated PIC values for 122 populations as an estimator of genetic diversity for HLA-A, HLA-B and HLA-DRB1 high resolution data (see Supplementary Table 1). We extracted pathogen and viral richness data from the GIDEON Table 1. Allelic frequencies of HLA class I (HLA-A, -B and -C) in 218 Lacandon Native Americans. A.F.: Allele frequency. HLA-B alleles marked with *** are classified as HLA-Bw4 alleles 132 . HLA-C alleles marked with *** are classified as HLA-C2 alleles 91 .
database 57 . We first tested the correlation between genetic diversity for each gene and geographic distance from Africa. Distance was calculated from East Africa to the location of each sample set analyzed. The general outlooks for (a) pathogen richness; (b) viral richness; (c) HLA class I (represented by its most polymorphic gene, HLA-B) genetic diversity; and (d) HLA class II (represented by its most polymorphic gene, HLA-DRB1) genetic diversity are shown in Fig. 6. We ran linear regressions for the PIC values for all three HLA loci vs. the geographic distance from Africa ( Supplementary Fig. 4). We found a similar tendency for the three genes analyzed: a general decrease in diversity when distance from Africa increases (HLA-A: r 2 = 0.5444; HLA-B: r 2 = 0.2787; HLA-DRB1: r 2 = 0.3352; all these r 2 values correspond to regressions ran after removing outliers). We then ran general linear models including both distance from Africa and either pathogen richness (Supplementary Table 3 Table 4) as predictor variables. No relationships were apparent between HLA-A or HLA-B diversity and either pathogen or viral richness. The 95% confidence interval for the gradient of the relationship between HLA-DRB1 diversity and both pathogen and viral richness suggests a negative relationship (Supplementary Tables 3 and 4), but these relationships do not retain significance following a Bonferroni correction. Furthermore, only the negative relationship between viral richness and HLA-DRB1 diversity in the non-Native American populations retains a 95% confidence interval < 0 in the analysis excluding outliers.

Discussion
In this work, we used next generation sequencing to carry out high resolution typing of the HLA-A to HLA-DPB1 loci in a group of Lacandon Maya settled in the Lacandon Rainforest in the lowlands of Chiapas State in the southeast of Mexico. We determined the distribution of HLA alleles and CEHs and their possible ancestral origin, and assessed genetic diversity within the classical HLA genes. We also put together this Lacandon Maya  www.nature.com/scientificreports www.nature.com/scientificreports/ population with other populations, both Native American and non-Native American, to asses not only genetic relationships but also their general tendency when correlating the genetic diversity of these populations with the geographic distance from Africa and both pathogen and viral richness.
The PCA plot (Fig. 2) showed that the Lacandon population is genetically similar to other Native American populations, such as Mayos 28 , Teenek 58 , Seri 25 , Maya 30 , Wayu 33 , and Quechua 34 . When we performed a phylogenetic analysis of the relationship between the Lacandon and other North and South American Native populations, we found that the Lacandon belonged to the same clade as Native North Americans from Oaxaca (Mixe, Mixtec and Zapotecs) 29 and the very divergent Yucpa from Venezuela 35 . However, these analyses were performed using allelic frequencies, not haplotypic data. When it comes to HLA, allelic diversity in Mesoamerican-descent groups tends to group together most of the Native North American populations. Haplotypic diversity can distinguish finer scale relationships among Native American populations, reflecting that although a limited allelic diversity came into the continent when the first human settlers arrived, recombination played an important role in adaptation to new environments and population differentiation. This is exemplified by the fact that four out of the top ten CEH in the Lacandon were not previously reported (accounting for 30.58% of the total CEHs), although all four haplotypes contained alleles for both class I and class II that are commonly present in other Native American populations.
As it has been shown before 49   www.nature.com/scientificreports www.nature.com/scientificreports/ northern India with an A.F. = 0.0100 63 . Furthermore, we observed a high frequency of haplotypes containing the allele HLA-DRB1*04:11 in Lacandons, an allele that was found with a frequency of 0.3690, which is the highest frequency when compared to other neighboring indigenous groups 29,30 . Other authors 37 have reported class II HLA alleles in the Lacandon (N = 162), and found an even higher frequency of HLA-DRB1*04:11 (Our study 0.3690 vs. ref. 37 0.5740, p = 0.0002). Only two out of nine HLA class II blocks present in previous reports 37 and in our study were found to be at different frequencies:    www.nature.com/scientificreports www.nature.com/scientificreports/ population 66 . One SNP position in LD with non-DRB3, non-DRB5 haplotypes (i.e. the ones not linked with HLA-DRB1*04 allelic group, for instance) has been reported to be associated with a positive tuberculosis test in a recent study using genome-wide data paired with fine-mapping of the HLA region 67 . Although the significance is marginal, this finding further supports the involvement of the region in the susceptibility of tuberculosis. Also, the HLA-DRB1*04 allelic group alleles, highly prevalent in Lacandon Mayans, have been suggested to have a protective effect against hepatitis B virus infection 68 .
A high frequency of "rare alleles or haplotypes" may be present due to genetic isolation and small population number, both of which increase the effects of genetic drift. Class I (ABF: 0.4847), class II (ABF: 0.1659) and CEH (ABF: 0.5786) haplotypes previously unreported in Native American population account for an important part of haplotypic diversity in the Lacandon population, and they point to a distinctive Native American root that was overlooked until recent times 49 . For instance, haplotype HLA-A*31:01~B*40:02~C*03:04 (H.F: = 0.1310) was previously reported only in East Asian populations such as Japanese 69,70 , Chinese 71 and Malaysia Peninsular Chinese (data collected by Sulaiman Salsabil, available in 61 ), but it was also reported in mixed-ancestry populations such as Mexico City 72 and "Hispanics" from USA 73 . Its frequency in Lacandon Maya is the highest frequency ever reported for this haplotype which would be indicative of a Native American MPA haplotype that can be traced back to its original East Asian ancestry thousands of years ago. HLA-B*35~C*07 associations, although uncommon, have been previously reported in Native North American populations such as Mixtec and Mixe 29 but also in some Asian populations such as Rakhine from Myanmar (data collected by Thu ZinZin, available in 61 ). Again, Lacandon Maya exhibit the highest frequency (H.F.: 0.1659) of this haplotype ever reported for any population. The reasons why these associations have never been previously found in Native American human groups include biased sampling procedures, but may also reflect genetic drift or pathogen selection having led to the extinction of specific haplotypes in other Native American groups, or the elevation of specific haplotypes in the Lacandon.
It is important to note that a European haplotype is present as part of the top ten most frequent haplotypes in our sample: A*02:01~B*18:01~C*07:01~DRB1*11:04~DQB1*03:01, with 2.40% (Table 6). That haplotype has been reported in frequencies ranging from 0.91% to 7.32% in populations from eastern and Mediterranean regions from Europe, such as Albania, Macedonia, Greece, Bosnia and Herzegovina, Romania and Italy 71,74-78 . The only African haplotype present at least twice in our sample was A*68:02~B*53:01~C*04:01~DRB1*13:03 DQB1*02:02 (HF = 0.44%; Table 6). This class I and class II haplotype can only be found in mixed ancestry populations such as Mexicans from Mexico City 72 and African Americans 73 , but the class I block is present in populations from sub Saharan Africa such as the Bandiagara from Mali 79 , the Nandi and Luo from Kenya 79,80 and the multicultural Worcester region in South Africa 81 . These two examples of non-Native American haplotypes give account of admixture events consistent with the ancestries brought into the Americas by conquerors during the conquest wars period and the colonial times 82,83 , even in what is considered to be one of the most isolated Native American human groups 52,53 .
HLA molecules have an important biological role as Killer cell Immunoglobulin-like Receptor (KIR) ligands. KIR2DL1, KIR2DL2/3 and KIR3DL1 bind HLA-C2, -C1 and -Bw4 ligands respectively, resulting in inhibition of natural killer (NK) cell-mediated cytolysis. C2, C1 and Bw4 are all found on HLA class I molecules: C1 and C2 are exclusive to HLA-C, while Bw4 can be found on some HLA-B and some HLA-A molecules. Several physiological functions of Natural Killer (NK) cells in human immunity and reproduction depend upon diverse interactions between KIRs and their HLA class I ligands [84][85][86][87] . In most populations, HLA-Bw4 alleles (i.e. those carrying asparagine, aspartic acid or serine in amino acid residue 77 and isoleucine or threonine in amino acid residue 80) are present in nearly 50% of the haplotypes, which means that around 75% of the individuals of any given population should express a ligand for the KIR3DL1 receptor 88 www.nature.com/scientificreports www.nature.com/scientificreports/ 60% in Oceania native human groups and some African populations. In our sample, the C2 ligand is present on 0.2294 of the HLA-C alleles (Table 1). This level of HLA-C2 is consistent with other Native American populations in which HLA-C2 alleles are present in AF < 0.3 such as Barí (0.075) and Yucpa (0.214) from Venezuela's tropical humid forests 35,36 , both of which share the same type of ecosystem with Lacandon Maya.
As expected, when PIC values observed in Lacandons (Table 8) are compared against those found in a mixed ancestry population 72  In what could be called the "class I/class II diversity paradox", the Lacandon population exhibits a relatively low diversity in HLA-DRB1 (globally one of the most variable regions in the human genome) compared to HLA-A and HLA-B. It is possible that the lower diversity of MHC class II genes in Native American populations (and, www.nature.com/scientificreports www.nature.com/scientificreports/ by extension, in mixed populations with high proportions of Native American ancestry) might result from the frequency increase of some alleles that provide efficient immune protection against highly prevalent extracellular pathogens in specific populations 6,7,15,19,72 .
Previous studies have demonstrated a positive correlation between HLA class I allele diversity and pathogen richness and a negative correlation between HLA class II diversity and pathogen richness, raising the possibility that HLA class I and class II genes undergo different types of evolutionary trajectory in response to pathogen selection 5,6 . However, Sanchez Mazas et al. 6 found that the significance of both their observed positive correlation between HLA class I allele diversity and pathogen richness and the negative correlation between HLA class II diversity and pathogen richness disappeared when Native American and Taiwanese populations were removed from the dataset. Native American populations have low HLA class II diversity in a high pathogen environment (this in turn seems to have helped to generate the previously observed negative correlation between class II diversity and pathogen richness). Is it possible that low HLA class II diversity can be a form of adaptation to a high pathogen environment? The class II HLA evolutionary mechanisms proposed by Sanchez-Mazas et al. 6 apply particularly to HLA-DQA1 and HLA-DQB1, whereas as noted previously the Lacandon also exhibit relatively low diversity in HLA-DRB1. Certain HLA alleles appear to be promiscuous and are capable of binding an exceptionally large set of epitope peptide segments 15 . Since the HLA class II alleles commonly found in Native Americans (except for HLA-DRB1*16:02), and specifically those reported here for Lacandon Mayans, do not fall within the category of "promiscuous alleles" established by Manczinger et al. 15 , we can in principle hypothesize that selection events happening in recent times may have driven this genetic structure. Other non-promiscuous alleles include those of the HLA-DRB1*04 allelic group (accounting altogether for 70.22% of the total HLA-DRB1 diversity observed in Lacandons), with some of them being associated with specific pathogens as discussed above. It is noteworthy that at least one allele of this allelic group has been implicated in resistance to the development of enteric fever caused by Salmonella enterica 92 and that there is molecular evidence of this pathogen causing at least one outbreak after the conquest in a region not far away from Chiapas (i.e. the state of Oaxaca) 93 . New alleles, with a very specific peptide binding repertoire, might be the most efficient way to achieve resistance to specific pathogens, as high promiscuity may not be able to cope with the rise of novel pathogens 15,94 . www.nature.com/scientificreports www.nature.com/scientificreports/ Sanchez Mazas et al. 6 found a positive relationship between HLA-B genetic diversity and either pathogen score or viral score, once distance from Africa was taken into account. In our analysis, our fitted coefficients were consistent with a positive relationship, but we could not reject the null hypothesis of no relationship (Supplementary  Tables 3 and 4). Sanchez Mazas et al. did not find evidence for a relationship between HLA-DRB1 genetic diversity and either pathogen score or viral score. Within our dataset, the strongest signal of any relationship (albeit one which was not robust to a Bonferroni correction) was for a negative relationship between HLA-DRB1 genetic diversity and viral score once distance from Africa is considered, in non-Native American populations. We found no evidence for a relationship between either distance from Africa or pathogen/viral scores and HLA diversity when the Native American populations were considered alone. This may be due to the small size of the dataset. A better picture may be drawn in the coming years when ancient DNA analyses are taken into account for studying the immunogenetic diversity of these and other populations before and after specific events that would have It has been shown using a multilocus model of host-pathogen co-evolution with allele-specific adaptive immunity that if selection from a multi-epitope, strain-structured pathogen is maintaining associations between host recognition loci, alleles at those loci should not only be in linkage disequilibrium, but also exhibit non-overlapping associations 24 . Penman et al. 24 showed in particular that pathogen selection has the potential to maintain non-overlapping associations between HLA loci despite the presence of recombination between those loci. They demonstrated that such long-range non-overlapping associations may be observed between HLA loci such as HLA-B and HLA-DRB1 in a dataset of HLA haplotypes from the Hutterite population of South Dakota. We examined non-overlapping associations within Lacandon HLA haplotypes in an effort to identify similar possible signatures of pathogen selection (Figs. 3, 4 and Supplementary Fig. 3). In our dataset, the highest levels of non-overlap between HLA loci are observed between the physically close HLA-B and -C pair and the similarly physically close HLA-DRB1 and HLA-DRB3/4/5. Such associations could be a result of pathogen selection, but given the likely low level of recombination between these loci we cannot rule out other population genetic processes. For HLA-DRB1 and HLA-DRB3/4/5, epistasis with the HLA-DRA locus with which the gene products of HLA-DRB1 or HLA-DRB3/4/5 must interact may contribute to particularly high levels of non-overlap. Recent work has also revealed that haplotypic associations between HLA-B and HLA-C in particular may be driven by the different mechanisms by which the proteins they encode are likely to be able to participate in NK cell education 95 .
There was no evidence for non-overlapping associations spanning class I and class II HLA loci in the Lacandon dataset, which may suggest that the most recent forms of pathogen selection in the Lacandon population have acted on the class I and/or class II loci separately. Furthermore, due to the particularly devastating infectious disease events the Lacandon, and other Native American populations have experienced -in which many people have died in large epidemics 93,[96][97][98][99] -it is unlikely that the patterns seen in this group may be the result of the long-term pathogen selection simulated in the original Penman et al. model 24 . It may be that the Lacandon population maintained a set of haplotypes as a consequence of millennia of co-evolution with prevalent American pathogens, and the pattern seen today is the consequence of a short-term disruption of that state. It has been recently found 100 that class II genes show a signature for selection when comparing Native Americans from the same region in the northwest coast of North America before and after the contact with Europeans and the European-borne pathogens during the conquest and colonial period 98,99,101 . Our results further add to the discussion on whether HLA class II region could have been under similar selective pressure in Native Americans.
In summary, the Lacandon Maya represent one of the most homozygous and least diverse populations regarding HLA class II genes among Native American populations. It is possible that a history of infections, genocides and inbreeding contributed to this limited diversity in the HLA system. The relationship between genetic diversity of the HLA system and both pathogen and viral diversity, as well as its correlation with the geographic distance  www.nature.com/scientificreports www.nature.com/scientificreports/ from Africa, may become clearer if data from previous time periods are only considered. Future work using ancient DNA approaches to study populations before specific historic and prehistoric periods (e.g. the conquest of the Americas or great human migrations), but also identifying pathogens that caused outbreaks and epidemics, may shed light on the actual diversity of the HLA system in human populations before and after epidemic and pandemic events that may have shaped the immunogenetic diversity of our species.

Subjects and Methods
Subjects. A total of 218 first-degree non-related Lacandon individuals [146 ♀ (64.2%) and 82 ♂ (35.8%); average age = 31.7 ± 17.4 years] were studied for a final number of 436 chromosomes. All participants were inhabitants from small villages located in the Lacandon jungle in the southern State of Chiapas, Mexico including: Lacanjá, Bethel, E. Lacandón, Metzabok, Na Há, San Javier, and Tumbo, all of them belonging to the municipality of Ocosingo (Fig. 1). We confirmed the Lacandon ancestry (parents and grandparents born in the same region) of all included individuals by questionnaire. Collection of blood samples and demographic data was performed according to the requisites of the Helsinki Declaration (2008) and the General Health Law of Mexico and following the protocol approved jointly by the Ethics in Research Committee and the Research Committee from the National Institute for Medical Sciences and Nutrition "Salvador Zubirán" (INCMNSZ). All subjects provided written informed consent for these studies, and they authorized the storage of their DNA samples. Informed consent was obtained from all participants. When participants were under 18 years at the time of the sample collection, informed consent was obtained from a parent and/or legal guardian.
High resolution HLA typing by next generation sequencing. Genomic DNA was obtained from peripheral blood mononuclear cells using the QIAamp DNA mini kit (Qiagen ® , Valencia, CA, USA). All samples were typed for 11 HLA loci utilizing low-and high-resolution methods. Low resolution typing was performed by a Luminex-based detection and typing method including PCR amplification and reverse oligonucleotide hybridization (LABType ® SSO Typing Tests, One Lambda Inc., Canoga Park, CA, USA). High-resolution HLA class I and class II typing was performed by a recently developed high throughput sequencing technique involving long range PCR of each HLA locus, shearing of DNA and high throughput NGS typing as previously described 102 . In addition, these samples were typed with a commercial kit MIA FORA TM NGS FLEX HLA Typing Kit (Immucor, Norcross, GA, USA) that utilizes a similar method which applies enzymatic fragmentation instead of Covaris possible ⁎ f adj scores for each pair of loci in the dataset that would be obtained if the alleles at those loci were associated entirely at random. We then calculated the difference between the ⁎ f adj value calculated from the Lacandon dataset for each pair of loci and the mean of the ⁎ f adj scores calculated from the randomized data for that pair of loci, then divided that difference by the standard deviation of ⁎ f adj calculated from the randomized data. The resulting scores allowed us to rank the HLA pairs in order of how extreme a level of non-overlap they displayed relative to entirely random associations between the same alleles. There are many reasons why associations between HLA loci should not be entirely random, so we do not claim that a departure from randomness necessarily means selection has occurred. However, ranking the pairs of HLA loci by how much they each depart from a state of random association allows for a more meaningful assessment of which pairs of HLA loci are most strongly non-overlapping, since it accounts for the differing allelic diversity at each locus.
Analyzing genetic relationships between populations. A PCA plot and a population phylogenetic tree were constructed for 180 populations (including the Lacandon group studied in this work) with HLA-A, HLA-B and HLA-DRB1 low-resolution data from a worldwide population dataset (Supplementary Table 1) using the IBM SPSS Statistics 19 software (IBM Corporation, Armonk, NY, USA) for the PCA and POPTREEW 128 to analyze the distribution of 115 human groups with high resolution HLA typing, including our Lacandon sample. We used D A distance 129 and a NJ clustering method to construct a population phylogenetic tree with bootstrapping (1200 replications). References for each population group included in the PCA and phylogenetic analysis are listed given in Supplementary Table 1.
Assessment of the correlation between pathogen richness and HLA diversity. We used PIC as an estimator for genetic diversity for three HLA loci (HLA-A, HLA-B and HLA-DRB1) and conducted an analysis of HLA genetic diversity and pathogen species richness similar to that of Sanchez-Mazas et al. 6 . First, we tested the correlation between genetic diversity for each gene and the geographic distance from Africa. We maintained the five key geographic points suggested by the authors and modeled the distance (in Km) from Addis Ababa (as a suggested point of departure) to every one of the locations where the 122 populations analyzed in the previous point (with available high-resolution data) live today. In our dataset the same populations were used for the three genes analyzed. Distance was calculated with the Measure distance function of Google Maps 130 . Information on pathogen and viral richness was extracted from the GIDEON database 57 . In order to relate the level of HLA polymorphism within a population and the pathogen environment of this population, we compiled the number of infectious diseases present in all countries for which we had information on HLA genetic diversity (N = 115). To assess the effect of genetic drift/recent bottlenecks/selection on small sized, isolated populations, we ran the statistical analyses after excluding Native American populations, and also ran tests on Native American populations separately. The final datasets include, thus, the worldwide populations' dataset, the Native American populations' dataset and the worldwide non-Native American populations' dataset.
We used general linear models to determine how the distance from Africa and either pathogen or viral richness explained the level of genetic diversity (as estimated with PIC) at each of the three loci, for each of the datasets previously listed. We logit-transformed the PIC values before carrying out the analyses in order to improve normality. Coefficients for the relationships between distance, pathogen score or virus score; their 95% confidence intervals and associated p values, and the R 2 value for each model are shown in Supplementary Tables 3  and 4. We performed the analyses with and without outliers. Outliers were defined as populations for which the logit transformed PIC value at any of the three loci was further than 2 standard deviations away from the mean logit(PIC) value for that locus.

Data availability
All data from our sample sets, both frequencies and anonymized individual genotypes, can be found at The Allele Frequency Net Database website (www.allelefrequencies.net) 131