Recent genetic connectivity and clinal variation in chimpanzees

Much like humans, chimpanzees occupy diverse habitats and exhibit extensive behavioural variability. However, chimpanzees are recognized as a discontinuous species, with four subspecies separated by historical geographic barriers. Nevertheless, their range-wide degree of genetic connectivity remains poorly resolved, mainly due to sampling limitations. By analyzing a geographically comprehensive sample set amplified at microsatellite markers that inform recent population history, we found that isolation by distance explains most of the range-wide genetic structure of chimpanzees. Furthermore, we did not identify spatial discontinuities corresponding with the recognized subspecies, suggesting that some of the subspecies-delineating geographic barriers were recently permeable to gene flow. Substantial range-wide genetic connectivity is consistent with the hypothesis that behavioural flexibility is a salient driver of chimpanzee responses to changing environmental conditions. Finally, our observation of strong local differentiation associated with recent anthropogenic pressures portends future loss of critical genetic diversity if habitat fragmentation and population isolation continue unabated.

were mixed thoroughly using a vortex, followed by a 5-minute incubation at 4°C then continuous centrifugation for 5 minutes at 2800 rpm. We next added 20 l QIAGEN 102 Proteinase K to each reaction well and then resealed with new adhesive film. We 103 utilized a specialized QIAGEN QIAamp Fast DNA Stool Qiacube HT MB software 104 protocol that was modified with a pretreatment protocol, which included a filtration step 105 that required the use of an S-Block fitted with ports for a vacuum step, a QIAGEN Turbo 106 filter and the addition of 250 l of 100% ethanol to each reaction well during the 107 procedure. These steps were necessary to accommodate the desiccated faecal 108 samples used in this study, improve DNA yield and PCR amplification, and prevented 109 instrument clogging during the extraction process. 110 DNA extracts were PCR amplified at 14 unlinked, polymorphic microsatellite loci and an 111 additional sex-determining locus (amelogenin) using a two-step multiplex process 1,2 with 112 the following modifications. Rather than a second step singleplex PCR, we used 113 multiplexes of subsets of the target loci to improve workflow efficiency. In the first step 114 we used a primer concentration of 0.15 mM as in the two-step multiplex protocol 1   We amplified all loci used in the study in the first step. We then amplified the PCR product (1:100 123 dilution) in smaller targeted groups of four or five loci, segregated by optimal annealing temperature. The combination of 124 primers in the second multiplex step, Group 3a, did not reliably amplify alleles across loci, therefore for the second half of 125 the study we assembled a new combination, Group3b. 126 Genotype reconstruction 127 In a reliability test 4 , we determined that two replicates were required to achieve greater 128 than 99% certainty of correctly identifying homozygotes as follows. For each locus, we  confidence of homozygote assignment ((Allelic dropout rate) X < 0.01. Homozygote 137 assignment confidence was calculated by finding the value of the exponent (x) in which 138 allelic dropout rate (allelic dropout divided by total possible amplifications) was less than 139 0.01 for each locus. 140 have a high mutation rate 5 , that tends toward asymmetry 6,7 and follow several mutation 143 models; stepwise, infinite allele (point mutations) and two phase 3,8,9 . We began by using 144 established alleles and bin ranges put forth by previous studies in Pan troglodytes 145 schweinfurthii 2 , and scored 14 loci across all 4 subspecies using these alleles. We then 146 evaluated each locus for clustering of alleles by sorting raw allele sizes. The measured 147 size of an allele will vary due to instrument precision and variation in measurement  Microsatellite alleles typically mutate following a stepwise model, whereby alleles 159 mutate by an expansion or a contraction of one or more locus-specific repeat units, 160 leading to a consistent pattern (motif). Another, less common mutation model in 161 microsatellites, the infinite allele model, occurs when an allele changes state by a single 162 base pair (bp), also known as an instert-deletion (indel). This leads to a disruption of the 163 motif and can give rise to a new motif within a locus following subsequent stepwise 164 mutation events of the original indel. When identifying allele clusters we observed 165 numerous cases of "alleles" that fell within the gaps between repeat-unit clusters or at 166 the extreme ends of the allele range, likely a result of an indel. If they were recorded in 167 fewer than five individuals or seemed spurious, we tested the putative alleles by 168 amplifying them in a singleplex PCR as confirmation of their size and distinction from 169 other alleles. We retained only those that unambiguously produced the same allele 170 sizes as was recorded in the original genotypes. We noted that the single-bp indel 171 alleles tended to cluster geographically, for which we even identified several short 172 motifs. We also detected the presence of heterozygotes that had both a copy of a 173 common standard allele and a copy of an indel allele, as well as homozygotes with two 174 copies of an indel allele, further supporting their distinct identity. 175 We used the R package 'CongenR' 11 to assign alleles and assemble genotypes. We 176 then analysed our genotypes using Cervus 3.0.7 12 to calculate allele frequencies, 177 determine the minimum number of loci needed to discriminate individuals, assess null 178 alleles, identify recaptures and to perform first-order relatives analyses. Using the 179 resulting allele-frequency data we calculated that a minimum of eight loci were 180 necessary to be confident that two matching genotypes were obtained from the same 181 individual rather than a full sibling (PIDsib < 0.001). For all unique genotypes (sampled 182 only once), we used a minimum of seven loci for inclusion in our analyses. This resulted 183 in a total of 939 unique individuals spanning 48 sampling locations as 7 sites did not 184 yield any usable genotypes (Supplementary Table 1 and was a consistent outlier in all population structure analyses. Additionally, there was 227 also a locus (D5s1457) in the P.t. schweinfurthii population that displayed higher than 228 expected homozygosity, for which the null allele frequency was > 0.10 with a significant 229 deviation from HWE (p > 0.001). We suspected that a combination of another consistent 230 outlier in our downstream analyses (Issa) and population-size effects (differences in Ne, 231 i.e., the Wahlund effect) were driving this significant result. When excluding Issa, the 232 null allele frequency was < 0.10, but deviations from HWE remained significant. When    Table   271 1.5). In particular, the proportion of first order assignments at Mt. Sangbé was over 3 272 standard deviations above the mean in our dataset. This high degree of relatedness, at 273 least partially, explains why this population appears as an outlier in all of our analyses.    Finally, we tested whether or not the effective barriers detected in our species-scale 297 spatially explicit analysis (EEMS) were a result of highly localized differentiation. We re-298 analysed our data excluding three sites associated with the barriers. We found that the 299 barriers were no longer significant, suggesting that they were a result of local 300 differentiation and no major discontinuities were present in the genetic data.

302
Cluster (STRUCTURE) analyses 303 We used STRUCTURE version 2.3.4 18 , a "spatially agnostic" Bayesian cluster analysis, 304 to test for the presence of geographic stratification in our genetic data. This program 305 utilizes allele frequency and linkage equilibrium patterns to assign individual genotypes 306 to a user-defined number of groups known as clusters (K). Then a test is performed to 307 ascertain which K best explain the data. This algorithm is sensitive to the presence of 308 isolation by distance (IBD; spatial auto correlation in the genetic data), a bias the authors noted in the user manual 18 . Spatially agnostic analyses rely on uniform 310 geographic distribution of the genetic data, which is critical in cases in which IBD is 311 present in the data, a violation of which will lead to a considerable risk of a Type I error,

348
Following genotype reconstruction and prior to performing any other analyses it was 349 necessary to identify contaminated or misidentified samples. Although many primate 350 species occur sympatrically within the chimpanzee range, they are not closely related 351 and therefore not expected to share many alleles. Alleles from monkeys will tend to 352 have unusual sizes, falling outside chimpanzee allele bins and ranges. Closely related 353 species, such as humans and gorillas, which also occur sympatrically, however are much more likely to share alleles, making it more difficult to detect them among the 355 genotypes in our dataset. Also, it is considerably more difficult to distinguish faeces 356 among these three species as compared to between apes and monkeys. Since the 357 STRUCTURE clustering algorithm enables us to detect population structure within a 358 species, it is also effective for detecting interspecific structure as well 27 . To test for the 359 presence of contamination from closely-related sympatric species we added 10 known 360 human and 10 known gorilla samples to our dataset for analysis of K = 3. We were able 361 to identify seven gorilla samples in areas where they are known to co-occur with 362 chimpanzees. We also detected four other gorilla samples in West Africa where they 363 are absent, suggesting that they likely originated from monkey species. Fifty genotypes 364 clustered with the 10 known human genotypes, nearly all of which originated from three 365 field sites located in Côte d'Ivoire (Djouroutou, Taï, and Comoé). This suggests 366 localized sampling issues rather than a widespread problem in our sampling approach.

367
Importantly, we did not include monkey reference samples in this analysis, such that the 368 species attributions of the samples clustering with the gorilla and human genotypes are 369 not certain, other than they are not of chimpanzee origin. All these suspected non-370 chimpanzee samples were removed and excluded from all downstream analyses.

371
In order to proceed with the cluster analysis, it was necessary to account for imbalances 372 and spatial heterogeneity in our data by creating random and balanced subsets of the 373 genotypes. We limited our analyses to individuals with a minimum of seven confirmed 374 loci, and to prevent overrepresentation of genetically similar individuals, we limited the 375 maximum number of individuals from any one site to 20. We employed a two-level 376 strategy of subsetting our data, in which we first randomly sampled 20 genotypes from sites with more than 20 individuals. We performed 10 iterations of subsetting at this 378 level. Then we subsequently pooled all of the selected genotypes by subspecies, 379 followed by randomly sampling 70 candidates from each population, and repeated this 380 step for a total of ten randomized iterations. Our data include 46 genotypes from P.t.

Isolation by distance (IBD) and distance estimators 501
To determine whether the current delineation of subspecies is supported by genetic 502 evidence, we first tested for collinearity between genetic and geographic distance, IBD.   To do so we applied a stratified Mantel test, whereby the matrix of genetic distance is 536 permuted within predefined clusters, in our case subspecies population 33 . Note, the 537 result of this test is sensitive to correct group assignment of the data. The presence of IBD was rejected using the four-subspecies model (Mantel's r = 0.638, p = 0.323). Since 539 the STRUCTURE results at the species level consistently supported differentiation 540 between P.t. verus and ETS, we ran a second stratified Mantel test using these two 541 populations. We found that this yielded a trend in agreement with the cluster analysis 542 results (Mantel's r = 0.645, p = 0.081). This test therefore appeared to be influenced by 543 how we defined the groups so we used a partial Mantel test to assess collinearity 544 between genetic and geographic distance, while controlling for binary cluster 545 membership in predefined groups, in our case subspecies. Note, in tests of IBD applied 546 in this fashion, this approach has been shown to have a 20% Type I error rate (though, to be present in chimpanzee genetic data, we conclude the four-subspecies model of 553 population structure is problematic in our dataset in these tests of IBD, suggesting that a 554 number other than four discrete populations characterize these data.

555
Next we tested for hierarchical clustering of the genetic data as defined by the 556 chimpanzee subspecies delineation and presumed barriers to dispersal (K = 4). Since 557 spatial autocorrelation of our genetic data was significant, it was necessary to account 558 for it in the model. Here we again used a partial Mantel test comparing genetic data (x) 559 to a priori defined binary cluster membership (y) and geographic distance (z), to control   (Fig. 2a). We also found that, on 604 average, the genetic distance between two pairs of locations is higher (given the same 605 geographic distance) between subspecies than within subspecies (Fig. 2b). As the 606 partial Mantel test of four discrete subspecies populations does not specifically reveal which, if not all, comparisons within chimpanzees is driving the significance of the test, 608 we fitted separate linear regression functions to all within-and between-subspecies 609 comparisons at the subspecies-group level (Fig. 2c, d). The slope of these functions   Fig. 2a; Supplementary Figure 3b, 4a, b). Therefore, the uniqueness of the     Figure 7b). We also fitted a function for all between-  in which significant barriers (posterior probability >95%) were present (Fig. 3ac). 818 These barriers appeared to be localized (between neighboring sites), and were likely a 819 result of site-level environmental pressures. One such barrier was associated with Mt. An important consideration for this analysis is that it relies on a relativistic approach, 855 thus extreme differences in effective population sizes, e.g., between P.t. verus and ETS, 856 will lead to over-and underestimation of posterior probabilities. In the case of our data, 857 the high m rates in P.t verus cause areas in ETS to appear to have significantly low 858 relative m rates. For this reason, it is necessary to also analyse these populations

871
The loss of the significant barriers when excluding the sites that were associated with 872 them is strong evidence that the discontinuity we observed was driven by localized 873 differentiation. That these sites are genetically distinct from their neighbours, points to 874 local effects and how strongly they influence site-level differentiation, and removing only three outliers from our dataset eliminated all significant signals of discontinuity in these 876 analyses.