Lyme disease is the most prevalent vector-borne disease in North America and continues to spread. The disease was first clinically described in the 1970s in Lyme, Connecticut, but the origins and history of spread of the Lyme disease bacteria, Borrelia burgdorferi sensu stricto (s.s.), are unknown. To explore the evolutionary history of B. burgdorferi in North America, we collected ticks from across the USA and southern Canada from 1984 to 2013 and sequenced the, to our knowledge, largest collection of 146 B. burgdorferi s.s. genomes. Here, we show that B. burgdorferi s.s. has a complex evolutionary history with previously undocumented levels of migration. Diversity is ancient and geographically widespread, well pre-dating the Lyme disease epidemic of the past ~40 years, as well as the Last Glacial Maximum ~20,000 years ago. This means the recent emergence of human Lyme disease probably reflects ecological change—climate change and land use changes over the past century—rather than evolutionary change of the bacterium.
Lyme disease is the most prevalent vector-borne disease in North America and continues to expand across the northern United States and into southern Canada1. Lyme disease was first identified in a bizarre cluster of juvenile arthritis cases in Lyme, Connecticut, in 19762. Quickly, it became clear that cases were not isolated to a single town nor to children and, by 1979, several hundred cases were reported in a cluster circling the Long Island Sound3. Since the 1970s, the disease has rapidly spread from disease foci in the northeast and midwest4, and to a lesser extent, in northern California. Now, more than 30,000 cases of Lyme disease are reported each year and incidence is estimated to be ten times as high5.
The Lyme disease spirochete, Borrelia burgdorferi sensu stricto (s.s.), is maintained in an enzootic transmission cycle, in which the bacteria is transmitted between Ixodes tick vectors and a community of vertebrate reservoir hosts6. Humans are incidental hosts; we are susceptible to infection but do not contribute significantly to transmission or spread. Despite its epidemiological importance, little is known about the evolutionary history of B. burgdorferi in North America. This limits our ability to track and predict the direction of ongoing spread and implement public health interventions.
Pathogen genomes can reveal epidemic histories. Phylogeography provides a powerful framework for inferring pathogen evolutionary dynamics including probable epidemic origins, rates and patterns of spread, and the distribution of highly virulent clades7,8,9,10. However, the evolutionary history of B. burgdorferi has not been explored using high-resolution molecular markers such as whole genomes sampled from a broad spatial area. Sampling and sequencing B. burgdorferi genomes directly from ticks is challenging; bacterial DNA is swamped by tick and environmental DNA11, and culturing the bacteria introduces biases12. Only 15 genomes of B. burgdorferi s.s. existed previously (Supplementary Table 1) and previous phylogeographic study has relied on single locus13,14 or multi-locus sequence typing (MLST) markers15,16,17.
Here, we use a population genomics approach to investigate outstanding questions about the processes shaping B. burgdorferi variation, patterns of gene flow across North America and the timescale of bacterial evolution. (1) How is B. burgdorferi variation generated and maintained? Previous study of 13 B. burgdorferi s. s. genomes found frequent small-scale recombination18,19, but the relative importance of mutation to recombination in shaping population-level diversity is unknown. (2) Where did North American B. burgdorferi originate and how has the bacterium spread? A previous MLST-based study suggests that B. burgdorferi was historically introduced to the midwest from the northeast and the two regions are now isolated15. However, rates of gene flow of the bacteria between these two Lyme endemic regions and the relative role of different vertebrate hosts in bacterial dispersal is unknown. (3) Finally, how old is B. burgdorferi diversity? Museum specimens reveal the bacteria was present on Cape Cod, Massachusetts, in the 1890s20 and Long Island, New York, in the 1940s21, long before clinical recognition of the disease. Phylogeography suggests a deeper evolutionary history—ticks and B. burgdorferi may have survived the Last Glacial Maximum in the southern United States and spread across North America after glacial retreat, ~20,000 years ago13,15. Reconstructing the timescale of diversification will indicate if B. burgdorferi variation holds a signature of recent emergence of human Lyme disease in the past 40 years or if diversity reflects more ancient processes.
To address these three epidemiologically relevant aspects of the evolutionary history of B. burgdorferi in North America, we sampled and sequenced 146 B. burgdorferi genomes directly from tick vectors from across the northeastern and midwestern United States and from southern Canada, including samples from 1984. With this genomic collection and all previously published B. burgdorferi genomes, including samples from Europe and the western USA (California), we reconstructed the evolutionary history of the Lyme disease bacteria in North America.
Field sampling and variation
We collected 146 B. burgdorferi-infected nymphal Ixodes scapularis ticks from across Lyme disease endemic areas in the northeast and midwest United States and southern Canada, spanning 30 years (1984–2013) and representing the widest spatial and temporal genomic collection of B. burgdorferi yet published (Fig. 1; Supplementary Information). We used a hybrid capture method we previously developed11 to enrich for and efficiently sequence B. burgdorferi-derived reads from a mixed DNA template (~73% capture efficiency), allowing us, for the first time, to screen B. burgdorferi directly from ticks. Our genomic collection allowed us to simultaneously study DNA from B. burgdorferi, the tick vector, and a co-vectored parasite, Babesia microti. To extend the geographic range of samples, we added all 15 published B. burgdorferi s.s. genomes (Supplementary Table 1), including two samples from Europe and three from the western USA (California), and used Borrelia finlandensis22 as an outgroup.
We identified single nucleotide polymorphisms (SNPs) along the B. burgdorferi linear chromosome (910,724 base pairs (bp)) and the two best-characterized and most conserved plasmids23, cp26 (26,498 bp) and lp54 (53,657 bp) by mapping short reads to the B31 reference genome24 (variant filters in Methods). We identified 18,740, 959 and 1,034 SNPs distributed along the chromosome, plasmid cp36 and plasmid lp54, respectively (Methods).
Both de novo mutation and recombination contribute to B. burgdorferi variation18. Early studies described B. burgdorferi as ‘clonal’14, while recent studies identified small-scale recombination18,25. However, the contribution of recombination to B. burgdorferi genomic variation has not been examined in a large population sample. We evaluated the contribution of both processes to simultaneously infer recombinant tracts and the underlying maximum likelihood phylogeny of the B. burgdorferi chromosome and plasmids cp26 and lp54 (Supplementary Fig. 1).
Recombination is frequent along the chromosome (763 recombination events) and plasmids cp26 (68 recombination events) and lp54 (53 recombination events) (Supplementary Fig. 1), and it has shaped the B. burgdorferi genome over its evolutionary history. For each branch of the chromosomal and plasmid trees, we determined the ratio of SNPs introduced via recombination to the number introduced via point mutations, ur/m, as well as the total number of recombination events per branch. The chromosome had a mean ur/m of 1.55 (standard deviation = 3.83) and had a mean of 2.63 recombination events per branch. Despite the recombination detected, 93.18% (846,801 of 910,724 bp) of the chromosome lies within a clonal frame, outside predicted recombinant regions (Supplementary Fig. 1a). The relative frequency of recombination to point mutation is lower on the plasmids than on the chromosome: ur/m of 0.70 (plasmid cp26) and 0.259 (plasmid lp54) (Supplementary Fig. 1b,c).
Recombinant hotspots on the plasmids contain several known antigens. A recombination peak exists at the ospC (outer surface protein C) gene on plasmid cp26 (Supplementary Fig. 1b). ospC is a major B. burgdorferi antigen required for transmission from tick to vertebrate host and dissemination within vertebrates26. ospC defines B. burgdorferi serotype: many ospC serotypes (alleles) circulate and vertebrates develop type-specific immunity protecting them from re-infection with the same ospC serotype27. ospC serotypes vary in virulence in humans: ‘disseminating’ serotypes are associated with spread from the skin (the initial site of infection) through the bloodstream, causing more severe disease, while non-disseminating serotypes may remain as localized skin infections28. Recombinant tracts vary in length: some are tightly centred on ospC (632 bp), while others are up to 5,000 bp long, including the neighbouring guaA (GMP synthase) and guaB (dehydrogenase) genes essential for infectivity in vertebrate hosts29. A recombination hotspot on plasmid cp26 also includes resT, a gene involved in recombination (Supplementary Fig. 1b).
Two sharp recombination peaks on plasmid lp54 correspond to two known B. burgdorferi antigens (Supplementary Fig. 1c): dbpA, an adhesin that enables B. burgdorferi dissemination in vertebrate hosts30; and BB_A05, an antigen involved in transmission from ticks to mammals12. Recombinant tracts in dbpA also vary in size from 70 to ~8,000 bp and frequently include the neighbouring dbpB, another adhesin. A second recombination hotspot includes two other antigens that elicit human antibody responses: S1 (BB_A05), highly expressed in feeding nymphal ticks during transmission to mammals31; and S2 (BbuZS7_A03).
To investigate B. burgdorferi population structure, we built a maximum likelihood phylogeny of the 16,370 SNP recombination-free alignment. The chromosomal phylogeny reveals a high degree of B. burgdorferi population structure divided into four major clades, which do not always reflect geography (Fig. 2a). Samples from the northeast and midwest are not monophyletic, in contrast to previous MLST-based analysis suggesting the two regions had a shared past and are now isolated15. Similarly, southern samples are not monophyletic and are distributed across the tree.
To explore the population structure of B. burgdorferi on a broader spatial scale, we built a maximum likelihood phylogeny including all 15 previously published B. burgdorferi s.s. genomes in addition to the 146 samples collected for this study (Supplementary Fig. 2). The three samples from the western USA (California) fall in two distinct clades, indicating that western B. burgdorferi are not genetically isolated from midwestern and northeastern samples (Supplementary Fig. 2). Although B. burgdorferi is maintained in a different ecological cycle in the western USA, where it primarily cycles between Ixodes pacificus ticks and wood rats32, the bacteria do not seem to be genetically differentiated from other North American samples. The two European B. burgdorferi s.s. genomes are closely related to each other and monophyletic (Supplementary Fig. 2, pink tips), nesting in a clade that contains midwestern samples. Previous MLST-based analysis suggested a European origin of all B. burgdorferi16. Although we cannot formally test this hypothesis because of the limited European sampling for this study, the placement of the two European samples within a clade including only midwestern samples suggests a more complex pattern of genetic differentiation than the one hypothesized with MLST-based analysis.
B. burgdorferi chromosomal and plasmid phylogenies are broadly similar, though there is evidence of several historic plasmid exchanges between lineages (Supplementary Fig. 3).
To assess whether clades vary in virulence, we investigated their ospC serotype composition by serotyping samples in silico. ospC serotypes26 cluster imperfectly within the B. burgdorferi tree, reflecting recombination discussed above (Fig. 2a; Supplementary Fig. 2). Virulent, disseminating ospC serotypes A, B, H, I and K occur within each phylogenetic cluster. However, the clades have different virulence profiles. Disseminating strains make up the majority of samples in clade I (72.1%, 95% confidence interval (CI): 56.1–84.2%) and clade II (72.5%, 95% CI: 55.9–84.9%) but only a minority of clade III (27.8%, 95% CI: 14.8–45.4%) and clade IV (7.7%, 95% CI: 1.34–26.6%) (Fig. 2a). Virulent serotypes are more common in the northeast, where 35.8% (95% CI: 28.6–43.6%) serotyped samples are disseminating compared with 15. 8% (7.91–28.4%) disseminating in the midwest.
Discordant tick and pathogen phylogenies
Although the efficiency of our hybrid capture protocol was high (~70% of reads mapped to B. burgdorferi)11, the majority of remaining sequence data corresponded to the tick, I. scapularis, genome33 (Supplementary Fig. 4).
Coverage of three tick mitochondrial genes (16S, cytochrome oxidase II and the control region; Supplementary Table 2) enabled us to examine potential co-evolution of the bacteria and its vector34 (Supplementary Fig. 5), demonstrating the biological relevance of sequence by-catch, unintentionally captured DNA sequence. We constructed a maximum likelihood tree based on 118 concatenated SNPs identified in three tick mitochondrial genes, including 46 samples with no more than 50% missing data (Supplementary Fig. 6). As observed for B. burgdorferi (Fig. 2a), ticks from the northeast and midwest are not monophyletic (Supplementary Fig. 6), suggesting a recent shared history or frequent migration between these regions, consistent with an earlier, single locus analysis13. However, in contrast to this earlier work13, we find that southern ticks are not monophyletic. Tick and bacterial phylogenies are incongruent (Supplementary Fig. 6).
In addition to capturing B. burgdorferi, we captured sequence derived from B. microti35, a co-emerging tick-borne parasite4. To test for evidence of shared evolutionary history of the two pathogens, we constructed a maximum likelihood tree based on the complete genome of the B. microti apicoplast (a small organelle found in most Apicomplexa parasites), 28.7 kbp, and compared it with the B. burgdorferi phylogeny (Supplementary Fig. 7). Despite our limited sample size of nine co-infected ticks, due to the relatively low prevalence of B. microti, we found incongruence between the two parasite phylogenies.
We used ancestral state reconstruction to explore the geographic origins and gene flow patterns of North American B. burgdorferi (Fig. 2b).
The most recent common ancestor of all sampled B. burgdorferi most likely existed in the northeast (posterior probability, 77.2%), as did the most recent common ancestor of each B. burgdorferi clade (clade I: 78.4%; clade II: 89.3%; clade III: 76.7%; and clade IV: 77.6%; Fig. 2b). Our phylogeny reveals a pattern of gene flow more complex than the previously described unidirectional pattern from northeast to midwest13,15. We uncover a signal of bidirectional gene flow between the three regions surveyed (Supplementary Fig. 9). Although migration rates between regions are not statistically different, there is a clear hierarchy in estimated rates of gene flow between regions. Migration between the northeast and midwest is most frequent, followed by migration between the northeast and south, and finally migration between the south and midwest. The most likely ancestral location for interior nodes within clades I, III and IV (Figs. 2b and 3, nodes labelled ‘MW’) existed in the midwest, revealing historic gene flow from the northeast to the midwest. This means that midwestern B. burgdorferi populations are not recently introduced from the northeast and that genetically distinct B. burgdorferi populations from clades I, II and IV were historically endemic both in the northeast and the midwest (Fig. 2b). Although the northeastern USA is currently the region with the highest density of tick vectors36, highest prevalence of infected ticks36 and the greatest epidemiological burden of Lyme disease37, complex patterns of historic gene flow across North America have shaped current patterns of diversity (Fig. 2b).
Estimating the timing of the North American B. burgdorferi diversification requires knowledge of how fast the bacteria evolves. However, the substitution rate of this bacteria has not been measured directly, and available estimates vary widely from 10−9 to 10−6 substitutions per site per year15,38. In bacteria, where fossils are not available to calibrate node ages, serially sampled tips (that is, samples collected at several time points) may enable estimation of substitution and diversification rates along the rest of the tree. This is possible if sufficient substitutions accumulate over the sampling period, within a measurably evolving population39,40. To maximize our sampling period, we included four ticks from 1984, the oldest available B. burgdorferi DNA in our dataset (Fig. 1b) and the oldest existing B. burgdorferi whole genomes besides the B31 reference strain cultured from a tick in 198224. We simultaneously estimated phylogenies, substitution rates and demography with a Bayesian approach implemented in BEAST41.
The observed substitution rate of 7.49 × 10−6 substitutions per site per year (95% highest posterior density (HPD) 3.32 × 10−6–1.34 × 10−5; Supplementary Table 3) is faster than previously estimated for B. burgdorferi15. This probably reflects the bacterium’s ecology. When B. burgdorferi is transmitted from tick to vertebrate host to tick, its population experiences severe bottlenecks, so that only a fraction of the variants transmit to the next host. Genetic drift in this severely bottlenecked bacterial population may increase the observed substitution rate42, although the within-host B. burgdorferi substitution rate is likely to be much slower.
We estimate the most recent common ancestor of the B. burgdorferi analysed in this study existed ~60,000 years ago (95% HPD 20,000–98,000) (Fig. 3; Supplementary Table 3). B. burgdorferi diversity is ancient and long pre-dates not only the Lyme epidemic of the past ~40 years but also the Last Glacial Maximum ~20,000 years ago13. Current Lyme disease foci are in regions once covered by the Pleistocene ice sheets13. Earlier studies suggest that the south was a refugia for many North American animals during the Last Glacial Maximum13,43 and, after glacial retreat, southern migrants colonized the northeast and midwest13,15. The ancient timescale of B. burgdorferi diversification in North America suggests that this species, in addition to ticks, was endemic in North America before the Pleistocene glaciation and its evolutionary history was shaped by ancient geological events.
Our estimates of substitution rate and tree age have substantial uncertainty (Fig. 3; Supplementary Table 3), as we do not have independent data to calibrate divergence time estimates, and the 30-year sampling interval that separate our samples is orders of magnitude shorter than the estimated tree age. However, even if B. burgdorferi substitution rates are slower than estimated in this study (that is, ~10−7–10−8 substitutions per site per year), the age of the most recent common ancestor would still be quite old, dating to several hundred thousand or a million years old. Thus, our finding that B. burgdorferi diversity is ancient and pre-dates the ongoing Lyme epidemic seems to be robust to uncertainty in the substitution rate estimate.
B. burgdorferi population size expanded dramatically ~20,000 years ago and remained relatively stable thereafter (Supplementary Fig. 10a) according to the skyline plot, a piecewise population model that allows effective population size to vary over time44, estimated in BEAST45. This signal may reflect population expansion from refugial populations in the south, after Pleistocene ice sheets retreated, allowing ticks and vertebrate hosts of B. burgdorferi to repopulate wide regions of northern North America13. This is consistent with previous studies reporting historic population size expansions in B. burgdorferi13,14,15. However, this result must be interpreted with caution because the bacterial population is geographically structured both across space and between hosts, violating the coalescent model assumption of a single, panmitic population, which can potentially bias effective population size estimates.
To complement the skyline plot analyses, we used the mismatch distribution—the distribution of pairwise SNP differences between samples—to test for historic changes in B. burgdorferi population size (Methods). The mismatch distribution was unimodal (Supplementary Fig. 10b) and consistent with past demographic or spatial expansion. The peak of the mismatch distribution (τ = 389) corresponds to the timing of historic population expansion, in units of mutational distance. Separate analyses of the mismatch distribution for samples from the northeast, midwest and south reveal a similar pattern of historic population expansion in each region.
Here, we reconstruct the evolutionary history of the Lyme disease bacteria in North America from the widest yet collection of B. burgdorferi genomes sequenced directly from field-collected ticks. By mining sequence by-catch and co-captured DNA in our genomic collection, we explore patterns of potential co-evolution with the tick vector and another co-infecting human parasite, B. microti.
We find that although B. burgdorferi is largely clonal, both recombination of short genomic tracts along the chromosome and plasmids (Supplementary Fig. 1) and shuffling of entire plasmids (Supplementary Fig. 3) shape genomic diversity. We also found that recombination hotspots include the major B. burgdorferi antigens ospC, dbpA and dbpB (Supplementary Fig. 1b,c). These surface-expressed antigens are probably experiencing balancing selection imposed by vertebrate host immune responses.
The complex phylogenetic structure of B. burgdorferi (Fig. 2a) together with the migration rate analyses (Fig. 2b) suggest previously undocumented levels of gene flow across North America. As Ixodes ticks move little, B. burgdorferi spreads through the movement of its vertebrate hosts, small mammals and birds. In contrast to previous findings of strong barriers to gene flow between the northeast and midwest13,15, we find evidence of long-distance migration events between the three major geographic regions sampled (Fig. 2b), probably due to long-distance, bird-mediated dispersal. Local dispersal by small mammals probably also contributes to gene flow on much smaller spatial scales. One limitation to this study is that phylogenies are inferred from consensus bacterial sequences from each individual tick sample. The consensus B. burgdorferi sequence from a tick with a mixed infection46 often represents the majority strain infecting a tick (that is, the strain comprising the largest proportion of infection). However, the consensus sequence may alternatively represent a chimeric sequence, a combination of segments of the multiple co-infecting haplotypes. With short-read sequence data used here, we are unable to reconstruct multiple bacterial haplotypes within a single sample. Longer-read sequencing will enable sequencing multiple bacterial genomes from individual tick vectors in the future.
We find no support for co-evolution of B. burgdorferi and its tick vector (Supplementary Fig. 6). While captured sequence data enabled us to examine phylogenetic structure at only a few tick genes, we do not expect further analysis to reveal a strong pattern of co-emergence of the bacteria and its vector. In contrast to obligate parasites that co-diverge with their hosts, B. burgdorferi cycles between tick vectors and vertebrate hosts, decoupling tick and bacterial evolutionary histories. Our finding of a lack of association between tick and bacterial evolutionary histories suggests that invasion of ticks and bacteria is not coupled and may reflect distinct ecological processes. This is consistent with observations that B. burgdorferi invasion often lags behind tick invasion. This means that epidemiological surveillance should focus on areas with established I. scapularis tick populations that are potential sites for B. burgdorferi introduction.
Finally, we find B. burgdorferi diversity is ancient (~60,000 years old) (Fig. 3). Although there is substantial uncertainty in estimated divergence times, our phylogeny shows a clear signature of ancient diversification of B. burgdorferi that long pre-dates the Lyme epidemic of the past ~40 years. Our reconstruction of the probable geographic origins of this diversity shows that, historically, it was geographically widespread across the northeast and midwest (Fig. 2b). The recent geographic and population size expansion (Supplementary Fig. 10) of B. burgdorferi reflects the spread of diverse B. burgdorferi already present in each region (Fig. 2a,b). This means that the recent emergence of Lyme disease does not reflect the spread of a single epidemic B. burgdorferi lineage across North America or recent diversification of the bacteria, but rather the spread of pre-existing, geographically widespread bacterial diversity.
Our finding of ancient B. burgdorferi diversification suggests that the recent Lyme disease epidemic does not reflect evolutionary processes but rather was driven by the ecological change in North America beginning in the colonial period ~700 years ago. Deforestation and intensive hunting47 during the colonial period followed by population explosion of white-tailed deer47,48 and climate change49,50 in the past century probably enabled dramatic range expansion of Ixodes spp. ticks49. The spread of ticks into environments with high densities of competent vertebrate hosts substantially widened the potential geographic range of B. burgdorferi and of Lyme disease. Ticks and B. burgdorferi continue to spread into southern Canada and across the United States, putting a greater population at risk of Lyme disease.
Although the evolutionary history of B. burgdorferi occurs at a deeper timescale than the recent Lyme disease epidemic, our phylogeographic analysis provides insights into the epidemiology of Lyme disease. The high levels of B. burgdorferi genomic diversity found across North America and continued long-distance gene flow suggests that humans may be exposed to diverse bacterial lineages regardless of where they are infected. B. burgdorferi strains vary substantially in virulence and we find that disseminating strains associated with more severe disease are found in all regions surveyed and can occur on diverse genomic backgrounds (Fig. 2a). The patterns of migration uncovered here suggest continued spread of diverse B. burgdorferi not only from the northeast across the rest of North America but also in several other directions (Fig. 2b), with gene flow occurring not only locally but also on a continental scale (Supplementary Fig. 2). This finding has important epidemiological consequences as it suggests that wide regions with established tick and vertebrate host populations are potential sites of B. burgdorferi invasion and future Lyme disease.
Bacterial sampling and sequencing
We collected 146 B. burgdorferi-infected nymphal I. scapularis ticks from the widest available spatial and temporal range (including ticks sampled from 1984 to 2013) (Fig. 1; Supplementary Information). Tick sampling, DNA extractions and quantitative PCR testing for B. burgdorferi infection followed described protocols11. Genomic libraries were prepared from infected tick samples. B. burgdorferi DNA was captured using a custom hybridization capture array method1. Sequencing of 75-bp paired end reads was conducted on an Illumina HiSeq 2500 at the Yale Center for Genomic Analysis. Short-read sequence data were submitted to the NCBI Short Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/), SRA accession: SRP058536 (Supplementary Information).
We conducted analyses with samples collected for this study and added to them all 15 previously published B. burgdorferi genomes, representing cultured isolates from ticks, humans and a song sparrow (Supplementary Table 1)22,24,51. We included B. finlandensis as an outgroup22.
Read alignment and SNP detection
We focused analysis on SNPs on the B. burgdorferi linear chromosome (910,724 bp) and the two best-characterized and most conserved plasmids, lp54 (53,657 bp) and cp26 (26,498 bp)52,53. This represents 65% of the total B3124,51 reference genome.
Raw sequence reads for each sample were aligned to the B. burgdorferi reference genome strain B3124,51, using BWA mem (v. 0.7.7)54. Duplicate sequence reads were excluded from downstream analysis, using the Picard Suite (v. 1.117) MarkDuplicates (http://picard.sourceforge.net). Only samples with mean coverage >10 × ere retained for analysis. Variants with respect to strain B3124,51 were identified with GATK HaplotypeCaller55 (ploidy set to 1), generating sample-specific gVCF files. We conducted joint genotyping of samples with GATK GenotypeGVCFs. We excluded indels (insertions and deletions) and SNPs with signals of low mapping or genotyping quality with GATK VariantFiltration, using the following filters recommended by GATK56: quality by depth (QD < 2.0), Fisher strand bias (FS > 60.0), mapping quality (MQ < 40.0), mapping quality rank sum test (MQRankSum < −12.5), the Mann–Whitney rank sum test (ReadPosRankSum < −8.0), low genotype call (GQ < 20) and strand odds ratio test (SOR > 4.0), and set filtered genotypes to no call (GATK option --setFilteredGtToNocall).
To create an alignment of consensus sequences from the variant file, we used the GATK tool FastaAlternateReferenceMaker. This resulted in multiple alignments of consensus sequences of the chromosome (910,724 bp), plasmid cp26 (26,498 bp) and plasmid lp54 (53,657 bp).
It is critical to account for recombination when reconstructing bacterial evolutionary histories because horizontal gene transfer can overwhelm the evolutionary signal from vertically inherited mutations and skew inferred phylogenies57. Therefore, we used GUBBINS58 to infer recombinant tracts along the chromosome in addition to plasmids cp26 and lp54. We used RAxML59 to construct maximum likelihood phylogenetic trees from the 16,370 bp recombination-free SNP alignment of all 146 isolates. We used the generalized time-reversible model and a gamma distribution to model site-specific rate variation (the GTR + Γ substitution model) and corrected for using only variant sites in the tree, using the Lewis ascertainment bias correction60. Support for each branch was assessed using 100 bootstrapped samples of the multiple alignment. Recombination hotspots and RAxML phylogenies are visualized in Supplementary Fig. 1 with Phandango (http://jameshadfield.github.io/phandango/).
To infer rates of gene flow between geographic regions, we performed discrete ancestral state reconstruction61,62, coding the region of origin of tick samples as a discrete trait. We fitted a Markov model of character state evolution62 to the underlying BEAST phylogeny with the diversitree package61 in R and found that posterior probability distributions of migration rates between regions overlapped significantly (Supplementary Fig. 8). Therefore, we fitted an equal rates model that considered a single long-distance migration rate between regions in phytools and simulated 1,000 stochastic character maps, or reconstructions of ancestral geographic location, on our phylogeny with the phytools63 function make.simmap (Fig. 2b).
Sequence by-catch and co-capture
Although our hybrid capture approach was efficient and the majority of sequence reads corresponded to B. burgdorferi, ~30% of reads did not map to B. burgdorferi and constitute metagenomic data from our tick samples. We tested whether by-catch—unintentionally captured sequence data—corresponded to the tick vector by mapping sequence reads to the tick, I. scapularis, genome33 with BWA mem (v. 0.7.7)54, as described above. To investigate whether by-catch sequence contained sufficient coverage of I. scapularis genes for phylogenetic inference, we mapped reads to ten tick genes previously used for population genetic study (Supplementary Table 2). Genbank accession numbers of the I. scapularis haplotypes used as references for mapping are in Supplementary Table 2.
Three tick mitochondrial genes (16S, cytochrome oxidase II and the control region) had a mean coverage > 1.5 × (Supplementary Fig. 5, Supplementary Table 2), enabling us to use them in phylogenetic analyses to examine potential co-evolution of the bacteria and its vector34. We called variants on the three mitochondrial genes and used RAxML to infer a maximum likelihood tree, using the methods described above and including all tick samples with <40% missing data.
As we simultaneously co-captured both B. burgdorferi and the co-vectored pathogen B. microti35, we were able to explore patterns of potential co-evolution of the two co-vectored pathogens. Read mapping, variant calling and tree building of the B. microti apicoplast, a small organelle found in most Apicomplexa parasites (28.7 kbp), was conducted as previously described35.
We tested for evidence of co-evolution of B. burgdorferi and both I. scapularis and B. microti with a global test of co-evolution implemented in ParaFit66. The two null hypotheses that B. burgdorferi and its tick vector and that B. burgdorferi and B. microti have independent evolutionary histories were tested by permuting the observed associations of parasite–vector or parasite–parasite 999 times.
Phylogeographic and demographic analysis
We used a coalescent approach implemented in BEAST 241,45 to simultaneously explore the phylogenetic and demographic history of sampled B. burgdorferi. Recombination-free SNP alignments were used to fit four alternative models including either a strict molecular clock or a relaxed (uncorrelated log-normal) molecular clock67 (allowing substitution rates to vary across tree branches) and a constant population size or a Bayesian skyline model of changing population size. We used dates of tick collection to calibrate tips. We used an ascertainment bias correction to account for the invariant sites not included in the SNP alignment. All phylogeographic models used the HKY substitution model, allowing for gamma-distributed rate variation across sites.
For each model, we ran chains of 100 million iterations or until convergence. Chains were thinned by sampling every 1,000 iterations and 10% of each chain was discarded as burn-in. Maximum clade credibility (MCC) trees were generated using TreeAnnotator41,45. Tracer v. 1.6 was used to assess convergence visually and confirm effective sample sizes were greater than 200 for each parameter. To test the influence of prior distributions specified for evolutionary and demographic model parameters, we reran BEAST in triplicate for each model with no input alignment, sampling only from prior distributions and compared results with the posterior parameter estimates of fitted models (Supplementary Fig. 11).
To identify the best-fitting model, we used path sampling68 to estimate the marginal likelihood of each model. We compared alternative models with Bayes factors and identified the relaxed log-normal clock, skyline population size as the best-fitting model for the recombination-free chromosomal SNP alignment (Bayes factors > 80; Supplementary Table 3), used in all above analyses. All four coalescent models estimated highly congruent tree topologies and differed only in placement of deep branches, expected because of the high degree of uncertainty at deeper timescales within the B. burgdorferi tree.
To speed computational time, we performed BEAST analyses on randomly sampled 5,000-bp subsets of the full 16,370-bp SNP alignment. We resampled and reran analyses ten times for the best-fitting model and compared MCC trees to confirm that subsampling the alignment did not distort phylogenetic inference. The MCC tree depicted in Figs. 2b and 3 was inferred from a single 5,000-bp subset of the SNP alignment.
To investigate B. burgdorferi demographic history, we complemented the coalescent-based approach with mismatch distribution analysis, as implemented in DnaSP v569, and tested if our observed data were consistent with a mismatch distribution expected from a sudden population expansion or a spatial expansion70,71. The mismatch distribution and expected mismatch distribution for a stable population was visualized in the pegas package72 in R.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This study was supported by the National Institutes of Health (NIH) Ecology and Evolution of Infectious Disease Program grant R01 GM105246 and by NIH grant R21AI112938. K.S.W. was supported by the NIH Ruth L. Kirschstein National Research Service Award (F31 AI118233-01A1) and the NSF Doctoral Dissertation Improvement Grant (DEB-1401143). G.C. was supported by the Gaylord Donnelley Postdoctoral Environmental Fellowship (the Yale Institute for Biospheric Studies). We thank N. Ogden, J. Brinkerhoff, S. Paskewitz, D. Fish and S. Bent for providing tick samples; P. Flynn and J. Underwood for laboratory work; C. Ben Mamoun and P. Krause for discussions; and R. Bjornson and the Yale High Performance Computing Center for computational support.