# Inferring bacterial recombination rates from large-scale sequencing datasets

## Abstract

We present a robust, computationally efficient method (https://github.com/kussell-lab/mcorr) for inferring the parameters of homologous recombination in bacteria, which can be applied in diverse datasets, from whole-genome sequencing to metagenomic shotgun sequencing data. Using correlation profiles of synonymous substitutions, we determine recombination rates and diversity levels of the shared gene pool that has contributed to a given sample. We validated the recombination parameters using data from laboratory experiments. We determined the recombination parameters for a wide range of bacterial species, and inferred the distribution of shared gene pools for global Helicobacter pylori isolates. Using metagenomics data of the infant gut microbiome, we measured the recombination parameters of multidrug-resistant Escherichia coli ST131. Lastly, we analyzed ancient samples of bacterial DNA from the Copper Age ‘Iceman’ mummy and from 14th century victims of the Black Death, obtaining measurements of bacterial recombination rates and gene pool diversity of earlier eras.

## Main

Recombination plays a fundamental role in the evolution of bacterial genomes1,2,3,4,5, and has been implicated in the acquisition and spread of antibiotic resistance and pathogenic determinants6,7. While a variety of methods have been developed to estimate the basic parameters of bacterial recombination from sequence data8,9,10,11,12,13,14,15, these approaches vary in their ability to employ the massive sequencing datasets that are routinely acquired. Existing methods often rely on phylogenetic reconstruction, that is, the ability to determine the genealogical relationships of the sampled sequences; however, the precise phylogeny is often difficult to determine, such as when using metagenomics data. Several approaches involve Bayesian simulations, which are computationally demanding and expensive, and become intractable as the number of sequences grows. More generally, existing methods require reliable assemblies that span a large portion of the bacterial genome and cannot be applied directly to raw reads.

We present a method that addresses these problems and enables the inference of homologous recombination rates from metagenomics and ancient DNA samples. We demonstrate its reliability using data from laboratory evolution experiments. We establish that the approach reliably infers recombination rates in global samples from a diverse collection of bacterial species. We determine recombination rates using metagenomics data from the infant gut microbiome, as well as from ancient DNA samples of H. pylori from the Copper Age ‘Iceman’ mummy16 and Yersinia pestis infections of the Middle Ages17. Our methodology reveals that, via the process of recombination, a single ancient DNA sample can provide a snapshot of the global gene pool diversity of specific bacterial species existing at different periods in the past.

We model a population of genomes in which mutations and homologous recombination events occur with rates μ and γ per base pair (bp) per generation, respectively. Each recombination event transfers a homologous fragment of size f bp from a donor genome to a recipient genome, where f is a random variable. To infer these parameters from a sample of sequences, we consider each pair of homologous sequences, and determine its substitution profile $$\sigma _i$$, a binary variable with value 0 for identity or 1 for a difference at locus position i. Conditional on observing a difference at a randomly chosen site i, we compute the probability, P, of observing a difference at the site i + l, or, explicitly, $$P\left( l \right) \equiv {\mathrm{Prob}}\left( {\sigma _{i + l} = 1|\sigma _i = 1} \right)$$. The function $$P\left( l \right)$$ measures the degree to which any two loci l bp apart have correlated substitutions, and we refer to it as the ‘correlation profile’. In the absence of recombination, $$P\left( l \right)$$ will be constant, while the presence of recombination causes $$P\left( l \right)$$ to exhibit a monotonic decay as a function of l, with the decay rate becoming more pronounced as γ increases18.

We apply the analysis to coding regions, and specifically to synonymous substitutions at third-position sites of codons, which minimizes the effect of selection on correlation profiles. We previously found that the functional form of $$P\left( l \right)$$ can be analytically calculated in two special cases: (1) for sequences sampled randomly from the entire population of bacterial species, and (2) for a sample of closely related sequences consisting of a single tight clade, serotype, or other well-defined subspecies18. We showed, using whole-genome datasets of closely related strains of a single serotype, separately in E. coli and Streptococcus pneumoniae, that measured correlation profiles were consistent with the predicted functional form. Here we have substantially extended the theory to determine $$P\left( l \right)$$ for samples with any degree of divergence, and we developed a computational tool to compute $$P\left( l \right)$$ using raw reads without assembly. This enabled us to validate and test the approach on a wide variety of samples, yielding a reliable, general method that is broadly applicable in microbial ecology, evolution, and epidemiology.

## Results

### Theory

We use a coalescent-based model with a focal population (the sample) that incorporates DNA fragments by homologous recombination from a large and potentially more diverse bulk population (the pool). The method measures $$P\left( l \right)$$ using synonymous substitutions in a given sample of sequences. By fitting the data to the theoretically predicted form, we determine a set of three combined parameters that characterize the bulk gene pool: the mutational divergence, $$\theta _{\mathrm{pool}} \equiv 2\bar T\mu$$; the recombinational divergence, $$\phi _{\mathrm{pool}} \equiv 2\bar T\gamma$$; and the mean fragment size, $$\bar f$$, where $$\bar T$$ is the mean pairwise coalescence time across all loci in the bulk pool (Supplementary Note 1). $$\theta _{\mathrm{pool}}$$ is the mean number of mutations per locus since divergence of a pair of homologous sites in the pool, and it determines the pool’s pairwise diversity, $$d_{\mathrm{pool}}$$, the probability that two sequences differ at any locus. The ratio $$\phi _{\mathrm{pool}}/\theta _{\mathrm{pool}} = \gamma /\mu$$ yields the relative rate of recombination to mutation. We express the sample’s diversity, $$d_{\mathrm{sample}}$$, as a mixture of the pool diversity brought into the sample by recombination and the clonal diversity, $$d_{\mathrm{clonal}}$$, which accumulated through mutations since the last common ancestor (LCA) of the sample (Supplementary Fig. 1), or

$$d_{\mathrm{sample}} = c\;d_{\mathrm{pool}} + (1 - c)d_{\mathrm{clonal}}$$

where c is the fraction of the sample diversity that results from recombination events. When c is 0, the sample has evolved clonally since its LCA, and the sample’s diversity is a simple function of its age. As c approaches 1, the sample has recombined nearly all of its DNA with the bulk gene pool, and its diversity becomes independent of its age and reflects the pool’s diversity, $$d_{\mathrm{pool}}$$. We refer to c as the ‘recombination coverage’ of the sample, since it indicates the proportion of sites in the genome whose diversity has come from outside the sample. In the Supplementary Note, we derive the relations that allow c and $$d_{\mathrm{clonal}}$$ to be calculated directly for each sample after fitting the correlation profile.

Fitting a sample’s correlation profile yields quantitative information on (1) the overall gene pool diversity, $$d_{\mathrm{pool}}$$, accessed by the sample through recombination over its evolutionary history; (2) the relative rate of recombination in the bulk population, $$\phi _{\mathrm{pool}}/\theta _{\mathrm{pool}}$$; (3) the sample’s recombination coverage, c; and (4) the amount of diversity accumulated by clonal evolution within the sample, $$d_{\mathrm{clonal}}$$, which provides information about the sample’s age. Our model of the gene pool is flexible in that different samples may have access to different subsets of a global gene pool, and at various genetic distances. The method infers for each sample the diversity of the gene pool with which it has recombined. The model makes no assumptions on the mechanism(s) of recombination—for example, the distribution of fragment sizes or the nature of recombination barriers. The inferred parameters characterize the set of recombination events that have occurred in the history of each sample.

### Analysis of laboratory transformation experiments

We tested the parameter inference using two laboratory transformation experiments in which the true parameter values are known: H. pylori strain 26695 was exposed to 28–52 cycles of growth and transformation with genomic DNA extracted from strain J99-R3, and 20 clones from the final population were chosen randomly for whole-genome sequencing;19 and S. pneumoniae strains TIGR4 and ATCC200669 were used as donor and recipient, respectively, for a single round of transformation, and 84 clones were fully sequenced20. The sample diversity within the evolved clones, measured by synonymous substitutions, was $$d_{\mathrm{sample}} = 5{\times}10^{ - 3}$$ (H. pylori) and $$d_{\mathrm{sample}} = 7{\times}10^{ - 5}$$ (S. pneumoniae). The correlation profiles $$P\left( l \right)$$ calculated from whole-genome sequences of the evolved strains exhibited decaying correlations, indicating the presence of recombination, and the model fits yielded parameter values that could be compared with direct measurements (Fig. 1a, Supplementary Fig. 2a and Supplementary Tables 1 and 2). Remarkably, the form of the correlations measured in the evolved strains was nearly identical to those measured between donor and recipient genomes (Fig. 1b and Supplementary Fig. 2b), indicating that the properties of a shared gene pool can be accurately inferred from a single sample. The pool diversity inferred from the sample of evolved clones, $$d_{\mathrm{pool}} = 0.087$$ (H. pylori), 0.076 (S. pneumoniae), was substantially higher than the diversity of each sample, and closely matched the diversity inferred from comparison of donor and recipient strains, $$d_{\mathrm{pool}} = 0.089$$ (H. pylori), 0.085 (S. pneumoniae). The inferred mean fragment sizes and recombination coverage were within the range of values found in the original studies (Supplementary Tables 1 and 2). Given that the mean number of recombination events per clone was 41.3 (H. pylori)19 and 2.3 (S. pneumoniae)20, even a small number of recombination events provides informative correlation profiles. When applying the method to a single pair of sequences, or in very small samples, inferred $$\bar f$$ can vary more substantially owing to the small number of recombined fragments since the LCA.

### Analysis of global bacterial strain collections

Next, we analyzed collections of global strains in different bacterial species, starting with a collection of 401 H. pylori isolates that has been shown to consist of 12 geographical clusters by chromosome painting, a method designed for grouping genomes whose phylogeny has been substantially fragmented by recombination21. Grouping all strains within each cluster to form a sample, we inferred values of $$\theta _{\mathrm{pool}}$$ = 0.045–0.086 and $$\phi _{\mathrm{pool}}$$ = 0.19–0.34 across the clusters (Fig. 2a, Supplementary Fig. 3, and Supplementary Table 3). We attempted to re-cluster the isolates as follows. For each pair of isolates, we inferred $$\theta _{\mathrm{pool}}$$ by fitting the pair’s correlation profile, and we used $$\theta _{\mathrm{pool}}$$ as a distance metric for hierarchical clustering of all isolates. Since $$\theta _{\mathrm{pool}}$$ measures the combined diversity of the gene pools accessible by recombination for a pair of isolates, our clustering groups isolates by the similarity of their respective gene pools. The clusters we obtained (Fig. 2b) were nearly identical to the clusters determined in the original study21, indicating that distinct geographical gene pools exist in the H. pylori global population, and their existence can be inferred entirely on the basis of information present in correlation profiles.

Correlation profiles of different species were well fit by the predicted functional form (Fig. 2c and Supplementary Fig. 4), and most species exhibited decaying profiles indicating the presence of recombination, with relative rates of recombination to mutation $$\phi _{\mathrm{pool}}/\theta _{\mathrm{pool}}$$ in the range 1.0–13 (Fig. 2c and Supplementary Table 3). One notable case is the collection of 5,310 Mycobacterium tuberculosis genomes, which has a sample diversity of $$d_{\mathrm{sample}} = 0.0003$$ and exhibits a flat correlation profile, indicating an absence of homologous recombination consistent with previous findings showing that recombination is either extremely rare or absent in this pathogenic species22. We note that the flat correlation profile is not the result of insufficient diversity in the samples, as the Y. pestis strains have similarly low sample diversity and exhibit a decaying correlation profile. To further confirm this, we ranked genes in each dataset by their synonymous diversity, a simple measure that is often used to detect recombined regions. We found that the top 3% of genes separately in M. tuberculosis and Y. pestis had nearly identical ranges of synonymous diversity (~0.001–0.01). In Y. pestis, these genes accounted for nearly all of the measured correlation decay, consistent with the inferred recombination coverage c = 3.3%, while in M. tuberculosis their correlation profiles were flat, supporting an absence of recombination (Supplementary Fig. 5).

### Recombination within a single host

To assess performance under different population structures, we analyzed a collection of 18 H. pylori isolates obtained from a single Chinese patient23, where isolates have been shown to group into two distinct clades. Diversity within clades ($$d_{\mathrm{sample}}$$ = 0.0010, 0.0025) was much lower than between clades ($$d_{\mathrm{sample}}$$ = 0.033). Importantly, between-clade recombination events would involve homologous fragments with a mean genetic distance of 0.033, which is comparable to the global diversity within the East Asia H. pylori cluster (Supplementary Table 3). We found by using correlation profiles computed within a clade or between clades that the inferred $$\theta _{\mathrm{pool}}$$ values of each clade (0.040 and 0.045) were similar to $$\theta _{\mathrm{pool}}$$ inferred between clades (0.049); within-clade and between-clade $$\phi _{\mathrm{pool}}$$ values were similarly consistent (Supplementary Fig. 6 and Supplementary Table 4). Recombination coverage was low within clades (c = 3.9%, 8.6%), indicating that only a small number of recombination events have occurred since the LCA of each clade, and was high between clades (c = 79%), reflecting the long coalescence times of distinct global isolates. These findings support a model of within-host recombination occurring between two divergent clades. Disregarding the clade structure—that is, computing correlation profiles across all isolates—we inferred $$\theta _{\mathrm{pool}}$$ = 0.049, which was very similar to the value of the East Asia H. pylori cluster ($$\theta _{\mathrm{pool}}$$ = 0.052). Thus, remarkably, when multiple clades exist within a single host, our method enables reliable inference of gene pool parameters even in the absence of specific information about the clade structure itself. We obtained similar results in the analysis of four pairs of H. pylori strains isolated from four Colombian subjects at time points spanning either 3 or 16 years24 (Supplementary Fig. 7 and Supplementary Table 5). Despite the small sample size consisting of two isolates from each host, which exhibited pairwise diversity as low as 0.00036, we inferred values of $$\theta _{\mathrm{pool}}$$ = 0.059–0.080, comparable to the inferred gene pool diversity of the global hspEuropeColombia cluster ($$\theta _{\mathrm{pool}}$$ = 0.081), and reflecting the range of gene pool diversity across the New World H. pylori strains (Supplementary Table 3). These results are compelling evidence that correlation profiles measured in local samples can provide information about the global gene pool diversity of a bacterial species.

### Measuring correlation profiles using raw reads

We extended the analysis to use raw reads from next-generation sequencing directly, without requiring or performing genome assembly. We simulated raw reads25 for the 20 evolved strains in the H. pylori transformation experiment. Simulated reads had a sequencing error profile modeling that of the platform, and we artificially increased the error rate to determine its effect on the analysis. We mapped the simulated reads onto the reference genome (H. pylori 26695; Methods), computed the substitution profiles for each pair of overlapping reads, and, using these, obtained the correlation profile of the sample (Fig. 1c,d). In all of the raw read analyses, the reference genome is not used to compute the substitution profiles, and is used only for read alignment. Using raw read correlation profiles, we inferred similar values of recombination parameters as we had using genome assembly (Fig. 1e and Supplementary Table 1). The results were relatively insensitive to substantial increases in sequencing error, with $$\theta _{\mathrm{pool}}$$ remaining within 10% of its expected value even for a fivefold increase in error rate (Supplementary Fig. 8). This robustness results from the largely uncorrelated nature of sequencing errors, which artificially inflate the sample diversity but contribute little to the correlation profiles. Using the raw read data reported for the S. pneumoniae transformation experiments20, we inferred similar values for $$\theta _{\mathrm{pool}}$$ and $$\phi _{\mathrm{pool}}$$ as when using the assembly-based analysis (Supplementary Fig. 2e). Errors that occur during read alignment can affect the inference of $$\bar f$$, which is sensitive to the shape of the tail of the correlation profile; therefore, it is more reliable to use assemblies to infer this parameter. Using raw read data obtained from a single colony clone26, a case in which recombination has not affected the sample since its LCA (that is, the single cell from which the colony was initialized), we measured a flat correlation profile as expected (Supplementary Fig. 9a). The small but non-zero value of correlations measured, which corresponds to correlated sequencing errors, was of the same magnitude as the fitting residuals in epidemiological samples (Supplementary Fig. 9b), indicating that correlated errors are negligible relative to the signal of recombination in correlation profiles.

### Metagenomics applications: gut microbiome and ancient DNA samples

We next analyzed gut microbiome samples collected from four infants that had been found to be dominated by E. coli ST131 strains27. After read filtering and mapping (Methods), we obtained high-quality E. coli reads for each sample ($$d_{\mathrm{sample}} = 0.0007-0.0013$$ across samples). From the raw read correlation profiles, we inferred values of $$\theta _{\mathrm{pool}} = 0.079 - 0.12$$ and $$\phi _{\mathrm{pool}} = 0.031 - 0.076$$, similar to the ranges inferred from global E. coli ST131 isolates (Fig. 3a and Supplementary Table 6). Samples collected from the same infant at different time points had similar correlation profiles and inferred parameter values (Fig. 3a and Supplementary Fig. 10).

Finally, we performed raw read correlation profile analysis to infer recombination parameters from samples of ancient DNA. The 5,300-year-old Iceman mummy, discovered in the Ötztal Alps, was recently subjected to metagenomic analysis16, in which sequencing after target capturing using specific probes for H. pylori DNA was performed. The presence of decaying correlation profiles in this single-host, ancient sample is strongly indicative of it having recombined over its history, similarly to the Chinese and Colombian host samples discussed above. From the measured correlation profiles, we inferred values of $$\theta _{\mathrm{pool}} = 0.039 - 0.045$$ and $$\phi _{\mathrm{pool}} = 0.10 - 0.12$$ for the ancient gene pools, which were very similar to the values inferred from the Chinese host samples (Fig. 3b and Supplementary Tables 4 and 7). We analyzed ancient Y. pestis samples17 obtained from four victims of the plague outbreak of 1347 to 1351. From the reads of target-captured Y. pestis DNA, we inferred $$\theta _{\mathrm{pool}} = 0.069 - 0.096$$ and $$\phi _{\mathrm{pool}} = 0.19 - 0.29$$ among the four samples (Fig. 3c and Supplementary Table 8). Compared with the modern strains, both $$\theta _{\mathrm{pool}}$$ and $$\phi _{\mathrm{pool}}$$ of the ancient sample were almost an order of magnitude larger. The method yielded robust results in both technical replicates (comparing different bait pools in H. pylori; Fig. 3b) and biological replicates (comparing different Y. pestis samples; Fig. 3c).

## Discussion

Correlation profile analysis infers the basic parameters of bacterial recombination using any type of available sequence data, from whole genomes to metagenomes, which opens a range of new possibilities for ecological, evolutionary, and population genetic inference. In applications to ancient DNA samples, this tool reveals evolutionary processes in bacterial populations of past eras. Metagenomics data from the infant gut microbiome yield clear signals of recombination, consistent with independent inferences using whole-genome data, enabling the interplay of ecology, evolution, and recombination to be probed using metagenomics community datasets. The ability to cluster samples on the basis of the diversity of their shared gene pools provides another powerful tool for analysis of the effect of recombination on bacterial species.

## Methods

### Generating alignments from whole-genome assemblies or raw reads

#### Generating genome alignments from assemblies

For the collections of bacterial strains with assemblies available (Supplementary Table 9), we downloaded the assemblies from NCBI (https://www.ncbi.nlm.nih.gov/assembly) and re-annotated them using Prokka28. We extracted the core set of genes that are present in ≥95% of all strains using Roary29. For each gene cluster, we aligned the protein sequences using MUSCLE30 and mapped back to obtain the DNA alignment. To remove gaps, we split each alignment into groups of sequences, such that the sequences within the same group had the same gap opening positions. For each group of sequences, we then removed the gaps to obtain the final alignment. For consistency between our analyses of global strains and transformation experiments in H. pylori and S. pneumoniae, in each of these two species we used the same set of core genes determined from global strains to analyze the results of the transformation experiments.

#### Generating genome alignments from raw reads

For the collections of global strains without assemblies (K. pneumoniae, M. abscessus, M. tuberculosis, and S. aureus), we applied a reference-based approach to generate whole-genome alignments from the raw reads. For each strain, we mapped the raw reads against the reference genomes (Supplementary Table 9) using SMALT (version 0.7.6, https://sourceforge.net/projects/smalt/), and obtained the consensus genome using SAMtools mpileup31. The resulting consensus genomes were combined to generate a whole-genome alignment, from which we extracted the alignments of coding regions. In each alignment, we removed sequences in which gaps constituted ≥2% of the total length. After this filtering, we used those gene alignments in which ≥95% of all strains were present.

We generated read alignments from raw reads of different sources: reads from simulations of next-generation sequencing, reads from metagenomic shotgun sequencing data, and reads from ancient DNA samples (Supplementary Table 9). For paired-end read data, each read was treated as two separate single reads. For each dataset, we first applied filters to obtain high-quality reads as detailed below and then mapped them to the reference genomes to obtain the final alignments.

To simulate reads, we used ART25 with parameters -ss HS25 -l 150 -f 5, which simulated single-end Illumina HiSeq 2500 reads of length 150 bp with 5× coverage. To mimic the increase of sequencing errors, we specified the option -qs from −1 to −7, corresponding to the sequencing errors increasing from 1.25 to 5 times the default setting.

For the infant microbiome samples27, we aligned the raw reads against a set of E. coli reference genomes compiled by PanPhlAn32 using Bowtie 233. We then filtered the mapped reads by removing those with edit distance greater than 4. For ancient samples16,17, we analyzed the uracil DNA glycosylase-treated and target-captured reads. We used SeqPrep (version 1.2, https://github.com/jstjohn/SeqPrep) to trim adaptors and merge reads. We then filtered the merged reads with length <25 base pairs. Each sample was analyzed separately rather than by pooling to avoid biases due to differences in library size.

After obtaining the high-quality reads, we aligned them against the reference genomes (Supplementary Table 9) using SMALT, and filtered the reads with mapping quality < 30. Positions with base quality < 30 were marked and ignored in correlation profile calculations. In Y. pestis ancient samples, however, such stringent filtering removed nearly all reads because of the data quality, and this filter was therefore not applied in that dataset. The final alignments were sorted by SAMtools31, and coding region alignments were extracted from these. Alignments that covered ≥50% of any given gene with read depth ≥ 5 were retained for correlation profile calculation, while all others were discarded.

### Calculating and fitting correlation profiles

Given a sample of aligned gene sequences, obtained from either assemblies or reads, we compared each pair of sequences within each gene to identify the synonymous substitutions, which yielded the binary variable $$\sigma _i$$ for each sequence pair and gene, where i indexes base pair position within the gene. We calculated $$d_{\mathrm{sample}}$$ and $$P(l)$$ from $$\sigma _i$$ by taking the appropriate average over third-position sites of codons, averaging over all sequence pairs within each alignment, and then averaging over all genes (Supplementary Note). The maximal value of l, denoted $$l_{\mathrm{max}}$$, was set to 250 bp, where $$l_{\mathrm{max}}$$ was substantially shorter than the maximum length of alignments to ensure sufficient sampling of the correlation profile. When using read alignments, however, the short-read length limited the range of l values that could be assessed. To overcome this limitation, we computed the correlation of substitution rates at any pair of sites using all aligned reads at each position, and performed a linear transformation to obtain $$P(l)$$ for large l (Supplementary Note and Supplementary Fig. 11). We performed non-linear fitting of $$P(l)$$ to the functional form derived in Supplementary Note 1, which has three independent parameters. Using the three fitted parameters, we obtained $$\theta _{\mathrm{pool}}$$, $$\phi _{\mathrm{pool}}$$, $$\bar f$$, $$c$$, $$\gamma /\mu$$, $$d_{\mathrm{pool}}$$, and $$d_{\mathrm{clonal}}$$ (Supplementary Note). For comparison with other methods in terms of speed and accuracy, see the Supplementary Note, Supplementary Fig. 12, and Supplementary Table 10.

### Statistics

The number of strains and genes used to calculate correlation profiles for each dataset is provided in Supplementary Tables 1–8. Unless otherwise indicated, in all parameter value bar plots, the error bars and circles indicate, respectively, the interquartile range (that is, the range between the 25th and 75th percentile) and the median over 1,000 bootstrapped parameter values (Supplementary Note).

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Code availability

Our code is available as open source at https://github.com/kussell-lab/mcorr and as Supplementary Software.

## Data availability

All sequencing datasets used in this study are publicly available as cited in Supplementary Tables.

## References

1. 1.

Maynard Smith, J. The population genetics of bacteria. Proc. Biol. Sci. 245, 37–41 (1991).

2. 2.

Thomas, C. M. & Nielsen, K. M. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3, 711–721 (2005).

3. 3.

Fraser, C., Hanage, W. P. & Spratt, B. G. Recombination and the nature of bacterial speciation. Science 315, 476–480 (2007).

4. 4.

Shapiro, B. J. et al. Population genomics of early events in the ecological differentiation of bacteria. Science 336, 48–51 (2012).

5. 5.

Hanage, W. P. Not so simple after all: bacteria, their population genetics, and recombination. Cold Spring Harb. Perspect. Biol. 8, a018069 (2016).

6. 6.

Chang, H. H. et al. Origin and proliferation of multiple-drug resistance in bacterial pathogens. Microbiol. Mol. Biol. Rev. 79, 101–116 (2015).

7. 7.

Didelot, X., Walker, A. S., Peto, T. E., Crook, D. W. & Wilson, D. J. Within-host evolution of bacterial pathogens. Nat. Rev. Microbiol. 14, 150–162 (2016).

8. 8.

Ansari, M. A. & Didelot, X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196, 253 (2014).

9. 9.

Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).

10. 10.

Didelot, X. & Falush, D. Inference of bacterial microevolution using multilocus sequence data. Genetics 175, 1251–1266 (2007).

11. 11.

Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous recombination in bacteria using whole-genome sequences. Genetics 186, 1435–1449 (2010).

12. 12.

Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).

13. 13.

Marttinen, P. et al. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res. 40, e6 (2012).

14. 14.

Mostowy, R. et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol. Biol. Evol. 34, 1167–1182 (2017).

15. 15.

Arnold, B. J. et al. Weak epistasis may drive adaptation in recombining bacteria. Genetics 208, 1247–1260 (2018).

16. 16.

Maixner, F. et al. The 5300-year-old Helicobacter pylori genome of the Iceman. Science 351, 162–165 (2016).

17. 17.

Bos, K. I. et al. A draft genome of Yersinia pestis from victims of the Black Death. Nature 478, 506–510 (2011).

18. 18.

Lin, M. & Kussell, E. Correlated mutations and homologous recombination within bacterial populations. Genetics 205, 891–917 (2017).

19. 19.

Bubendorfer, S. et al. Genome-wide analysis of chromosomal import patterns after natural transformation of Helicobacter pylori. Nat. Commun. 7, 11995 (2016).

20. 20.

Croucher, N. J., Harris, S. R., Barquist, L., Parkhill, J. & Bentley, S. D. A high-resolution view of genome-wide pneumococcal transformation. PLoS Pathog. 8, e1002745 (2012).

21. 21.

Thorell, K. et al. Rapid evolution of distinct Helicobacter pylori subpopulations in the Americas. PLoS Genet. 13, e1006546 (2017).

22. 22.

Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017).

23. 23.

Cao, Q. Z. et al. Progressive genomic convergence of two Helicobacter pylori strains during mixed infection of a patient with chronic gastritis. Gut 64, 554–561 (2015).

24. 24.

Kennemann, L. et al. Helicobacter pylori genome evolution during human infection. Proc. Natl Acad. Sci. USA 108, 5033–5038 (2011).

25. 25.

Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

26. 26.

Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011).

27. 27.

Ward, D. V. et al. Metagenomic sequencing with strain-level resolution implicates uropathogenic E. coli in necrotizing enterocolitis and mortality in preterm infants. Cell Rep. 14, 2912–2924 (2016).

28. 28.

Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

29. 29.

Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).

30. 30.

Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

31. 31.

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

32. 32.

Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438 (2016).

33. 33.

Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

## Acknowledgements

We thank J. Carlton, T. Nozoe, M. Rockman, and A. Skanata for helpful comments on the manuscript.

## Author information

M.L. and E.K. designed the research, developed the mathematical theory, interpreted the results, and wrote the paper. M.L. performed the bioinformatic analyses and wrote the mcorr package code.

Correspondence to Edo Kussell.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Integrated supplementary information

### Supplementary Figure 1 Schematic of a bulk population pool of recombining sequences and the recombination coverage of a focal sample consisting of two sequences.

The pool (upper panel): a set of five genome sequences is shown within a much larger bulk pool of sequences, and the clonal phylogeny, representing the genealogy of the five individuals, is indicated on the left. Recombination events since the last common ancestor (LCA) of the five individuals are indicated by gray arrows, with the transferred fragments shown in the color of the donor genome. The sample (lower panel): a sample consisting of two genomes (4 and 5) is shown in the lower portion. The recombination coverage c, shown schematically in gray, is the fraction of the genome that has recombined over the time t since the LCA of the pair of genomes. The sample diversity, dsample, is the probability that the pair will differ at any locus. It is decomposed into the recombined fraction c, which exhibits the pool diversity, dpool, and the clonal fraction 1 – c, which exhibits the clonal diversity, dclonal, i.e., the diversity that has accumulated by mutations since the LCA of the pair. In general, samples can consist of any number of sequences; a single pair is shown here for simplicity. See the Supplementary Notes for mathematical definitions and further details.

### Supplementary Figure 2 Correlation profile analysis of genomic sequences sampled from a transformation experiment in S. pneumoniae.

Panels (a) and (b) show correlation profiles (open circles) calculated from genomic assemblies of the 84 evolved strains of S. pneumoniae1, and from the donor and recipient strains, respectively. Panels (c) and (d) show correlation profiles calculated from raw reads using the two sets of strains in (a) and (b), respectively. Model fits are shown as solid lines. In panel (e), the solid bars show the inferred values of $$\theta _{{\mathrm{pool}}}$$, $$\phi _{{\mathrm{pool}}}$$, $$\bar f$$, and c for the four datasets shown in the same colors as $$P(l)$$; error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 2 for values and sample sizes.

### Supplementary Figure 3 Correlation profile analysis of the 12 major clusters of the H. pylori global collections.

In panel (a), correlation profiles (open circles) and model fits (solid lines) of the H. pylori global collections2 are shown. In panel (b), the solid bars show the inferred values of $$\theta _{{\mathrm{pool}}}$$, $$\phi _{{\mathrm{pool}}}$$, $$\bar f$$, and c for the clusters, using the same colors as $$P(l)$$; error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 3 for values and sample sizes.

### Supplementary Figure 4 Correlation profile fitting and residual plots of global collections of strains in eight bacterial species.

Insets show the histograms of residual values. See Supplementary Table 3 for values and sample sizes.

### Supplementary Figure 5 Comparison of correlation profiles for genes binned by synonymous diversity levels for M. tuberculosis and Y. pestis.

Histograms on the right show the distribution of synonymous diversity levels (ds) across genes on a logarithmic scale. Panels on the left show the correlation profiles computed using different subsets of genes. The high-diversity subset corresponds to the top 3% of genes ranked by ds. Correlation profiles in M. tuberculosis are seen to be flat regardless of gene subset, while in Y. pestis the high-diversity genes exhibit a pronounced decay in the correlation profile. See Supplementary Table 3 for values and sample sizes.

### Supplementary Figure 6 Correlation profile analysis of H. pylori strains isolated from a single Chinese subject.

In panel (a), correlation profiles (open circles) and model fits (solid lines) from a single Chinese subject3 are shown. The strains were split into two clades, and the analyses of within-clade and between-clade strains are indicated in different colors. In panel (b), the solid bars show the inferred values of $$\theta _{{\mathrm{pool}}}$$, $$\phi _{{\mathrm{pool}}}$$, $$\bar f$$, and c using the same colors as $$P(l)$$; error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 4 for values and sample sizes.

### Supplementary Figure 7 Correlation profile analysis of four pairs of H. pylori strains from four Colombian subjects.

Four pairs of H. pylori strains were analyzed, with each pair corresponding to two time points of the same subject, spanning either 3 or 16 years4. In panel (a), correlation profiles (open circles) and model fits (solid lines) are shown. In panel (b), the solid bars show the inferred values of $$\theta _{{\mathrm{pool}}}$$, $$\phi _{{\mathrm{pool}}}$$, $$\bar f$$, and c using the same colors as $$P(l)$$; error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 5 for values and sample sizes.

### Supplementary Figure 8 Parameter inference for different sequencing error rates.

Inferred $$\theta _{{\mathrm{pool}}}$$, $$\phi _{{\mathrm{pool}}}$$, c and measured $$d_{{\mathrm{sample}}}$$ from simulated reads with increasing sequencing error rates for the H. pylori transformation experiment data using the evolved strains (see Methods for details). The error bars are the s.d. of 100 independent simulations with the same settings. Error bars are not visible when smaller than the symbol size. Dashed lines indicate value obtained from simulated reads with the default error rate (relative error rate = 1). See Supplementary Table 1 for sample sizes.

### Supplementary Figure 9 Correlation profile analysis of raw reads sequenced from an S. aureus single clone.

Panel (a) shows the measurement of $$P(l)$$ using raw reads from the single S. aureus USA300 clone5. In panel (b), $$P(l)$$ for the global S. aureus strains (black circles) and the single S. aureus clone (blue circles) are shown for comparison. Model fit is shown as a solid line. The red dashed line shows the maximum absolute value of the fitting residuals. See Supplementary Table 3 for values and sample sizes.

### Supplementary Figure 10 Correlation profile analysis of E. coli ST131 from infant gut microbiome samples.

Correlation profiles of infant gut microbiome samples6 from the same infant at different time points are shown, denoted by different shapes and colors. Model fits are shown as solid lines. See Supplementary Table 6 for values and sample sizes.

### Supplementary Figure 11 Linear relationship between QS(l) and $$\hat{Q_s}({l})$$.

Analysis was performed as described in the Supplementary Notes using the H. pylori ancient sample C0058. The linear regression line is shown (n = 20), and the squared Pearson correlation coefficient, R2, is indicated. See Supplementary Table 7 for values and sample sizes.

### Supplementary Figure 12 Running time comparison of correlation profile analysis versus ClonalFrameML.

For each data point, the indicated number of genomic sequences was provided as input to both methods. Sequences were chosen randomly from the global collection of H. pylori strains, and for each comparison both methods were run on the identical set of sequences. Running time for the correlation profile method (blue) includes both the calculation of the correlation profile and the calculation of the best fitting parameters. Running time for ClonalFrameML7 is shown either with (green) or without (red) the phylogeny-building step (RAxML).

## Supplementary information

### Supplementary Text and Figures

Supplementary Figures 1–12, Supplementary Tables 1–10 and Supplementary Note 1

## Rights and permissions

Reprints and Permissions

Lin, M., Kussell, E. Inferring bacterial recombination rates from large-scale sequencing datasets. Nat Methods 16, 199–204 (2019) doi:10.1038/s41592-018-0293-7

• #### DOI

https://doi.org/10.1038/s41592-018-0293-7