Abstract
We present a robust, computationally efficient method (https://github.com/kussell-lab/mcorr) for inferring the parameters of homologous recombination in bacteria, which can be applied in diverse datasets, from whole-genome sequencing to metagenomic shotgun sequencing data. Using correlation profiles of synonymous substitutions, we determine recombination rates and diversity levels of the shared gene pool that has contributed to a given sample. We validated the recombination parameters using data from laboratory experiments. We determined the recombination parameters for a wide range of bacterial species, and inferred the distribution of shared gene pools for global Helicobacter pylori isolates. Using metagenomics data of the infant gut microbiome, we measured the recombination parameters of multidrug-resistant Escherichia coli ST131. Lastly, we analyzed ancient samples of bacterial DNA from the Copper Age ‘Iceman’ mummy and from 14th century victims of the Black Death, obtaining measurements of bacterial recombination rates and gene pool diversity of earlier eras.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Code availability
Our code is available as open source at https://github.com/kussell-lab/mcorr and as Supplementary Software.
Data availability
All sequencing datasets used in this study are publicly available as cited in Supplementary Tables.
References
Maynard Smith, J. The population genetics of bacteria. Proc. Biol. Sci. 245, 37–41 (1991).
Thomas, C. M. & Nielsen, K. M. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3, 711–721 (2005).
Fraser, C., Hanage, W. P. & Spratt, B. G. Recombination and the nature of bacterial speciation. Science 315, 476–480 (2007).
Shapiro, B. J. et al. Population genomics of early events in the ecological differentiation of bacteria. Science 336, 48–51 (2012).
Hanage, W. P. Not so simple after all: bacteria, their population genetics, and recombination. Cold Spring Harb. Perspect. Biol. 8, a018069 (2016).
Chang, H. H. et al. Origin and proliferation of multiple-drug resistance in bacterial pathogens. Microbiol. Mol. Biol. Rev. 79, 101–116 (2015).
Didelot, X., Walker, A. S., Peto, T. E., Crook, D. W. & Wilson, D. J. Within-host evolution of bacterial pathogens. Nat. Rev. Microbiol. 14, 150–162 (2016).
Ansari, M. A. & Didelot, X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196, 253 (2014).
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).
Didelot, X. & Falush, D. Inference of bacterial microevolution using multilocus sequence data. Genetics 175, 1251–1266 (2007).
Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous recombination in bacteria using whole-genome sequences. Genetics 186, 1435–1449 (2010).
Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).
Marttinen, P. et al. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res. 40, e6 (2012).
Mostowy, R. et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol. Biol. Evol. 34, 1167–1182 (2017).
Arnold, B. J. et al. Weak epistasis may drive adaptation in recombining bacteria. Genetics 208, 1247–1260 (2018).
Maixner, F. et al. The 5300-year-old Helicobacter pylori genome of the Iceman. Science 351, 162–165 (2016).
Bos, K. I. et al. A draft genome of Yersinia pestis from victims of the Black Death. Nature 478, 506–510 (2011).
Lin, M. & Kussell, E. Correlated mutations and homologous recombination within bacterial populations. Genetics 205, 891–917 (2017).
Bubendorfer, S. et al. Genome-wide analysis of chromosomal import patterns after natural transformation of Helicobacter pylori. Nat. Commun. 7, 11995 (2016).
Croucher, N. J., Harris, S. R., Barquist, L., Parkhill, J. & Bentley, S. D. A high-resolution view of genome-wide pneumococcal transformation. PLoS Pathog. 8, e1002745 (2012).
Thorell, K. et al. Rapid evolution of distinct Helicobacter pylori subpopulations in the Americas. PLoS Genet. 13, e1006546 (2017).
Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017).
Cao, Q. Z. et al. Progressive genomic convergence of two Helicobacter pylori strains during mixed infection of a patient with chronic gastritis. Gut 64, 554–561 (2015).
Kennemann, L. et al. Helicobacter pylori genome evolution during human infection. Proc. Natl Acad. Sci. USA 108, 5033–5038 (2011).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011).
Ward, D. V. et al. Metagenomic sequencing with strain-level resolution implicates uropathogenic E. coli in necrotizing enterocolitis and mortality in preterm infants. Cell Rep. 14, 2912–2924 (2016).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438 (2016).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Acknowledgements
We thank J. Carlton, T. Nozoe, M. Rockman, and A. Skanata for helpful comments on the manuscript.
Author information
Authors and Affiliations
Contributions
M.L. and E.K. designed the research, developed the mathematical theory, interpreted the results, and wrote the paper. M.L. performed the bioinformatic analyses and wrote the mcorr package code.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 Schematic of a bulk population pool of recombining sequences and the recombination coverage of a focal sample consisting of two sequences.
The pool (upper panel): a set of five genome sequences is shown within a much larger bulk pool of sequences, and the clonal phylogeny, representing the genealogy of the five individuals, is indicated on the left. Recombination events since the last common ancestor (LCA) of the five individuals are indicated by gray arrows, with the transferred fragments shown in the color of the donor genome. The sample (lower panel): a sample consisting of two genomes (4 and 5) is shown in the lower portion. The recombination coverage c, shown schematically in gray, is the fraction of the genome that has recombined over the time t since the LCA of the pair of genomes. The sample diversity, dsample, is the probability that the pair will differ at any locus. It is decomposed into the recombined fraction c, which exhibits the pool diversity, dpool, and the clonal fraction 1 – c, which exhibits the clonal diversity, dclonal, i.e., the diversity that has accumulated by mutations since the LCA of the pair. In general, samples can consist of any number of sequences; a single pair is shown here for simplicity. See the Supplementary Notes for mathematical definitions and further details.
Supplementary Figure 2 Correlation profile analysis of genomic sequences sampled from a transformation experiment in S. pneumoniae.
Panels (a) and (b) show correlation profiles (open circles) calculated from genomic assemblies of the 84 evolved strains of S. pneumoniae1, and from the donor and recipient strains, respectively. Panels (c) and (d) show correlation profiles calculated from raw reads using the two sets of strains in (a) and (b), respectively. Model fits are shown as solid lines. In panel (e), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the four datasets shown in the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 2 for values and sample sizes.
Supplementary Figure 3 Correlation profile analysis of the 12 major clusters of the H. pylori global collections.
In panel (a), correlation profiles (open circles) and model fits (solid lines) of the H. pylori global collections2 are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the clusters, using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 3 for values and sample sizes.
Supplementary Figure 4 Correlation profile fitting and residual plots of global collections of strains in eight bacterial species.
Insets show the histograms of residual values. See Supplementary Table 3 for values and sample sizes.
Supplementary Figure 5 Comparison of correlation profiles for genes binned by synonymous diversity levels for M. tuberculosis and Y. pestis.
Histograms on the right show the distribution of synonymous diversity levels (ds) across genes on a logarithmic scale. Panels on the left show the correlation profiles computed using different subsets of genes. The high-diversity subset corresponds to the top 3% of genes ranked by ds. Correlation profiles in M. tuberculosis are seen to be flat regardless of gene subset, while in Y. pestis the high-diversity genes exhibit a pronounced decay in the correlation profile. See Supplementary Table 3 for values and sample sizes.
Supplementary Figure 6 Correlation profile analysis of H. pylori strains isolated from a single Chinese subject.
In panel (a), correlation profiles (open circles) and model fits (solid lines) from a single Chinese subject3 are shown. The strains were split into two clades, and the analyses of within-clade and between-clade strains are indicated in different colors. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 4 for values and sample sizes.
Supplementary Figure 7 Correlation profile analysis of four pairs of H. pylori strains from four Colombian subjects.
Four pairs of H. pylori strains were analyzed, with each pair corresponding to two time points of the same subject, spanning either 3 or 16 years4. In panel (a), correlation profiles (open circles) and model fits (solid lines) are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 5 for values and sample sizes.
Supplementary Figure 8 Parameter inference for different sequencing error rates.
Inferred \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), c and measured \(d_{{\mathrm{sample}}}\) from simulated reads with increasing sequencing error rates for the H. pylori transformation experiment data using the evolved strains (see Methods for details). The error bars are the s.d. of 100 independent simulations with the same settings. Error bars are not visible when smaller than the symbol size. Dashed lines indicate value obtained from simulated reads with the default error rate (relative error rate = 1). See Supplementary Table 1 for sample sizes.
Supplementary Figure 9 Correlation profile analysis of raw reads sequenced from an S. aureus single clone.
Panel (a) shows the measurement of \(P(l)\) using raw reads from the single S. aureus USA300 clone5. In panel (b), \(P(l)\) for the global S. aureus strains (black circles) and the single S. aureus clone (blue circles) are shown for comparison. Model fit is shown as a solid line. The red dashed line shows the maximum absolute value of the fitting residuals. See Supplementary Table 3 for values and sample sizes.
Supplementary Figure 11 Linear relationship between QS(l) and \(\hat{Q_s}({l})\).
Analysis was performed as described in the Supplementary Notes using the H. pylori ancient sample C0058. The linear regression line is shown (n = 20), and the squared Pearson correlation coefficient, R2, is indicated. See Supplementary Table 7 for values and sample sizes.
Supplementary Figure 12 Running time comparison of correlation profile analysis versus ClonalFrameML.
For each data point, the indicated number of genomic sequences was provided as input to both methods. Sequences were chosen randomly from the global collection of H. pylori strains, and for each comparison both methods were run on the identical set of sequences. Running time for the correlation profile method (blue) includes both the calculation of the correlation profile and the calculation of the best fitting parameters. Running time for ClonalFrameML7 is shown either with (green) or without (red) the phylogeny-building step (RAxML).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12, Supplementary Tables 1–10 and Supplementary Note 1
Rights and permissions
About this article
Cite this article
Lin, M., Kussell, E. Inferring bacterial recombination rates from large-scale sequencing datasets. Nat Methods 16, 199–204 (2019). https://doi.org/10.1038/s41592-018-0293-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-018-0293-7
This article is cited by
-
Evolutionary ecology of microbial populations inhabiting deep sea sediments associated with cold seeps
Nature Communications (2023)
-
Frequencies and characteristics of genome-wide recombination in Streptococcus agalactiae, Streptococcus pyogenes, and Streptococcus suis
Scientific Reports (2022)
-
Horizontal gene transfer and adaptive evolution in bacteria
Nature Reviews Microbiology (2022)
-
Salt flat microbial diversity and dynamics across salinity gradient
Scientific Reports (2022)
-
Fast and accurate metagenotyping of the human gut microbiome with GT-Pro
Nature Biotechnology (2022)