Inferring bacterial recombination rates from large-scale sequencing datasets

Lin, Mingzhi; Kussell, Edo

doi:10.1038/s41592-018-0293-7

Article
Published: 21 January 2019

Inferring bacterial recombination rates from large-scale sequencing datasets

Nature Methods volume 16, pages 199–204 (2019)Cite this article

6406 Accesses
39 Citations
43 Altmetric
Metrics details

Subjects

Abstract

We present a robust, computationally efficient method (https://github.com/kussell-lab/mcorr) for inferring the parameters of homologous recombination in bacteria, which can be applied in diverse datasets, from whole-genome sequencing to metagenomic shotgun sequencing data. Using correlation profiles of synonymous substitutions, we determine recombination rates and diversity levels of the shared gene pool that has contributed to a given sample. We validated the recombination parameters using data from laboratory experiments. We determined the recombination parameters for a wide range of bacterial species, and inferred the distribution of shared gene pools for global Helicobacter pylori isolates. Using metagenomics data of the infant gut microbiome, we measured the recombination parameters of multidrug-resistant Escherichia coli ST131. Lastly, we analyzed ancient samples of bacterial DNA from the Copper Age ‘Iceman’ mummy and from 14th century victims of the Black Death, obtaining measurements of bacterial recombination rates and gene pool diversity of earlier eras.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Measured correlation profiles and inferred recombination parameters from genomic sequences sampled from a transformation experiment in H. *pylori*¹⁹.**

**Fig. 2: Correlation profile analysis of global bacterial strain collections.**

**Fig. 3: Applications to metagenomics and ancient samples using analysis of unassembled raw reads.**

Tracking microbial evolution in the human gut using Hi-C reveals extensive horizontal gene transfer, persistence and adaptation

Article 23 December 2019

An Escherichia coli ST131 pangenome atlas reveals population structure and evolution across 4,071 isolates

Article Open access 22 November 2019

A 500-year tale of co-evolution, adaptation, and virulence: Helicobacter pylori in the Americas

Article Open access 02 September 2020

Code availability

Our code is available as open source at https://github.com/kussell-lab/mcorr and as Supplementary Software.

Data availability

All sequencing datasets used in this study are publicly available as cited in Supplementary Tables.

References

Maynard Smith, J. The population genetics of bacteria. Proc. Biol. Sci. 245, 37–41 (1991).
Article Google Scholar
Thomas, C. M. & Nielsen, K. M. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3, 711–721 (2005).
Article CAS Google Scholar
Fraser, C., Hanage, W. P. & Spratt, B. G. Recombination and the nature of bacterial speciation. Science 315, 476–480 (2007).
Article CAS Google Scholar
Shapiro, B. J. et al. Population genomics of early events in the ecological differentiation of bacteria. Science 336, 48–51 (2012).
Article CAS Google Scholar
Hanage, W. P. Not so simple after all: bacteria, their population genetics, and recombination. Cold Spring Harb. Perspect. Biol. 8, a018069 (2016).
Article Google Scholar
Chang, H. H. et al. Origin and proliferation of multiple-drug resistance in bacterial pathogens. Microbiol. Mol. Biol. Rev. 79, 101–116 (2015).
Article CAS Google Scholar
Didelot, X., Walker, A. S., Peto, T. E., Crook, D. W. & Wilson, D. J. Within-host evolution of bacterial pathogens. Nat. Rev. Microbiol. 14, 150–162 (2016).
Article CAS Google Scholar
Ansari, M. A. & Didelot, X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196, 253 (2014).
Article Google Scholar
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).
Article Google Scholar
Didelot, X. & Falush, D. Inference of bacterial microevolution using multilocus sequence data. Genetics 175, 1251–1266 (2007).
Article CAS Google Scholar
Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous recombination in bacteria using whole-genome sequences. Genetics 186, 1435–1449 (2010).
Article CAS Google Scholar
Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).
Article Google Scholar
Marttinen, P. et al. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res. 40, e6 (2012).
Article CAS Google Scholar
Mostowy, R. et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol. Biol. Evol. 34, 1167–1182 (2017).
Article CAS Google Scholar
Arnold, B. J. et al. Weak epistasis may drive adaptation in recombining bacteria. Genetics 208, 1247–1260 (2018).
Article CAS Google Scholar
Maixner, F. et al. The 5300-year-old Helicobacter pylori genome of the Iceman. Science 351, 162–165 (2016).
Article CAS Google Scholar
Bos, K. I. et al. A draft genome of Yersinia pestis from victims of the Black Death. Nature 478, 506–510 (2011).
Article CAS Google Scholar
Lin, M. & Kussell, E. Correlated mutations and homologous recombination within bacterial populations. Genetics 205, 891–917 (2017).
Article CAS Google Scholar
Bubendorfer, S. et al. Genome-wide analysis of chromosomal import patterns after natural transformation of Helicobacter pylori. Nat. Commun. 7, 11995 (2016).
Article CAS Google Scholar
Croucher, N. J., Harris, S. R., Barquist, L., Parkhill, J. & Bentley, S. D. A high-resolution view of genome-wide pneumococcal transformation. PLoS Pathog. 8, e1002745 (2012).
Article CAS Google Scholar
Thorell, K. et al. Rapid evolution of distinct Helicobacter pylori subpopulations in the Americas. PLoS Genet. 13, e1006546 (2017).
Article Google Scholar
Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017).
Article CAS Google Scholar
Cao, Q. Z. et al. Progressive genomic convergence of two Helicobacter pylori strains during mixed infection of a patient with chronic gastritis. Gut 64, 554–561 (2015).
Article CAS Google Scholar
Kennemann, L. et al. Helicobacter pylori genome evolution during human infection. Proc. Natl Acad. Sci. USA 108, 5033–5038 (2011).
Article CAS Google Scholar
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article Google Scholar
Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011).
Article CAS Google Scholar
Ward, D. V. et al. Metagenomic sequencing with strain-level resolution implicates uropathogenic E. coli in necrotizing enterocolitis and mortality in preterm infants. Cell Rep. 14, 2912–2924 (2016).
Article CAS Google Scholar
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
Article CAS Google Scholar
Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).
Article CAS Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438 (2016).
Article CAS Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

We thank J. Carlton, T. Nozoe, M. Rockman, and A. Skanata for helpful comments on the manuscript.

Author information

Authors and Affiliations

Department of Biology and Center for Genomics and Systems Biology, New York University, New York, NY, USA
Mingzhi Lin & Edo Kussell
Department of Physics, New York University, New York, NY, USA
Edo Kussell

Authors

Mingzhi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Edo Kussell
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.L. and E.K. designed the research, developed the mathematical theory, interpreted the results, and wrote the paper. M.L. performed the bioinformatic analyses and wrote the mcorr package code.

Corresponding author

Correspondence to Edo Kussell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Schematic of a bulk population pool of recombining sequences and the recombination coverage of a focal sample consisting of two sequences.

The pool (upper panel): a set of five genome sequences is shown within a much larger bulk pool of sequences, and the clonal phylogeny, representing the genealogy of the five individuals, is indicated on the left. Recombination events since the last common ancestor (LCA) of the five individuals are indicated by gray arrows, with the transferred fragments shown in the color of the donor genome. The sample (lower panel): a sample consisting of two genomes (4 and 5) is shown in the lower portion. The recombination coverage c, shown schematically in gray, is the fraction of the genome that has recombined over the time t since the LCA of the pair of genomes. The sample diversity, d_sample, is the probability that the pair will differ at any locus. It is decomposed into the recombined fraction c, which exhibits the pool diversity, d_pool, and the clonal fraction 1 – c, which exhibits the clonal diversity, d_clonal, i.e., the diversity that has accumulated by mutations since the LCA of the pair. In general, samples can consist of any number of sequences; a single pair is shown here for simplicity. See the Supplementary Notes for mathematical definitions and further details.

Supplementary Figure 2 Correlation profile analysis of genomic sequences sampled from a transformation experiment in S. pneumoniae.

Panels (a) and (b) show correlation profiles (open circles) calculated from genomic assemblies of the 84 evolved strains of S. pneumoniae¹, and from the donor and recipient strains, respectively. Panels (c) and (d) show correlation profiles calculated from raw reads using the two sets of strains in (a) and (b), respectively. Model fits are shown as solid lines. In panel (e), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the four datasets shown in the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 2 for values and sample sizes.

Supplementary Figure 3 Correlation profile analysis of the 12 major clusters of the H. pylori global collections.

In panel (a), correlation profiles (open circles) and model fits (solid lines) of the H. pylori global collections² are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the clusters, using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 4 Correlation profile fitting and residual plots of global collections of strains in eight bacterial species.

Insets show the histograms of residual values. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 5 Comparison of correlation profiles for genes binned by synonymous diversity levels for M. tuberculosis and Y. pestis.

Histograms on the right show the distribution of synonymous diversity levels (d_s) across genes on a logarithmic scale. Panels on the left show the correlation profiles computed using different subsets of genes. The high-diversity subset corresponds to the top 3% of genes ranked by d_s. Correlation profiles in M. tuberculosis are seen to be flat regardless of gene subset, while in Y. pestis the high-diversity genes exhibit a pronounced decay in the correlation profile. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 6 Correlation profile analysis of H. pylori strains isolated from a single Chinese subject.

In panel (a), correlation profiles (open circles) and model fits (solid lines) from a single Chinese subject³ are shown. The strains were split into two clades, and the analyses of within-clade and between-clade strains are indicated in different colors. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 4 for values and sample sizes.

Supplementary Figure 7 Correlation profile analysis of four pairs of H. pylori strains from four Colombian subjects.

Four pairs of H. pylori strains were analyzed, with each pair corresponding to two time points of the same subject, spanning either 3 or 16 years⁴. In panel (a), correlation profiles (open circles) and model fits (solid lines) are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 5 for values and sample sizes.

Supplementary Figure 8 Parameter inference for different sequencing error rates.

Inferred \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), c and measured \(d_{{\mathrm{sample}}}\) from simulated reads with increasing sequencing error rates for the H. pylori transformation experiment data using the evolved strains (see Methods for details). The error bars are the s.d. of 100 independent simulations with the same settings. Error bars are not visible when smaller than the symbol size. Dashed lines indicate value obtained from simulated reads with the default error rate (relative error rate = 1). See Supplementary Table 1 for sample sizes.

Supplementary Figure 9 Correlation profile analysis of raw reads sequenced from an S. aureus single clone.

Panel (a) shows the measurement of \(P(l)\) using raw reads from the single S. aureus USA300 clone⁵. In panel (b), \(P(l)\) for the global S. aureus strains (black circles) and the single S. aureus clone (blue circles) are shown for comparison. Model fit is shown as a solid line. The red dashed line shows the maximum absolute value of the fitting residuals. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 10 Correlation profile analysis of E. coli ST131 from infant gut microbiome samples.

Correlation profiles of infant gut microbiome samples⁶ from the same infant at different time points are shown, denoted by different shapes and colors. Model fits are shown as solid lines. See Supplementary Table 6 for values and sample sizes.

Supplementary Figure 11 Linear relationship between Q_S(l) and \(\hat{Q_s}({l})\).

Analysis was performed as described in the Supplementary Notes using the H. pylori ancient sample C0058. The linear regression line is shown (n = 20), and the squared Pearson correlation coefficient, R², is indicated. See Supplementary Table 7 for values and sample sizes.

Supplementary Figure 12 Running time comparison of correlation profile analysis versus ClonalFrameML.

For each data point, the indicated number of genomic sequences was provided as input to both methods. Sequences were chosen randomly from the global collection of H. pylori strains, and for each comparison both methods were run on the identical set of sequences. Running time for the correlation profile method (blue) includes both the calculation of the correlation profile and the calculation of the best fitting parameters. Running time for ClonalFrameML⁷ is shown either with (green) or without (red) the phylogeny-building step (RAxML).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12, Supplementary Tables 1–10 and Supplementary Note 1

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, M., Kussell, E. Inferring bacterial recombination rates from large-scale sequencing datasets. Nat Methods 16, 199–204 (2019). https://doi.org/10.1038/s41592-018-0293-7

Download citation

Received: 12 November 2017
Accepted: 30 November 2018
Published: 21 January 2019
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41592-018-0293-7

This article is cited by

Evolutionary ecology of microbial populations inhabiting deep sea sediments associated with cold seeps
- Xiyang Dong
- Yongyi Peng
- Casey R. J. Hubert
Nature Communications (2023)
Frequencies and characteristics of genome-wide recombination in Streptococcus agalactiae, Streptococcus pyogenes, and Streptococcus suis
- Isaiah Paolo A. Lee
- Cheryl P. Andam
Scientific Reports (2022)
Horizontal gene transfer and adaptive evolution in bacteria
- Brian J. Arnold
- I-Ting Huang
- William P. Hanage
Nature Reviews Microbiology (2022)
Salt flat microbial diversity and dynamics across salinity gradient
- Khaled M. Hazzouri
- Naganeeswaran Sudalaimuthuasari
- Khaled M. A. Amiri
Scientific Reports (2022)
Fast and accurate metagenotyping of the human gut microbiome with GT-Pro
- Zhou Jason Shi
- Boris Dimitrov
- Katherine S. Pollard
Nature Biotechnology (2022)