Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Inferring bacterial recombination rates from large-scale sequencing datasets


We present a robust, computationally efficient method ( for inferring the parameters of homologous recombination in bacteria, which can be applied in diverse datasets, from whole-genome sequencing to metagenomic shotgun sequencing data. Using correlation profiles of synonymous substitutions, we determine recombination rates and diversity levels of the shared gene pool that has contributed to a given sample. We validated the recombination parameters using data from laboratory experiments. We determined the recombination parameters for a wide range of bacterial species, and inferred the distribution of shared gene pools for global Helicobacter pylori isolates. Using metagenomics data of the infant gut microbiome, we measured the recombination parameters of multidrug-resistant Escherichia coli ST131. Lastly, we analyzed ancient samples of bacterial DNA from the Copper Age ‘Iceman’ mummy and from 14th century victims of the Black Death, obtaining measurements of bacterial recombination rates and gene pool diversity of earlier eras.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Measured correlation profiles and inferred recombination parameters from genomic sequences sampled from a transformation experiment in H. pylori19.
Fig. 2: Correlation profile analysis of global bacterial strain collections.
Fig. 3: Applications to metagenomics and ancient samples using analysis of unassembled raw reads.

Code availability

Our code is available as open source at and as Supplementary Software.

Data availability

All sequencing datasets used in this study are publicly available as cited in Supplementary Tables.


  1. 1.

    Maynard Smith, J. The population genetics of bacteria. Proc. Biol. Sci. 245, 37–41 (1991).

    Article  Google Scholar 

  2. 2.

    Thomas, C. M. & Nielsen, K. M. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3, 711–721 (2005).

    CAS  Article  Google Scholar 

  3. 3.

    Fraser, C., Hanage, W. P. & Spratt, B. G. Recombination and the nature of bacterial speciation. Science 315, 476–480 (2007).

    CAS  Article  Google Scholar 

  4. 4.

    Shapiro, B. J. et al. Population genomics of early events in the ecological differentiation of bacteria. Science 336, 48–51 (2012).

    CAS  Article  Google Scholar 

  5. 5.

    Hanage, W. P. Not so simple after all: bacteria, their population genetics, and recombination. Cold Spring Harb. Perspect. Biol. 8, a018069 (2016).

    Article  Google Scholar 

  6. 6.

    Chang, H. H. et al. Origin and proliferation of multiple-drug resistance in bacterial pathogens. Microbiol. Mol. Biol. Rev. 79, 101–116 (2015).

    CAS  Article  Google Scholar 

  7. 7.

    Didelot, X., Walker, A. S., Peto, T. E., Crook, D. W. & Wilson, D. J. Within-host evolution of bacterial pathogens. Nat. Rev. Microbiol. 14, 150–162 (2016).

    CAS  Article  Google Scholar 

  8. 8.

    Ansari, M. A. & Didelot, X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196, 253 (2014).

    Article  Google Scholar 

  9. 9.

    Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).

    Article  Google Scholar 

  10. 10.

    Didelot, X. & Falush, D. Inference of bacterial microevolution using multilocus sequence data. Genetics 175, 1251–1266 (2007).

    CAS  Article  Google Scholar 

  11. 11.

    Didelot, X., Lawson, D., Darling, A. & Falush, D. Inference of homologous recombination in bacteria using whole-genome sequences. Genetics 186, 1435–1449 (2010).

    CAS  Article  Google Scholar 

  12. 12.

    Didelot, X. & Wilson, D. J. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput. Biol. 11, e1004041 (2015).

    Article  Google Scholar 

  13. 13.

    Marttinen, P. et al. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res. 40, e6 (2012).

    CAS  Article  Google Scholar 

  14. 14.

    Mostowy, R. et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol. Biol. Evol. 34, 1167–1182 (2017).

    CAS  Article  Google Scholar 

  15. 15.

    Arnold, B. J. et al. Weak epistasis may drive adaptation in recombining bacteria. Genetics 208, 1247–1260 (2018).

    CAS  Article  Google Scholar 

  16. 16.

    Maixner, F. et al. The 5300-year-old Helicobacter pylori genome of the Iceman. Science 351, 162–165 (2016).

    CAS  Article  Google Scholar 

  17. 17.

    Bos, K. I. et al. A draft genome of Yersinia pestis from victims of the Black Death. Nature 478, 506–510 (2011).

    CAS  Article  Google Scholar 

  18. 18.

    Lin, M. & Kussell, E. Correlated mutations and homologous recombination within bacterial populations. Genetics 205, 891–917 (2017).

    CAS  Article  Google Scholar 

  19. 19.

    Bubendorfer, S. et al. Genome-wide analysis of chromosomal import patterns after natural transformation of Helicobacter pylori. Nat. Commun. 7, 11995 (2016).

    CAS  Article  Google Scholar 

  20. 20.

    Croucher, N. J., Harris, S. R., Barquist, L., Parkhill, J. & Bentley, S. D. A high-resolution view of genome-wide pneumococcal transformation. PLoS Pathog. 8, e1002745 (2012).

    CAS  Article  Google Scholar 

  21. 21.

    Thorell, K. et al. Rapid evolution of distinct Helicobacter pylori subpopulations in the Americas. PLoS Genet. 13, e1006546 (2017).

    Article  Google Scholar 

  22. 22.

    Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017).

    CAS  Article  Google Scholar 

  23. 23.

    Cao, Q. Z. et al. Progressive genomic convergence of two Helicobacter pylori strains during mixed infection of a patient with chronic gastritis. Gut 64, 554–561 (2015).

    CAS  Article  Google Scholar 

  24. 24.

    Kennemann, L. et al. Helicobacter pylori genome evolution during human infection. Proc. Natl Acad. Sci. USA 108, 5033–5038 (2011).

    CAS  Article  Google Scholar 

  25. 25.

    Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

    Article  Google Scholar 

  26. 26.

    Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39, e90 (2011).

    CAS  Article  Google Scholar 

  27. 27.

    Ward, D. V. et al. Metagenomic sequencing with strain-level resolution implicates uropathogenic E. coli in necrotizing enterocolitis and mortality in preterm infants. Cell Rep. 14, 2912–2924 (2016).

    CAS  Article  Google Scholar 

  28. 28.

    Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

    CAS  Article  Google Scholar 

  29. 29.

    Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).

    CAS  Article  Google Scholar 

  30. 30.

    Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    CAS  Article  Google Scholar 

  31. 31.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  32. 32.

    Scholz, M. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438 (2016).

    CAS  Article  Google Scholar 

  33. 33.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  Article  Google Scholar 

Download references


We thank J. Carlton, T. Nozoe, M. Rockman, and A. Skanata for helpful comments on the manuscript.

Author information




M.L. and E.K. designed the research, developed the mathematical theory, interpreted the results, and wrote the paper. M.L. performed the bioinformatic analyses and wrote the mcorr package code.

Corresponding author

Correspondence to Edo Kussell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Schematic of a bulk population pool of recombining sequences and the recombination coverage of a focal sample consisting of two sequences.

The pool (upper panel): a set of five genome sequences is shown within a much larger bulk pool of sequences, and the clonal phylogeny, representing the genealogy of the five individuals, is indicated on the left. Recombination events since the last common ancestor (LCA) of the five individuals are indicated by gray arrows, with the transferred fragments shown in the color of the donor genome. The sample (lower panel): a sample consisting of two genomes (4 and 5) is shown in the lower portion. The recombination coverage c, shown schematically in gray, is the fraction of the genome that has recombined over the time t since the LCA of the pair of genomes. The sample diversity, dsample, is the probability that the pair will differ at any locus. It is decomposed into the recombined fraction c, which exhibits the pool diversity, dpool, and the clonal fraction 1 – c, which exhibits the clonal diversity, dclonal, i.e., the diversity that has accumulated by mutations since the LCA of the pair. In general, samples can consist of any number of sequences; a single pair is shown here for simplicity. See the Supplementary Notes for mathematical definitions and further details.

Supplementary Figure 2 Correlation profile analysis of genomic sequences sampled from a transformation experiment in S. pneumoniae.

Panels (a) and (b) show correlation profiles (open circles) calculated from genomic assemblies of the 84 evolved strains of S. pneumoniae1, and from the donor and recipient strains, respectively. Panels (c) and (d) show correlation profiles calculated from raw reads using the two sets of strains in (a) and (b), respectively. Model fits are shown as solid lines. In panel (e), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the four datasets shown in the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 2 for values and sample sizes.

Supplementary Figure 3 Correlation profile analysis of the 12 major clusters of the H. pylori global collections.

In panel (a), correlation profiles (open circles) and model fits (solid lines) of the H. pylori global collections2 are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c for the clusters, using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 4 Correlation profile fitting and residual plots of global collections of strains in eight bacterial species.

Insets show the histograms of residual values. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 5 Comparison of correlation profiles for genes binned by synonymous diversity levels for M. tuberculosis and Y. pestis.

Histograms on the right show the distribution of synonymous diversity levels (ds) across genes on a logarithmic scale. Panels on the left show the correlation profiles computed using different subsets of genes. The high-diversity subset corresponds to the top 3% of genes ranked by ds. Correlation profiles in M. tuberculosis are seen to be flat regardless of gene subset, while in Y. pestis the high-diversity genes exhibit a pronounced decay in the correlation profile. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 6 Correlation profile analysis of H. pylori strains isolated from a single Chinese subject.

In panel (a), correlation profiles (open circles) and model fits (solid lines) from a single Chinese subject3 are shown. The strains were split into two clades, and the analyses of within-clade and between-clade strains are indicated in different colors. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 4 for values and sample sizes.

Supplementary Figure 7 Correlation profile analysis of four pairs of H. pylori strains from four Colombian subjects.

Four pairs of H. pylori strains were analyzed, with each pair corresponding to two time points of the same subject, spanning either 3 or 16 years4. In panel (a), correlation profiles (open circles) and model fits (solid lines) are shown. In panel (b), the solid bars show the inferred values of \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), \(\bar f\), and c using the same colors as \(P(l)\); error bars and a circle indicate, respectively, the interquartile range and median of 1,000 bootstrapping results (Methods). See Supplementary Table 5 for values and sample sizes.

Supplementary Figure 8 Parameter inference for different sequencing error rates.

Inferred \(\theta _{{\mathrm{pool}}}\), \(\phi _{{\mathrm{pool}}}\), c and measured \(d_{{\mathrm{sample}}}\) from simulated reads with increasing sequencing error rates for the H. pylori transformation experiment data using the evolved strains (see Methods for details). The error bars are the s.d. of 100 independent simulations with the same settings. Error bars are not visible when smaller than the symbol size. Dashed lines indicate value obtained from simulated reads with the default error rate (relative error rate = 1). See Supplementary Table 1 for sample sizes.

Supplementary Figure 9 Correlation profile analysis of raw reads sequenced from an S. aureus single clone.

Panel (a) shows the measurement of \(P(l)\) using raw reads from the single S. aureus USA300 clone5. In panel (b), \(P(l)\) for the global S. aureus strains (black circles) and the single S. aureus clone (blue circles) are shown for comparison. Model fit is shown as a solid line. The red dashed line shows the maximum absolute value of the fitting residuals. See Supplementary Table 3 for values and sample sizes.

Supplementary Figure 10 Correlation profile analysis of E. coli ST131 from infant gut microbiome samples.

Correlation profiles of infant gut microbiome samples6 from the same infant at different time points are shown, denoted by different shapes and colors. Model fits are shown as solid lines. See Supplementary Table 6 for values and sample sizes.

Supplementary Figure 11 Linear relationship between QS(l) and \(\hat{Q_s}({l})\).

Analysis was performed as described in the Supplementary Notes using the H. pylori ancient sample C0058. The linear regression line is shown (n = 20), and the squared Pearson correlation coefficient, R2, is indicated. See Supplementary Table 7 for values and sample sizes.

Supplementary Figure 12 Running time comparison of correlation profile analysis versus ClonalFrameML.

For each data point, the indicated number of genomic sequences was provided as input to both methods. Sequences were chosen randomly from the global collection of H. pylori strains, and for each comparison both methods were run on the identical set of sequences. Running time for the correlation profile method (blue) includes both the calculation of the correlation profile and the calculation of the best fitting parameters. Running time for ClonalFrameML7 is shown either with (green) or without (red) the phylogeny-building step (RAxML).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–12, Supplementary Tables 1–10 and Supplementary Note 1

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, M., Kussell, E. Inferring bacterial recombination rates from large-scale sequencing datasets. Nat Methods 16, 199–204 (2019).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing