The development of the Multi Locus Sequence Typing (MLST) method to genotype pathogenic bacteria (Maiden et al., 1998) has not only benefited molecular epidemiology, but has also greatly improved our understanding of bacterial evolution (Feil et al., 2001; Feil, 2004; Maiden, 2006). MLST consists of sequencing fragments of multiple housekeeping genes (genes encoding proteins essential for cell metabolism) spread around the chromosome. A clear advantage of this method over gel-based methods such as RFLP is that data are unambiguous and can be easily accessed and compared through online databases. Also, in contrast to many other genotyping methods, the resulting sequence data are readily amenable to population genetic analyses (Maiden et al., 1998).

Tests for recombination are routinely performed in MLST-based studies and it has become clear that homologous recombination rates (HRR) vary widely between different species (for example, (Maynard Smith et al., 1993; Feil et al., 2001; Hanage et al., 2006; Narra and Ochman, 2006; Perez-Losada et al., 2006)). The underlying causes of this variation, however, are rarely addressed and not well understood. Calculating a measure of recombination rate, rather than simply detecting a significant presence or absence of homologous recombination events, enables an explicit comparison between species. This allows the variation in HRR to be reviewed in the light of phylogeny and ecology. Similar HRR among species having comparable ecologies but belonging to divergent taxonomic groups could indicate that recombination rates have evolved because of adaptive evolution. On the other hand, different HRR among species having comparable ecologies but belonging to divergent taxonomic groups could imply that recombination rates are evolutionarily constrained.

Why bacteria engage in homologous recombination is the subject of intense debate (Redfield, 2001; Narra and Ochman, 2006; Michod et al., 2008). Three main hypotheses have been brought forward to explain the evolutionary benefits of homologous recombination. The DNA repair hypothesis states that foreign DNA serves as a template to repair double-stranded breaks (Bernstein et al., 1981). According to the food hypothesis, incorporation of foreign DNA in the genome is a by-product of the uptake of DNA for metabolism (Redfield, 1993, 2001). Finally, the various hypotheses for the maintenance of sex in eukaryotes, that is, the removal of deleterious mutations and the combination of beneficial mutations, could be equally applied to bacteria (Narra and Ochman, 2006). Elevated HRR in certain groups thus could indicate increased need for DNA repair, increased importance of DNA for metabolism or a role for recombination to increase the efficacy of natural selection.

Many approaches are available to identify homologous recombination events and rates from sequence data. However, different methods vary in their ability to detect recombination (Posada, 2002; Stumpf and McVean, 2003; Didelot and Falush, 2007), making comparisons of datasets from the literature difficult. Here, we reanalyzed MLST data from a wide variety of species using the coalescent-based method implemented in the computer package ClonalFrame (Didelot and Falush, 2007). This method estimates the relative probabilities that a nucleotide is changed as the result of recombination relative to point mutation (r/m), which is a direct measure of the relative impact of recombination on sequence diversification (Guttman and Dykhuizen, 1994).

Materials and methods

Measuring the impact of homologous recombination using ClonalFrame

A commonly used evolutionary-based measure for the prominence of recombination in bacteria is the ratio of the rates of occurrence of recombination and mutation, ρ/θ (Milkman and Bridges, 1990). A wide spectrum of methods exist to estimate this ratio, using either microevolutionary techniques (Falush et al., 2001; Feil et al., 2004) or population genetics methodology (Fearnhead et al., 2005; Fraser et al., 2005; Jolley et al., 2005). The ratio ρ/θ is a measure of the frequency at which recombination occurs relative to mutation and therefore has an intuitive interpretation: if for example ρ/θ=2, recombination events occur two times as often as point mutation in the evolution of the population. However, since it ignores length and nucleotide diversity of imported fragments, it contains no information on the actual impact recombination has on evolutionary change.

To measure the relative effect of homologous recombination on the genetic diversification of populations, we decided to use the ratio r/m, or the ratio of rates at which nucleotides become substituted as a result of recombination and mutation (Guttman and Dykhuizen, 1994). For example, if r/m=10, then recombination introduces 10 times more nucleotide substitutions than do point mutations during the evolution of the population. This is compatible with a value of ρ/θ=2 if each recombination event introduces five substitutions on average. r/m can be estimated using eBURST (Feil et al., 2004; Spratt et al., 2004), but this method has the disadvantage to be based only on the differences between close relatives within clonal complexes, and could therefore produce inflated results if the role of recombination has increased in recent time. Here, we calculated the values of r/m using ClonalFrame (Didelot and Falush, 2007) (freely available from

ClonalFrame attempts to reconstruct the clonal genealogy of a sample of strains, as well as the mutation and recombination events that took place on the branches of this genealogy, based on a coalescent model. The coalescent is a population genetics model that tracks the ancestry of present day individuals back in time to their last common ancestor (Kingman, 1982). It approximates the expected genealogy of a sample of individuals within a large population evolving under the Wright–Fisher model (Fisher, 1930; Wright, 1931). Mutation and recombination are assumed to occur at constant rates θ/2 and ρ/2 on the branches of the coalescent tree. When a mutation happens, it affects any nucleotide in the gene fragment with uniform probability and according to the Jukes–Cantor model of substitution (Jukes and Cantor, 1969). When a recombination event happens, it affects a stretch of DNA within which every nucleotide has an equal probability to be substituted (Didelot and Falush, 2007). By not attempting to reconstruct the origin of each recombination event within the population, ClonalFrame provides an accurate and efficient approximation of the computationally demanding coalescent with the recombination model (Hudson, 1983). ClonalFrame is capable of estimating a number of evolutionary parameters, including r/m. As it uses Bayesian statistics, a credibility interval can be computed for each parameter, which is a direct reflection of our uncertainty to infer the parameter based on the data.

All datasets analyzed in this study are listed in Table 1 in order of inferred mean r/m value. A brief description of each dataset is given in the Supplementary Information. In the main text, r/m values are referred to as low (<1), intermediate (1–2), high (2–10) or very high (>10). These boundaries are arbitrary but facilitate discussion and roughly correspond to interpretations of recombination rates in the literature. The values in Table 1 should be interpreted only as a general indication of HRR in a species. Results will vary when a different sample of strains is used. Loci vary in their recombination rate (Mau et al., 2006), and so the choice of MLST loci will influence results. Some estimates will be imprecise because of suboptimal sampling from the natural population (see below). The more Sequence Types (unique combinations of MLST alleles) could be used in each analysis, the more statistical power was available to infer the genealogy and other parameters, resulting in tighter estimates of r/m. Finally, it has to be stressed that different populations belonging to the same species might have different HRR.

Table 1 The ratio of nucleotide changes as the result of recombination relative to point mutation (r/m) for different bacteria and archaea estimated from MLST data using ClonalFrame

ClonalFrame settings

All values of r/m were computed with the scaled mutational rate θ set equal to Watterson's moment estimator (Watterson, 1978). For each dataset, two runs of the ClonalFrame MCMC were performed, each consisting of 200 000 iterations. The first half of the chains was discarded, and the second half was sampled every hundred iterations. The Gelman–Rubin statistic (Gelman and Rubin, 1992) was then computed for r/m in each dataset to assess convergence and mixing properties of the MCMC. For the datasets in which we found a Gelman–Rubin statistic above 1.1, longer runs were performed consisting of 2 000 000 iterations. We then recomputed the Gelman–Rubin statistics and found all of them to be satisfactory (that is, below 1.1). Graphical comparisons of the traces of the likelihood and model parameters demonstrated that the runs were properly converged and mixed. For each dataset, the results of the two ClonalFrame runs were then concatenated, and the reported values of the mean and 95% credibility interval were computed based on the resulting posterior samples. The total computational cost of all ClonalFrame runs combined was approximately 1000 CPU hours.

Selection of loci

All datasets analyzed here are based on multiple, selectively constrained housekeeping loci. The use of multiple loci buffers against possible variation in HRR across the genome as well as against stochastic variation. Intergenic spacer regions, genes under diversifying selection and genes encoding ribosomal subunits were not included because of potential confounding effects of selection on the detection of HRR. The r/m values surveyed here are taken to be representative of HRR of the majority of selectively constrained protein-encoding loci located on the chromosome (the ‘core genome’).

Selection of strains

Representative sampling of bacterial populations is required to estimate recombination rates that are biologically meaningful. There are two main ways in which non-representative sampling can lead to an underestimation of the actual recombination rate: (1) when multiple distinct populations are lumped together and (2) when certain genotypes are over-represented in a sample (Figure 1). Avoiding these pitfalls requires a detailed knowledge on the biology of the species in question.

Figure 1
figure 1

Sampling clones from a population. Homologous recombination events are depicted by arrows and take place primarily within separate evolutionary lineages (ecotypes). Asterisks represent sampled clones. Sampling scheme 1 is biased because it does not differentiate between distinct ecotypes. Sampling scheme 2 shows correct sampling from a distinct ecotype. Sampling scheme 3 is biased towards an epidemic clone within an ecotype (visualized by the increased width of the lineage).

Distinct populations within a species may emerge because of differential local adaptation and/or genetic drift. These clusters of closely related genotypes within a named species are often termed ecotypes (Cohan, 2002). It is plausible that ecotypes could differ in their HRR because of adaptive evolution or environmental constraints. When a population sample contains different ecotypes inhabiting distinct, spatially separated micro-niches that preclude the close contact necessary for genetic exchange, HRR will be underestimated. Similarly, ecotypes inhabiting identical micro-niches in different locations are less likely to exchange DNA than clones from the same location. Evidence for this process has been found in the soil bacterium Rhizobium leguminosarum, where clonality was less pronounced at a regional scale than it was at a global scale (Souza et al., 1992).

Highly successful clones will become widespread in a population. Maynard Smith et al. (1993) first pointed out that the over-representation of closely related, high frequency (epidemic) clones in a sample will lead to an inflated estimate of clonality of the population as a whole. Oversampling of a single clone in an epidemic population structure will therefore result in an underestimation of HRR. Although pooling of distinct populations will generally result in an underestimation of HRR, it is possible to overestimate HRR of a given ecotype when it is lumped together with ecotypes that have higher HRR (Figure 1).

To avoid potential confounding effects of spatial population structure, local or regional strain collections were analyzed instead of global collections when possible. For studies where strains were found to cluster in multiple, deep-branching clades, only one such clade was analyzed to avoid possible pooling of distinct ecotypes (Cohan, 2002), each with possibly distinct HRR. Only one representative of each Sequence Type was included in the analysis to avoid possible effects of epidemic population structure.

Species classification

Bacterial species were classified according to ecology in the following broad groups: (1) extremophiles, (2) marine and aquatic bacteria, (3) terrestrial bacteria, (4) commensals, that is, species that are part of the normal flora of humans or other animals, (5) obligate pathogens and (6) endosymbionts. Depending on their environment of origin, opportunistic pathogens are classified in group 2, 3 or 4. Within each group, species are divided according to phylum, with the proteobacteria further subdivided into the α-, β-, δ-, ɛ- and γ-divisions.

Results and discussion

Variation in HRR

Great variation in HRR was detected among species (Table 1). The lowest and highest r/m point estimates differ by three orders of magnitude. The upper 95% credibility interval of the species with lowest r/m and the lower 95% credibility interval of the species with highest r/m are over two orders of magnitude apart. It is obvious that homologous recombination is a powerful force in shaping the genetic diversity of a wide range of bacteria and archaea as its ability to change genomes exceed that of the process of mutation (that is, r/m>1) in 56% (27/48) of the datasets analyzed. Neisseria and Helicobacter are frequently used as examples of bacteria with very high HRR, but lesser known species Flavobacterium and Pelagibacter were found to be even more recombinogenic.


Extremophiles have attracted attention from microbial ecologists partly because the isolation of their habitats (such as geothermal vents, seeps, springs and salt lakes) results in potentially strongly structured populations, and therefore offer a special opportunity to study microbial biogeography. The hot spring inhabiting cyanobacterium Mastigocladus laminosus has a low-to-intermediate HRR. Two archaea, the thermoacidophile Sulfolobus and the halophile Halorubrum, have similar, intermediate HRR. Homologous recombination has been detected in the bacteria Thermotoga (Nesbo et al., 2006) and Leptospirillum (Lo et al., 2007) and the archaeon Ferroplasma (Tyson et al., 2004; Eppley et al., 2007), but these findings were based on non-MLST methods and so could not be included here.

Marine and aquatic bacteria (including opportunistic pathogens)

There has been a surge in sequence-based research on marine prokaryotes in recent years (for example, (Rusch et al., 2007)). However, relatively few research efforts have focused at population level sequence variation. The oceanic species Pelagibacter ubique has very high HRR. HRR is also very high in the pelagic freshwater cyanobacterium Microcystis but low in the benthic marine cyanobacterium Microcoleus. MLST data on the marine cyanobacterium Nodularia were not reanalyzed as they were based on non-housekeeping loci but indicate high HRR (Hayes et al., 2002). Environmental isolates of marine and estuarine Vibrio parahaemolyticus and V. vulnificus were found to have very high HRR. Disease-related lineages in both species show lowered HRR which is consistent with epidemic spread of a subset of virulent clones (Chowdhury et al., 2004; Perez-Losada et al., 2006; Bisharat et al., 2007). HRR is high in the γ-proteobacterium Plesiomonas shigelloides found in freshwater and estuarine environments as well as in the gastrointestinal tracts of a wide variety of animals. It can cause gastrointestinal disease in humans after consumption of seafood or contact with untreated water (Salerno et al., 2007).

Terrestrial bacteria (including opportunistic pathogens)

A number of MLST studies have been carried out for proteobacteria that live in soil, or are associated with plants. The α-proteobacterium Rhizobium gallicum has very low HRR, the β-proteobacterium Ralstonia solanacearum has intermediate HRR and the δ-proteobacterium Myxococcus xanthus has high HRR. The Pseudomonads are ubiquitous γ-proteobacteria in soil environments. HRR was found to be intermediate in both Pseudomonas syringae and P. viridiflava. Data on P. stutzeri (Cladera et al., 2004) and Phi-producing Pseudomonads (Frapolli et al., 2007) are indicative of similar HRR. The nitrogen fixing soil bacterium Klebsiella pneumophila is an important opportunistic pathogen for which we found a low HRR.

One of the first MLST studies on free-living bacteria investigated the Firmicutes Bacillus cereus, B. thuringiensis and B. weihenstephanensis, occurring sympatrically in soil (Sorokin et al., 2006). In agreement with the original study, the first two species were found to have low HRR with B. weihenstephanensis having higher HRR. HRR was found to be higher in another local B. thuringiensis population isolated from clover (r/m=2.0, credibility interval 1.2–3.1; see Supplementary Information). Firmicutes species for which less well-defined populations were sampled are Listeria monocytogenes and Oenococcus oeni.

Commensals (including opportunistic pathogens)

This group is largely composed of species that inhabit the gastrointestinal tract, the respiratory tract and skin. The gastrointestinal lifestyle of some commensals means that they can also be common in the environment. The β-proteobacterium Neisseria meningitidis is one of the best-known examples of bacteria with high HRR. The related, but never pathogenic commensal N. lactamica also has high HRR. The microaerophilic ɛ-proteobacteria in the genus Campylobacter inhabit the gut and can cause intestinal infection. A C. insulaenigrae population isolated from northern elephant seals displayed high HRR as did Campylobacter jejuni isolated from farm animals and the environment. We classified Helicobacter pylori as an opportunistic pathogens as it inhabits the stomachs of over half the global human population but only occasionally causes disease (Falush et al., 2001). It is one of the best-known examples of bacteria with very high HRR (Suerbaum et al., 1998; Falush et al., 2001).

The γ-proteobacterium E. coli was the first model species in the study of bacterial population structure (Guttman, 1997). It is a ubiquitous commensal in the intestine of mammals and birds, but certain types are also known to persist in the environment (Walk et al., 2007). Although usually harmless, E. coli also encompasses several important pathogenic lineages. Strains belonging to well-defined clade ET-1 are prevalent in freshwater environments (Walk et al., 2007) and have low HRR. The intestinal γ-proteobacterium Salmonella enterica on the other hand was found to have very high HRR. The γ-proteobacterium Haemophilus influenza is a commensal in the upper respiratory tract of humans (Gilsdorf, 1998); HRR was found to be moderately high. H. parasuis is a commensal and opportunistic pathogens of the respiratory tract of pigs (Olvera et al., 2006). Two divergent lineages, one consisting of mainly non-pathogenic isolates (Table 1) and one consisting of mainly pathogenic isolates (Supplementary Information) were analyzed with the latter having higher recombination rate. The γ-proteobacterium Moraxella catarrhalis resides in the upper respiratory tract where it can cause diseases; HRR was found to be very high.

Streptococci are Firmicutes found on the skin, in the intestine and in the upper respiratory tract and can cause a range of infections. HRR is very high in S. pneumoniae and in S. pyogenes. Staphylococcus aureus inhabits skin and nasal mucus and can cause a variety of infections. HRR was found to be very low. Clostridium difficile can be found in low numbers in the gastrointestinal tract, where it can cause diarrhoea, as well as in the environment; data indicate very low HRR. The gastrointestinal opportunistic pathogens Enterococcus faecalis and E. faecium were found to have low and intermediate HRR respectively. The gastrointestinal commensal Lactobacillus casei has low HRR. The only Firmicute belonging to the class mollicutes for which data were available, Mycoplasma hyopneumoniae, has a moderately high HRR. Finally, the intestinal Spirochaetes Brachyspira and Leptospira were found to have very low HRR.

Obligate pathogens

Obligate pathogens are specialized parasites that are primarily associated with disease. However, as the biology of most species is not well-known, it is possible that some species classified as obligate pathogens are actually opportunists from the environment or unrepresentative strains of commensals. The α-proteobacterium Bartonella and the β-proteobacterium Bordetella exhibit very low genetic diversity and have very low HRR. HRR is low as well in the γ-proteobacterium Yersinia pseudotuberculosis. The γ-proteobacterium Legionella pneumophila is an intracellular pathogen of protozoa. The relatively high optimum temperature it prefers permits it to occasionally thrive in spas and cooling towers where it can be transmitted to human airways and cause Legionnaire's disease. A local population was found to exhibit low HRR. Porphyromonas gingivalis, causing periodontal disease in humans and Flavobacterium psychrophilus, causing disease in salmonid fish, are the only representatives of the phylum bacteroidetes for which MLST data are available. In sharp contrast with low HRR in Porphyromonas, Flavobacterium was found to have the highest HRR of all species analyzed here. Chlamydia trachomatis is an obligate intracellular pathogen belonging to the phylum Chlamydiae; HRR was found to be low.


The Wolbachia α-proteobacteria form a peculiar group of intercellular arthropod and nematode symbionts with genomes profoundly shaped by the loss and inactivation of genes (Tamas et al., 2002). Wolbachia has high HRR. No information is available on other endosymbionts.

Association between HRR and phylogeny

The number of species for which MLST data are available is small, especially given the astounding diversity of bacteria and archaea. It is therefore not possible to statistically test whether HRR is elevated in certain phylogenetic or ecological groups. However, this dataset clearly demonstrates wide variation in HRR among species belonging to the same phylum or division. Examples are the γ-proteobacteria Klebsiella and Vibrio and the Bacteroidetes Porphyromonas and Flavobacterium (Table 1). An even more striking instance is provided by S. enterica for which we estimated a HRR almost 50 times higher than for E. coli despite the fact that they belong to the same family. Figure 2 shows the r/m point estimates of all phyla (or divisions) for which three or more representatives were available. It is evident that variation within phyla is of the same order as variation among phyla (note the log10-scale).

Figure 2
figure 2

Range of inferred mean r/m values of the phylogenetic groupings for which three or more representative species were available. Greek letters refer to the Proteobacteria divisions, Firmi=Firmicutes, Cyano=Cyanobacteria. All r/m values (as well as credibility intervals) are listed in Table 1.

Several genera are represented by two different species in this study. HRR is similar for the pairs of Vibrio, Streptococcus, Neisseria, Haemophilus, Campylobacter and Pseudomonas species analyzed (Table 1). HRR of the three Bacillus species are also quite similar. The variation in HRR thus seems to decrease with finer scales of taxonomic resolution, as expected when phylogeny is an important determinant of HRR. The role of phylogeny in determining HRR, however, is obscured by its strong correlation with ecology. For example, both Neisseria species are commensals of the nasopharynx and both Pseudomonas species are plant pathogens. When more data become available the hypothesis that the evolution of HRR is constrained at the genus level could be falsified.

Association between HRR and ecology

As found in earlier reviews (Feil et al., 2001; Hanage et al., 2006; Perez-Losada et al., 2006), bacteria that cause disease vary widely in HRR. This is true for obligate pathogens, commensals and opportunistic pathogens from the environment. This is unsurprising, as great variation exists in pathogenic lifestyles, for example, in host species, host range, virulence, site of infection, mechanisms of immune evasion and host-to-host transmission. When considering truly free-living, non-animal associated species, one particular trend seems to emerge. HRR is high or very high in all marine and aquatic species examined, with the exception of Microcoleus (Table 1). Interestingly, the fish pathogen Flavobacterium psychrophilum also has very high HRR. In contrast, HRR of terrestrial bacteria analyzed is low or intermediate across all phyla/divisions analyzed, with the exception of Myxococcus (Table 1). Data for only three, widely divergent, extremophile species were analyzed; all three had similar HRR. Figure 3 shows all r/m values of the three types of free-living bacteria.

Figure 3
figure 3

All inferred mean r/m values of terrestrial species (open triangles), marine/aquatic species (black squares) and extremophiles (grey diamonds). Therm=Thermoprotei, Halo=Halobacteria, other abbreviations as in Figure 1. Species included are: B. cereus, B. thuringiensis, B. weihenstephanensis, K. pneumoniae, L. monocytogenes, M. xanthus, O. oeni, P. syringae, P. viridiflava, R. solanacearum, R. gallicum (terrestrial), M. aeruginosa, M. chtonoplastes, P. shigelloides, P. ubique, V. parahaemolyticus, V. vulnificus (marine/aquatic) and Halorubrum, M. laminosum and S. islandicus (extremophile). All r/m values (as well as credibility intervals) are listed in Table 1.


The comparative method is the most general way to approach patterns of evolutionary change (Harvey and Pagel, 1991). Most of the species for which data are currently available are (opportunistic) pathogens or agronomically important bacteria. However, the MLST method is gaining popularity among microbial ecologists, and more data on the population structures of the overwhelming majority of free-living, non-pathogenic bacteria are expected in the near future. The accumulation of sequence-based, population-level studies will enable more systematic testing whether certain ecological variables correlate with a particularly high homologous recombination rate.