Introduction

Plant roots host a large variety of bacteria, many of them cooperating with the plant and enhancing plant nutrition, stress tolerance or health1. Several different modes of action are documented in these Plant Growth-Promoting Rhizobacteria (PGPR). Direct effects on plants may involve enhanced availability of nutriments2,3, stimulation of root system development via production of phytohormones and other signals4 or interference with plant's ethylene synthesis5,6, and/or induced systemic resistance7. Indirect beneficial effects of PGPR on plants entail competition or antagonism towards phytoparasites8,9.

Despite extensive literature on PGPR's modes of action (especially in the Proteobacteria), the molecular features that define a PGPR remain elusive, because the PGPR status is not always well defined. First, PGPR may occupy different microbial habitats, as they range from saprophytic soil bacteria that colonize the rhizosphere to bacteria that can also colonize internal root tissues. This means that the distinction is not often simple respectively with saprophytes without plant-beneficial effects (especially plant commensals) and with vertically-inherited endophytes or plant endosymbionts. Second, several bacteria display alternate ecological niches and at times some may function as PGPR. For instance, certain tumor-inducing Agrobacterium strains have plant growth stimulation potential on non-susceptible plant hosts10, a property also found in an Escherichia coli gut commensal10. Third, the genes implicated in plant-beneficial functions range from genes directly conferring plant-beneficial properties, such as nif (nitrogen fixation)11 or phl (phloroglucinol synthesis)12, to genes contributing to a variety of cell functions indirectly or secondarily including plant-beneficial ones, such as pqq (pyrroloquinoline quinone synthesis)13. Fourth, many PGPR strains are not yet recognized as such (as determination of PGPR status requires experimental assessment) and it is very likely that not all plant-beneficial traits and the corresponding genes have already been identified. Fifth, the assessment of genes encoding plant-beneficial properties is commonly restrained to particular bacterial clades14 if not particular PGPR strains9,12, without a more general analysis of gene distribution across several bacterial clades15.

Despite these limitations, however, a number of emblematic PGPR model strains have been extensively characterized over the last 20 years, uncovering the molecular basis of at least some of their plant-beneficial effects. These studies have evidenced that many PGPR strains typically harbor more than one plant-beneficial property8,16 and it could be hypothesized that the accumulation of genes contributing (whether directly or indirectly) to plant-beneficial traits has been selected by the interaction of these bacteria with plants. On this basis, it could even be expected that PGPR might be identified by their particular assortment of genes contributing to plant-beneficial functions. So far, a more general description of the occurrence of these genes, including in bacteria not interacting with plants, is still lacking. Such knowledge would bring fundamental insights into the potential associations of phytobeneficial traits in PGPR bacteria and this can now be achieved based on genome comparisons and phylogenetic analyses17,18.

Hence, our objective was to assess the distribution of 23 genes contributing to eight key plant-beneficial functions using genomic and phylogenetic analyses, as well as ancestral state character reconstruction to infer possible gene transfers. These plant-beneficial function contributing genes (hereafter referred to as PBFC genes) were investigated using the genomes of 25 emblematic proteobacterial PGPR (i.e. bacteria colonizing root surface and/or tissues and displaying plant growth-promotion effects). These genomes were also compared with those of 279 other Alpha-, Beta- and Gammaproteobacteria representing various taxonomic groups and ecological status, such as (i) endophytes/symbionts (i.e. asymptomatic, endophytic bacteria possibly in symbiotic interaction with the plant but for which plant-beneficial effects are not documented, as well as root-nodulating, diazotrophic bacteria), (ii) saprophytes (i.e. bacteria from various environments including soil; some of them possibly colonizing roots but without established plant-beneficial effects), (iii) plant pathogens and (iv) animal pathogens.

The 23 genes selected included (i) the nitrogenase-encoding genes nifHDK responsible for nitrogen fixation in proteobacterial PGPR from Azospirillum11, Burkholderia19 and other genera, (ii) the pyrroloquinoline quinone-encoding genes pqqBCDEFG contributing to mineral phosphate solubilization in the PGPR Pseudomonas fluorescens F11320, Erwinia herbicola21 and Enterobacter intermedium22, (iii) the indole-3-pyruvate decarboxylase/phenylpyruvate decarboxylase gene ipdC/ppdC of the indole-3-pyruvate pathway for synthesis of the auxinic phytohormone indole acetic acid (IAA) in Azospirillum brasilense Sp24515, Enterobacter cloacae UW523 and other Enterobacteriaceae PGPR24, (iv) the copper nitrite reductase gene nirK leading to formation of the NO root-branching signal in Azospirillum brasilense Sp24525, (v) the 1-aminocyclopropane-1-carboxylate (ACC) deaminase gene acdS in Pseudomonas putida GR12-226 and various other Pseudomonas PGPR6, which enables degradation of the plant's ethylene precursor, (vi) the acetoine genes budAB and 2,3-butanediol gene budC (induced systemic resistance) in the PGPR Enterobacter sp. 63827 and (vii) genes hcnABC (hydrogen cyanide) and phlACBD (2,4-diacetylphloroglucinol) for synthesis of antimicrobial compounds in P. fluorescens F113, P. protegens CHA0 and many other PGPR pseudomonads12.

Results

Contrasted co-occurrence patterns of PBFC genes in proteobacterial PGPR

In the 25 sequenced PGPR strains, which belonged to the genera Azospirillum, Rhizobium/Agrobacterium (Alphaproteobacteria), Azoarcus, Burkholderia, (Betaproteobacteria) and Enterobacter, Klebsiella, Pantoea, Pseudomonas, Serratia, (Gammaproteobacteria), the PBFC genes were found in 2 (for gene ppdC) to 20 (pqqBCDE) of the genomes (Table 1). The PGPR strains harbored from 1 (i.e. acdS in Burkholderiacepacia’ 383 and B. phytofirmans PSJN) to 14 of the 23 PBFC genes studied (in P. protegens Pf-5, P. brassicacearum NFM421 and P. fluorescens F113), which gave 7.5 ± 3.1 PBFC genes per strain (Supplementary Fig. S1a). The exact test of Fisher (P < 0.05) evidenced that phlACBD and hcnABC significantly occurred together in certain PGPR strains (Fig. 1) i.e. pseudomonads. Three other separate groups of co-occurring genes were identified, i.e. budAB and ipdC, the operon nifHDK and the clustered genes pqqBCDE. No other significant co-occurrence of PBFC genes was found.

Table 1 Distribution of plant-beneficial function contributing (PBFC) genes according to the primary ecological lifestyle documented for the bacteria studied
Figure 1
figure 1

Co-occurrence network of the PBFC genes for the 25 PGPR genomes.

The genes are depicted with a colored circle according to their encoded function. Each co-occurrence is represented by an edge linking the corresponding genes and materialized by a line (based on Fisher exact test; P < 0.05). Several PBFC genes found in PGPR (i.e. pqqF, pqqG, budC, nirK, ppdC and acdS) did not display significant co-occurrence with any other(s).

Similar or lower prevalence of PBFC genes in Proteobacteria of other ecological types

The genomes of 279 other sequenced Proteobacteria corresponding to saprophytes or endophytes/symbionts without established PGPR status, as well as pathogens of plants or animals, were studied as well. For the 56 endophytes/symbionts, PBFC genes were found in 0 (for phlACBD) to 36 (pqqBCDE) of the genomes (Table 1). Whereas two bacteria did not display any of the 23 PBFC genes, they were extensively found in others, with eight strains exhibiting as many as 10 PBFC genes each. Overall, the endophytes/symbionts harbored 6.1 ± 2.6 PBFC genes per strain, but the difference with PGPR was not significant (P = 0.06). Exact-Fisher pairwise tests of the co-occurrence of PBFC genes (P < 0.05) revealed four groups, i.e. hcnABC and pqqBCDE linked by pqqFG genes, as well as nifHDK/acdS and budAB/ipdC further apart (Fig. 2a).

Figure 2
figure 2

Co-occurrence network of PBFC genes according to primary ecological classification of bacteria.

The genes are depicted with a colored circle according to their encoded function. Each co-occurrence is represented by an edge linking the corresponding genes and materialized by a line. Computations were made for (a) endophytes/symbionts, (b) saprophytes, (c) phytoparasites, (d) animal pathogens.

Within the 29 saprophytes, PBFC genes were found in 0 (for phlACBD and ppdC) to 26 (pqqBCDE) of the genomes (Table 1). Although three bacterial strains showed none of the studied genes, one strain (Pantoea sp. At-9b) exhibited as many as 12 genes. Globally, saprophytic strains contained 5.5 ± 1.8 genes per genomes. This is significantly lower than in PGPR (P < 0.05) but not different from endophytes/symbionts (P = 0.44). Co-occurrence analysis of PBFC genes in saprophytic bacteria evidenced five separate groups, i.e. hcnABC, pqqBCDE, pqqFG, nifHDK and budABC/ipdC (Fig. 2b).

In the 59 phytopathogenic bacteria, PBFC genes were found in 0 (for the 8 genes ppdC, phlACBD and hcnABC) to 35 (pqqCDE) of the genomes (Table 1). Whereas seven phytopathogens (Xylella sp. and Xanthomonas albilineans) did not contain any of the 23 PBFC genes, as many as 8 PBFC genes occurred in Erwinia and Pantoea species. This gave overall 4.3 ± 2.2 PBFC genes per strain, which was lower than in PGPR and endophytes/symbionts (P < 0.05) but not significantly lower than in saprophytes (P = 0.06). Exact-Fisher pairwise tests (P < 0.05) of the co-occurrence of PBFC genes revealed two independent groups, i.e. pqqBCDEFG linked to acdS via pqqG and budABC/ipdC with nifHDK (Fig. 2c).

Most PBFC genes were not prevalent in the 135 animal pathogens. Except nirK present in 109 of them, the other PBFC genes were not often found (ranging from 4 genomes for pqqG, phlD and hcnABC to 44 genomes for acdS) or not found at all (nifHDK, ppdC, phlACB). The number of PBFC genes varied from 0 (in 9 animal pathogens) to 9 (in 7 other animal pathogens), i.e. 1.8 ± 1.2 PBFC genes per strain overall, which was lower than for all other ecological types (all P < 0.05). Exact-Fisher pairwise tests (P < 0.05) evidenced a single group comprised of three subgroups extensively linked with one another, i.e. budABC/ipdC, pqqBCDEF and hcnABC/pqqG/phlD (Fig. 2d).

Distribution of PBFC genes across all 304 proteobacterial genomes reveals taxonomic specificities

Whereas phlACB were only retrieved in PGPR (in 3 of 25 genomes), the other PBFC genes were recovered in bacteria of different ecological types. Many occurred in PGPR as well as in endophytes/symbionts, saprophytes and phytopathogens, especially pqqCDE (36 of 56, 26 of 29 and 35 of 59 genomes, respectively) and with a lower prevalence ipdC (2 of 56, 2 of 29 and 10 of 59 genomes, respectively) and nifHDK (23 of 56, 3 of the 29 and 3 of 59 genomes, respectively). In contrast, the hcnABC genes were retrieved in PGPR (3 of 25 genomes), saprophytes (2 of 29 genomes), endophytes/symbionts (9 of 59 genomes) and animal pathogens (4 of 135 genomes), but were absent in plant pathogens.

The distribution of certain PBFC genes according to bacterial ecological type could, at least in part, reflect taxonomic properties. This is indicated by the occurrence of PBFC genes in taxa restricted to a given ecological type (Fig. 3). In particular, ppdC was only retrieved in certain Azospirillum PGPR and Bradyrhizobium in the endophyte/symbiont category. For many PBFC genes, however, their occurrence within a taxon was related to species/strain ecology. This was the case for phlACBD (Pseudomonas PGPR), hcnABC (all Pseudomonas types except phytopathogens) and nifHDK (mainly in PGPR and endophytes/symbionts from various proteobacterial taxa). The relation to ecology, if any, was not as strong for ipdC and budAB (Enterobacteriaceae), acdS (all Burkholderiaceae considered and various Alphaproteobacteria and Gammaproteobacteria), nirK and pqq genes (various Proteobacteria corresponding to several ecological types).

Figure 3
figure 3

Phylogenetic distribution of genes along Proteobacteria phylogeny.

Internal circles: presence of a gene is indicated by a grey square and absence by a white square. Taxonomically coherent groups with the same gene content were collapsed for sake of clarity. Biovars are indicated for Rhizobium leguminosarum and pathovars for Pseudomonas syringae.

The comparison of the 304 genomes showed that, unexpectedly, PBFC genes previously described as clustered (even forming operons in many cases) were not necessarily found together in a same genome (Fig. 4). For instance, pqqFG were close to pqqBCDE in Pseudomonas (and a few other genera), whereas pqqBCDE occurred without pqqG (encoding a family-S9 peptidase) and especially pqqF (encoding a family-M16 peptidase) in most other Proteobacteria. Similar observations were made for phlD and phlACB, as well as budAB and budC. Yet, the groups revealed by exact-Fisher pairwise tests (P < 0.01) corresponded mainly to genes involved in a same function (Fig. 3). This analysis showed that hcnABC and phlD linked the other phl genes with the six pqq genes, themselves linked to budABC/ipdC via nirK and to nifHDK. nifHDK were also linked, separately, to ppdC and to acdS.

Figure 4
figure 4

Co-occurrence network of the PBFC genes for the 304 genomes.

The genes are depicted with a colored circle according to their encoded function. Each co-occurrence is represented by an edge linking the corresponding genes and materialized by a line. nirK does not appear in the figure because this gene did not shown any significant co-occurrence with other PBFC gene(s).

Distribution of PBFC genes is partly related to proteobacterial phylogeny

We assessed whether the distribution of PBFC genes exhibited significant phylogenetic signal, meaning that closely-related species have similar gene content. Fritz and Purvis D index analysis (Table 2) showed that distribution of the PBFC genes was significantly influenced by evolutionary relationships between proteobacterial species, as indicated by D scores significantly less than 1. The genes phlACBD, pqqFG, budABC, ipdC, ppdC and hcnABC showed a strong phylogenetic signal, while acdS, nifHDK and pqqBCDE showed weaker signals.

Table 2 Phylogenetic patterns of gene distribution in selected Proteobacteria. Values were calculated for the 1000 partitions of the species phylogenetic tree

Horizontal gene transfers had significant effects on PBFC gene distribution in Proteobacteria

When the impact of genome plasticity was assessed, by computing events of acquisitions and losses across proteobacterial species, no loss was detected for pqqF, phlACBD, ppdC and nirK (Table 3). On the contrary, a few losses were inferred for the other genes, ranging from 1 loss for pqqG, budABC, ipdC and hcnABC to 6 losses for pqqB. In comparison, the number of acquisitions was of a larger scale, from 1 for ipdC and budAB to 21 for acdS (Table 3).

Table 3 Number of acquisitions and losses for each PBFC genea according to proteobacterial species tree (ancestral character reconstruction)

All 23 genes appeared at least once in a distant ancestor of the species studied (Fig. 5). ipdC, ppdC and phlACBD are clade specific; ipdC appeared in the last common ancestor (LCA) of Pantoea and Erwinia genera, ppdC in the LCA of Azospirillum brasilense and the LCA of Bradyrhizobium strains ORS78 and BTAi1 and phlACBD in the LCA of Pseudomonas fluorescens F113 and Pseudomonas brassicacearum NFM421 and the LCA of Pseudomonas protegens Pf-5. budABC appeared in the LCA of Enterobacteriaceae; budAB are clade specific but budC was acquired at least 7 times in other clades. The pqqBCDEFG genes appeared in the LCA of the Pseudomonas; pqqG, pqqF and pqqBCDE were acquired respectively 4, 5 and 15 times by other taxa. At the extreme, nifHDK underwent at least 18 acquisitions and acdS (which appeared in the Burkholderiaceae LCA) 21 acquisitions, in both cases across the three phyla considered.

Figure 5
figure 5

Reconstruction of acquisitions and losses of PBFC genes in relation to evolutionary history of sequenced bacteria.

When different strains of the same species had the same PBFC gene profile, only one representative strain was kept in the Maximum-Likelihood tree to avoid redundant information. Acquisitions are indicated by a blue arrow with a circle and losses by a red arrow with a triangle.

Discussion

In this study, plant-beneficial properties of PGPR were for the first time assessed on a broad scale, by considering (i) a large range of PBFC genes corresponding to various types of plant-beneficial properties, (ii) PGPR strains of contrasted taxonomic status (from the Alpha- Beta- and Gammaproteobacteria) and (iii) a selection of non-PGPR Proteobacteria with primarily other biotic relations with plants (i.e. endophytes/symbionts and phytopathogens) or other types of ecology (i.e. saprophytes and animal pathogens).

It could have been thought that the PGPR status entailed the presence of a core collection of PBFC genes shared by all PGPR strains, but the current results based on 25 emblematic PGPR strains indicate that none of the 23 key PBFC genes of the study were common to all strains, even though as many as 20 PGPR genomes displayed pqqBCDE. PQQ is a co-factor potentially implicated in several cellular processes (and incidentally contributing also to phosphate solubilization), which may explain its wide occurrence in PGPR13,28. In comparison with Proteobacteria of other lifestyle, PBFC genes restricted to PGPR were not found, except for phlACB but these genes were present in only 3 Pseudomonas PGPR. However, the number of PBFC genes increased along the continuum animal pathogens (only 1.8 PBFC genes/strain), phytopathogens, saprophytes, endophytes/symbionts, PGPR (as many as 7.5 PBFC genes/strain). The same findings were made when assessing the number of functions expected from these PBFC genes, except that the difference between animal and plant pathogens was not significant (Supplementary Fig. S1b).

Our gene distribution data suggest that PBFC genes might by selected in plant-associated habitats and counter-selected elsewhere, as exemplified by the very low number of these genes in animal pathogens (where only nirK was prevalent). This is in accordance with the expectations that most of the corresponding functions would not be relevant for animal physiology and plant is not the primary habitat of these bacteria. For instance, nitrogen fixation is counter-selected in pathogenic bacteria29,30. In addition, results suggest that amongst all the plant-associated bacteria, specific lifestyle is also a major factor explaining distribution of PBFC genes, with higher prevalence in plant-beneficial strains. This possibility stems in particular from the comparison of (i) PGPR and endophytes/symbionts versus (ii) phytopathogens, despite the presence of PBFC homologs budABC, ipdC and/or acdS (not necessarily together; Fig. 2c) in many phytopathogens (Table 1). Indeed, many of the plant-beneficial traits found in PGPR could be used by endophytic Proteobacteria documented (or presumably) in a mutualistic symbiosis with the plant host. This would be a generalization of previous observations made with nifHDK31 and to a lesser extent acdS32.

Most PBFC genes were identified in bacteria from different ecological types (Table 1), which is an indication that (i) strain information was not always sufficient to determine lifestyle precisely and/or (ii) boundaries between different lifestyles may not be very stringent in Proteobacteria. The first possibility is clear in the case of saprophytes, as this category contains a number of strains originating from bulk or rhizosphere soil but for which the PGPR potential has not been experimentally tested, raising the possibility that some of them could indeed be PGPR. Similarly, certain PGPR can also be endophytic, e.g. Azospirillum sp. B510 and Azoarcus sp. BH7238, but some of the endophytes studied here have not been assessed for their effects on plants and so could not be listed among the PGPR. The second possibility is illustrated with many animal pathogens belonging to Pseudomonas33 or the Enterobacteriaceae34 that can colonize plants asymptomatically, probably because these alternative hosts promote bacterial survival before recolonizing the next animal primary host35. This could explain why certain animal-associated strains displayed PBFC genes. Furthermore, opportunistic human pathogens such as Pseudomonas aeruginosa PAO1 and PA14 can also infect roots and lead to plant death36.

Mutational inactivation of a particular PBFC gene may reduce (without necessarily abolishing) plant-beneficial effects in PGPR strains1,37 and genetic acquisition of an additional PBFC gene has the potential to enhance PGPR performance8,39. This indicates that possessing multiple PBFC genes should confer a better efficiency at enhancing plant growth. In this context, the analysis of co-occurrence patterns (exact Fisher tests) can be useful to identify selection of multiple PBFC genes and their potential synergistic effects. However, gene co-occurrence may also take place because species that share a recent evolutionary history also share similar gene contents, a phenomenon known as phylogenetic signal. Indeed, the Fritz and Purvis index clearly pointed to gene associations related to phylogenetic signal, i.e. PBFC genes were more likely to be conserved in closely-related species. This also raises the possibility that the potential to become a PGPR may rely (at least in part) on ancestral features in the corresponding bacterial taxa, which is in phase with previous findings on particular PGPR populations40 and more generally on function distributions in Gammaproteobacteria41.

The distribution pattern of PBFC genes amongst Proteobacteria of various lifestyles and the relation to bacterial taxonomy prompted us to assess in more details the evolutionary history of these genes. Ancestral character reconstructions showed few losses of PBFC genes, even in animal-associated bacteria, but many more gene acquisitions. Indeed, the role of horizontal gene transfer has been substantiated with various types of PGPR42,43 and suggests that cooperation interactions between Proteobacteria and plant roots might have established separately in various taxa, yielding PGPR strains whose effect(s) on the plant may rely on different and taxa-specific combinations of modes of action. Further genome sequencing efforts targeting close relatives of these PGPR would be needed to confirm this possibility. Despite conservation of PBFC genes across different ecological lifestyles, a differential use/regulation of these genes depending on environmental and host conditions is likely44, as can take place during exaptation45. Indeed, expression patterns of PBFC genes according to taxonomic and/or lifestyle properties is an important ecological issue, which will deserve further research attention. Bacterial adaptation to new niches is mainly dependent on genetic novelty46,47, which may entail gene acquisitions46 or differential regulation48. Many examples of traits conferring environmental adaptation that were further co-opted as virulence factor are documented in human pathogens49. Similar processes are likely to have taken place in PGPR as well50.

In conclusion, the comparison of taxonomically-contrasted proteobacterial PGPR with a wide range of related, non-PGPR bacteria suggested that the emergence of the PGPR status could have paralleled accumulation of PBFC genes in root-adapted bacteria. It is likely that this process took place separately in taxonomically-contrasted Proteobacteria and involved ancient gene acquisitions, which explains why subsequent diversification produced taxonomic subgroups of PGPR strains differing from one another in the range of PBFC genes accumulated.

Methods

Selection of genomes

The genomes used were selected among those available in October 2012. They corresponded to 25 PGPR, 56 endophytes/symbionts (35 endophytes and 21 root-nodulating bacteria), 29 saprophytes (3 from water environments, 6 from bulk soil, 16 from the rhizosphere and 4 from healthy animal samples), 59 plant pathogens and 135 animal pathogens (124 of them infecting humans). Since distribution can be influenced by phylogenetic relatedness, also called phylogenetic signal18, genomes were chosen so as to balance the prevalence of the various Alpha-, Beta- and Gammaproteobacterial groups for which PGPR genomes were available, following two principles. First, the primary lifestyle of the selected bacteria had to be documented sufficiently clearly and their genomes were fully sequenced (except in a few cases for orders of particular interest). Second, bacterial orders in which PGPR representatives were available were assessed for genome availability of bacteria corresponding to other lifestyles (especially within the same or closely-related families/genera) and if unsuccessful the phylogenetically-closest order was then targeted.

Homologs retrieval

Homologs of genes contributing to a phytobeneficial function in PGPR were retrieved using a BLAST-based method. A protein to protein search was done using Blastp51 with a subset of genes documented to contribute to a given phytobeneficial function (Supplementary Table S1). As annotations in public databases may contain errors or sometimes fail to accurately predict gene identity, we then did a tblastn51 search on genomic sequences to overcome these limitations. An E-value threshold of 1e-15 was set to filter blast searches.

Protein family assignment

Assignment of homologous proteins to families having the same putative function was done using a combination of significant sequence identity (see above) and protein domain assignment. Protein domain assignment was done using rps-blast52 and the Conserved Domain Database (CDD)53. We separated the NCBI-curated domains (which are considered more accurate) and external sources domains in two distinct databases. The NCBI-curated database was preferentially used for protein domain assignments while external source database was used when the NCBI-curated one could not retrieve results. Proteins were considered of the same family if they (i) had at least 30% of identity on at least 70% of their respective protein sequence length and (ii) shared the same domains with a reference phytobeneficial protein. Phylogenetic profiles (corresponding to a binary vector with gene's presence and absence respectively indicated as 1 and 0 for each genome) were used to represent the presence/absence of a particular gene in the different organisms for analysis of phylogenetic signal and ancestral state reconstruction.

Gene distribution

For statistical analysis of the number of PBFC genes and number of corresponding functions per genome, according to primary bacterial ecology, the Wilcoxon test was used with the R command wilcox.test (P < 0.05).

Proteobacterial phylogenetic tree

The proteobacterial phylogenetic tree was based on 31 housekeeping markers identified, aligned and trimmed with Amphora2, as done previously54. Trees were inferred by ExaML55 with the concatenated alignment, 1000 replicates and the PSR model of rate heterogeneity.

Computation of phylogenetic signal

The phylogenetic signal for each gene was calculated using Fritz and Purvis's D index56 implemented in the R package “caper”. Computation of random and Brownian motion of evolution probabilities was based on 10,000 permutations. Briefly, a given trait (a gene in our case) displays a highly clustered distribution if D < 0, is as clustered as if it evolved under Brownian motion if D = 0, displays random distribution if D = 1 and is overdispersed if D > 1. Comparison of D scores was used to arbitrarily infer the strength of the phylogenetic signal for each gene.

Ancestral state character reconstruction

The GLOOME algorithm was used to infer the presence or absence of each gene on each node of a phylogenetic tree based on their distributions in terminal taxa. The phylogenetic tree used was computed as previously but was based on a filtered alignment. When many bacteria of the same species had the same content in genes of interest, only the reference species indicated in the NCBI database was conserved. This simplified the reconstruction model by removing redundant information. Reconstructions were made with the Maximum Parsimony method57, which allows to reconstruct ancestral states by minimizing character change events along a phylogenetic tree.