Plants intimately associate with diverse bacteria. Plant-associated bacteria have ostensibly evolved genes that enable them to adapt to plant environments. However, the identities of such genes are mostly unknown, and their functions are poorly characterized. We sequenced 484 genomes of bacterial isolates from roots of Brassicaceae, poplar, and maize. We then compared 3,837 bacterial genomes to identify thousands of plant-associated gene clusters. Genomes of plant-associated bacteria encode more carbohydrate metabolism functions and fewer mobile elements than related non-plant-associated genomes do. We experimentally validated candidates from two sets of plant-associated genes: one involved in plant colonization, and the other serving in microbe–microbe competition between plant-associated bacteria. We also identified 64 plant-associated protein domains that potentially mimic plant domains; some are shared with plant-associated fungi and oomycetes. This work expands the genome-based understanding of plant–microbe interactions and provides potential leads for efficient and sustainable agriculture through microbiome engineering.
The microbiota of plants and animals have coevolved with their hosts for millions of years1,2,3. Through photosynthesis, plants serve as a rich source of carbon for diverse bacterial communities. These include mutualists and commensals, as well as pathogens. Phytopathogens and growth-promoting bacteria have considerable effects on plant growth, health, and productivity4,5,6,7. Except for intensively studied relationships such as root nodulation in legumes8, T-DNA transfer by Agrobacterium9, and type III secretion–mediated pathogenesis10, the molecular mechanisms that govern plant–microbe interactions are not well understood. It is therefore important to identify and characterize the bacterial genes and functions that help microbes thrive in the plant environment. Such knowledge should improve the ability to combat plant diseases and harness beneficial bacterial functions for agriculture, with direct effects on global food security, bioenergy, and carbon sequestration.
Cultivation-independent methods based on profiling of marker genes or shotgun metagenome sequencing have considerably improved the overall understanding of microbial ecology in the plant environment11,12,13,14,15. In parallel, reduced sequencing costs have enabled the genome sequencing of plant-associated bacterial isolates at a large scale16. Importantly, isolates enable functional validation of in silico predictions. Isolate genomes also provide genomic and evolutionary context for individual genes, as well as the potential to access genomes of rare organisms that might be missed by metagenomics because of limited sequencing depth. Although metagenome sequencing has the advantage of capturing the DNA of uncultivated organisms, multiple 16S rRNA gene surveys have reproducibly shown that the most common plant-associated bacteria are derived mainly from four phyla13,17 (Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes) that are amenable to cultivation. Thus, bacterial cultivation is not a major limitation in sampling of the abundant members of the plant microbiome16.
Our objective was to characterize the genes that contribute to bacterial adaptation to plants (plant-associated genes) and those genes that specifically aid in bacterial root colonization (root-associated genes). We sequenced the genomes of 484 new bacterial isolates and single bacterial cells from the roots of Brassicaceae, maize, and poplar trees. We combined the newly sequenced genomes with existing genomes to create a dataset of 3,837 high-quality, nonredundant genomes. We then developed a computational approach to identify plant-associated genes and root-associated genes based on comparison of phylogenetically related genomes with knowledge of the origin of isolation. We experimentally validated two sets of plant-associated genes, including a previously unrecognized gene family that functions in plant-associated microbe–microbe competition. In addition, we characterized many plant-associated genes that are shared between bacteria of different phyla, and even between bacteria and plant-associated eukaryotes. This study represents a comprehensive and unbiased effort to identify and characterize candidate genes required at the bacteria–plant interface.
Expanding the plant-associated bacterial reference catalog
To obtain a comprehensive reference set of plant-associated bacterial genomes, we isolated and sequenced 191, 135, and 51 novel bacterial strains from the roots of Brassicaceae (91% from Arabidopsis thaliana), poplar trees (Populus trichocarpa and Populus deltoides), and maize, respectively (Methods, Table 1, Supplementary Tables 1–3). The bacteria were specifically isolated from the interior (endophytic compartment) or surface (rhizoplane) of plant roots, or from soil attached to the root (rhizosphere). In addition, we isolated and sequenced 107 single bacterial cells from surface-sterilized roots of A. thaliana. All genomes were assembled, annotated, and deposited in public databases and in a dedicated website (“URLs,” Supplementary Table 3, Methods).
A broad, high-quality bacterial genome collection
In addition to the newly sequenced genomes noted above, we collected 5,587 bacterial genomes belonging to the four most abundant phyla of plant-associated bacteria13 from public databases (Methods). We manually classified each genome as plant-associated, non-plant-associated (NPA), or soil-derived on the basis of its unambiguous isolation niche (Methods, Supplementary Tables 1 and 2). The plant-associated genomes included organisms isolated from plants or rhizospheres. A subset of the plant-associated bacteria was also annotated as ‘root-associated’ when isolated from the rhizoplane or the root endophytic compartment. Genomes from bacteria isolated from soil were considered as a separate group, as it is unknown whether these strains can actively associate with plants. Finally, the remaining genomes were labeled as NPA genomes; these were isolated from diverse sources, including humans, non-human animals, air, sediments, and aquatic environments.
We carried out stringent quality control to remove low-quality or redundant genomes (Methods). This led to a final dataset of 3,837 high-quality and nonredundant genomes, including 1,160 plant-associated genomes, 523 of which were also root-associated. We grouped these 3,837 genomes into nine monophyletic taxa to allow comparative genomics analysis among phylogenetically related genomes (Fig. 1a, Supplementary Tables 1 and 2, Methods, “URLs”).
To determine whether our genome collection from cultured isolates was representative of plant-associated bacterial communities, we analyzed cultivation-independent 16S rDNA surveys and metagenomes from the plant environments of Arabidopsis11,12, barley18, wheat, and cucumber14 (Methods). The nine taxa analyzed here account for 33–76% (median, 41%; Supplementary Table 4) of the total bacterial communities found in plant-associated environments and therefore represent a substantial portion of the plant microbiota, consistent with previous reports13,16,19.
Increased carbohydrate metabolism and fewer mobile elements in plant-associated genomes
We compared the genomes of bacteria isolated from plant environments with those from bacteria of shared ancestry that were isolated from non-plant environments. We assumed that the two groups should differ in the set of accessory genes that evolved as part of their adaptation to a specific niche. Comparison of the size of plant-associated, soil, and NPA genomes showed that plant-associated and/or soil genomes were significantly larger than NPA genomes (P < 0.05, PhyloGLM and t-tests; Supplementary Fig. 1a, Supplementary Table 5). We observed this trend in six to seven of the nine analyzed taxa (depending on the test), representing all four phyla. Pangenome analyses of a few genera with plant-associated and NPA isolation sites showed that pangenome sizes were similar between plant-associated and NPA genomes (Supplementary Fig. 2).
Next, we examined whether certain gene categories are enriched or depleted in plant-associated genomes versus in their NPA counterparts, using 26 broad functional gene categories (Supplementary Table 6). We used the PhyloGLM test (Fig. 1b) and t-test (Supplementary Fig. 3) to detect enrichment. Two gene categories demonstrated similar phylogeny-independent trends suggestive of an environment-dependent selection process. The “Carbohydrate metabolism and transport” gene category was expanded in the plant-associated organisms of six taxa (Fig. 1b). This was the most expanded category in Alphaproteobacteria, Bacteroidetes, Xanthomonadaceae, and Pseudomonas (Supplementary Fig. 3). In contrast, mobile genetic elements (phages and transposons) were underrepresented in four plant-associated taxa (Fig. 1b and Supplementary Fig. 3). Plant-associated genomes showed increased genome sizes despite a reduction in the number of mobile elements that often serve as vehicles for horizontal gene transfer and genome expansion. A comparison of root-associated bacteria to soil bacteria showed less drastic changes than those seen between plant-associated and NPA groups, as expected for organisms that live in more similar habitats (Fig. 1b and Supplementary Fig. 3).
Identification and validation of plant- and root-associated genes
We sought to identify specific genes enriched in plant- and root-associated genomes compared with NPA and soil-derived genomes, respectively (Supplementary Fig. 4, Methods). First, we clustered the proteins and/or protein domains of each taxon on the basis of homology, using the annotation resources COG20, KEGG Orthology21, and TIGRFAM22, which typically comprise 35–75% of all genes in bacterial genomes23. To capture genes that do not have existing functional annotations, we also used OrthoFinder24 (after benchmarking; Supplementary Fig. 5) to cluster all protein sequences within each taxon into homology-based orthogroups. Finally, we clustered protein domains with Pfam25 (Methods, “URLs”). We used these five protein/domain-clustering approaches in parallel comparative genomics pipelines. Each protein/domain sequence was additionally labeled as originating from either a plant-associated genome or an NPA genome.
Next, we determined whether protein/domain clusters were significantly associated with a plant-associated lifestyle by using five independent statistical approaches: hypergbin, hypergcn (two versions of the hypergeometric test), phyloglmbin, phyloglmcn (two phylogenetic tests based on PhyloGLM26), and Scoary27 (a stringent combined test) (Methods). These analyses were based on either gene presence/absence or gene copy number. We defined a gene as significantly plant-associated if at least one test showed that it belonged to a significant plant-associated gene cluster, and if it originated from a plant-associated genome. We defined significant NPA, root-associated, and soil genes in the same way. Significant gene clusters identified by the different methods had varying degrees of overlap (Supplementary Figs. 6 and 7). In general, we noted a high degree of overlap between plant-associated and root-associated genes and overlap between NPA and soil-associated genes (Supplementary Fig. 8). Overall, plant-associated genes were depleted from NPA genomes from heterogeneous isolation sources (Supplementary Figs. 9 and 10). Principal coordinates analysis with matrices that contained only the plant-associated and NPA genes derived from each method as features increased the separation of plant-associated from NPA genomes along the first two axes (Supplementary Fig. 11). We provide full lists of statistically significant plant-associated, root-associated, soil-associated, and NPA proteins and domains according to the five clustering techniques and five statistical approaches for each taxon in Supplementary Tables 7–15 (also see “URLs”).
To validate our predictions, we assessed the abundance patterns of plant-associated and root-associated genes in natural environments. We retrieved 38 publicly available plant-associated, NPA, root-associated, and soil-associated shotgun metagenomes, including some from plant-associated environments that were not used for isolation of the bacteria analyzed here14,28,29 (Supplementary Table 16a). We mapped reads from these culture-independent metagenomes to plant-associated genes found with all statistical approaches (Methods, Supplementary Figs. 12–16). Plant-associated genes in up to seven taxa were more abundant (P < 0.05, t-test) in plant-associated metagenomes than in NPA metagenomes (Fig. 2a, Supplementary Table 16b). Root-associated, soil-associated, and NPA genes, in contrast, were not necessarily more abundant in their expected environments (Supplementary Table 16b).
In addition, we selected eight genes that were predicted to be plant-associated by multiple approaches (Supplementary Table 17a) for experimental validation via an in planta bacterial fitness assay (Methods). We inoculated the roots of surface-sterilized rice seedlings (n = 9–30 seedlings per experiment) with wild-type Paraburkholderia kururiensis M130 (a rice endophyte30) or a knockout mutant strain for each of the eight genes. We grew the plants for 11 d and then collected and quantified the bacteria that were tightly attached to the roots (Methods, Supplementary Table 17b). Mutations in two genes led to fourfold to sixfold reductions in colonization (false discovery rate (FDR)-corrected Wilcoxon rank sum test, q < 0.1) relative to that by wild-type bacteria (Fig. 2b), without an observed effect on growth rate (Supplementary Fig. 17). These two genes encode an outer-membrane efflux transporter from the nodT family and a Tir chaperone protein (CesT), respectively. It is plausible that the other six genes assayed function in facets of plant association not captured in this experimental context.
Functions for which coexpression of and cooperation between different proteins are needed are often encoded by gene operons in bacteria. We therefore tested whether our methods could correctly predict known plant-associated operons. We grouped plant-associated and root-associated genes into putative plant-associated and root-associated operons on the basis of their genomic proximity and orientation (Supplementary Fig. 4, Methods, “URLs”). This analysis yielded some well-known plant-associated functions, such as those of the nodABCSUIJZ and nifHDKENXQ operons (Fig. 2c,d). Nod and Nif proteins are integral for biological nitrogen cycling and mediate root nodulation31 and nitrogen fixation32, respectively. We also identified the biosynthetic gene cluster for the precursor of the plant hormone gibberellin33,34 (Fig. 2e). Other known plant-associated operons identified are related to chemotaxis35, secretion systems such as T3SS36 and T6SS37, and flagellum biosyntheis38,39,40 (Fig. 2f–i).
Thus, we identified thousands of plant-associated and root-associated gene clusters by using five different statistical approaches (Supplementary Table 18) and validated them by means of computational and experimental approaches, broadening our understanding of the genetic basis of plant–microbe interactions and providing a valuable resource to drive further experimentation.
Protein domains reproducibly enriched in diverse plant-associated genomes
Plant-associated and root-associated proteins and protein domains conserved across evolutionarily diverse taxa are potentially pivotal to the interaction between bacteria and plants. We identified 767 Pfam domains as significant plant-associated domains in at least three taxa, on the basis of multiple tests (Supplementary Table 19a). Below we elaborate on a few domains that were plant-associated or root-associated in all four phyla. Two of these domains, a DNA-binding domain (pfam00356) and a ligand-binding (pfam13377) domain, are characteristic of the LacI transcription factor (TF) family. These TFs regulate gene expression in response to different sugars41, and their copy numbers were expanded in the genomes of plant-associated and root-associated bacteria in eight of the nine taxa analyzed (Fig. 3a). Examination of the genomic neighbors of lacI-family genes identified strong enrichment for genes involved in carbohydrate metabolism and transport in all of these taxa, consistent with their expected regulation by a LacI-family protein41 (Supplementary Fig. 18). We analyzed the promoter regions of these putative regulatory targets of LacI-family TFs, and identified three AANCGNTT palindromic octamers that were statistically enriched in all but one taxon, and which may serve as the TF-binding site (Supplementary Table 20). These data suggest that accumulation of a large repertoire of LacI-family-controlled regulons is a common strategy across bacterial lineages during adaptation to the plant environment.
Another domain, the metabolic domain aldo-keto reductase (pfam00248), was enriched in the genomes of plant-associated and root-associated bacteria from eight taxa belonging to all four phyla investigated (Fig. 3b). This domain is involved in the metabolic conversion of a broad range of substrates, including sugars and toxic carbonyl compounds42. Thus, bacteria that inhabit plant environments may consume similar substrates. Additional plant-associated and root-associated proteins and domains that were enriched in at least six taxa are described in Supplementary Fig. 19.
We also identified domains that were reproducibly enriched in NPA and/or soil-associated genomes, including many domains of mobile genetic elements (Supplementary Fig. 20).
Putative plant protein mimicry by plant- and root-associated proteins
Convergent evolution and horizontal transfer of protein domains from eukaryotes to bacteria have been suggested for some microbial effector proteins that are secreted into eukaryotic host cells to suppress defense and facilitate microbial proliferation43,44,45. We searched for new candidate effectors or other functional plant-protein mimics. We retrieved a set of significant plant-associated and root-associated Pfam domains that were reproducibly predicted by multiple approaches or in multiple taxa, and we cross-referenced these with protein domains that were also more abundant in plant genomes than in bacterial genomes (Methods). This analysis yielded 64 plant-resembling plant-associated and root-associated domains (PREPARADOs) encoded by 11,916 genes (Supplementary Fig. 21, Supplementary Table 21). The number of PREPARADOs was fourfold higher than the number of domains that overlapped reproducible NPA/soil-associated domains and plant domains (n = 15). The PREPARADOs were relatively abundant in genomes of plant-associated Bacteroidetes and Xanthomonadaceae ( > 0.5% of all domains on average; Supplementary Fig. 22). Some PREPARADOs were previously described as domains within effector proteins, such as Ankyrin repeats46, regulator of chromosome condensation repeat (RCC1)47, leucine-rich repeat (LRR)48, and pectate lyase49. PREPARADOs from plant genomes were enriched 3–14-fold (P < 10−5, Fisher’s exact test) as domains predicted to be ‘integrated effector decoys’ when fused to plant intracellular innate immune receptors of the NLR class50,51,52,53 (compared with two random domain sets; Methods, Supplementary Figs. 21 and 23, Supplementary Table 21). We found that 2,201 bacterial proteins that encode 17 of the 64 PREPARADOs shared ≥40% identity across the entire protein sequence with eukaryotic proteins from plants, plant-associated fungi, or plant-associated oomycetes, and therefore are likely to maintain a similar function (Supplementary Fig. 24, Supplementary Tables 21 and 22). The varied phylogenetic distribution among this protein class could have resulted from convergent evolution or from cross-kingdom horizontal gene transfer between phylogenetically distant organisms subjected to the shared selective forces of the plant environment.
Seven PREPARADO-containing protein families were characterized by N-terminal eukaryotic or bacterial signal peptides followed by a PREPARADO dedicated to carbohydrate binding or metabolism (Supplementary Table 21). One of these domains, Jacalin, is a mannose-binding lectin domain that is found in 48 genes in the A. thaliana genome, compared with three genes in the human genome25. Mannose is found on the cell wall of different bacterial and fungal pathogens and could serve as a microbial-associated molecular pattern that is recognized by the plant immune system54,55,56,57,58,59,60,61. We identified a family of ~430-amino-acid-long microbial proteins with a signal peptide followed by a functionally ill-defined endonuclease/exonuclease/phosphatase family domain (pfam03372), and ending with a Jacalin domain (pfam01419). This domain architecture is absent in plants but is found in diverse microorganisms, many of which are phytopathogens, including Gram-negative and Gram-positive bacteria, fungi from the Ascomycota and Basidiomycota phyla, and oomycetes (Fig. 4). We speculate that these microbial lectins may be secreted to outcompete plant immune receptors for mannose-binding on the microbial cell wall, effectively serving as camouflage.
We thus discovered a large set of protein domains that are shared between plants and the microbes that colonize them. In many cases the entire protein is conserved across evolutionarily distant plant-associated microorganisms.
Co-occurrence of plant-associated gene clusters
We identified numerous cases of plant-associated gene clusters (orthogroups) that demonstrate high co-occurrence between genomes (“URLs”). When the plant-associated genes were derived by phylogeny-aware tests (i.e., PhyloGLM and Scoary), they were candidates for intertaxon horizontal gene transfer events. For example, we identified a cluster predicted by Scoary of up to 11 co-occurring genes (mean pairwise Spearman correlation: 0.81) in a flagellum-like locus from sporadically distributed plant-associated or soil-associated genomes across 12 different genera in Burkholderiales (Fig. 5). Two of the annotated flagellar-like proteins, FlgB (COG1815) and FliN (pfam01052), are also encoded by plant-associated genes in Actinobacteria 1 and Alphaproteobacteria taxa. Six of the remaining genes encode hypothetical proteins, all but one of which are specific to Betaproteobacteria, suggestive of a flagellar structure variant that evolved in this class in the plant environment. Flagellum-mediated motility or flagellum-derived secretion systems (for example, T3SS) are important for plant colonization and virulence39,40,62,63 and can be horizontally transferred64.
Novel putative plant- and root-associated gene operons
In addition to successfully capturing several known plant-associated operons (Fig. 2c–i), we also identified putative plant-associated bacterial operons (“URLs”). Two previously uncharacterized plant-associated gene families were conspicuous. These genes are organized in multiple loci in plant-associated genomes, each with up to five tandem gene copies. They encode short, highly divergent, high-copy-number proteins that are predicted to be secreted, as explained below. These two plant-associated protein families never co-occurred in the same genome, and their genomic presence was perfectly correlated with lifestyles of pathogenic or nonpathogenic bacteria of the genus Acidovorax (order Burkholderiales) (Fig. 6a). We named the gene families present in non-pathogens and pathogens Jekyll and Hyde, respectively, after the characters in Robert Louis Stevenson’s classic novel.
The typical Jekyll gene is 97 amino acids long, contains an N-terminal signal peptide, lacks a transmembrane domain, and, in 98.5% of cases, appears in non-pathogenic plant-associated or soil-associated Acidovorax isolates (Fig. 6a, Supplementary Fig. 25d, Supplementary Table 23a). A single genome may encode up to 13 Jekyll gene copies (Fig. 6a) distributed in up to nine loci (Supplementary Table 23a). We recently isolated four Acidovorax strains from the leaves of naturally grown Arabidopsis16. Even these nearly identical isolates carried hypervariable Jekyll loci that were substantially more divergent than neighboring genes and included copy-number variations and various mutations (Fig. 6b, Supplementary Fig. 25, Supplementary Table 24).
The Hyde putative operons, in contrast, are composed of two distinct gene families unrelated to Jekyll. A typical Hyde1 protein has 135 amino acids and an N-terminal transmembrane helix. Hyde1 proteins are also highly variable, as demonstrated by copy-number variation, sequence divergence, and intralocus transposon insertions (Fig. 6a,c, Supplementary Fig. 26a–c, Supplementary Table 23b). Hyde1 was found in 99% of cases in phytopathogenic Acidovorax. These genomes carried up to 15 Hyde1 gene copies distributed in up to ten loci (Fig. 6a, Supplementary Table 23b). In 70% of cases Hyde1 was located directly downstream from a more conserved ~300-amino-acid-long plant-associated protein-coding gene that we named Hyde2 (Fig. 6c,d, Supplementary Table 23d). We identified loci with Hyde2 followed by Hyde1-like genes in different members of the Proteobacteria phylum. These contained a highly variable Hyde1-like protein family that maintained only the short length and a transmembrane helix (Supplementary Fig. 26d). Hyde-carrying organisms included other phytopathogens, such as Pseudomonas syringae, in which the Hyde1-like-Hyde2 locus was again highly variable between closely related strains (Fig. 6d, Supplementary Table 23c). However, the striking Hyde genomic expansion was specific to the phytopathogenic Acidovorax lineage (Supplementary Table 23e). Notably, we observed that Hyde genes often are directly preceded by genes that encode core structural T6SS proteins, such as PAAR, VgrG, and Hcp65, or are fused to PAAR (Fig. 6d, Supplementary Fig. 27a,b, Supplementary Table 23e). We therefore suggest that Hyde1 and/or Hyde2 might constitute a new T6SS effector family.
The high sequence diversity of Jekyll and Hyde1 genes suggests that the two plant-associated protein families encoded by these genes could be involved in molecular arms races with other organisms in the plant environment. As many type VI effectors are used in interbacterial warfare, we tested Acidovorax Hyde1 proteins for antibacterial properties. Expression of two variants of the gene in Escherichia coli led to a 105–106-fold reduction in cell numbers (Fig. 7a, Supplementary Table 25). We constructed a mutant strain of the phytopathogen Acidovorax citrulli AAC00-1with deletion of five Hyde1 loci (∆5-Hyde1), encompassing 9 of 11 Hyde1 genes (Supplementary Fig. 28, Supplementary Table 25). Wild-type, ∆5-Hyde1, and T6SS-mutant (∆T6SS) Acidovorax strains were coincubated with an E. coli strain that is susceptible to T6SS killing66 and nine phylogenetically diverse Arabidopsis leaf bacterial isolates16. Survival of wild-type E. coli and six of the leaf isolates after coincubation with wild-type Acidovorax was reduced 102–106-fold compared with that after coincubation with ∆5-Hyde1 or ∆T6SS Acidovorax (Fig. 7b, Supplementary Fig. 29, Supplementary Table 25). Combined with the genomic association of Hyde loci with T6SS, these results suggest that the T6SS antibacterial phenotype of Acidovorax is mediated by Hyde proteins and that these toxins could be used in competition against other plant-associated organisms. Consistent with a function in microbe–microbe interactions, we did not detect compromised virulence of the ∆5-Hyde1 strain on host plants (watermelon; data not shown). However, clearance of competitors via T6SS can promote the persistence of Acidovorax citrulli on its host67.
There is increasing awareness that plant-associated microbial communities have important roles in host growth and health. An understanding of plant–microbe relationships at the genomic level could enable scientists to use microbes to enhance agricultural productivity. Most studies have focused on specific plant microbiomes, with more emphasis on microbial diversity than on gene function12,14,16,18,68,69,70,71,72,73,74. Here we sequenced nearly 500 root-associated bacterial genomes isolated from different plant hosts. These new genomes were combined in a collection of 3,837 high-quality bacterial genomes for comparative analysis. We developed a systematic approach to identify plant-associated and root-associated genes and putative operons. Our method is accurate as reflected by its ability to capture numerous operons previously shown to have a plant-associated function, the enrichment of plant-associated genes in plant-associated metagenomes, the validation of Hyde1 proteins as likely type VI effectors in Acidovorax directed against other plant-associated bacteria, and the validation of two new genes in P. kururiensis that affect rice root colonization. We note that bacterial genes that are enriched in genomes from the plant environment are also likely to be involved in adaptation to the many other organisms that share the same niche, as we demonstrated for Hyde1.
We used five different statistical approaches to identify genes that were significantly associated with the plant/root environment, each with its advantages and disadvantages. The phylogeny-correcting approaches (phyloglmbin, phyloglmcn, and Scoary) allow accurate identification of genes that are polyphyletic and correlate with an environment independently of ancestral state. On the basis of our metagenome validation, the hypergeometric test predicts more genes that are abundant in plant-associated communities than PhyloGLM does. It also identifies monophyletic plant-associated genes, but it yields more false positives than the phylogenetic tests, because in every plant-associated lineage many lineage-specific genes will be considered plant-associated. Scoary is the most stringent method of all and yielded the fewest predictions (Supplementary Table 18). Future experimental validation should prioritize genes predicted in multiple taxa and/or by multiple approaches (Supplementary Figs. 5 and 6, Supplementary Tables 20 and 26).
We discovered 64 PREPARADOs. Proteins containing 19 of these domains are predicted to be secreted by the Sec or T3SS protein secretion systems (Supplementary Table 21). Notably, plant proteins carrying 35 of these domains belonged to the NLR class of intracellular innate immune receptors (Supplementary Fig. 23, Supplementary Table 21). Thus, these PREPARADO protein domains may serve as molecular mimics. Some may interfere with plant immune functions through disruption of key plant protein interactions75,76. Likewise, the Jacalin-containing proteins we detected in plant-associated bacteria, fungi, and oomycetes may represent a strategy of avoiding immunity triggered by microbial-associated molecular patterns, by binding to extracellular microbial mannose molecules and thereby serving as a molecular invisibility cloak77,78.
Finally, we demonstrated that numerous plant-associated functions are consistent across phylogenetically diverse bacterial taxa, and that some functions are even shared with plant-associated eukaryotes. Some of these traits may facilitate plant colonization by microbes and therefore might prove useful in genome engineering of agricultural inoculants to eventually yield a more efficient and sustainable agriculture.
iTOL Interactive tree (Fig. 1a), https://itol.embl.de/tree/15223230182273621508772620; datasets at the Dangl lab’s dedicated website, http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/index.html (Dataset 1, FNA—nucleotide FASTA files of the 3,837 genomes; Dataset 2, FAA—FASTA files of all proteins used in the analysis; Dataset 3, COG/KEGG Orthology/Pfam/TIGRFAM IMG annotations of all genes used in analysis; Dataset 4, metadata of all genomes; Dataset 5, phylogenetic trees of each of the nine taxa; Dataset 6, pangenome matrices; Dataset 7, pangeneome data frames; Dataset 8, OrthoFinder orthogroup FASTA files; Dataset 9, Mafft MSA of all orthogroups; Dataset 10, hidden Markov models of all orthogroups; Dataset 11, plant-associated/NPA and root-associated/soil-associated enrichment tables; Dataset 12, correlation matrices; Dataset 13, predicted operons); DSMZ, https://www.dsmz.de/; ATCC, https://www.atcc.org/; NCBI Biosample, https://www.ncbi.nlm.nih.gov/biosample/; IMG, https://img.jgi.doe.gov/cgi-bin/mer/main.cgi; GOLD, https://gold.jgi.doe.gov/; Phytozome, https://phytozome.jgi.doe.gov/pz/portal.html; BrassicaDB, http://brassicadb.org/brad/; sm R package, http://www.stats.gla.ac.uk/~adrian/sm; vegan R package, https://cran.r-project.org/web/packages/vegan/index.html; ape R package, https://cran.r-project.org/web/packages/ape/ape.pdf; fpc R package, https://cran.r-project.org/web/packages/fpc/index.html; phylolmR package, https://cran.r-project.org/web/packages/phylolm/index.html; scripts used to compute the orthogroups, https://github.com/isaisg/gfobap/tree/master/orthofinder_diamond; scripts used to run the gene enrichment tests, https://github.com/isaisg/gfobap/tree/master/enrichment_tests; scripts used to perform the PCoA, https://github.com/isaisg/gfobap/tree/master/pcoa_visualization_ogs_enriched.
Additional method descriptions appear in Supplementary Note 1.
Bacterial isolation and genome sequencing
The detailed isolation procedure is described in Supplementary Note 1. Bacterial strains from Brassicaceae and poplar were isolated via previously described protocols79,80. Poplar strains were cultured from root tissues collected from Populus deltoides and Populus trichocarpa trees in Tennessee, North Carolina, and Oregon. Root samples were processed as described previously15,80. Briefly, we isolated rhizosphere strains by plating serial dilutions of root wash, whereas for endosphere strains, we pulverized surface-sterilized roots with a sterile mortar and pestle in 10 mL of MgSO4 (10 mM) solution before plating serial dilutions. Strains were isolated on R2A agar media, and the resulting colonies were picked and re-streaked a minimum of three times to ensure isolation. Isolated strains were identified by 16S rDNA PCR followed by Sanger sequencing.
For maize isolates, we selected soils associated with Il14h and Mo17 maize genotypes grown in Lansing, NY, and Urbana, IL. The rhizosphere soil samples of each maize genotype were grown at each location and were collected at week 12 as previously described68. From each rhizosphere soil sample, soil was washed and samples were plated onto Pseudomonas isolation agar (BD Diagnostic Systems). The plates were incubated at 30 °C until colonies formed, and DNA was extracted from cells.
For isolation of single cells, A. thaliana accessions Col-0 and Cvi-0 were grown to maturity. Roots were washed in distilled water multiple times. Root surfaces were sterilized with bleach. Surface-sterilized roots were then ground with a sterile mortar and pestle. Individual cells were isolated by flow cytometry followed by DNA amplification with MDA, and 16S rDNA screening as described previously81.
DNA from isolates and single cells was sequenced on next-generation sequencing platforms, mostly using Illumina HiSeq technology (Supplementary Table 3). Sequenced genomic DNA was assembled via different assembly methods (Supplementary Table 3). Genomes were annotated using the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)23 and deposited at the IMG database (“URLs”), ENA, or Genbank for public use.
Data compilation of 3,837 isolate genomes and their isolation-site metadata
We retrieved 5,586 bacterial genomes from the IMG system (“URLs,” Supplementary Table 1). Isolation sites were identified through a manual curation process that included scanning of IMG metadata, DSMZ, ATCC, NCBI Biosample (“URLs”), and the scientific literature. On the basis of its isolation site, each genome was labeled as plant-associated, NPA, or soil-associated. Plant-associated organisms were also labeled as root-associated when isolated from the endophytic compartments or from the rhizoplane. We applied stringent quality control measures to ensure a high-quality and minimally biased set of genomes:
Known isolation site: genomes with missing isolation-site information were filtered out.
High genome quality and completeness: all isolate genomes passed this filter if N50 (the shortest sequence length at 50% of the genome) was more than 50,000 bp. Single amplified genomes passed the quality filter if they had at least 90% of 35 universal single-copy clusters of orthologous groups (COGs)82. In addition, we used CheckM83 to assess isolate genome completeness and contamination. Only genomes that were at least 95% complete and no more than 5% contaminated were used.
High-quality gene annotation: genomes that passed this filter had at least 90% genome sequence coding for genes, with an exception—in the Bartonella genus most genomes have coding base percentages below 90%.
Nonredundancy: we computed whole-genome average nucleotide identity and alignment fraction values for each pair of genomes84. When the alignment fraction exceeded 90% and the whole-genome average nucleotide identity was greater than 99.995% we considered the genome pair redundant. In such cases one genome was randomly selected, and the other genome was marked as redundant and was filtered out.
Consistency in the phylogenetic tree: we filtered out 14 bacterial genomes that showed discrepancy between their given taxonomy and their actual phylogenetic placement in the bacterial tree.
Construction of the bacterial genome tree
To generate a phylogenetic tree of the 3,837 high-quality and nonredundant bacterial genomes, we retrieved 31 universal single-copy genes from each genome with AMPHORA285. For each individual marker gene, we used Muscle with default parameters to construct an alignment. We masked the 31 alignments by using Zorro86 and filtered the low-quality columns of the alignment. Finally, we concatenated the 31 alignments into an overall merged alignment, from which we built an approximately maximum-likelihood phylogenetic tree with the WAG model implemented in FastTree 2.187. Trees of each taxon are provided in Dataset S5 at http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/faa_trees_metadata.html.
Clustering of 3,837 genomes into nine taxa
We divided the dataset into different taxa (taxonomic groups) to allow downstream identification of genes enriched in the plant-associated or root-associated genomes of each taxon compared with the NPA or soil-associated genomes from the same taxon, respectively. To determine the number of taxonomic groups to analyze, we converted the phylogenetic tree into a distance matrix, using the cophenetic function implemented in the R package ape (“URLs”). We then clustered the 3,837 genomes into nine groups using k-medoids clustering as implemented in the PAM (partitioning around medoids) algorithm from the R package fpc (“URLs”). The k-medoids algorithm clusters a dataset of n objects into k a priori–defined clusters. To identify the optimal k value for the dataset, we compared the silhouette coefficients for values of k ranging from 1 to 30. We selected a value of k = 9 because it yielded the maximal average silhouette coefficient (0.66). In addition, at k = 9 the taxa were monophyletic, contained hundreds of genomes, and were relatively balanced between plant-associated and NPA genomes in most taxa (Table 1). The resulting genome clusters generally overlapped with annotated taxonomic units. One exception was in the Actinobacteria phylum. Here our clustering divided the genomes into two taxa that we named, for simplicity, Actinobacteria 1 and Actinobacteria 2. However, our rigorous phylogenetic analysis supports previous suggestions for revisions in the taxonomy of phylum Actinobacteria88.
In addition, the tree showed very divergent bacterial taxa in the Bacteroidetes phylum that could not be separated into monophyletic groups. Specifically, the Sphingobacteriales order (from class Sphingobacteria) and the Cytophagaceae (from class Cytophagia) are paraphyletic. Therefore, we decided to unify all Bacteroidetes into one phylum-level taxon. Analysis of the prevalence of the nine taxa in 16S rDNA and metagenome appears in the Supplementary Information.
For each comparison in Supplementary Fig. 2, a random set of ten genomes from each environment (plant-associated and NPA from specific environments) was selected, and the mean and s.d. of the phylogenetic distance in the set were calculated. This step was repeated 50 times to produce two random sets of genomes (plant-associated and NPA) that were comparable and had minimum differences between their mean and s.d. of phylogenetic distances. Genes for pangenome analysis were taken from the orthogroups (see below). Core genome, accessory genome, and unique genes were defined as genes that appeared in all ten genomes, in two to nine genomes, and in only one genome, respectively. For core and accessory genomes, the median copy number in each relevant orthogroup was used.
Genome size comparison and gene category enrichment analysis
Genome sizes were retrieved from the IMG database (“URLs”) and compared by t-test and PhyloGLM26. Kernel density plots from the R sm package (“URLs”) were used to prepare Supplementary Fig. 1. Protein-coding genes were retrieved and mapped to COG IDs with the program RPS-BLAST at an e-value cutoff of 1e–2 and an alignment length of at least 70% of the consensus sequence length. Each COG ID was mapped to at least one COG category (Supplementary Table 6). For each genome, we counted the number of genes from a given category. A t-test and PhyloGLM test were used to compare the number of genes in the genomes that shared the same taxon and category but different labels (e.g., plant-associated versus NPA).
Benchmarking gene clustering with UCLUST and OrthoFinder
We computed clusters of coding sequences across each of the nine taxa defined above with two algorithms: UCLUST89 (v 7.0) and OrthoFinder24 (v 1.1.4). UCLUST was run with 50% identity and 50% coverage in the target to call the clusters. Command used: usearch7.0.1090_i86linux64 -cluster_fast < input_file > -id 0.5 -maxaccepts 0 -maxrejects 0 -target_cov 0.5 –uc < output_file > . To improve pairwise alignment performance, we used the accelerated protein alignment algorithm implemented in DIAMOND90 (v 0.8.36.98) with the --very-sensitive option in the DIAMOND BLASTP algorithm. After computing the alignments, we ran OrthoFinder with default parameters. See “URLs” for the scripts used to compute the orthogroups.
Supplementary Fig. 5 shows benchmarking of OrthoFinder against UCLUST. To estimate the quality of the clusters output by UCLUST and OrthoFinder, we mapped the proteins from our datasets to the curated set of taxon markers from Phyla-AMPHORA91. Next, we compared the distribution of each of the taxon-specific markers identified by Phyla-AMPHORA across the clusters output by UCLUST and OrthoFinder. To compare the two approaches, we estimated two metrics: the purity and the fragmentation index, explained in Supplementary Fig. 5 and in the Supplementary Information.
Identification of plant-associated, NPA, root-associated, and soil genes/domains
The following description applies to plant-associated, NPA, root-associated, and soil genes. For conciseness, only plant-associated genes are described here. Plant-associated genes were identified via a two-step process that included protein/domain clustering on the basis of amino acid sequence similarity and subsequent identification of the protein/domain clusters significantly enriched in proteins/domains from plant-associated bacteria (Supplementary Fig. 4). Clustering of genes and protein domains involved five independent methods: OrthoFinder24, COG20, KEGG Orthology (KO)21, TIGRFAM22, and Pfam25. OrthoFinder was selected (after the aforementioned benchmarking) as a clustering approach that included all proteins, including those that lack any functional annotation. We first compiled, for each taxon separately, a list of all proteins in the genomes. For COG, KO, TIGRFAM, and Pfam, we used the existing annotations of IMG genes that are based on BLAST alignments to the different protein/domain models23. This process yielded gene/domain clusters. Next, we determined which clusters were significantly enriched with genes derived from plant-associated genomes. These clusters were termed plant-associated clusters. In the statistical analysis, we used only clusters of more than five members. We corrected P values with Benjamini–Hochberg FDR and used q < 0.05 as the significance threshold, unless stated otherwise. The proteins in each cluster were categorized as either plant-associated or NPA, on the basis of the label of the encoding genome. Namely, a plant-associated gene is a gene derived from a plant-associated gene cluster and a plant-associated genome.
The three main approaches were the hypergeometric test (Hyperg), PhyloGLM, and Scoary. Hyperg looks for overall enrichment of gene copies across a group of genomes but ignores the phylogenetic structure of the dataset. PhyloGLM26 takes into account phylogenetic information to eliminate apparent enrichments that can be explained by shared ancestry. The Hyperg and PhyloGLM tests were used in two versions, based on either gene presence/absence data (hypergbin, phyloglmbin) or gene copy-number data (hypergcn, phyloglmcn). We also used a stringent version of Scoary27, a gene presence/absence approach that combines Fisher’s exact test, a phylogenetic test, and a label-permutation test. The first hypergeometric test, hypergcn, used the gene copy-number data, with the cluster being the sample, the total number of plant-associated and NPA genes being the population, and the number of plant-associated genes within the cluster being considered as ‘successes’. The second version, hybergbin, used gene presence/absence data. P values were corrected by Benjamini–Hochberg FDR92 for clusters of COG/KO/TIGRFAM/Pfam. For the abundant OrthoFinder clusters, we used Bonferroni correction with a threshold of P < 0.1, as downstream validation with metagenomes showed fewer false positives with the more significant clusters. The third and fourth statistical approaches used PhyloGLM26, implemented in the phylolm (v 2.5) R package (“URLs”). PhyloGLM combines a Markov process of lifestyle (e.g., plant-associated versus NPA) evolution with a regularized logistic regression. This approach takes advantage of the known phylogeny to specify the residual correlation structure between genomes that share common ancestry, and so it does not need to make the incorrect assumption that observations are independent. Intuitively PhyloGLM favors genes found in multiple lineages of the same taxon. For each taxon we used the subtree from Fig. 1a to estimate the correlation matrix between observations and used the copy number (in phyloglmcn) or presence/absence pattern (in phyloglmbin) of each gene as the only independent variable. Positive and negative estimates in phyloglmbin/phyloglmcn indicated plant-associated/root-associated and NPA/soil-associated proteins/domains, respectively.
Finally, the fifth statistical approach was Scoary27, which uses a gene presence/absence dataset. Scoary combines Fisher’s exact test, a phylogeny-aware test, and an empirical label-switching permutation analysis. A gene cluster was considered significant by Scoary only if (1) it had a q-value less than 0.05 for Fisher’s exact test, (2) the ‘worst’ P value from the pairwise comparison algorithm was < 0.05, and (3) the empirical (permutation-based) P value was < 0.05. These are very stringent criteria that yielded relatively few significant predictions. Odds ratios greater than or less than 1 in Scoary indicated plant-associated/root-associated and NPA/soil-associated proteins/domains, respectively.
See “URLs” for links to the code used for the gene enrichment tests. A description of additional assessment of plant-associated/NPA prediction robustness using validation genome datasets is presented in Supplementary Note 1.
Validation of predicted plant-associated, NPA, root-associated, and soil-associated genes using metagenomes
Metagenome samples (n = 38; Supplementary Table 16) were downloaded from NCBI and GOLD (“URLs”). The reads were translated into proteins, and proteins at least 40 amino acids long were aligned using HMMsearch93 against the different protein references. The protein references included the predicted plant-associated, root-associated, soil-associated, and NPA proteins from OrthoFinder that were found to be significant by the different approaches. The normalization process is explained in Supplementary Figs. 12–16.
Principal coordinates analysis
To visualize the overall contribution of statistically significant enriched/depleted orthogroups to the differentiation of plant-associated and NPA genomes, we used principal coordinates analysis (PCoA) and logistic regression. For each of the nine taxa analyzed, we ran this analysis over a collection of matrices. The first matrix was the full pan-genome matrix, which depicted the distribution of all the orthogroups contained across all the genomes in a given taxon. The subsequent matrices represented subsets of the full pan-genome matrix; each of these matrices depicted the distribution of only the statistically significant orthogroups as called by one of the five different algorithms used to test for the genotype–phenotype association. A full description of this process is presented in Supplementary Note 1.
We used the function cmdscale from the R (v 3.3.1) stats package to run PCoA over all the matrices described above, using the Canberra distance as implemented in the vegdist function from the vegan (v 2.4-2) R package (“URLs”). Then, we took the first two axes output from the PCoA and used them as independent variables to fit a logistic regression over the labels of each genome (plant-associated, NPA). Finally, we computed the Akaike information criterion for each of the different models fitted. Briefly, the Akaike information criterion estimates how much information is lost when a model is applied to represent the true model of a particular dataset. See “URLs” for a link to the scripts used to perform the PCoA.
Validation of plant-associated genes in Paraburkholderia kururiensis M130 affecting rice root colonization
Growth and transformation details of P. kururiensis M130 are described in Supplementary Note 1.
Internal fragments of 200–900 bp from each gene of interest were PCR-amplified with the primers listed in Supplementary Table 17c. Fragments were cloned in the pGem2T easy vector (Promega) and sequenced (GATC Biotech; Germany), then excised with EcoRI restriction enzyme and cloned in the corresponding site in pKNOCK-Km R94. These plasmids were then used as a suicide delivery system to create the knockout mutants and transferred to P. kururiensis M130 by triparental mating. All the mutants were verified by PCR with primers specific to the pKNOCK-Km vector and to the genomic DNA sequences upstream and downstream from the targeted genes.
Rhizosphere colonization experiments with P. kururiensis and mutant derivatives
Seeds of Oryza sativa (BALDO variety) were surface-sterilized and then left to germinate in sterile conditions at 30 °C in the dark for 7 d. Each seedling was then aseptically transferred into a 50-mL Falcon tube containing 35 mL of half-strength Hoagland solution semisolid substrate (0.4% agar). The tubes were then inoculated with 107 colony-forming units (cfu) of a P. kururiensis suspension. Plants were grown for 11 d at 30 °C (16/8-h light/dark cycles). For the determination of the bacterial counts, plants were washed under tap water for 1 min and then cut below the cotyledon to excise the roots. Roots were air-dried for 15 min, weighed, and then transferred to a sterile tube containing 5 mL of PBS. After vortexing, the suspension was serially diluted to 10−1 and 10−2 in PBS, and aliquots were plated on KB plates containing the appropriate antibiotic (rifampicin 50 µg/mL for the wild type, rifampicin 50 µg/mL and kanamycin 50 µg/mL for the mutants). After 3 d of incubation at 30 °C, we counted colony-forming units (CFU). Three replicates for each dilution from ten independent plantlets were used to determine the average CFU values.
Plant-mimicking plant-associated and root-associated proteins
Supplementary Fig. 21 summarizes the algorithm used to find plant-mimicking plant-associated and root-associated proteins. Pfam25 version 30.0 metadata were downloaded. Protein domains that appeared in both Viridiplantae and bacteria and occurred at least two times more frequently in Viridiplantae than in bacteria were considered as plant-like domains (n = 708). In parallel, we scanned the set of significant plant-associated, root-associated, NPA, and soil-associated Pfam protein domains predicted by the five algorithms in the nine taxa. We compiled a list of domains that were significantly plant-associated/root-associated in at least four tests, and significantly NPA/soil-associated in up to two tests (n = 1,779). The overlapping domains between the first two sets were defined as PREPARADOs (n = 64). In parallel, we created two control sets of 500 random plant-like Pfam domains and 500 random plant-associated/root-associated Pfam domains. Enrichment of PREPARADOs integrated into plant NLR proteins in comparison to the domains in the control groups was tested by Fisher’s exact test. To identify domains found in plant disease-resistance proteins, we retrieved all proteins from Phytozome and BrassicaDB (“URLs”). To identify domains in plant disease-resistance proteins, we used hmmscan to search protein sequences for the presence of NB-ARC (PF00931.20), TIR (PF01582.18), TIR_2 (PF13676.4), or RPW8 (PF05659.9) domains. Bacterial proteins carrying the PREPARADO domains were considered as having full-length identity to fungal, oomycete, or plant proteins on the basis of LAST alignments to all Refseq proteins of plants, fungi, and protozoa. “Full-length” is defined here as an alignment length of at least 90% of the length of both query and reference proteins. The threshold used for considering a high amino acid identity was 40%. An explanation of the prediction of secretion of proteins with PREPARADOs is presented in the Supplementary Information.
Prediction of plant-associated, NPA, root-associated, and soil-associated operons and their annotation as biosynthetic gene clusters
Significant plant-associated, NPA, root-associated, and soil-associated genes of each genome were clustered on the basis of genomic distance: genes sharing the same scaffold and strand that were up to 200 bp apart were clustered into the same predicted operon. We allowed up to one spacer gene, which is a non-significant gene, between each pair of significant genes within an operon. Operons were predicted for the genes in COG and OrthoFinder clusters using all five approaches. Operons were annotated as biosynthetic gene clusters if at least one of the constituent genes was part of a biosynthetic gene cluster from the IMG-ABC database95.
Jekyll and Hyde analyses
To find all homologs and paralogs of Jekyll and Hyde genes, we used IMG BLAST search with an e-value threshold of 1e–5 against all IMG isolates. We searched Hyde1 homologs of Acidovorax, Hyde1 homologs of Pseudomonas, Hyde2, and Jekyll genes using proteins of genes Aave_1071, A243_06583, Ga0078621_123530, and Ga0102403_10160 as the query sequence, respectively. Multiple sequence alignments were done with Mafft96. A phylogenetic tree of Acidovorax species was produced with RaxML97, based on concatenation of 35 single-copy genes98.
Hyde1 toxicity assay
To verify the toxicity of Hyde1 and Hyde2 proteins to E. coli, we cloned genes encoding proteins Aave_0990 (Hyde2), Aave_0989 (Hyde1), and Aave_3191 (Hyde1), or GFP as a control, to the inducible pET28b expression vector via the LR reaction. The recombinant vectors were transformed into E. coli C41 competent cells by electroporation after sequencing validation. Five colonies were selected and cultured in LB liquid media supplemented with kanamycin with shaking overnight. The OD600 of the bacterial culture was adjusted to 1.0, and then the culture was diluted by 102, 104, 106, and 108 times successively. Bacteria culture gradients were spotted (5 μL) on LB plates with or without 0.5 mM IPTG to induce gene expression.
Construction of ∆5-Hyde1 strain
Details of the construction of the ∆5-Hyde1 strain are presented in Supplementary Note 1. A. citrulli strain AAC00-1 and its derived mutants were grown on nutrient agar medium supplemented with rifampicin (100 µg/ml). To delete a cluster of five Hyde1 genes (Aave_3191–3195), we carried out a marker-exchange mutagenesis as previously described99. The marker-free mutant was designated as ∆1-Hyde1, and its genotype was confirmed by PCR amplification and sequencing. The marker-exchange mutagenesis procedure was repeated to delete four other Hyde1 loci (Supplementary Fig. 28). The primers used are listed in Supplementary Table 25. The final mutant, with deletion of 9 out of 11 Hyde1 genes (in five loci), was designated as ∆5-Hyde1 and was used for competition assay. The ∆T6SS mutant was provided by Ron Walcott’s lab.
Competition assay of Acidovorax citrulli AAC00-1 against different strains
E. coli BW25113 pSEVA381 was grown aerobically in LB broth (5 g/L NaCl) at 37 °C in the presence of chloramphenicol. Naturally antibiotic-resistant bacterial leaf isolates16 and Acidovorax strains were grown aerobically in NB medium (5 g/L NaCl) at 28 °C in the presence of the appropriate antibiotic. Antibiotic resistance and concentrations used in the competition assay are mentioned in Supplementary Table 25.
Competition assays were conducted similarly as described elsewhere66,100. Briefly, bacterial overnight cultures were harvested and washed in PBS (pH 7.4) to remove excess antibiotics, and resuspended in fresh NB medium to an optical density of 10. Predator and prey strains were mixed at a 1:1 ratio, and 5 µL of the mixture was spotted onto dry NB agar plates and incubated at 28 °C. As a negative control, the same volume of NB medium was mixed with prey cells instead of the predator strain. After 19 h of coincubation, bacterial spots were excised from the agar and resuspended in 500 µL of NB medium and then spotted on NB agar containing antibiotic selective for the prey strains. CFUs of recovered prey cells were determined after incubation at 28 °C. All assays were performed in at least three biological replicates.
Life sciences reporting summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
A dedicated website for the Dangl lab: http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/index.html
The dedicated website contains nucleotide and amino acid FASTA files of all datasets used, protein/domain annotations (COG, KO, TiGRfam, Pfam), metadata, phylogenetic trees, OrthoFinder orthogroups, orthogroup hidden Markov models, full enrichment datasets, correlation between orthogroups, and predicted operons (“URLs”).
Links to different scripts that were used in analysis are included in the “URLs” section. The full genome sequence, gene annotation, and metadata of each genome used can be found at the IMG website (https://img.jgi.doe.gov/). For example, the metadata of taxon ID 2558860101 can be found at https://img.jgi.doe.gov/cgi-bin/mer/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2558860101.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231. J.L.D. and S.G.T. were supported by NSF INSPIRE grant IOS-1343020, and J.L.D. was also supported by DOE–USDA Feedstock Award DE-SC001043 and by the Office of Science (BER), US Department of Energy, grant no. DE-SC0014395. S.H.P. was supported by NIH Training Grant T32 GM067553-06 and was a Howard Hughes Medical Institute (HHMI) International Student Research Fellow. D.S.L. was supported by NIH Training Grant T32 GM07092-34. J.L.D. is an Investigator of the HHMI, supported by the HHMI and the Gordon and Betty Moore Foundation (GBMF3030). M.E.F. was supported by NIH Dr. Ruth L. Kirschstein NRSA Fellowship F32-GM112345. D.A.P. and T.-Y.L. were supported by the Genomic Science Program, US Department of Energy, Office of Science, Biological and Environmental Research as part of the Oak Ridge National Laboratory Plant Microbe Interfaces Scientific Focus Area (http://pmi.ornl.gov) and Plant Feedstock Genomics Award DE-SC001043. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. J.A.V. was supported by a SystemsX.ch grant (Micro2X) and a European Research Council (ERC) advanced grant (PhyMo). We thank I. Bertani, C. Bez, R. Bowers, D. Burstein, A. Chun Chen, D. Chiniquy, B. Cole, O. Cohen, A. Copeland, J. Eisen, E. Eloe-Fadrosh, M. Hadjithomas, O. Finkel, H. Schnitzel Meule Fux, N. Ivanova, J. Knelman, R. Malmstrom, R. Perez-Torres, D. Salomon, R. Sorek, T. Mucyn, R. Seshadri, T.K. Reddy, L. Ryan, and H. Sberro Livnat for general help, text editing, and ideas for this work. We thank R. Walcott (University of Georgia, Athens, GA, USA) for providing the Acidovorax citrulli VasD mutant strain.
Supplementary Figures 1–29 and Supplementary Note 1.
All genomes used. Lists of all genomes used from nine taxa (pre-filtration). Cells filled with yellow are Brassicaceae root isolates from the USA, cells filled with green are single cells isolated from Arabidopsis thaliana, cells filled with pink are poplar isolates, cells filled with blue are recently published leaf and root Arabidopsis and soil isolates from Europe, cells filled with purple are maize root isolates. “Filtered out?” column is ‘N’ if genome is retained for usage in analysis after QA process. “Representative genome taxid” – taxon id of another genome (different row in the same tab) representing at least two redundant genomes. Completeness and contamination values were calculated with CheckM. Full genome sequence, gene annotation, and metadata of each genome used can be found in the IMG website https://img.jgi.doe.gov/. For example the metadata of taxon id 2558860101 can be found in https://img.jgi.doe.gov/cgibin/mer/main.cgi?section=TaxonDetail&page=taxonDetail&ta xon_oid=2558860101.
Statistics of genomes in the taxa used
Sequencing and assembly information of new genomes
Abundance of the nine taxa in 16S marker gene surveys. The relative abundances of taxa composing a specific taxon were taken from the different publications and were added to yield the relative abundance of that taxon. In those cases with biological replicates, e.g. in Lundberg et al. Nature 2012 we used the median value.
Genome size comparison. Genome size comparison between the different isolation sites done by t-test and PhyloGLM. Each cell denotes the group with the largest genomes, if the difference is significant (P < 0.05). N.S. - not significant. PhyloGLM test takes into account the phylogenetic structure of the taxon.
COG-to-COG category mapping
Acinetobacter PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster inpfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue< 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1.To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Actinobacteria1 PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Actinobacteria2 PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Alphaproteobacteria PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used q- value < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Bacillales PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Bacteroidetes PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Burkholderiales PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Pseudomonas PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn, we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Xanthomonadaceae PA/NPA/RA/soil genes/domains. Phylogenetic diversity is the median pairwise distance between the genomes hosting the genes in the cluster. Values for each test are "Y", "N", or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). To be considered as a significant cluster in pfam/COG/TIGRFAM/KO + hypergbin/hypergcn, we used qvalue < 0.05 (Benjamini Hochberg FDR corrected). To be considered as significant cluster in OrthoFinder + hypergbin/hypergcn we used Bonferroni-corrected P < 0.1. To be considered as a significant PA/RA cluster in phyloglmcn/phyloglmcn, we used q-value < 0.05 (Benjamini Hochberg FDR corrected) and an estimate > 0 (or estimate < 0 for significant NPA/soil). To be considered as a significant PA/RA cluster in Scoary, we used P < 0.05 for three tests: Fisher exact test (Benjamini Hochberg FDR corrected), worst pairing scenario test, and empirical test and odds ratio or Fisher exact test > 1 (odds ratio < 1 for NPA/soil).
Validation of PA/NPA/RA/soil genes through metagenomes. a. Samples used (n=38), b. Summary of results based on two sided t test.
Validation of PA genes in Paraburkholderia kururiensis M130. a. Mutant used and statistical tests results, b. Raw data: cfu/g root, 3. Primers used.
The number of operons predicted by different approaches.
Reproducible PA domains. a. Protein domains that are significantly PA in at least three taxa by at least two tests. NA – test results are not available (untested), NS – non-significant result. b. Fractions for LacI proteins within genomes, c. Fraction of pfam00248 domain within genomes.
DNA motifs predicted to be bound by LacI transcription factors. Predicted promoter sequences are intergenic sequences, at least 25 bp long, located upstream of carbohydrate metabolism and transport genes that are found directly adjacent to LacI genes. The most abundant kmers of different lengths were detected using wordcount (Emboss package). The most abundant motifs found in multiple taxa were compared against their distribution in random intergenic sequences using the Fisher exact test.
PREPARADOs. Pfam domains that are both significant PA/RA domains (reproducibly found as such in multiple taxa or by multiple approaches) and more abundant in plants than in bacteria according to Pfam (PREPARADOs). Pfams labeled in yellow are carbohydrate-related and are part of proteins found in eukaryotes and bacteria with full length sequence similarity, having an N-terminus signal peptide, and lacking a transmembrane domain. Cells marked in green are domains that are predicted to be secreted by Sec or T3SS (over >50% of the bacterial proteins having the domain are predicted to be secreted by these secretion systems).
Full-length proteins conserved between PA bacterial genes and eukaryotic genes. LAST alignment results of PREPARADO-containing proteins from bacteria (query) against plant, fungi, oomycetes, and protist proteins from Refseq (target). Only alignments that are over 40% identity and stretch across at least 90% of the query and target length are shown.
Jekyll and Hyde. Gene homologs of Jekyll and Hyde proteins based on protein homologs on IMG; To find all homologs and paralogs of Jekyll and Hyde genes (a-d) we used IMG blast search with e value threshold of 1e-5 against all IMG isolates, some of which were not included in the original comparartive analysis and hence their genes are not part of any cluster. Since Hyde1 proteins are rapidly evolving, they are scattered across multiple OrthoFinder orthogroups. Metadata in a-d was retrieved from IMG website. a. Jekyll protein homologs of Acidovorax gene Ga0102403_10160, b. Hyde1 protein homologs of Acidovorax protein Aave_1071, c. Hyde1-like protein homologs of Pseudomonas protein A243_06583, d. Hyde2 homologs of Ga0078621_123530, e. Hyde1-like-Hyde2 loci in representative Proteobacteria, one per genus, and their location adjacent to T6SS genes and within genomes that encode T6SS. Hyde2 was found based on blast search against the nr db with Acav_4635 as the query.
Divergence of Jekyll gene operon. An analysis of the Jekyll gene cluster that is presented in Figure 6b. Control genes are shown in Figure S26c. The table summarizes a comparison between multiple sequence alignments of the Jekyll locus (Figure S24b) and the control genes (Figure S24c).
Toxicity of Hyde proteins and recovery of prey cells confronted with Hyde-encoding Acidovorax and different mutants. Includes primers used to make Acidovorax deletion strains, strains used as prey and their antibiotic resistance, raw results for cell toxicity and competition assays.
Significant orthogroups (orthofinder clusters) supported by three statistical approaches: either hypergbin, phyloglmbin, and Scoary, or hypergcn, phyloglmcn, and Scoary