Genome reduction in an abundant and ubiquitous soil bacterial lineage

Although bacteria within the Verrucomicrobia phylum are pervasive in soils around the world, they are underrepresented in both isolate collections and genomic databases. Here we describe a single verrucomicrobial phylotype within the class Spartobacteria that is not closely related to any previously described taxa. We examined >1000 soils and found this spartobacterial phylotype to be ubiquitous and consistently one of the most abundant soil bacterial phylotypes, particularly in grasslands, where it was typically the most abundant phylotype. We reconstructed a nearly complete genome of this phylotype from a soil metagenome for which we propose the provisional name ‘Candidatus Udaeobacter copiosus’. The Ca. U. copiosus genome is unusually small for soil bacteria, estimated to be only 2.81 Mbp compared to the predicted effective mean genome size of 4.74 Mbp for soil bacteria. Metabolic reconstruction suggests that Ca. U. copiosus is an aerobic chemoorganoheterotroph with numerous amino acid and vitamin auxotrophies. The large population size, relatively small genome and multiple putative auxotrophies characteristic of Ca. U. copiosus suggests that it may be undergoing streamlining selection to minimize cellular architecture, a phenomenon previously thought to be restricted to aquatic bacteria. Although many soil bacteria need relatively large, complex genomes to be successful in soil, Ca. U. copiosus appears to have identified an alternate strategy, sacrificing metabolic versatility for efficiency to become dominant in the soil environment.

as being among the most numerically abundant taxa in soil (2, 3), we know very 90 little about the ecological or genomic attributes that contribute to their success. 91 The phylum Verrucomicrobia is highly diverse and its members possess a broad 92 range of metabolic capabilities. For example, members of the class 93 Methylacidiphilae are nitrogen-fixing acidophiles capable of methane oxidation 94 (8) while Akkermansia muciniphila of the class Verrucomicrobiae is a mucin-95 degrading resident of the human gut linked to reduced host obesity (9). 96 However, the dominant Verrucomicrobia found in soil typically belong to the 97 class Spartobacteria. For example, while Verrucomicrobia accounted for >50% 98 of all bacterial 16S rRNA gene sequences in tallgrass prairie soils in the United 99 States, >75% of these sequences were assigned to the class Spartobacteria 100 (10). Currently, the class Spartobacteria contains only a single described and 101 sequenced isolate, Chthoniobacter flavus, a slow-growing aerobic 102 chemoorganoheterotroph capable of using common components of plant 103 biomass for growth (11,12). While Spartobacteria are prevalent in soils, they 104 have also been observed in marine systems ('Spartobacteria baltica', 13) and as 105 nematode symbionts (genus Xiphinematobacter, 14).

107
Here we report the distribution of a dominant Spartobacteria lineage, compiling 108 data from both amplicon and shotgun metagenomic 16S rRNA gene surveys to 109 quantify its relative abundance across >1000 unique soils. We assembled a 110 near-complete genome of this lineage from a single soil where it was 111 exceptionally abundant. These results provide our first glimpse into the 112 phylogeny, ecology, and potential physiological traits of a dominant soil 113 Verrucomicrobia and suggest that members of this group are efficient at 114 growing and persisting in the low resource conditions common in many soil 115 microenvironments.

117
Results and Discussion 118 119 Distribution of the dominant Verrucomicrobia in soil 120 A single spartobacterial clade dominates bacterial communities found in a wide 121 range of soil types across the globe. One phylotype from this group of 122 Spartobacteria represented up to 31% of total 16S rRNA gene sequences 123 recovered from prairie soils (10). This phylotype shares 99% 16S rRNA gene 124 sequence identity with a ribosomal clone named 'Da101', first described in 1998 125 as a particularly abundant 16S rRNA sequence recovered from a grassland soil 126 in the Netherlands (15). To determine if the Da101 phylotype (termed 'Da101' 127 herein) is abundant in other soils, we re-analyzed amplicon 16S rRNA gene 128 sequence data obtained from >1000 soils representing a wide range of soil and 129 site characteristics (Table S1). We found that Da101 was on average ranked 130 within the top two most abundant bacterial phylotypes in each study (Fig. 1). In 131 over 70% of the soils analyzed Da101 was within the top ten most abundant 132 phylotypes. Interestingly, phylotypes belonging to the same family as Da101 133 (Chthoniobacteraceae) were also found within the top 5 most abundant 134 phylotypes of several studies (Fig. 1).

136
As some 16S rRNA gene PCR primer sets can misestimate the relative 137 abundance of Verrucomicrobia (3, 16), we investigated whether the apparent 138 numerical dominance of Da101 in the amplicon datasets was a product of PCR 139 primer biases. To do so, we quantified the abundance of Da101 16S rRNA 140 genes within 75 previously published soil shotgun metagenomes (17, 18). The 141 relative abundance of Da101 in amplicon data was reasonably well correlated 142 with the relative abundance of Da101 determined from shotgun metagenomic 143 data (P<0.0001, ρ=0.50). Confirming the amplicon-based results (Fig. 1), we 144 found that Da101 was among the most abundant phylotypes observed in the 145 soil bacterial communities characterized via shotgun metagenomic sequencing 146 (Fig. S1). Thus, we conclude that the numerical dominance of Da101 in soils is 147 not simply a product of primer biases.

149
Despite Da101 being one of the most abundant phylotypes found in soil, its 150 proportional abundance can vary significantly across soil types ( Fig. 1 and S1).

151
We used metadata associated with each soil sample to determine which of the 152 measured soil and site characteristics best predicted the relative abundance of 153 Da101. We found that Da101 was significantly more abundant in grassland soils 154 than in forest soils (P<0.0001, two tailed t test, Fig. S2); on average, Da101 is six 155 times more abundant in grassland soils. These findings suggest that the soils in 156 which Da101 excels do not overlap with those forest soils dominated by non-157 symbiotic Bradyrhizobium taxa (6). Across the grassland soils included in our 158 meta-analysis, the relative abundance of Da101 was positively correlated with 159 both soil microbial biomass (P<0.0001, ρ=0.57, Fig. S3), and aboveground plant 160 biomass (P<0.0001, ρ=0.47, Fig. S3). Together, these results indicate that 161 Da101 prefers soils receiving elevated amounts of labile carbon inputs. We did 162 not identify any consistent correlations between the abundance of Da101 and 163 other prokaryotic or eukaryotic taxa, suggesting that Da101 is unlikely to be a 164 part of an obligate pathogenic or symbiotic relationship.

166
Diversity of soil Verrucomicrobia 167 We determined the phylogenetic placement of Da101 and other soil 168 Verrucomicrobia by assembling near full-length 16S rRNA gene sequences from 169 six distinct grassland soils collected from multiple continents (Fig. 2 completeness for each of the 378 taxa using the same method as for Ca. U. 220 copiosus and found the mean estimated genome size of these taxa to be 5.28 ± 221 2.15 Mbp (mean ± SD), which is nearly identical to metagenomic based 222 estimates of mean genome size for soil microbes (24). Strikingly, the 2.81 Mbp 223 genome of Ca. U. copiosus is ~50% smaller than the mean genome size of 224 these 378 taxa and only 13% of these genomes were smaller than the genome 225 of Ca. U. copiosus.

227
Although soil bacteria with larger genomes tend to be more common in soil, Ca. 228 U. copiosus is a notable exception to this pattern. We linked the genome size of 229 each of the matched IMG bacterial genomes with the average abundance of 230 their corresponding amplicon sequence from Leff et al. (2015) and found that 231 genome size is positively correlated with average relative abundance (P <0.001, 232 ρ=0.37, Fig. 3). That is, bacteria with large genomes tend to comprise a 233 significantly larger proportion of soil bacterial communities. On average, the 234 genomes of soil prokaryotes are larger than those inhabiting aquatic 235 ecosystems (25) or the human gut (26). These relatively large genomes are 236 thought to provide soil-dwelling bacteria with a more diverse genetic inventory 237 to enhance survival in conditions where resources are diverse, but sparse (27, 238 28). However, the Ca. U. copiosus genome has a conspicuously reduced 239 genome given its numerical abundance (Fig. 3). This suggests that Ca. U. 240 copiosus occupies a niche space that does not require expansive functional 241 diversity and points to an alternative route to success for soil bacteria. These 242 results also suggest that abundant, uncultivated soil bacteria may have smaller 243 genomes than the cultivated taxa that represent the vast majority of available 244 genomic data. A similar pattern has been observed in aquatic systems where 245 uncultivated taxa often have smaller genomes than cultivated taxa (29). Because 246 the majority of available genomic information is derived from cultivated bacterial 247 taxa, the lack of genomic information from bacteria with reduced genomes likely 248 stems from challenges associated with culturing taxa with reduced genomes 249 (25 Genes encoding for the biosynthesis of all branched-chain (isoleucine, leucine 270 and valine) and aromatic (tryptophan, tyrosine and phenylalanine) amino acids 271 were conspicuously absent from the Ca. U copiosus genome. Additionally, the 272 biosynthetic pathways for arginine and histidine were also absent. These eight 273 amino acids are among the most energetically expensive to make (Fig S4, 31), 274 suggesting that their acquisition from the environment offers Ca. U copiosus an 275 energetic savings relative to taxa that synthesize them de novo. In contrast, Ca.

276
U. copiosus does have the complete suite of genes for the biosynthesis of those 277 amino acids that are energetically less expensive to make (including alanine, 278 aspartate, and glutamate, Fig. S4). The abundance and cosmopolitan distribution of Ca. U. copiosus (Fig. 1), 307 together with its small genome size relative to other soil microbes (Fig. 3), 308 suggest that it is undergoing streamlining selection to minimize genome size. 309 The genome-streamlining hypothesis proposes that, in large bacterial 310 populations, reduced genome complexity is a trait under natural selection, 311 especially in environments where nutrients are sparse and can periodically limit 312 growth (25). All contemporary free-living organisms with streamlined genomes 313 inhabit aquatic environments (25, 35). However, compared to these aquatic 314 environments, soils are more heterogeneous (36), have higher overall microbial 315 diversity (37), and slower carbon turnover (38). Therefore, the functional 316 complexity required by soil microbes to succeed within a given niche is likely 317 large relative to that required by aquatic microbes. This means that the effects 318 of genome streamlining are likely to be most evident (i.e., result in smaller 319 genomes) in aquatic environments and that we might expect genome reduction 320 to be relatively uncommon across soil taxa. This expectation is consistent with 321 the fact that, on average, the genomes of aquatic microbes are smaller than 322 their terrestrial counterparts (25). However, the small genome and numerous 323 putative auxotrophies of Ca. U. copiosus show that genome streamlining is not 324 unique to aquatic organisms and that genome streamlining may also confer a 325 selective growth advantage in the soil environment.

327
Genome streamlining in Ca. U. copiosus has resulted in reduced catabolic and 328 biosynthetic capacity, and thus a loss of metabolic versatility. The absence of 329 multiple costly amino acid and vitamin biosynthetic pathways from the Ca. U. 330 copiosus genome implies that these compounds can be acquired from the soil 331 environment. Several studies have shown that free amino acids are present in 332 soil (39, 40), although oligopeptides are reported to be more abundant in 333 grasslands and may be assimilated with kinetics similar to free amino acids (41). 334 The enrichment of proteases and amino acid and peptide importers in the Ca. U. 335 copiosus genome suggests that it is well equipped to assimilate this fraction of 336 soil organic matter. Dispensing the capacity to synthesize costly amino acids 337 and vitamins likely provides Ca. U copiosus a growth advantage in resource 338 limiting conditions when competition for labile carbon is high. Alternatively, 339 many of the putative amino acid auxotrophies described here are involved in 340 synergistic growth (42) and may be supplied by other microbes as common 341 community goods (43). Based on the few spartobacterial isolates that have been 342 cultivated (11), culture-independent studies (10, 44), and the genomic data 343 presented here, we speculate that Ca. U. copiosus is a small, oligotrophic soil 344 bacterium that reduces its requirement for soil organic carbon by acquiring 345 costly amino acids and vitamins from the environment.

347
Conclusions 348 Whereas successful soil microbes are predicted to have large genomes (27, 28, 349 Fig. 3 Colorado BioFrontiers Institute Next-Gen Sequencing Core Facility. Sequences 391 were processed as described previously (17). In brief, we used a combination of 392 QIIME (47) and UPARSE (48) to quality-filter, remove singletons, and merge 393 paired reads. Sequences were assembled into phylotypes at the 97% identity 394 level using UCLUST (49). Taxonomy was assigned using the Greengenes 13_8 395 database (50) and the Ribosomal Database Project classifier (4) and each 396 dataset was rarefied independently (Table S1). 397 As PCR primer biases can misestimate the relative abundances of 398 Verrucomicrobia (3,16), we also estimated the abundances of the Da101 399 phylotype directly from shotgun metagenomic data. We used Metaxa2 with 400 default settings (51) to extract bacterial 16S rRNA gene sequences from shotgun 401 metagenomic data compiled from previous analyses of 75 different soils (data 402 from 17, 18). Extracted 16S rRNA gene fragments were matched to GreenGenes 403 full-length sequences at 99% ID using the usearch7 command usearch_global. 404 The matched Greengenes sequences were then clustered and assigned 405 taxonomy as described above.

407
Describing the phylogenetic diversity of soil Verrucomicrobia 408 We reconstructed near-full length 16S rRNA gene sequences to construct a 409 phylogeny of soil Verrucomicrobia from six soil samples (see Table S2) that were 410 selected to represent geographically distinct grasslands with a range of 411 verrucomicrobial abundances. We extracted DNA from each of these soils as 412 described previously (17) and used the 27f/1392r primer pair (52) to amplify near 413 full-length 16S rRNA genes as described in (19). The amplicons were sheared 414 using the Covaris M220 (Covaris, Woburn, MA) and the 16S rRNA gene libraries 415 were prepared using TruSeq DNA LT library preparation kits (Illumina, San 416 Diego, CA). Samples were pooled and sequenced on an Illumina MiSeq 417 (2x300bp) at the University of Colorado Next Generation Sequencing Facility.

419
After quality filtering of sequences, near full length SSU sequences were 420 reconstructed using EMIRGE (19). After 40 iterations, sequences were merged 421 into phylotypes with ≥97% similarity. Reconstructed sequences were trimmed 422 to 1200 bp and all sequences were further clustered at 95% identity due to gaps 423 in some assemblies. Full-length 16S rRNA sequences from named 424 verrucomicrobial isolates were aligned along with the reconstructed sequences 425 using PyNAST (53). A UPGMA tree was constructed using the R packages 426 seqnir and phangorn and visualized with GraPhlAn (R 3.2.2, version 0.9.7).

428
Assembly and annotation of a genome from the dominant soil 429 Verrucomicrobial phylotype 430 We assembled the genome of 'Candidatus Udaeobacter copiosus' from a 431 metagenome of a U.S. prairie soil sample (NTP21, Hayden, IA) estimated to have 432 particularly high abundances of bacteria within the Da101 clade (10). 433 Fragmented DNA extracted from this soil was prepared for sequencing using 434 WaferGen Verrucomicrobia, the genome was selectively re-assembled using Velvet with a 443 kmer size of 59, and expected kmer coverage of 11.5 (range 7.5 to 15.5).