A genomic catalog of Earth’s microbiomes

The reconstruction of bacterial and archaeal genomes from shotgun metagenomes has enabled insights into the ecology and evolution of environmental and host-associated microbiomes. Here we applied this approach to >10,000 metagenomes collected from diverse habitats covering all of Earth’s continents and oceans, including metagenomes from human and animal hosts, engineered environments, and natural and agricultural soils, to capture extant microbial, metabolic and functional potential. This comprehensive catalog includes 52,515 metagenome-assembled genomes representing 12,556 novel candidate species-level operational taxonomic units spanning 135 phyla. The catalog expands the known phylogenetic diversity of bacteria and archaea by 44% and is broadly available for streamlined comparative analyses, interactive exploration, metabolic modeling and bulk download. We demonstrate the utility of this collection for understanding secondary-metabolite biosynthetic potential and for resolving thousands of new host linkages to uncultivated viruses. This resource underscores the value of genome-centric approaches for revealing genomic properties of uncultivated microorganisms that affect ecosystem processes.

Horizontal transfer of BGCs in nature also likely plays a factor in the fragmentation and lower counts of BGCs in the catalogue. BGCs are well known to be horizontally transferred between bacterial and fungal species, and it is common to see G+C content and codon bias within BGCs that differ dramatically from the host genome [4], as well as integrase or viral remnants or non-coding sequence flanking BGC sequences. Indicators like this are commonly used to interpret BGC boundaries, and some species have seemingly adapted their genome structure to accommodate uptake and regulation of BGCs, resulting in genome regions classically termed "pathogenicity islands" or genomic islands [5]. BGCs are also regularly found on plasmids, another common horizontal transfer vector. Therefore, we expect that many BGCcontaining contigs may have been lost or mis-assigned in the assembly and binning process of constructing the GEM catalogue, resulting in an undercount and/or increased fragmentation of BGCs, with frequent loss of pathogenicity islands and loss of plasmids.

Genome-scale metabolic models
All metabolic models were built and reconstructed using the "Build Metabolic Model" App in KBase: https://narrative.kbase.us/#catalog/apps/fba_tools/build_metabolic_model/release. GEMs and reference genomes were annotated with RAST [6] in KBase, as the ModelSEED pipeline uses RAST functional roles to map genes to biochemical reactions [7,8] . The metabolic models were used to assess pathway presence (as defined by KEGG [9]) as detected by a complete flux pathway within the defined environments. A pathway was determined present or not detected by computing the number of gene-associated functional reactions (GAFRs) in each pathway across all models. GAFRs are defined as reactions in a model that are involved in pathways that offer uninterrupted mass-balanced routes from nutrients to biomass and byproducts. Thus, GAFRs exclude reactions that are part of fragmentary, incomplete, and likely nonfunctional pathways.
The GEMs were expected to have some gaps due to incomplete genome reconstruction, and gaps will occur due to errors and omissions in functional annotations. To address these issues, all GEMs were subjected to a gap filling operation [10] that ensured that every highquality GEM was capable of producing biomass from a least one carbon source. Out of the 3,732 high-quality GEMs, we analyzed metabolic models for a subset of 3,270, excluding MAGs with biome labeled as "other" and biomes with low MAG counts (<40 MAGs). Out of the subset of 3,270, 15 did not successfully complete the gap filling operation, resulting in 3,255 GEMs with metabolic models.
A threshold-based approach was used to define each pathway as being either present or not detected in each GEM and reference genome analyzed. The individual thresholds were assessed by calculating the difference between average and standard deviation of GAFRs for each individual pathway. Pathways above the calculated threshold (Table S14) are considered "present" for a given model/organism. Only pathways with five or more GAFRs were considered in this study to account for smaller linear pathway definitions by KEGG. Based on this analysis, the fraction of GEMs in each environment that were determined to possess active pathways are shown in Figure S7. Three scenarios were detected: (1) pathways effectively present in all genomes in the environment (light color cells); (2) pathways not detected in any genomes (dark color cells); and (3) pathways present in some genomes but not others. This corresponds with pathways that are likely essential, pathways that likely contribute little to fitness, and pathways that may contribute to potential cometabolism and trophic dependency within the microbial community. Differences can also be seen in patterns of pathway presence between environments, although similar environments do cluster together (e.g., human and mammal). To validate the high-quality GEM metabolic models, pathway presence profiles were computed for reference genomes associated with humans and the built environment, as these two environments have >100 GEMs with associated reference genomes ( Figure S8). The resulting profiles were nearly identical for all pathways. Pearson correlation coefficients were calculated for each GEM and corresponding reference genome across 55 metabolic pathways, with an average value >0.98. When the GEM and reference genomes were randomly paired and a Pearson correlation was calculated, the average correlation dropped to ~0.82, indicating that the high correlation previously obtained reflects the similarity of the GEM and reference genome. All data and calculations used in these analyses are available in the Table S14.

Custom species and gene trees for mcrABG-encoding Archaea
The 15 archaeal GEMs which encoded for the mcrABG operon were added to a representative set of 652 publicly available archaeal genomes downloaded from the Integrated Microbial Genomes and Microbiomes (IMG/M) online database [11] (database accessed May 2019) and the GTDB [12] . A species tree was built from 56 universal marker proteins from the COG database [13] which were identified with hmmsearch v3.1b2 [14] using a specific hidden Markov model for each of the markers. For every protein, alignments were prepared with MAFFT v 7.294b [15] and subsequently trimmed with BMGE using BLOSUM30 [16]. Genomes with less than 20 marker proteins and genomes that had more than five duplicated marker proteins were removed from the alignment. Aligned marker proteins were concatenated to a supermatrix which comprised 17,240 informative sites and a phylogenetic tree was built with IQ-TREE v1.6.12 [17] using model LG4X+F with the ultrafast bootstrap option [18]. The phylogenetic tree was visualized with iTol v5 [19].
McrA (MCR_alpha), McrB (MCR_beta), and McrG (MCR_gamma) were identified in the same set of genomes used to build the species tree of the Archaea using models from the PfamA database v29.0 [20]. Significant hits for McrABG were extracted from genomes which encoded all three genes. Sequences were aligned with MAFFT v7.294b [15], trimmed with trimal v1.4 [21] to remove positions with more than 90% of gaps (-gt 0.1). A phylogenetic tree was built from a concatenated alignment of McrABG (1,199 informative positions) with IQ-tree v1.6.12 using model LG+F+R5 chosen based on the model test feature [22]. The phylogenetic tree was visualized with iTol v5 [19].

Benchmarking host-prediction methods
We used MAGs from the GEM catalogue to predict hosts for IMG/VR viruses using a combination of CRISPR targeting and prophages (see Methods in main text for details). Each MAG was taxonomically annotated by the GTDB, enabling us to link each virus to a host at a variety of taxonomic ranks (phylum, class, order, family, genus, and species). The predicted host taxonomy for viruses was then validated using two separate approaches.
We first compared the host taxonomy of viruses determined by genome sequences matches (i.e. prophages) versus predictions based on spacer matches. For each method, and at each taxonomic rank, the predicted host was determined based on the most commonly observed taxon. These predicted hosts were then compared between methods. For viruses with hosts predicted by both methods, we observed agreement at the following ranks: phylum (91.9%), class (91.8%), order (88.5%), family (82.4%), genus (73.9%), species (53.2%). We found that viruses matching MAG contigs 'end-to-end' often had discordant predictions with CRISPR spacers. We reasoned that these contigs of entirely viral origin may be incorrectly assigned to a MAG during the binning process, likely because of differences in GC content and codon usage between viral and host genome, and/or because of differences in genome copy number (i.e. coverage) for viruses actively replicating. When only considering MAG contigs >1.5x the length of matched viruses, the agreement with CRISPR spacers was increased: phylum (96.9%), class (96.9%), order (94.9%), family (88.6%), genus (79.3%), species (56.3%). Finally, we found that agreement between methods further increased when only considering "confident" predictions by each method (>10 virus-host connections with >90% agreement within-method): phylum (99.2%), class (99.2%), order (99.0%), family (98.7%), genus (98.0%), species (98.6%).
Second, we evaluated the "purity" of predicted hosts for each virus at different taxonomic ranks. This was performed for each prediction method. Purity was defined at each taxonomic rank by (1) identifying the most common host taxon for a virus, and (2)   MAGs from the GEM catalogue were clustered with reference genomes into OTUs at 95% ANI. Each point represents one OTU containing at least one GEM and one isolate reference genome. In both panels, the x-axis indicates the average observed genome length of isolates genomes. In the left panel, the y-axis indicates the average observed genome length of MAGs, while this is normalized for estimated genome completeness on the right. The average genome sizes of MAGs and isolate genomes from the same species is highly correlated with a slope close to one, indicating no systematic loss or gain of gene content. Figure S3. MAGs increase the mappability of metagenomes. Sequencing reads from 3,170 metagenomes were mapped to a database containing 52,515 MAGs from the GEM catalogue and 151,730 isolate genomes from NCBI RefSeq. The barplot shows the percent of high-quality reads mapping to genomes from each database with >95% identity. The number of metagenomes is indicated in parenthesis.       and previously published archaeal genomes which encode for the mcrABG operon (highlighted with a blue star). Clades that predominantly contain genomes that encode for the mcrABG operon are highlighted in blue and also labeled with a blue star. The tree is rooted at the DPANN supergroup. Scale bar indicates substitutions per site. B) Unrooted maximum likelihood phylogenetic tree (IQ-tree, LG4X+F) inferred from a concatenated alignment of McrABG. GEMs are highlighted in orange. Clades that predominantly contain genomes that encode for the mcrABG operon are highlighted in blue. The archaeal genome highlighted in red (a member of TACK supergroup in species tree) most likely acquired its mcrABG operon from euryarchaeotal GEMs (Hadesarchaeota in the species tree) through horizontal gene transfer. Branches with low bootstrap support (<70%) are indicated by filled red circles, branches with medium support (between 70-90%) are indicated by filled orange circles, branches without filled circles are highly supported (>90%). Scale bar indicates substitutions per site. Figure S11. Schematic comparing the organization of putative T4SS in novel Coxiella spp. GEM (e.g., bin ID: 3300021338_21) with isolate C. burnetii icm-dot CDS. Genes are shown as arrows and colored by predicted function: blue, T4SS components; gray, conserved hypothetical proteins; red, IS111 element transposase, black, response regulator. Majority of the Coxiella spp. operon is located on a single scaffold (Scaffold ID: Ga0213839_1000116). Dotted lines connect this scaffold to two disparate genomic scaffolds (Ga0213839_1000281 and Ga0213839_1000243) containing putative orthologs of the remainder of the T4SS components. Two copies of DotA are detected in the GEM. Additional GEMs include 3300021341_23 and 3300021343_16.  Microviridae phylogeny with associated host information. For each clade of 3 or more virus sequences associated with the same host group, the main host is indicated next to the clade along with the number of sequences linking this Microviridae clade to this host group, first in the reference sequences, then in the GEM dataset. Clades are colored according to the origin of the host information, and new host groups identified only from the GEM catalogue are highlighted in bold. All nodes with < 50% support values are displayed as multifurcation, and nodes with > 80% support are indicated with a black dot. B) Subfamily-level representation of the Inoviridae diversity as a genome network. The original network was previously generated in [2]. Briefly, subfamilies are represented by circle nodes and connected to protein clusters, represented as square nodes. Subfamilies are grouped by family, which are indicated on the network along with the major host associated with this family. Subfamilies for which host information could be obtained only from the GEM catalogue are highlighted with a bold outline and linked to the specific GEM-derived host prediction when these host groups had not been associated with inoviruses yet (Fibrobacteres).