Uneven distribution of cobamide biosynthesis and dependence in bacteria predicted by comparative genomics

The vitamin B12 family of cofactors known as cobamides are essential for a variety of microbial metabolisms. We used comparative genomics of 11,000 bacterial species to analyze the extent and distribution of cobamide production and use across bacteria. We find that 86% of bacteria in this data set have at least one of 15 cobamide-dependent enzyme families, but only 37% are predicted to synthesize cobamides de novo. The distribution of cobamide biosynthesis and use vary at the phylum level. While 57% of Actinobacteria are predicted to biosynthesize cobamides, only 0.6% of Bacteroidetes have the complete pathway, yet 96% of species in this phylum have cobamide-dependent enzymes. The form of cobamide produced by the bacteria could be predicted for 58% of cobamide-producing species, based on the presence of signature lower ligand biosynthesis and attachment genes. Our predictions also revealed that 17% of bacteria have partial biosynthetic pathways, yet have the potential to salvage cobamide precursors. Bacteria with a partial cobamide biosynthesis pathway include those in a newly defined, experimentally verified category of bacteria lacking the first step in the biosynthesis pathway. These predictions highlight the importance of cobamide and cobamide precursor salvaging as examples of nutritional dependencies in bacteria.

full-length BzaC protein led to the conclusion that these short proteins were degraded and nonfunctional versions of BzaC and not an additional family. As such, we were then able to validate the cutoffs of the original model.
To download the protein sequences for each genome in our data set, the JGI/IMGer genome entries were matched to GenBank and RefSeq entries (accessed Aug 2, 2017) (Benson et al., 2013;O'Leary et al., 2016). To do so, the GenBank and RefSeq entries were matched by bio sample accession, species name and strain designation, or IMG-taxon ID. Most genomes in the JGI/IMGer dataset had available protein files in these databases (10591 out of 11436).

BLASTP search of putative tetrapyrrole precursor auxotrophs
We queried the genomes of the 201 predicted tetrapyrrole precursor auxotrophs using BLASTP on IMGer using the default settings to only return hits with an E-value less than 1e-5. For the Alphaproteobacterial ALA synthase HemA, we used the protein from Rhodobacter sphaeroides with a cutoff of an E-value less than 1e-130 (GenPept C49845). Clostridium saccharobutylicum DSM 13864 HemA, HemL, HemB, HemC, and HemD (GenBank: AGX44136.1, AGX44131.1, AGX44132.1, AGX44134.1, AGX44133.4, respectively). For hemA, any hit with an E-value lower than 1e-5 was considered hemA, and all hits observed were annotated as hemA. All hemL hits with an E-value less than 1e-100 were considered hemL. To be an ALA auxotroph, the genome must be missing both hemA and hemL. We did not observe any hits to hemB with an Evalue lower than 1e-5 in any genome predicted to be missing the gene from the annotation-based search. For hemC, we required the E-value be less than 1e-29. There were two genomes that had hits below this cutoff. For hemD, we required hits to have an E-value less than 1e-40.
Since the C. saccharobutylicum HemD is a fusion protein with both the UroIII synthase and UroIII methyltransferases domains, we additionally searched for the Bacillus subtilis HemD, which only has the UroIII synthase activity (UniProtKB P21248.2). We did not observe any hits to the B. subtilis hemD with an E-value lower than 1e-5 in any genome predicted to be missing the gene from the annotation-based search. Genes with high matches for the C. saccharobutylicum hemD were inspected in the IMG browser for domains to determine if the methyltransferase and UroIII synthase domains were both present.

Growth conditions
Clostridium scindens ATCC35704 was grown at 37ºC under 80% N 2 , 20% CO 2 in an anaerobic defined mineral salts medium with the following composition (g/L): NaCl, 1; MgCl 2 • 6H 2 O, 0.5; KH 2 PO 4 , 0.2; NH 4 Cl, 0.3; KCl, 0.3; CaCl 2 • 2 H 2 O, 0.015. In addition, 2.29 g of N-Tris(hydroxymethyl)methyl-2-aminoethanesulfonic acid (TES, free acid), 2 ml of a trace element solution (He et al., 2007), 1 ml of a Na 2 SeO 3 -Na 2 WO 4 solution (Widdel and Bak, 1992),10 mg of resazurin, and 40 mg each of the amino acids arginine, cysteine, glycine, histidine, isoleucine, leucine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine were added per liter (Lovitt et al., 1987). After the medium was boiled and cooled under N 2 , the gas was switched to an 80% N 2 , 20% CO 2 mix, and the reductants Na 2 S • 9 H 2 O and L-cysteine were added to final concentrations of 0.2 mM each. Next, 2.52 g NaHCO 3 (30 mM final concentration) was added to the medium, and the pH was adjusted to 7.0. The medium was dispensed under 80% N 2 , 20% CO 2 in 10 ml aliquots in 25 ml Balch tubes, or for large volumes, 1 L in 2 L pyrex bottles. Tubes and bottles were sealed with butyl stoppers and aluminum crimp seals, autoclaved for 30 min, and cooled to room temperature. Glucose was subsequently added to a final concentration of 25 mM, and Wolin vitamin solution (Wolin et al., 1963)  Clostridium sporogenes ATCC 15579 was grown in the same medium and conditions as C. scindens with the following changes: cysteine, serine, and threonine were omitted, and 1 mL of a vitamin solution containing 500 mg/L nicotinic acid, 50 mg/L thiamine HCl, 5 mg/L biotin, and 5 mg/L p-aminobenzoic acid was added per liter of medium (Lovitt et al., 1987).
Treponema primitia ZAS-2 was grown at room temperature in anaerobic 4YACo medium with a headspace of 80% H 2 , 20% CO 2 as previously described (Graber and Breznak, 2004), with the following changes. For cobalamin added cultures, the final concentration of cyanocobalamin was reduced from 4.42 µM to 37 nM. For testing no addition and ALA addition to cultures, cobalamin-supplemented cultures were serially passaged three times in cobalaminfree medium or in cobalamin-free medium containing 1 mM ALA before being used as inocula for growth experiments. Growth was monitored spectrophotometrically (O.D. 650 ). All growth experiments were performed in triplicate.
Clostridium phytofermentans ISDg (ATCC 700394) was grown anaerobically at 25°C under an N 2 atmosphere in GS-2CB medium (Warnick et al., 2002). names had been changed and could not be easily matched to the prior download, as the tool output did not include the unique identifier until recently. Sheet 4 is the "genomes vs functions" download for the 55 single copy genes.
Supplementary Table 2: Genomes in the filtered data set. Sheet 1 contains the metadata, Sheet 2 is the "genomes vs function" and "genomes vs genes" downloads for the cobamide-dependent enzyme families, cobamide-independent alternatives, and cobamide biosynthesis genes. Sheet 3 is the "genomes vs functions" download for the 55 single copy genes.
Supplementary Table 3: Completeness analysis by single copy genes for all genomes. It lists the number of unique single copy genes (maximum 55), the average number of single copy genes, and lists with the annotations either missing or in multiples for each genome.
Supplementary Table 4: Annotations used for cobamide-dependent enzyme families and alternatives and query genes used for BLASTP-based search for enzyme families without annotations.
Supplementary Table 5: Results for cobamide-dependent and -independent enzyme families, and cobamide biosynthesis genes and categories for the filtered data set by genome.
Supplementary Table 6: Strains tested for experimental production of corrinoids in Table 1 and the strain used for genome analysis, the cobamide biosynthesis phenotype, and reference for the corrinoid production observation. Table 7: Sheet 1 contains the definitions of the cobamide biosynthesis pathway sections used in classifying genomes. Sheet 2 contains the definitions of cobamide biosynthesis categories.

Supplementary
Supplementary Table 8: Sheet 1 summarizes the results of the hmmsearch for bzaABCDEF. Sheet 2 and 3 show the tabular output from hmmsearch above the trusted cutoff for each HMM for bzaABDEF and bzaC, which was a domain model, for genomes from GenBank and RefSeq, respectively.
Supplementary Table 9: Genomes with identified tandem CobT homologs based on sequential gene identifiers, and analysis of length of genes. TRUE in column 2 indicates that the tandem annotations were consistent with full-length CobT. Each gene's JGI/IMG unique identifier, and amino acid length of each gene is listed.
Supplementary Table 10: List of tetrapyrrole precursor salvagers (TPS) checked for missing genes by BLASTP. Genomes in red were excluded as TPS because missing genes completing the tetrapyrrole precursor biosynthesis section were found. The columns for each tetrapyrrole precursor biosynthesis step are populated with results from the search on IMG using annotations. Columns with the BLASTP hits show the bit score and E-value for the highest scoring hit in the genome if found. Results are summarized in the column "Additional genes found by BLAST." If the genome was consistent with TPS, then the specific type is listed.

Supplementary Figures
Supplementary Figure 1: Number of unique single copy genes in the bacterial genomes before and after filtering by unique single copy genes and selecting a single genome for each species.