Genome-scale metabolic reconstructions of multiple Salmonella strains reveal serovar-specific metabolic traits

Salmonella strains are traditionally classified into serovars based on their surface antigens. While increasing availability of whole-genome sequences has allowed for more detailed subtyping of strains, links between genotype, serovar, and host remain elusive. Here we reconstruct genome-scale metabolic models for 410 Salmonella strains spanning 64 serovars. Model-predicted growth capabilities in over 530 different environments demonstrate that: (1) the Salmonella accessory metabolic network includes alternative carbon metabolism, and cell wall biosynthesis; (2) metabolic capabilities correspond to each strain’s serovar and isolation host; (3) growth predictions agree with 83.1% of experimental outcomes for 12 strains (690 out of 858); (4) 27 strains are auxotrophic for at least one compound, including l-tryptophan, niacin, l-histidine, l-cysteine, and p-aminobenzoate; and (5) the catabolic pathways that are important for fitness in the gastrointestinal environment are lost amongst extraintestinal serovars. Our results reveal growth differences that may reflect adaptation to particular colonization sites.

D-tagatose is a natural sweetener that has been shown to be incompletely absorbed in the small intestine and subsequently fermented in the colon in both human, murine and other non-human primates 2 . It is also a product of human gut microbiota derived degradation of the mucus layer 3 . The tagatose transport and utilization operon consists of 9 ORFs. The disruption of 3 genes in the tagatose operon has been shown to be important for intestinal infection of cattle, chicken and pig but not intravenous infection of mice 4 . It was shown that knocking out one of those three genes results in abrogated growth of a Salmonella strain on M9 minimal medium + 1% D-tagatose 5 . Taken together these findings suggest that the microenvironment of extra-intestinal pathogens does not contain D-tagatose and that these pathogens are unaffected by the loss of the capability to uptake and utilize this nutrient.

Supplementary Note 3: Analysis of the growth predicted profiles that did not agree with experimental evidence
The first GEM for serovar Dublin strain CT_02021853 falsely simulated growth on proline and tricarballylate. The second GEM for serovar Agona strain SL483 falsely simulated an incapability to grow on glycolate and D-Tagatose and the third GEM for serovar Schwarzengrund strain CVM19633 falsely simulated growth on serine. We found that the false no growth phenotypes were due to the absence of homologous sequences for the PTS tagatose transporters (STM325-STM3256) as well as a D-tagatose 1-phosphate kinase (STM3254) and glycolate oxidase (STM1620). Interestingly, an STM3254 deletion mutant was shown to be incapable of growth on M9 minimal medium + 1% tagatose suggesting that strain SL483 may have an alternative pathway for the utilization of tagatose 5 . Conflicting evidence was found for the successful utilization of glycolate across Salmonella strain. Thus far, there have been no molecular studies detailing out the glycolate utilization biochemical pathway in Salmonella . We notice that the annotated pathway in the Salmonella GEM mirrors the pathway that has been characterized in E. coli which means that it was probably added as a result of observed sequence homology of the genes involved in the pathway. Glycolate uptake and utilization is mediated by a multi-protein complex in E-coli (including glcDEF ). However only one gene (STM1620) was found to correspond to such a function in Salmonella . Additionally, while it is annotated in the model as a glycolate oxidase, its genome annotation was that of a putative lactate oxidase. We noticed that STM1620 is only found in 46% of the genomes included in this data-set. We also found conflicting reports of the capability of several Salmonella strains (including str. LT2) to successful utilize glycolate 6-8 . Intriguingly, it was shown that STM1620 highly contributes to fitness of Salmonella strains across different hosts 9 . It is probable that the exact function of STM1620 was not correctly annotated in pan-STM.1.v2 , therefore we decided to assign to it a confidence score of 1 (as per the established protocol for building genome-scale reconstructions 10 ). We also modified Figure 4.b (now 4.A) by excluding glycolate utilization as a differentiating metabolic capability because it could mislead the reader. The validation of predictions with experimental data precisely serves the function of highlighting such knowledge gaps.

Supplementary Note 4: A total of 531 nutrient environments highlight 242 conditionally essential genes:
To link genes in the reconstructed Salmonella pan reactome to the conferred catabolic capabilities, we searched for gene essentiality in all simulated nutrient environments. Here, we defined conditionally essential genes (CEG) as those genes found to be essential in one of the nutrient conditions but not in aerobic glucose+M9 minimal medium. Of the 531 nutrient environments 242 were anaerobic and 289 were aerobic. We identified a total of 242 predicted CEGs of which 195 and 217 were essential in at least one aerobic and at least one anaerobic nutrient condition respectively (with some genes being essential in both). Conditionally essential genes comprise genes involved in the uptake and catabolism of a nutrient source but also genes that participate in central metabolism. For instance, the four genes of the fumarate reductase complex ( frdABCD ) were conditionally essential in 82.2% of the anaerobic nutrient conditions and several genes involved in glycolysis (including tpiA , gapA , pgk, eno and ppc ) were classified as CEGs in various aerobic and/or anaerobic nutrient conditions (Fig 6A). Fumarate reductase is the terminal enzyme for anaerobic respiration and several catabolic pathways connect to central metabolism by feeding into glycolysis. Interestingly, phosphoenolpyruvate carboxylase (PPC) was essential in 85.1% of the simulated anaerobic conditions. PPC synthesizes oxaloacetate (a four carbon sugar) by adding bicarbonate to phosphoenolpyruvate (a three carbon sugar). It was shown that PPC Typhimurium mutants fail to grow on pyruvate or its precursors and seems to serve the only anaplerotic function by replenishing the supply of oxaloacetate which is drained away by the biosynthesis of amino acids from the tricarboxylic acid cycle 11 . Triose phosphate isomerase (TPI) was predicted to be conditionally essential in 30.4% of the simulated aerobic and anaerobic growth environments. Interestingly, TPI Typhimurium mutants were found to have decreased fitness in a mouse model of typhoid fever 12 . We further identified 78 and 72 CEGs that were essential in only one aerobic and anaerobic nutrient condition respectively. Xylose isomerase ( xylA ) is one such example that is categorized as CEG when the main carbon source is xylose. We refer to CEGs that are found to be essential in less than 5 nutrient conditions as sCEGs and those to be essential in more than 5 nutrient conditions as mCEGs. While sCEGs participate in nutrient catabolic pathways, mCEGs are located deeper in the network.

Supplementary Note 5: Heap's law and the comparison of pan genome curves:
We constructed pan genome curves for four data-sets described in methods 1.D. We subsequently fit each of the curves with Heap's law using the "curve_fit" function from the Scipy toolkit 13 .
where N is the number of gene families, ∅ Is the average number of gene families per genome, and k , are the parameters to be fit. If is smaller than 1, the pan genome is γ γ said to be "closed" (meaning that after a sufficient number of genomes, no new additional gene families will be encountered). Based on fitting Heap's curve to the Salmonella pan genome curve (built from all 410 genomic sequences), we obtained a γ of 0.512 ( Fig. S3 ). Previous efforts in the field concluded that Salmonella has a "closed" pan genome. However, when we built a pan genome curve from a subset of 41 genomes, we obtained a of 0.627. In order to identify the sensitivity of the Heap's law γ parameters, we fitted Heap's law to pan genome curves built using 10, 20, 30 and up to 410 genomes. The factor obtained by fitting Heaps' law to the sampled curves had a γ large standard deviation and decreased with the number of genomic sequences included in the data set ( Fig. S8 ) making the method of fitting Heap's law to the pan genome curve an unreliable predictor of the pan genome expansion (or the expected number of novel gene families encountered at a future genomic addition). The phylogenetic distance was computed from the alignment of the concatenation of the 7 housekeeping genes of Salmonella (aroC, dnaN, hemD, hisD, purE, sucA and thrA). The accessory genome of a pair of genome contains the gene families that are not shared between the two strains. We calculated the Pearson correlation between the two measurements of distance using the python command scipy.stats.pearsonr 14 .

Supplementary Figure 3: Salmonella full pan genome: A) The Salmonella
pan and core genome curves are plotted after the strains of a serovar are clustered together. Random sampling of the clustered strains is performed to obtain an average core and pan value after each genomic addition. This is the full version of Fig 1.A. Each shade of blue under the pan genome curve represents a new serovar. In other words, all strains of a serovar are represented by the same shade of blue. B) The Salmonella average pan and core genome curves are plotted after randomly sampling all 410 genomic sequences a thousand times. Heap's law was fitted to all of the 1000 sampled pan genome curve and the average and standard deviation of gamma is reported. This is a more traditional representation of pan and core genome curves and comes as an extension to the subplot in Fig 1.B for the Salmonella pan and core genome curves. The core genome is represented by a dark grey color and the accessory genome is represented in light mauve. The number of protein families shared across all genomes is plotted against the number of genomes added, i.e. the core genome curve (in light blue). The union of the set of gene families at each addition is plotted against the number of genomes added, i.e. the pan genome curve (in purple).