Mapping the bacterial metabolic niche space

The rise in the availability of bacterial genomes defines a need for synthesis: abstracting from individual taxa, to see larger patterns of bacterial lifestyles across systems. A key concept for such synthesis in ecology is the niche, the set of capabilities that enables a population’s persistence and defines its impact on the environment. The set of possible niches forms the niche space, a conceptual space delineating ways in which persistence in a system is possible. Here we use manifold learning to map the space of metabolic networks representing thousands of bacterial genera. The results suggest a metabolic niche space comprising a collection of discrete clusters and branching manifolds, which constitute strategies spanning life in different habitats and hosts. We further demonstrate that communities from similar ecosystem types map to characteristic regions of this functional coordinate system, permitting coarse-graining of microbiomes in terms of ecological niches that may be filled.


Desirable properties of an ecological coordinate system
An effective ecological coordinate system for high-dimensional trait data would ideally preserve important local and global features of the dataset [2]. We demonstrate by way of example that diffusion-based methods learn both local and global structures in the trait space, while popular methods usually emphasize only one of these desirable properties.
We consider three examples of types of functional 'soft properties' [4] identified by the diffusion map and supported by enrichment analysis [5] (Supplementary Tables 1-7): an example of capabilities that uniquely distinguish a group from all others (variable Figure 1: Two-dimensional embedding of diffusion variables, computed using the 'PHATE' algorithm [2]. Points are individual genomes colored by their assigned entries in a specific diffusion variable. Dark shades of red and blue correspond to small (i.e., the most negative) and large (positive) variable entries; white points are near zero. Axes mark (0, 0) in the coordinate system. A) Diffusion variable 1, identifying photosynthetic capabilities in Cyanobacteria (dark red points). B) Diffusion variable 2, which identifies differences between soil-associated Actinobacteria (dark blue) and host-associated Gammaproteobacteria (dark red). C) Diffusion variable 31, which finds a functional split among close relatives in the Enterobacterales. 1; carbon fixation by photosynthetic Cyanobacteria), an example of conserved differences between major taxonomic classes (variable 2; soilborne Actinobacteria vs. hostassociated Gammaproteobacteria), and an example of major differences among close relatives (variable 31; differences among species in the Enterobacterales).
Supplementary Figure 1 shows the two-dimensional embedding of diffusion variables using the 'potential of heat-diffusion for affinity-based transition embedding' (PHATE) method (also see In contrast, other methods discard important fine-grained details encoded in higher dimensions. This is well-understood for linear methods like PCA that focus on explain-   [6]. Columns correspond to different principal component axis extrema for the first 50 axes. Darker tiles indicate that a larger fraction of community censuses contained taxa that mapped to those extrema. Blue and red arrows along the horizontal axis denote positive and negative variable extrema respectively. Columns and rows are ordered based on the procedure described in Methods. ing global variances [2] (Supplementary Figs. 2A-2C). In practice, this means that PCA is able to find some global contrasts in the trait dataset, like the major differences in metabolic capabilities between Actinobacteria and Gammaproteobacteria (Supplementary Fig. 2B). However, the method is unable to identify finer structures near nonlinear submanifolds. As a consequence, details like intra-class differences in metabolic traits are obfuscated (e.g., Supplementary Fig. 2C). This observation is recapitulated by a mapping between samples in the Earth Microbiome Project (EMP) [6] and PCA axes (see Fig. 4; description in Methods), which provides only a rough characterization of ecosystem types ( Supplementary Fig. 3).
The t-SNE [1] method is capable of identifying localized [3] clusters in the data (e.g., differences in the Cyanobacteria; Supplementary Fig. 4A) but cannot reliably capture global features of the dataset. This is because t-SNE minimizes an objective function that ignores large dissimilarities between data points, causing distances in t-SNE space to be mostly meaningless [2]. This leads to issues particularly in the analysis of data that   Fig. 1 for a description). A) Diffusion variable 1, B) diffusion variable 2, and C) diffusion variable 31. are well-described by continuous or branching trajectories (e.g., cell differentiation data), as t-SNE shatters these trajectories leading to the false impression of data clusters [2].

Robustness of diffusion mapping to data perturbations
A notable concern is the sensitivity of dimensionality reduction methods to noise [2]. This problem is particularly apparent in the application of multidimensional scaling to our metabolic trait dataset. Visible in the embedding is a strong sensitivity to outliers (Supplementary Figs. 5A-5C), leading to a tight cluster of genomes with very little visible structure. Within this cluster, important features are generally difficult to resolve.
In contrast, a majority of local structures in the diffusion map are robust to data perturbations. To demonstrate this point, we computed the degree to which replacing a specified proportion of real trait sets with random metabolic configurations altered the neighborhoods of intact data points in diffusion space (see Supplementary Fig. 6A). Similarities of the 10-nearest neighbors between pairs of matching points in the intact and permuted diffusion spaces were calculated as 1 − D J , where D J is the binary Jaccard distance [7]. Supplementary Fig. 6B shows the results of 100 iterations of the procedure outlined in Fig. 5A, which indicate that local structures are mostly preserved for the intact data (similarity > 0.5), even when permuting 10% of the trait data points. Thus, geometric structures in the diffusion map have the benefit of being robust to potential noise introduced by sequencing errors, genome and annotation incompleteness, or the metabolic reconstruction process.
Even using a limited number of examples, several of the limitations of common methods for trait space reconstruction become apparent. Included are tradeoffs between the preservation of local and global data structures, prior assumptions about the linearity or geometry of the underlying data, and sensitivity to noise.  Table 1: Top 5 over-represented metabolites in the metabolic networks of taxa that receive the most negative entries on variable 1. The Enrich. Score and FDR-Adj. P columns show the normalized 'Enrichment score' and FDR-adjusted [8] P -value from the enrichment analysis [5]. The Synthesized column indicates whether the network is predicted to produce the metabolite, based on its in-degree [9].