Abstract
Surveys of microbial biodiversity such as the Earth Microbiome Project (EMP) and the Human Microbiome Project (HMP) have revealed robust ecological patterns across different environments. A major goal in ecology is to leverage these patterns to identify the ecological processes shaping microbial ecosystems. One promising approach is to use minimal models that can relate mechanistic assumptions at the microbe scale to communitylevel patterns. Here, we demonstrate the utility of this approach by showing that the Microbial Consumer Resource Model (MiCRM) – a minimal model for microbial communities with resource competition, metabolic crossfeeding and stochastic colonization – can qualitatively reproduce patterns found in survey data including compositional gradients, dissimilarity/overlap correlations, richness/harshness correlations, and nestedness of community composition. By using the MiCRM to generate synthetic data with different environmental and taxonomical structure, we show that large scale patterns in the EMP can be reproduced by considering the energetic cost of surviving in harsh environments and HMP patterns may reflect the importance of environmental filtering in shaping competition. We also show that recently discovered dissimilarityoverlap correlations in the HMP likely arise from communities that share similar environments rather than reflecting universal dynamics. We identify ecologically meaningful changes in parameters that alter or destroy each one of these patterns, suggesting new mechanistic hypotheses for further investigation. These findings highlight the promise of minimal models for microbial ecology.
Introduction
Over the past decade, nextgeneration sequencing has highlighted the incredible diversity of the microbial ecosystems that fill every corner of our planet. Microbial communities are incredibly complex and and occur in environments ranging from soils to the human body. Largescale surveys of microbial biodiversity, such as the Earth Microbiome Project (EMP), the Human Microbiome Project (HMP) and the European Metagenomics of the Human Intestinal Tract project (MetaHIT), have revealed a number of robust and reproducible patterns in community composition and function^{1,2,3}. A major challenge for contemporary microbial ecology is to understand and identify the ecological origins of these patterns. This problem is especially difficult because it involves what in the ecology literature has been called the “problem of pattern and scale”^{4}: explaining ecological patterns requires connecting processes that occur at very different scales of spatial, temporal, and taxonomical organization.
One potential approach for overcoming the problem of scale is to use mathematical models and large simulations to investigate how mechanistic assumptions about environmental and taxonomical structure at the microscopic scale affect the kind of ecological patterns observed at larger scales. A major obstacle in realizing this goal is that any mathematical model that seeks to explain modern microbial sequencing data must deal with the enormous complexity of microbial communities: the numbers of species and consumable molecules in a community can easily reach into the hundreds or thousands^{1}. Thus, by necessity any mechanistic model of community assembly will have an extraordinary number of free parameters, presenting a major obstacle for understanding microbial dynamics^{5}.
One potential strategy for overcoming this difficulty is to exploit the observation that complex systems often have generic behaviors that can be described by sampling parameters from an appropriately chosen random distribution^{6,7}. The most famous example of this is in nuclear physics where the intractably complicated quantum dynamics of the uranium nucleus were successfully modeled using random matrices^{6}. Recently, we have adapted these ideas to the microbial setting by formulating a minimal model for microbial population dynamics we term the Microbial Consumer Resource Model (MiCRM)^{8,9,10} (see Fig. 1).
The MiCRM builds on the classic framework for resource competition developed by MacArthur and Levins^{11}. As in all consumer resource models, species in the MiCRM are defined by their preferences for resources (Fig. 1d). Species with similar preferences naturally compete with each other, giving rise to competitive exclusion and niche partitioning. Crucially, the MiCRM incorporates two additional pieces of biological knowledge that are specific to microbial communities. First, the MiCRM explicitly includes crossfeeding and syntrophy – the consumption of metabolic byproducts of one species by another species^{8,12,13,14}. This is incorporated into the MiCRM through a stoichiometric metabolic matrix that parameterizes the metabolic transformations of consumed metabolites into secreted byproducts (Fig. 1a,e). Second the MiCRM incorporates stochasticity in dispersal and colonization^{15,16,17,18}. Due to proximity effects, it is known that new environments are almost always colonized by only a subset of all the species capable of existing in that environment^{19}. The MiCRM incorporates stochastic dispersal by seeding new environments through random sampling of a larger regional species pool (Fig. 1b).
Taxonomic and metabolic assumptions are incorporated into the MiCRM through the choice of consumer preferences and metabolic matrices (see Methods for detailed discussion and implementation details). In the most minimal version of the MiCRM, species have no taxonomic structure (i.e. consumer preferences are uncorrelated across species and resources) and metabolism is completely random (i.e. the metabolic matrix has no structure beyond that required by energy and mass conservation). Largescale surveys such as EMP and HMP often sample communities from very different environmental conditions. For this reason, it is important to be able to incorporate environmental structure and heterogeneity into our models. This is done by choosing which externally supplied resources are present in an environment (Fig. 1c).
Importantly, the MiCRM also allows for the incorporation of additional metabolic and taxonomic structure allowing us to ask how taxonomy and metabolism shape community structure and function. This is implemented in the MiCRM by dividing resources into general resource classes (e.g. sugars, carboxylic acids, lipids, amino acids, etc.) and then using a tiered secretion model where metabolic byproducts are preferentially secreted into certain resource classes (Methods). This allows us to incorporate metabolic structure missing in the minimal MiCRM such as the fact that the fermentation of sugars preferentially results in the secretion of carboyxlic acids.
Taxonomic structure can also be easily incorporated into the MiCRM by introducing correlations in species preferences that come from the same family. For example, it is well known that bacteria from the Enterobacteria family have a strong preference for fermenting sugars. The MiCRM incorporates such preferences by assigning species to families, with each family preferentially consuming resources from certain resource classes. Importantly, we can control the amount of metabolic and taxonomic structure in the community by modulating just two parameters that control the correlation structures of the consumer preference and metabolic matrix (see Methods and^{10}). This addresses the major modeling bottleneck discussed above about how to choose parameters for diverse ecosystems.
In this paper, we use the MiCRM to test simple hypotheses about the mechanistic origins of patterns observed in EMP, HMP and MetaHIT, as well as in recent studies of marine microbial communities^{20,21}. We find that the MiCRM can qualitatively reproduce observed phenomena with minimum fitting or finetuning. We illustrate the utility of the model by identifying ecological mechanisms necessary for reproducing observed patterns as well as identifying ecological processes that can destroy these patterns. This allows us to use the MiCRM to generate new ecological hypotheses linking microscopic processes to largescale patterns.
All simulation data and analysis scripts are available at https://github.com/EmergentBehaviorsinBiology/microbiomepatterns. The model itself is implemented in the freely available Python module CommunitySimulator^{10} https://github.com/EmergentBehaviorsinBiology/communitysimulator. Since the number of simulations required for comparisons with survey data is necessarily large, our numerical work relies heavily on a novel algorithm implemented in the Community Simulator, which takes advantage of a recently discovered duality between consumer resource models and constrained optimization to quickly and accurately simulate hundreds of communities^{22,23}.
Results
Patterns in the earth microbiome project can be explained by energetic costs associated with harsh environments
The Earth Microbiome Project is a systematic attempt to characterize global microbial diversity and function. It consists of over 20,000 samples in 17 environments located on all 7 continents^{1}. Recently, a metaanalysis of this data was carried out and several robust patterns were identified. Chief among these was an interesting anticorrelation between richness and environmental harshness reproduced in Fig. 2. Samples near neutral pH or at moderate temperatures (~15°C) showed much higher levels of richness than samples from more extreme conditions. Peak richness dropped by a factor of 2 for pHs less than 5 or greater than 9, and temperatures less than 5°C or greater than 20°C.
The EMP samples also showed a strongly nested structure: less diverse communities tended to be subsets of the more diverse communities. This is most clearly visible by creating a presence/absence matrix that indicates whether a taxon is present in a sample. Each column in the matrix corresponds to a different sample and each row to a different taxon. When the rows are sorted by taxon prevalence and the columns by richness, as in Fig. 3, one can visually verify that the taxa composing the lowdiversity communities are also present in most of the higherdiversity communities.
One possible cause of both these patterns is that microbes require more energy intake to survive in harsher environments^{24}. For example, powering chaperones to prevent protein denaturation and running ion pumps to maintain pH homeostasis both require significant amounts of ATP. We hypothesized that varying energy demands could explain the patterns observed in the EMP since they would directly alter the severity of environmental filtering.
In the MiCRM, the energetic costs of reproduction are encoded in the model parameter m_{i}, which is the minimal percapita resource consumption required for net population growth of species i (see Methods for full model equations). The m_{i} are sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. To vary the harshness of an environment, we added an environmentspecific random number m_{env} to the m_{i} of all species that colonized a given environment. A large m_{env} corresponds to harsh environments with increased energetic demands whereas small or negative m_{env} corresponds to energetically favorable environments. To mimic the variability in environmental harshness in the EMP, for each community we randomly drew m_{env} uniformly between − 0.5 and 9.5.
To selectively test the effect of an energy demand gradient on both alpha (withinsample) and beta (betweensample) diversity, we stochastically colonized 300 simulated communities of 150 species each from a regional pool of 180 species with a chemistry of 90 metabolites. We supplied each community with a constant flux of the same resource type. As discussed above, each of the 300 simulated communities was also assigned a random m_{env} to mimic the effects of environmental harshness on growth rates. The results from this simulation are shown in Fig. 2 and in the bottom left panel of Fig. 3. The same simulation correctly captures both the richness/harshness correlation and the nestedness of the EMP data, suggesting that these large scale patterns may have a simple origin.
Given the way we have modeled the harshness variations, the link with diversity is not very surprising, because a sufficiently high maintenance cost can make it impossible for a species to survive on a given resource supply, regardless of the surrounding community structure. This pattern is thus guaranteed to occur in any simulation sharing this basic structure. The shape of the richness/harshness relationship does depend on modeling choices, however. We found that diversity loss happens more quickly when the incoming energy is divided among all possible resource types before being supplied to the system, as shown in Fig. 2. In this case it can happen that no single species is able to harvest a sufficient number of distinct nutrient sources to meet its maintenance cost, and the whole community goes extinct. In the original simulations, by contrast, the surviving species at high harshness levels satisfy most of their energy requirements by directly consuming the externally supplied resource, with the metabolic byproducts supply sufficient niche differentiation for multiple species to coexist.
To explore the ecological origins of the nested pattern in more depth, we ran additional simulations in a different regime of community assembly. Instead of modulating diversity with varying levels of selection pressure, we tried varying the degree of dispersal limitation. In the new scenario, each community faced identical environmental conditions, but the initial number of species from the regional species pool allowed to colonize the community was randomly chosen, from 1 to the maximum possible value of 180. In these new simulations, shown in the bottom right panel of Fig. 3, the nestedness vanishes. The reason for this is that in many environments, only a few species colonize the community resulting in many metabolic niches being unoccupied. We also ran simulations where both m_{env} and the initial number of species varied from site to site, and obtained an intermediate degree of nestedness, as shown in Supplementary Fig. S1. Collectively, these simulations suggest that nestedness in crosssectional data may be a sign of selectiondominated community assembly.
We emphasize that these simulations themselves do not confirm the hypothesis that energy gradients are the driver of the observed patterns. Temperature and pH affect microbes in many other ways that are not included in our minimal model. But our simulations do show that accounting for increased energetic costs associated with harsh environments can reproduce the largescale patterns observed in the EMP even in the absence of any metabolic or taxonomic structure. Additionally, one ecological factor that seems crucial for reproducing these patterns is dispersal. The nestedness seen in the EMP requires that ecological dynamics are dominated by selection rather than stochastic colonization due to dispersal limitations.
Metabolic structure and species abundance curves
In order to reproduce more complex ecological patterns observed such as those observed in the HMP, we incorporated additional metabolic and taxonomic structure into our model^{8,10,23}, as illustrated in Fig. 4. The basic idea is to recognize the fact that metabolites often belong to distinct groups with different metabolic properties (e.g. lipids, sugars, amino acids, etc.). In most of our simulations, we introduce T = 6 groups labeled A − F representing these metabolic classes, with F a special "waste” class” which mimics commonly produced metabolic byproducts (i.e. carboyxlic acids for fermentative and respirofermentative bacteria). To incorporate this structure in our metabolic matrix we introduce a threetiered secretion model where: a fraction f_{s} of the byproduct flux from metabolism of a given resource is partitioned among resources of the same class, a fraction f_{w} of the flux is secreted as “waste” resources (class F), and the rest of the flux is nonspecifically partitioned among all the other classes.
Different taxonomic families often have distinct resource preferences. For example, it is well known that the bacteria from the taxonomic family Enterobacteriaceae to which E. coli belongs preferentially consume sugars. To reflect such taxonomical preferences in our model, microbial species are grouped into “families,” with each family specializing in a different resource class. Specialist families allocate a fraction q of their consumption capacity to their favored resource class. In all the simulations shown here, q = 0.9 meaning that specialist families derive 90% of their resources from their preferred resource class. In addition to these specialists, we know that certain microbial families behave as generalists with no strong metabolic preferences across resource types. To model this, we introduce a generalist family whose preferences are uniformly sampled across all resource types.
One commonly employed analysis tool for understanding community structure are species abundance curves. A species abundance curve is made by plotting the number of species present in a sample on the yaxis and the number of individuals or population size on the xaxis. In may ecosystems, it is known that species abundance curves are well fit by a Fisher log series^{25,26}. Unlike Gaussian distributions or other normaldistribution derived variants such as truncated Gaussians, the Fisher log series has a long tail, reflecting the preponderance of rare species in these ecosystems. As shown in Fig. 5, the Fisher log series also gives a good fit to species abundance data on ocean microbial communities from the Tara Oceans dataset^{20}. Simulation data generated using the MiCRM with taxonomic structure also result in longtailed species abundance distributions that are well fit by a Fisher log series. However, these tails disappear in simulations of the MiCRM lacking metabolic and taxonomic structure. In this case, the species abundance curve were better described using a truncated Gaussian, consistent with theoretical predictions^{22,27}. These simulations show that the long tailed species abundance curves seen in most ecosystems are compatible with an equilibrium niche model, provided a sufficient level of taxonomic structure, and do not necessarily require neutrality^{28} or chaotic dynamics^{29}.
Patterns in the HMP can be explained by environmental filtering and competition
The Human Microbiome Project is a largescale survey of the microbial communities that reside in and on the human body^{2}. The HMP was supplemented by the smaller MetaHIT project which focused on sequencing fecal metagenomes from multiple individuals. Initial analysis of the HMP and MetaHIT results on the human microbiome revealed three major patterns, displayed in the top half of Figs. 6, 7, and 8. First, for a given body site different individuals had very different community compositions (see Fig. 6). Even at the phylum level, the relative abundances of dominant taxa varied dramatically from sample to sample^{2}. But samples from different body sites still typically differed more than samples from the same body site, leading to the second pattern, shown in Fig. 7 of clustering of microbial communities by body site across individuals^{2,3}. Finally, the gradients in the relative abundances of the dominant taxa in a given body site across individuals were also visible in dimensional reductions of more finegrained (genuslevel) community composition, producing the third pattern shown in Fig. 8.
One factor associated with the compositional gradient is the host’s typical diet^{30,31}. Different kinds of externally supplied nutrients, such as fibers and proteins, are thought to encourage growth of different microbial taxa. For this reason, we hypothesized that the patterns in the HMP may arise from heterogeneity in the resources available in different environments. It is clear that reproducing such patterns requires assuming some minimal level of taxonomic and metabolic structure. For this reason, in our simulations we divided resources into six resource classes and species into six families, with each family specializing in one resource class, as illustrated in Fig. 4 and described above.
We first constructed metabolically structured simple environments where there were only two externally supplied resources. In particular, each of the three “body sites” was supplied with a unique pair of resources from distinct resource classes (i.e. body site 1 was supplied with a resource from class A and a resource from class B, body site 2 with a resource from class C and a resource from class D, and body site 3 with a resource from class E and a resource from class F). We modeled variability in the availability of resources across individuals at a fixed body site by changing the ratio of the two supplied resources while holding the total supplied energy fixed (see Methods). We also created metabolically structured complex environments where each body site was supplied with 50 external resources from each of the two resource classes while holding the total supplied energy fixed (i.e. body site was supplied with all 50 resources from class A and all 50 resources from class B, body site 2 with all 50 resources from class C and all 50 resources from class D, and body site 3 with all 50 resources from class E and all 50 resources from class F).
To mimic the scale of the actual microbiome data, we generated a regional pool of 5,000 species (approximately the number of OTU’s identified in the HMP^{2}), and stochastically colonized 300 samples per body site with 2,500 species each. Figure 6 shows the resulting patterns for simple and complex environments. For simple environments, our simulations mimic the broad range of compositions found in the data including gradients in the dominant families present at each of the body sites. In contrast, for complex environments we see that the relative abundance of different families stays almost constant across individuals for each body site. This suggests that the patterns found in Fig. 6 may reflect the combined effects of environmental filtering and competition between species in the presence of a few dominant externally supplied resources.
We used the data from simulations on metabolically structured simple environments to perform a PCoA across body sites as in the MetaHIT data. As can be seen in Fig. 7, these simulations recapitulated the pattern seen in real microbial communities. We found that this clustering by body site depended strongly on the fact that different body sites had metabolically distinct resources. For example, if we instead considered metabolically unstructured complex environments where each bodysite was supplied with 100 randomly chosen distinct resources regardless of resource class, the clusters were no longer fully separable on a twodimensional PCoA (right most graph in Fig. 7). This suggests that the clustering of human microbiomes according to bodysites likely reflects the fact that these body sites have metabolically distinct environments that result in different patterns of byproduct secretion.
We also investigated the ability of our model to reproduce the Ushaped curves observed in PCoA of communities at a single body site (see Fig. 8). We found that we could reproduce this pattern using the same simulations used in Fig. 7 to understand metabolically structured simple environments. With the level of dispersal limitation used in these simulations, the U shape primarily results from the stochastic presence or absence of the most abundant species. If we reduce dispersal limitation, however, by initializing each community with nearly all of the possible species, the U shape directly reflects the lowdimensional variability of resource supply in the simple environments. It remains unclear which, if any, of these two explanations corresponds to the pattern in the gut microbiome, whose significance is a matter of ongoing controversy^{30,31,32}.
Dissimilarityoverlap patterns reflect shared environments not universal dynamics
Another pattern obtained from a more recent analysis of the HMP data is an anticorrelation between overlap and dissimilarity of pairs of communities from a given body site (see Fig. 9 and Ref. ^{33} for details). Due to both stochastic colonization and variable environments, there are usually many species in one sample that are not present in the other. Different pairs of samples overlap to different degrees, and this overlap can be measured in terms of the ratio of the combined population of the shared species to the total population of the two samples. If one focuses on the subset of species that are shared, one can also compare the relative abundance distributions of the two samples within this shared pool, as illustrated in Fig. 9, using standard measures of dissimilarity such as the JensenShannon divergence (see Ref. ^{33} for detailed discussion). These two quantities are not intrinsically related, as can be seen by evaluating them over a randomly generated table of abundances^{33}.
This analysis was initially proposed as a way of distinguishing between “universal” and “hostspecific” microbial dynamics. It was argued that if the dynamics of human associated microbial communities were universal, different individuals could be modeled with the same dynamic parameters and this would be reflected by a negative correlations between dissimilarity and overlap across crosssectional samples. In contrast, for hostspecific dynamics each individual would have their own kinetic parameters and the dissimilarity and overlap would be uncorrelated. This interpretation has been disputed, however, with numerical simulations of LotkaVolterra type models showing that a negative correlation can result from hostspecific dynamics in the presence of stochasticity, sampling errors, or environmental gradients^{34}.
We reran the analysis of^{33} on our simulated HMP data discussed above and found that the dissimilarity and overlap were negatively correlated at a single body site just as in the real gut microbiome data (see Fig. 9, top right). However, this correlation was absent if we analyzed pairs of samples from distinct body sites, indicating that this signature likely arises due to the fact the all the communities at a given body site exist in a similar external environment (see Supplementary Fig. S3). Importantly, our simulations show that the negative dissimilarityoverlap correlation observed^{33} can be found even in the absence of universal dynamics since environments with different amounts of externally supplied resources generically give rise to communities with different ecological dynamics. Instead, our simulations suggest that the negative correlation between overlap and dissimilarity found in the HMP may reflect the fact that communities at a given bodysite experience similar but not identical environments.
As a further check, we approximated the population dynamics near the steady state using a generalized LotkaVolterra model (see Methods). This allowed us to explicitly calculate the effective carrying capacities and interaction coefficients for each community. The bottom row of Fig. 9 shows the carrying capacity for two pairs of communities: one pair where the two communities in the pair have a high overlap and another where the communities in the pair have a low overlap. The carrying capacities of species in the highoverlap communities are extremely similar whereas the carrying capacities of the low overlap communities are very different from each other. This provides strong evidence for the important role played by environmental filtering in producing the dissimilarityoverlap pattern observed in the HMP data.
“Modular assembly” of microbial communities
Our analysis of our synthetic HMP data also shows a new pattern: the familylevel composition of each community along the nutrient gradient is approximately a linear combination of the compositions of the two extreme communities on the gradient (Fig. 6d). Quantitatively, if \({N}_{i}^{1}\) are the population sizes for each family i in the community supplied with a flux κ_{1} of resource type 1 alone, and \({N}_{i}^{2}\) are the population sizes in the community supplied with flux κ_{2} of resource type 2 alone, then the community supplied with flux (1 − α)κ_{1} of resource type 1 and flux α of resource type 2 has population sizes approximately equal to \((1\alpha ){N}_{i}^{1}+\alpha {N}_{i}^{2}\). This “additivity” of communities from different environments can also be seen at the species level, by plotting the actual population sizes \({N}_{i}^{{\rm{mix}}}\) versus the weighted average prediction, as shown in Fig. 10.
Since this analysis is performed at the species level, meaningful taxonomic groupings are no longer necessary, so we also checked for additivity in simulations where communities lack any metabolic or taxonomic structure (i.e with unstructured metabolic and consumer preference matrices). Figure 10 shows that the population sizes in the mixedresource communities are not well predicted by the weighted average of singleresource communities, with an R^{2} of 0.21. This suggests that metabolic and taxonomic structure are necessary to see this additive pattern.
This pattern is difficult to test for in field survey data, since there are many additional factors besides diet that vary from sample to sample, many of which may themselves be correlated with diet. Studying this kind of effect requires controlled experiments where the variable of interest can be systematically varied. One recent experiment colonized seawater communities on small beads composed of different kinds of carbohydrates, which served as the sole externally provided carbon source for the community^{21}. This simplified scenario reflects the conditions of our minimal model more closely, where only one or two nutrient types were externally supplied. The authors of this study compared the weighted average of population sizes from communities grown on two different carbon sources, such as agarose and alginate, and the corresponding population sizes from other communities grown on a mixture of the two. They found the same “additivity” effect we observe in our metabolically structured simulations, with R^{2} values of 0.84 and 0.74 for two different carbon source pairs. They termed this property “modular assembly” of microbial communities. Our simulations show that modular assembly may be a generic property of complex microbial communities grown in the presence of multiple metabolically distinct resources.
Discussion
We have shown that the Microbial Consumer Resource Model introduced previously^{8} to describe laboratory experiments in synthetic minimal environments can also reproduce a wide range of experimentally observed patterns in survey data such as the HMP and EMP including harshness/richness correlations, nestedness of community composition, compositional gradients, and dissimilarity/overlap correlations. The MiCRM provides a systematic way of exploring the effect of stochastic colonization, resource competition, and metabolic crossfeeding on largescale observables. By randomly sampling parameters from welldefined probability distributions, we combine a sufficient level of mechanistic detail to make the parameters physically meaningful, while keeping the number of parameters small enough for systematic investigation of the factors that control different patterns.
Our numerical results complement recent theoretical works suggesting that complex ecosystems may still be well described by random ecoystems^{35,36}, suggesting the essential ecology of diverse ecosystems may be amenable to analysis using techniques from statistical mechanics and random matrix theory. For these reasons, the MiCRM is wellsuited to serve as a minimal model for understanding microbial ecology.
Our analysis also suggests several hypotheses relating mechanism to large scale patterns in both the EMP and HMP. We have shown that it is possible to reproduce the richness/harshness correlation and the nestedness of the EMP data by assuming that harsh environments pose an additional energetic cost to organisms. This is true even when communities are grown in otherwise identical environments and lack any taxonomic and metabolic structure. This complements earlier work showing that energy availability is a key driver of community function and structure^{9}. Our simulations on the HMP suggest that environmental gradients and resource availability result in significant environmental filtering and naturally explain the clustering of microbial communities by body site.
We have also identified ecological parameters that can break the observed patterns, allowing us to generate hypotheses about the underlying ecological processes: (a) breaking the nestedness pattern (Fig. 3) with dispersal limitation allowed us to connect nestedness to a selectiondominated regime in the EMP; (b) the loss of compositional gradients in complex environments (Fig. 6) led us to hypothesize that a small number of dominant resource types may drive intersubject variability in the HMP; (c) the degradation of compositional clustering by body site (Fig. 7) in metabolically unstructured environments highlighted the importance of metabolically relevant differences between resource environments in the HMP; (d) breaking the additivity of communities grown on mixed resource supplies (Fig. 10) allowed us to connect this pattern to taxonomic and metabolic structure in the microbial species.
Great care has to be taken when interpreting largescale patterns. For example, the negative correlation between dissimilarity and overlap observed in HMP data in Ref. ^{33} may be indicative of the fact that bodysites across individuals have similar environments rather than a much stronger claim of universal dynamics in the human microbiome. Our work also suggests that many large scale patterns may occur generically across different environmental settings. For example, the additivity observed in our synthetic HMP data is also observed in ocean communities grown on synthetic carbon beads^{21}, suggesting modular assembly may be a generic property of communities grown in environments with metabolically distinct resources.
The analysis presented here shows that it is possible to qualitatively reproduce patterns seen in largescale surveys such as the EMP and HMP using a simple minimal model. An interesting area of future research is to move beyond qualitative comparisons and ask how minimal models and large scale simulations can be quantitatively compared to largescale genomic surveys. This problem is especially challenging given the large number of parameters, environments, and experimental designs that must be explained. One potential avenue for doing this is to use statistical methods such as Approximate Bayesian Computation (ABC)^{37}. In ABC, the need to exactly calculate complicated likelihood functions is replaced with the calculation of summary statistics and numerical simulations. In this way, it may be possible to quantitatively relate mechanistic details at the level of microbes to community level patterns observed in largescale surveys.
Methods
All synthetic data was generated using the Microbial Consumer Resource Model previously described^{9,10} and summarized below. We found the fixed points of the dynamics for each community using the Python package Community Simulator^{10}: https://github.com/EmergentBehaviorsinBiology/communitysimulator. Principal Coordinate Analysis was performed on the simulated HMP data using the Python package scikitbio http://scikitbio.org/. The pairwise distance matrix was generated using standard scipy commands with the JensenShannon distance metric. Dissimilarityoverlap analysis was performed on the simulation data following the procedure described in Ref. ^{33}. All simulation data and analysis scripts are available at https://github.com/EmergentBehaviorsinBiology/microbiomepatterns.
MiCRM dynamical equations
We consider the dynamics of the population densities N_{i} of S microbial species and the abundances R_{α} of M resource types in a wellmixed system, governed by the following set of ordinary differential equations:
For this study, the conversion factors g_{i} from energy uptake to population growth were all set to 1, as were the resource qualities w_{α} and the resource dilution rate \({\tau }_{R}^{1}\). The leakage fractions l_{α} govern how much of each consumed resource is released into the environment as metabolic byproducts, and was set to 0.8 for all α. See Table 1 for list of all parameters and units.
Random sampling of consumer preference matrix and metabolic matrix
As noted in the Introduction, modeling highly diverse communities such as microbiomes requires a large number of free parameters. For example, the simulations with 5,000 species performed here required choosing over a million parameter values. In order to explore the typical phenomena produced by our model, we sampled the parameters randomly. Under the sampling scheme described in this section, the model is fully defined by a choice of just twelve parameters, listed in Table 2.
We choose consumer preferences c_{iα} as follows. We assume that each specialist family has a preference for one resource class A (where A = 1…F) with 0 ≤ F ≤ T, and we denote the consumer coefficients for this family by \({c}_{i\alpha }^{A}\). We also consider generalists that have no preferences, with consumer coefficients \({c}_{i\alpha }^{{\rm{gen}}}\). The \({c}_{i\alpha }^{A}\) can be drawn from one of three probability distributions : (i) a Normal/Gaussian distribution, (ii) a Gamma distribution (which ensure positivity of the coefficients), and (iii) a Bernoulli distribution with binary preference levels.
The key parameters for constructing all three distributions are the mean μ_{c} and the variance \({\sigma }_{c}^{2}\) of the sum ∑_{α}c_{iα} over a row of the matrix.
In the current study, we focus on binary preference levels (option iii). In this model, there are two possible values for each c_{iα}: a low level \(\frac{{c}_{0}}{M}\) and a high level \(\frac{{c}_{0}}{M}+{c}_{1}\). For a given choice of μ_{c}, the parameters c_{0} and c_{1} together determine the variance \({\sigma }_{c}^{2}\). The elements of \({c}_{i\alpha }^{A}\) are given by
where X_{iα} is a binary random variable that equals 1 with probability
for the specialist families, and
for the generalists.
We choose the metabolic matrix D_{αβ} according to the threetiered secretion model depicted in Fig. 4. The first tier is a preferred class of ‘waste’ products, such as carboyxlic acids for fermentative and respirofermentative bacteria, with M_{w} members. The second tier contains byproducts of the same class as the input resource (when the input resource is not in the preferred byproduct class). For example, this could be attributed to the partial oxidation of sugars into sugar alcohols, or the antiporter behavior of various amino acid transporters. The third tier includes everything else. We encode this structure in D_{αβ} by sampling each column β of the matrix from a Dirichlet distribution with concentration parameters d_{αβ} that depend on the byproduct tier, so that on average a fraction f_{w} of the secreted flux goes to the first tier, while a fraction f_{s} goes to the second tier, and the rest goes to the third. The Dirichlet distribution has the property that each sampled vector sums to 1, making it a natural way of randomly allocating a fixed total quantity (such as the total secretion flux from a given input). To write the expressions for these parameters explicitly, we let A(α) represent the class containing resource α, and let w represent the ‘waste’ class. We also introduce a parameter s that controls the sparsity of the reaction network, ranging from a dense network with alltoall connection when s → 0, to maximal sparsity with each input resource having just one randomly chosen output resource as s → 1. With this notation, we have
The final two lines handle the case when the ‘waste’ type is being consumed. In this case, the fraction allocated to the waste type is the sum of the fractions allocated to ‘same’ and ‘waste’.
Solving for uninvadable equilibrium
We computed the uninvadable equilibrium state of Equation (2) using a novel algorithm inspired by expectationmaximization methods in machine learning. The algorithm is described in detail in Ref. ^{10}, and implemented computationally in the Community Simulator package.
The raw results of the computation have nonzero abundances for all species, due to technical limits on numerical precision in the solver. In all simulations, the abundance distribution was clearly bimodal, with wellseparated peaks on a log scale for the surviving vs. extinct species. For purposes of determining species richness, we set the abundance of all species in the “extinct” group to zero. Histograms of the raw results are plotted in the accompanying Jupyter notebook, where the choice of threshold for removing extinct species can be directly verified.
The large simulations with 300 resources and 5,000 species pushed the limits of our implementation of the algorithm, and occasionally failed to converge. Before performing further analysis, we directly verified that a true solution had been found by calculating the percapita growth rate \(d\ {\rm{ln}}\,\ {N}_{i}\)/dt for all surviving species. A histogram of the maximum value of \( d\ {\rm{ln}}\,\ {N}_{i}/dt \) for each community (on a log scale) shows that most simulations are around 10^{−7}, with the upper tail reaching to around 10^{−5}. In the least stable simulation, with S = 4, 900 and two externally supplied resources, the failed runs form a second cluster around \( d\ {\rm{ln}}\,\ {N}_{i}/dt =1{0}^{3}\). To eliminate such runs, we set a threshold for all simulations discarding samples with \( d\ {\rm{ln}}\,\ {N}_{i}/dt \ \ge \ 1{0}^{5}\). For the least stable scenario, 29 of the 900 samples exceeded the threshold, and all others had between 0 and 11.
Synthetic data for global biodiversity patterns
Synthetic data for Figs. 2 and 3 was generated using a regional species pool of size S_{tot} = 180 and M = 90 potential resources. The elements of the 180 × 90 consumer preference matrix c_{iα} were sampled from a binary distribution as described above, with c_{0} = 0, c_{1} = 1 and μ_{c} = 10, using only one resource class (T = 1) and one consumer family (F = 1). The 90 × 90 metabolic matrix D_{αβ} was sampled from a Dirichlet distribution as described above, with s = 0.05. The m_{i} were sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. A random number from a uniform distribution between − 0.5 and 9.5 was added to all the m_{i}’s from each sample.
The rest of the parameters differed among the three scenarios we simulated, and were chosen as follows:
Simple environment (same as “selectionlimited” in Fig. 3): Each sample was stochastically colonized with S = 150 out of the 180 possible species, and supplied with a single external resource, with κ_{1} = 200 and κ_{α} = 0 for all α ≠ 1.
Complex environment: Each sample was stochastically colonized with S = 150 out of the 180 possible species, and supplied with all external resources, with κ_{α} = 200/M = 2.2 for all α.
Dispersal limited: Each sample was stochastically colonized with a randomly chosen number of species, uniformly distributed between S = 1 and S = S_{tot} = 180, and supplied with a single external resource, with κ_{1} = 200 and κ_{α} = 0 for all α ≠ 1.
Synthetic data for human microbiome patterns
To generate synthetic data for Figs. 6–9, we assumed a regional species pool of size S_{tot} = 5000, with M = 300 possible resource types. Resources were grouped into T = 6 classes of 50 resource types each, labeled A through F. Microbial species were grouped into 6 specialist ‘‘families” of 800 species, with each family specializing in one resource class as described above. The remaining 200 species were designated as generalists, with no bias towards any one resource class. The consumption parameters were set to c_{0} = 0, c_{1} = 1 and μ_{c} = 10 as for the previous set of simulations. The metabolic matrix sparsity was set to s = 0.3, to reflect the actual sparsity of the E. Coli metabolic network^{38}, and the secretions were allocated with f_{s} = f_{w} = 0.45. The m_{i} were sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. Each community was supplied with the same total incoming energy flux κ = ∑_{α}κ_{α} = 1, 000. For each scenario, we simulated 900 independent communities, evenly partitioned among three “body sites” with different environmental characteristics.
The rest of the parameters were varied to construct eight different scenarios. For S = 2, 500 (strong dispersal limitation) and S = 4, 900 (weak dispersal limitation), we made the following four combinations of species properties and environmental conditions:
Metabolically distinct simple environments: In each of the three simulated body sites, one resource was chosen from each of two resource classes (A + B, C + D, and E + F). The relative flux levels for these two resources were chosen for each of the 300 communities in the site by randomly sampling a number a from a uniform distribution over the interval [0, 1], and then letting (1 − a)κ be the flux of the first resource, with aκ from the second. Taxonomic structure was incorporated by setting a high strength of specialization q = 0.9.
Metabolically distinct complex environments: In each of the three simulated body sites, all resources were supplied from two resource classes (A + B, C + D, and E + F), with flux levels from all other classes still set to zero. The relative flux levels for 100 resource types in each site were randomly sampled for each community, by independently sampling 100 numbers \({\widetilde{\kappa }}_{\alpha }\) from a uniform distribution over the interval [0, 1], and then setting \({\kappa }_{\alpha }=\kappa \frac{{\widetilde{\kappa }}_{\alpha }}{{\sum }_{\beta }{\widetilde{\kappa }}_{\beta }}\). Taxonomic structure was incorporated by setting a high strength of specialization q = 0.9.
No taxonomic structure: Same as “metabolically distinct simple environments” above, except that taxonomic structure was removed by setting q = 0.
Metabolically overlapping complex environments: Same as “metabolically distinct complex environments” except that the 300 resources were randomly partitioned into three sets of 100, and each body site was supplied with resources from a different set.
Synthetic data for marine microbiome patterns
The abundance distributions in Fig. 5 were generated directly from the “Simple environments” and “No taxonomic structure” simulations described above for the human microbiome patterns.
The tests of modular community assembly in Fig. 10 were performed using the same setup as “Simple environments” in the human microbiome simulations, but with just two “body sites” (A + B and C + D), and three values of a (0, 1 and 0.5). The unstructured metabolism control was performed by setting T = F = 1, assigning all 300 resources to the same resource class before sampling the metabolic and consumer preference matrices.
Relative abundance distributions and fisher log series
To create Fig. 5, we first downloaded the 16S OTU table from the Tara Oceans companion website (http://oceanmicrobiome.embl.de/companion.html)^{20}. We performed 300 independent rarefactions to a constant read depth of 10,000. For each possible number of reads, from 1 through the maximum observed, we plotted the number of species assigned that number of reads (“population size”), averaged over all rarefactions.
In many ecological settings, it has been observed that the number s(n) of species with n individuals in a sample of N total individuals closely follows the Fisher log series^{25,26}:
where the parameters x and α are determined from N and the total number of observed species S through the following equations:
In the first panel of Fig. 5 we plot Eq. (8) using N = 10, 000 (the read depth we manually imposed for the rarefaction) and S equal to the number of OTU’s with more than 5 reads assigned to them in the original dataset.
For the simulation data in Fig. 5, we had the further advantage of having access to multiple independent trials under statistically similar conditions. Instead of averaging over multiple rarefactions generated from the same underlying dataset, we averaged over single samples of N = 1, 000 individuals taken from each of the 900 parallel communities. We plotted the Fisher log series with N = 1, 000 and S equal to the number of species with nonzero abundance.
We also plotted the distributions obtained when the underlying relative proportions of S species are generated by a simple null model, in which the invasion fitness (lowdensity growth rate) of each species is sampled from a Gaussian distribution. Species with negative values go extinct, and those with positive values end up with population sizes proportional to the invasion fitness. The resulting relative abundances of species in an infinitely large community follow a truncated Gaussian distribution. This distribution is determined by a single parameter, up to an overall scale that is irrelevant for the purposes of the current analysis. The parameter is inferred from the simulation results by matching the fraction of species initially present in the community that survive to equilibrium. In the figure, the green curves come from first sampling S underlying species abundances from this distribution, then sampling N = 1, 000 individuals from the resulting population, and averaging the results over 10,000 independent iterations.
Computation of overlap and dissimilarity
Here we summarize the definitions of the dissimilarity \(D(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})\) and overlap \(O(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})\) between two sets of population size measurements, as given in Ref. ^{33}. Here, μ = 1, 2, …C is an index labeling the sample from which the measurement was taken (as is ν). In order to define these two quantities, we must first introduce some notation concerning the shared species. We let S^{†} represent the set of species that are present in both communities, and denote the total number of species in this set by S^{†}. We also define two types of normalized abundances:
where the second quantity is normalized only by the set of species that is shared with the other community in the pair. We also define the average composition over the shared species:
Using these definitions, we can finally write
The first equation is simply the square root of the JensenShannon divergence between the relative abundances of the overlapping species, and the second measures the relative abundance of the species in the overlapping set, averaged over the two communities.
Computation of effective LotkaVolterra parameters
The distinction between “hostspecific” vs. “universal” population dynamics is most clearly defined in terms of a closed set of equations for the dynamics of the population sizes, with environmental factors treated implicitly^{33}. We can transform the MiCRM into a model of this form by examining the regime where the resource dynamics are “fast” compared to the timescale for changes in population sizes. We can then simplify the form of the resulting model by performing a Taylor expansion of the growth rate around the equilibrium population sizes \({\bar{N}}_{i}\), resulting in generalized LotkaVolterra equations parameterized by a set of carrying capacities and interaction coefficients.
We start by writing the full dynamical equation (2) in a more compact form:
with
We now invoke the “fast resource equilibration” assumption to set dR_{α}/dt = 0, and solve for the resource concentrations \({\bar{R}}_{\alpha }\) as functions of the set of population sizes {N_{j}}. Inserting this result back into the equations for the dynamics of N_{i}, we have:
To obtain the local LotkaVolterra coefficients, we perform a Taylor expansion of the term in brackets around \({N}_{j}={\bar{N}}_{j}\), up to first order in the distance from equilibrium:
where
We can compute the derivatives of \(\bar{R}\) through implicit differentiation to obtain
with
Thus we find
and conclude
where
Since all the parameters and equilibrium abundances are known in the simulations, this set of equations allows us to compute for each community μ (μ = 1, 2…300) the bare growth rates \({r}_{i}^{\mu }\), the interactions \({\alpha }_{ij}^{\mu }\) and the carrying capacities \({K}_{i}^{\mu }\). The scatter plot in the bottom right panel of Fig. 9 shows the normalized RMS variability in the carrying capacities for each pair of samples μ and ν, computed as:
where S^{†} is the number of surviving species shared between the two communities, and S^{†} is the set of indices of the shared species.
References
Thompson, L. R. et al. A communal catalogue reveals Earthas multiscale microbial diversity. Nature 551, 457 (2017).
Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207 (2012).
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59 (2010).
Levin, S. A. The problem of pattern and scale in ecology: the Robert H. MacArthur award lecture. Ecology 73, 1943–1967 (1992).
Hart, S. F. et al. Uncovering and resolving challenges of quantitative modeling in a simplified community of interacting cells. PLoS biology 17, e3000135 (2019).
Wigner, E. P. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics (ser. 2) 62, 548 (1955).
May, R. Will a Large Complex System be Stable? Nature 238, 413 (1972).
Goldford, J. E. et al. Emergent simplicity in microbial community assembly. Science 361, 469 (2018).
Marsland, R. III. et al. Available energy fluxes drive a transition in the diversity, stability, and functional structure of microbial communities. PLOS Computational Biology 15, e1006793 (2019).
Marsland, R.III, Cui, W., Golford, J. & Mehta, P. The Community Simulator: A Python package for microbial ecology arXiv:1904.09367 (2019).
MacArthur, R. Species Packing and Competitive Equilibrium for Many Species. Theoretical Population Biology 1, 1 (1970).
Harcombe, W. R. et al. Metabolic resource allocation in individual microbes determines ecosystem interactions and spatial dynamics. Cell Reports 7, 1104 (2014).
Zomorrodi, A. R. & Segrè, D. Synthetic ecology of microbes: Mathematical models and applications. J. Mol. Biol. 428, 837 (2016).
Pacheco, A. R., Moel, M. & Segrè, D. Costless metabolic secretions as drivers of interspecies interactions in microbial ecosystems. Nature Communications 10, 103 (2019).
Leibold, M. A. et al. The metacommunity concept: a framework for multiscale community ecology. Ecology Letters 7, 601 (2004).
Vellend, M. The Theory of Ecological Communities (MPB57), vol. 75 (Princeton University Press, 2016).
HilleRisLambers, J., Adler, P., Harpole, W., Levine, J. & Mayfield, M. Rethinking community assembly through the lens of coexistence theory. Annual Review of Ecology, Evolution, and Systematics 43 (2012).
DiniAndreote, F. & Raaijmakers, J. M. Embracing community ecology in plant microbiome research. Trends in Plant Science 23, 467–469 (2018).
Shurin, J. B. Dispersal limitation, invasion resistance, and the structure of pond zooplankton communities. Ecology 81, 3074 (2000).
Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
Enke, T. N. et al. Modular assembly of polysaccharidedegrading microbial communities in the ocean. Current Biology 29, 1528 (2019).
Mehta, P., Cui, W., Wang, C.H. & Marsland, R. III Constrained optimization as ecological dynamics with applications to random quadratic programming in high dimensions. Physical Review E 99, 052111 (2018).
Marsland, R. III, Cui, W. & Mehta, P. The minimum environmental perturbation principle: A new perspective on niche theory arXiv:1901.09673 (2019).
Hoehler, T. M. & Jørgensen, B. B. Microbial life under extreme energy limitation. Nature Reviews Microbiology 11, 83 (2013).
Fisher, R. A., Corbet, A. S. & Williams, C. B. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology 42–58 (1943).
Magurran, A. E. Species abundance distributions: pattern or process? Functional Ecology 19, 177–181 (2005).
Advani, M., Bunin, G. & Mehta, P. Statistical physics of community ecology: a cavity solution to MacArthuras consumer resource model. Journal of Statistical Mechanics 2018, 033406 (2018).
Hubbell, S. P. The Unified Neutral Theory of Biodiversity and Biogeography (MPB32) (Princeton University Press, 2001).
Pearce, M. T., Agarwala, A. & Fisher, D. S. Stabilization of extensive finescale diversity by spatiotemporal chaos. bioRxiv 736215 (2019).
Costea, P. I. et al. Enterotypes in the landscape of gut microbial community composition. Nature Microbiology 3, 8 (2018).
Gorvitovskaia, A., Holmes, S. P. & Huse, S. M. Interpreting prevotella and bacteroides as biomarkers of diet and lifestyle. Microbiome 4, 15 (2016).
Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174 (2011).
Bashan, A. et al. Universality of human microbial dynamics. Nature 534, 259 (2016).
Kalyuzhny, M. & Shnerb, N. M. Dissimilarityoverlap analysis of community dynamics: Opportunities and pitfalls. Methods in Ecology and Evolution 8, 1764 (2017).
Barbier, M., Arnoldi, J.F., Bunin, G. & Loreau, M. Generic assembly patterns in complex ecological communities. Proceedings of the National Academy of Sciences 115, 2156–2161 (2018).
Cui, W., Marsland, R. III. & Mehta, P. Diverse communities behave like typical random ecosystems arXiv:1904.02610 (2019).
Csilléry, K., Blum, M. G., Gaggiotti, O. E. & François, O. Approximate Bayesian computation (ABC) in practice. Trends in Ecology & Evolution 25, 410–418 (2010).
Wagner, A. & Fell, D. A. The small world inside large metabolic networks. Proceedings of the Royal Society of London B: Biological Sciences 268, 1803 (2001).
Acknowledgements
This work was supported by NIH NIGMS grant 1R35GM119461 and Simons Investigator in the Mathematical Modeling of Living Systems (MMLS) award to PM. Computational work was performed on the Shared Computing Cluster which is administered by Boston University Research Computing Services.
Author information
Authors and Affiliations
Contributions
R.M., W.C. and P.M. devised the model. R.M. performed the simulations and wrote the manuscript. W.C. and P.M. critically revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Marsland, R., Cui, W. & Mehta, P. A minimal model for microbial biodiversity can reproduce experimentally observed ecological patterns. Sci Rep 10, 3308 (2020). https://doi.org/10.1038/s41598020601302
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598020601302
Further reading

Complementary resource preferences spontaneously emerge in diauxic microbial communities
Nature Communications (2021)

Nonadditive microbial community responses to environmental complexity
Nature Communications (2021)

Engineering complex communities by directed evolution
Nature Ecology & Evolution (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.