Introduction

Over the past decade, next-generation sequencing has highlighted the incredible diversity of the microbial ecosystems that fill every corner of our planet. Microbial communities are incredibly complex and and occur in environments ranging from soils to the human body. Large-scale surveys of microbial biodiversity, such as the Earth Microbiome Project (EMP), the Human Microbiome Project (HMP) and the European Metagenomics of the Human Intestinal Tract project (MetaHIT), have revealed a number of robust and reproducible patterns in community composition and function1,2,3. A major challenge for contemporary microbial ecology is to understand and identify the ecological origins of these patterns. This problem is especially difficult because it involves what in the ecology literature has been called the “problem of pattern and scale”4: explaining ecological patterns requires connecting processes that occur at very different scales of spatial, temporal, and taxonomical organization.

One potential approach for overcoming the problem of scale is to use mathematical models and large simulations to investigate how mechanistic assumptions about environmental and taxonomical structure at the microscopic scale affect the kind of ecological patterns observed at larger scales. A major obstacle in realizing this goal is that any mathematical model that seeks to explain modern microbial sequencing data must deal with the enormous complexity of microbial communities: the numbers of species and consumable molecules in a community can easily reach into the hundreds or thousands1. Thus, by necessity any mechanistic model of community assembly will have an extraordinary number of free parameters, presenting a major obstacle for understanding microbial dynamics5.

One potential strategy for overcoming this difficulty is to exploit the observation that complex systems often have generic behaviors that can be described by sampling parameters from an appropriately chosen random distribution6,7. The most famous example of this is in nuclear physics where the intractably complicated quantum dynamics of the uranium nucleus were successfully modeled using random matrices6. Recently, we have adapted these ideas to the microbial setting by formulating a minimal model for microbial population dynamics we term the Microbial Consumer Resource Model (MiCRM)8,9,10 (see Fig. 1).

Figure 1
figure 1

A minimal model for investigating microbial biodiversity. (a) The Microbial Consumer Resource Model extends the classic Consumer Resource Model of MacArthur and Levins11 by incorporating the generic exchange of secondary metabolites observed in microbial communities, as described in the Methods. Each consumed resource type α (stars, squares, circles) with abundance Rα is taken up by species i at a rate ciαRα, and transformed into other resource types through metabolic reactions inside each cell with normalized stoichiometry matrix Dαβ. A fraction l of the resulting chemical flux returns to the environment, where it can be consumed by other microbes, while the rest is retained and used for growth. (b) Communities are initialized by randomly sampling subsets of species from a given regional pool, simulating the effect of stochastic colonization. The importance of dispersal limitation for community assembly can be tuned by adjusting the number of species in each the subset. (c) Each community is supplied with a constant influx of specified resource types, and all resources are diluted at a fixed rate. We assume that each community is well-mixed, so that its state is fully defined by the set of resource abundances Rα and microbial population sizes Ni. (d) Heat map of randomly sampled matrix of consumer preferences ciα with S = 200 species and M = 100 resource types. (e) Heat map of randomly sampled metabolic matrix Dαβ, which encodes the allowed metabolic transformations and their relative rates, shown here with M = 100 resource types.

The MiCRM builds on the classic framework for resource competition developed by MacArthur and Levins11. As in all consumer resource models, species in the MiCRM are defined by their preferences for resources (Fig. 1d). Species with similar preferences naturally compete with each other, giving rise to competitive exclusion and niche partitioning. Crucially, the MiCRM incorporates two additional pieces of biological knowledge that are specific to microbial communities. First, the MiCRM explicitly includes cross-feeding and syntrophy – the consumption of metabolic byproducts of one species by another species8,12,13,14. This is incorporated into the MiCRM through a stoichiometric metabolic matrix that parameterizes the metabolic transformations of consumed metabolites into secreted byproducts (Fig. 1a,e). Second the MiCRM incorporates stochasticity in dispersal and colonization15,16,17,18. Due to proximity effects, it is known that new environments are almost always colonized by only a subset of all the species capable of existing in that environment19. The MiCRM incorporates stochastic dispersal by seeding new environments through random sampling of a larger regional species pool (Fig. 1b).

Taxonomic and metabolic assumptions are incorporated into the MiCRM through the choice of consumer preferences and metabolic matrices (see Methods for detailed discussion and implementation details). In the most minimal version of the MiCRM, species have no taxonomic structure (i.e. consumer preferences are uncorrelated across species and resources) and metabolism is completely random (i.e. the metabolic matrix has no structure beyond that required by energy and mass conservation). Large-scale surveys such as EMP and HMP often sample communities from very different environmental conditions. For this reason, it is important to be able to incorporate environmental structure and heterogeneity into our models. This is done by choosing which externally supplied resources are present in an environment (Fig. 1c).

Importantly, the MiCRM also allows for the incorporation of additional metabolic and taxonomic structure allowing us to ask how taxonomy and metabolism shape community structure and function. This is implemented in the MiCRM by dividing resources into general resource classes (e.g. sugars, carboxylic acids, lipids, amino acids, etc.) and then using a tiered secretion model where metabolic byproducts are preferentially secreted into certain resource classes (Methods). This allows us to incorporate metabolic structure missing in the minimal MiCRM such as the fact that the fermentation of sugars preferentially results in the secretion of carboyxlic acids.

Taxonomic structure can also be easily incorporated into the MiCRM by introducing correlations in species preferences that come from the same family. For example, it is well known that bacteria from the Enterobacteria family have a strong preference for fermenting sugars. The MiCRM incorporates such preferences by assigning species to families, with each family preferentially consuming resources from certain resource classes. Importantly, we can control the amount of metabolic and taxonomic structure in the community by modulating just two parameters that control the correlation structures of the consumer preference and metabolic matrix (see Methods and10). This addresses the major modeling bottleneck discussed above about how to choose parameters for diverse ecosystems.

In this paper, we use the MiCRM to test simple hypotheses about the mechanistic origins of patterns observed in EMP, HMP and MetaHIT, as well as in recent studies of marine microbial communities20,21. We find that the MiCRM can qualitatively reproduce observed phenomena with minimum fitting or fine-tuning. We illustrate the utility of the model by identifying ecological mechanisms necessary for reproducing observed patterns as well as identifying ecological processes that can destroy these patterns. This allows us to use the MiCRM to generate new ecological hypotheses linking microscopic processes to large-scale patterns.

All simulation data and analysis scripts are available at https://github.com/Emergent-Behaviors-in-Biology/microbiome-patterns. The model itself is implemented in the freely available Python module Community-Simulator10 https://github.com/Emergent-Behaviors-in-Biology/community-simulator. Since the number of simulations required for comparisons with survey data is necessarily large, our numerical work relies heavily on a novel algorithm implemented in the Community Simulator, which takes advantage of a recently discovered duality between consumer resource models and constrained optimization to quickly and accurately simulate hundreds of communities22,23.

Results

Patterns in the earth microbiome project can be explained by energetic costs associated with harsh environments

The Earth Microbiome Project is a systematic attempt to characterize global microbial diversity and function. It consists of over 20,000 samples in 17 environments located on all 7 continents1. Recently, a metaanalysis of this data was carried out and several robust patterns were identified. Chief among these was an interesting anti-correlation between richness and environmental harshness reproduced in Fig. 2. Samples near neutral pH or at moderate temperatures (~15°C) showed much higher levels of richness than samples from more extreme conditions. Peak richness dropped by a factor of 2 for pHs less than 5 or greater than 9, and temperatures less than 5°C or greater than 20°C.

Figure 2
figure 2

Relationship between diversity and environmental harshness is modulated by environmental complexity. Left: Gray dots are the number of distinguishable strains observed in each sample of the EMP, plotted vs. pH and temperature. Black dots represent the 99th percentile of all communities at a given pH or temperature. Colored lines are fits of a Laplacian and a Gaussian distribution to the 99the percentile points. Reproduced from Figure 2 of the initial open-access report on the results of the EMP1. Right: The number of species surviving to steady state in simulated communities, plotted vs. environmental harshness. Harsher environments at extreme pH or temperature were simulated by increasing the total amount of resource consumption mi required for growth (by the same amount for all species). Blue squares are simulation results when all the energy was supplied via a single resource type, while orange circles are simulations where the incoming energy was evenly divided over all 90 possible resource types. See main text and Methods for simulation details.

The EMP samples also showed a strongly nested structure: less diverse communities tended to be subsets of the more diverse communities. This is most clearly visible by creating a presence/absence matrix that indicates whether a taxon is present in a sample. Each column in the matrix corresponds to a different sample and each row to a different taxon. When the rows are sorted by taxon prevalence and the columns by richness, as in Fig. 3, one can visually verify that the taxa composing the low-diversity communities are also present in most of the higher-diversity communities.

Figure 3
figure 3

Nestedness of community composition indicates selection-dominated community assembly. Top: Presence (colored) or absence (white) of each microbial phylum in a representative set of 2,000 samples from the EMP. Reproduced from Figure 3 of the EMP report1. Different colors represent different biomes. Bottom: Presence (black) or absence (white) of species in simulated communities. Two different regimes of community assembly were simulated. The first is the selection-dominated scenario of Fig. 2, where variability in diversity is produced by variations in environmental harshness, and all samples are initialized with the vast majority (150/180) of the species in the regional pool. The second is a dispersal-dominated scenario, where environmental conditions are identical for all samples, but each sample is initialized with a different number of species, varying from 1 to 180. See main text and Methods for simulation details.

One possible cause of both these patterns is that microbes require more energy intake to survive in harsher environments24. For example, powering chaperones to prevent protein denaturation and running ion pumps to maintain pH homeostasis both require significant amounts of ATP. We hypothesized that varying energy demands could explain the patterns observed in the EMP since they would directly alter the severity of environmental filtering.

In the MiCRM, the energetic costs of reproduction are encoded in the model parameter mi, which is the minimal per-capita resource consumption required for net population growth of species i (see Methods for full model equations). The mi are sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. To vary the harshness of an environment, we added an environment-specific random number menv to the mi of all species that colonized a given environment. A large menv corresponds to harsh environments with increased energetic demands whereas small or negative menv corresponds to energetically favorable environments. To mimic the variability in environmental harshness in the EMP, for each community we randomly drew menv uniformly between − 0.5 and 9.5.

To selectively test the effect of an energy demand gradient on both alpha (within-sample) and beta (between-sample) diversity, we stochastically colonized 300 simulated communities of 150 species each from a regional pool of 180 species with a chemistry of 90 metabolites. We supplied each community with a constant flux of the same resource type. As discussed above, each of the 300 simulated communities was also assigned a random menv to mimic the effects of environmental harshness on growth rates. The results from this simulation are shown in Fig. 2 and in the bottom left panel of Fig. 3. The same simulation correctly captures both the richness/harshness correlation and the nestedness of the EMP data, suggesting that these large scale patterns may have a simple origin.

Given the way we have modeled the harshness variations, the link with diversity is not very surprising, because a sufficiently high maintenance cost can make it impossible for a species to survive on a given resource supply, regardless of the surrounding community structure. This pattern is thus guaranteed to occur in any simulation sharing this basic structure. The shape of the richness/harshness relationship does depend on modeling choices, however. We found that diversity loss happens more quickly when the incoming energy is divided among all possible resource types before being supplied to the system, as shown in Fig. 2. In this case it can happen that no single species is able to harvest a sufficient number of distinct nutrient sources to meet its maintenance cost, and the whole community goes extinct. In the original simulations, by contrast, the surviving species at high harshness levels satisfy most of their energy requirements by directly consuming the externally supplied resource, with the metabolic byproducts supply sufficient niche differentiation for multiple species to coexist.

To explore the ecological origins of the nested pattern in more depth, we ran additional simulations in a different regime of community assembly. Instead of modulating diversity with varying levels of selection pressure, we tried varying the degree of dispersal limitation. In the new scenario, each community faced identical environmental conditions, but the initial number of species from the regional species pool allowed to colonize the community was randomly chosen, from 1 to the maximum possible value of 180. In these new simulations, shown in the bottom right panel of Fig. 3, the nestedness vanishes. The reason for this is that in many environments, only a few species colonize the community resulting in many metabolic niches being unoccupied. We also ran simulations where both menv and the initial number of species varied from site to site, and obtained an intermediate degree of nestedness, as shown in Supplementary Fig. S1. Collectively, these simulations suggest that nestedness in cross-sectional data may be a sign of selection-dominated community assembly.

We emphasize that these simulations themselves do not confirm the hypothesis that energy gradients are the driver of the observed patterns. Temperature and pH affect microbes in many other ways that are not included in our minimal model. But our simulations do show that accounting for increased energetic costs associated with harsh environments can reproduce the large-scale patterns observed in the EMP even in the absence of any metabolic or taxonomic structure. Additionally, one ecological factor that seems crucial for reproducing these patterns is dispersal. The nestedness seen in the EMP requires that ecological dynamics are dominated by selection rather than stochastic colonization due to dispersal limitations.

Metabolic structure and species abundance curves

In order to reproduce more complex ecological patterns observed such as those observed in the HMP, we incorporated additional metabolic and taxonomic structure into our model8,10,23, as illustrated in Fig. 4. The basic idea is to recognize the fact that metabolites often belong to distinct groups with different metabolic properties (e.g. lipids, sugars, amino acids, etc.). In most of our simulations, we introduce T = 6 groups labeled A − F representing these metabolic classes, with F a special "waste” class” which mimics commonly produced metabolic byproducts (i.e. carboyxlic acids for fermentative and respiro-fermentative bacteria). To incorporate this structure in our metabolic matrix we introduce a three-tiered secretion model where: a fraction fs of the byproduct flux from metabolism of a given resource is partitioned among resources of the same class, a fraction fw of the flux is secreted as “waste” resources (class F), and the rest of the flux is nonspecifically partitioned among all the other classes.

Figure 4
figure 4

Incorporating metabolic and taxonomic structure. (a) Three-tiered secretion model used for simulating human and marine microbiomes. M = 300 resource types are grouped into T = 6 classes of equal size, labeled A through F. These groups represent different kinds of metabolites, e.g. lipids, sugars, amino acids, etc. Group F is the “waste” class, containing common byproducts generated by many metabolic pathways, e.g., carboxylic acids. A fraction fs of the byproduct flux from metabolism of a given resource is partitioned among resources of the same class. A fraction fw of the flux is partitioned among “waste” resources (class F). The rest of the flux is nonspecifically partitioned among all the other classes. In all simulations shown here, fs = fw = 0.45. (b) Heatmap of a metabolic matrix Dαβ encoding the three-tiered secretion model. (c) Taxonomic structure used for human and marine microbiome simulations. Microbial species are grouped into “families,” with each family specializing in a different resource class. Specialist families allocate a fraction q of their consumption capacity to their favored resource class. In all the simulations shown here, q = 0.9. There is also a generalist family whose preferences are uniformly sampled across all resource types.

Different taxonomic families often have distinct resource preferences. For example, it is well known that the bacteria from the taxonomic family Enterobacteriaceae to which E. coli belongs preferentially consume sugars. To reflect such taxonomical preferences in our model, microbial species are grouped into “families,” with each family specializing in a different resource class. Specialist families allocate a fraction q of their consumption capacity to their favored resource class. In all the simulations shown here, q = 0.9 meaning that specialist families derive 90% of their resources from their preferred resource class. In addition to these specialists, we know that certain microbial families behave as generalists with no strong metabolic preferences across resource types. To model this, we introduce a generalist family whose preferences are uniformly sampled across all resource types.

One commonly employed analysis tool for understanding community structure are species abundance curves. A species abundance curve is made by plotting the number of species present in a sample on the y-axis and the number of individuals or population size on the x-axis. In may ecosystems, it is known that species abundance curves are well fit by a Fisher log series25,26. Unlike Gaussian distributions or other normal-distribution derived variants such as truncated Gaussians, the Fisher log series has a long tail, reflecting the preponderance of rare species in these ecosystems. As shown in Fig. 5, the Fisher log series also gives a good fit to species abundance data on ocean microbial communities from the Tara Oceans dataset20. Simulation data generated using the MiCRM with taxonomic structure also result in long-tailed species abundance distributions that are well fit by a Fisher log series. However, these tails disappear in simulations of the MiCRM lacking metabolic and taxonomic structure. In this case, the species abundance curve were better described using a truncated Gaussian, consistent with theoretical predictions22,27. These simulations show that the long tailed species abundance curves seen in most ecosystems are compatible with an equilibrium niche model, provided a sufficient level of taxonomic structure, and do not necessarily require neutrality28 or chaotic dynamics29.

Figure 5
figure 5

Metabolic and taxonomic structure give rise to Fisher log series Left: Tag sequence count distribution for a representative sea surface sample from the Tara Oceans Project. Data was subsampled 300 times at a depth of 10,000 reads (out of 129,135 in the original sample), and species with 5 reads or less in the raw data were treated as extinct for the purpose of computing the Fisher log series parameters (see Methods). Right: Abundance distributions for simulated communities. 1,000 individuals were sampled from each of 900 simulated communities, with environments and colonization as described for the “Simple Environments” panel of Fig. 6 below. Each point is an average over all 900 communities of the number of species with a given number of individuals. All simulations were performed with the metabolic structure described in Fig. 4 above. The left-hand panel also incorporated taxonomic structure, with different families specializing in different resource classes, with specialization level q = 0.9. The right-hand panel did not have taxonomic structure (q = 0), and consumption preferences for all species were sampled from the same Bernoulli distribution. Green curve (“Truncated Gaussian”) comes from assuming that species’ invasion fitness are sampled from a Gaussian distribution, and that population sizes for surviving species are proportional to the invasion fitness, while species with negative invasion fitness go extinct. See Methods for details.

Patterns in the HMP can be explained by environmental filtering and competition

The Human Microbiome Project is a large-scale survey of the microbial communities that reside in and on the human body2. The HMP was supplemented by the smaller MetaHIT project which focused on sequencing fecal metagenomes from multiple individuals. Initial analysis of the HMP and MetaHIT results on the human microbiome revealed three major patterns, displayed in the top half of Figs. 6, 7, and 8. First, for a given body site different individuals had very different community compositions (see Fig. 6). Even at the phylum level, the relative abundances of dominant taxa varied dramatically from sample to sample2. But samples from different body sites still typically differed more than samples from the same body site, leading to the second pattern, shown in Fig. 7 of clustering of microbial communities by body site across individuals2,3. Finally, the gradients in the relative abundances of the dominant taxa in a given body site across individuals were also visible in dimensional reductions of more fine-grained (genus-level) community composition, producing the third pattern shown in Fig. 8.

Figure 6
figure 6

Low-dimensional nutrient supply variation reproduces patterns in human microbiome survey data. Top: Each column represents one sample from the Human Microbiome Project (HMP). Colored segments represent relative abundances of different phyla in each community. Reproduced from Figure 2 of the initial open-access report on the results of the HMP2. Bottom: Each column represents one of 900 simulated samples, each stochastically colonized with 2,500 species from a regional pool of 5,000 species, comprising seven metabolically distinct families. Colored segments represent relative abundances of the seven families defined in Fig. 4. Each of the three “body sites” was supplied with resources from a different pair of resource classes, with total nutrient supply fixed. In the first set of simulations (left), one resource from each class was supplied, and the ratio of the two supply rates was randomly varied from sample to sample. In the second set (right), all resources from each class were supplied, with randomly chosen supply rates for each sample, normalized to keep the total supply fixed. The brown family present in all three environments specializes in the typical byproducts (e.g., carboxylic acids) generated from all the other resource classes. Within each body site, samples are sorted by relative abundance of this family. See main text and Methods for simulation details.

Figure 7
figure 7

Correlations between inter-site nutrient variation and metabolic structure affect distinguishability of body sites. Left: Principal coordinate analysis (PCoA) of MetaHIT OTU-level community compositions, using the Jensen-Shannon distance metric. Data points are colored by the body site from which the sample was taken. Reproduced with permission from Figure 1 of Ref. 30. Right: Jensen-Shannon PCoA of species-level compositions of the simulated communities. In the first set of simulations (left), the nutrients supplied to different body sites come from different resource classes. This is the same set of simulations used for the left-hand panel of Fig. 6, but similar results are obtained if the simulations of the other panel are used instead, or if consumption preferences are uniformly random with no taxonomic structure (See Supplementary Fig. S2). In the second set of simulations (right), each environment is supplied with a randomly chosen set of resource types, with each site being supplied with about one third of the 300 possible resources. See main text and Methods for simulation details.

Figure 8
figure 8

Pattern in ordination of compositions from single body site admits of multiple explanations. Left: Jensen-Shannon PCoA of MetaHIT stool samples, showing a characteristic ‘U’ shape that has been observed in many independent studies. Colors indicate three hypothesized enterotypes, which we do not discuss here. Reproduced with permission from Figure 3 of Ref. 30. Right: Jensen-Shannon PCoA of simulated samples from Body Site 1 under two different levels of dispersal limitation. In the first (top), each community was initialized with 2,500 randomly chosen species out of the regional pool of 5,000. The communities display a continuous gradient in the population size of the most abundant species (over all samples) along the ‘u’ shape from one end to the other. In the second (bottom), each community started with 4,900 species. These communities display a continuous gradient of environmental conditions along the ‘U’ shape from one end to another.

One factor associated with the compositional gradient is the host’s typical diet30,31. Different kinds of externally supplied nutrients, such as fibers and proteins, are thought to encourage growth of different microbial taxa. For this reason, we hypothesized that the patterns in the HMP may arise from heterogeneity in the resources available in different environments. It is clear that reproducing such patterns requires assuming some minimal level of taxonomic and metabolic structure. For this reason, in our simulations we divided resources into six resource classes and species into six families, with each family specializing in one resource class, as illustrated in Fig. 4 and described above.

We first constructed metabolically structured simple environments where there were only two externally supplied resources. In particular, each of the three “body sites” was supplied with a unique pair of resources from distinct resource classes (i.e. body site 1 was supplied with a resource from class A and a resource from class B, body site 2 with a resource from class C and a resource from class D, and body site 3 with a resource from class E and a resource from class F). We modeled variability in the availability of resources across individuals at a fixed body site by changing the ratio of the two supplied resources while holding the total supplied energy fixed (see Methods). We also created metabolically structured complex environments where each body site was supplied with 50 external resources from each of the two resource classes while holding the total supplied energy fixed (i.e. body site was supplied with all 50 resources from class A and all 50 resources from class B, body site 2 with all 50 resources from class C and all 50 resources from class D, and body site 3 with all 50 resources from class E and all 50 resources from class F).

To mimic the scale of the actual microbiome data, we generated a regional pool of 5,000 species (approximately the number of OTU’s identified in the HMP2), and stochastically colonized 300 samples per body site with 2,500 species each. Figure 6 shows the resulting patterns for simple and complex environments. For simple environments, our simulations mimic the broad range of compositions found in the data including gradients in the dominant families present at each of the body sites. In contrast, for complex environments we see that the relative abundance of different families stays almost constant across individuals for each body site. This suggests that the patterns found in Fig. 6 may reflect the combined effects of environmental filtering and competition between species in the presence of a few dominant externally supplied resources.

We used the data from simulations on metabolically structured simple environments to perform a PCoA across body sites as in the MetaHIT data. As can be seen in Fig. 7, these simulations recapitulated the pattern seen in real microbial communities. We found that this clustering by body site depended strongly on the fact that different body sites had metabolically distinct resources. For example, if we instead considered metabolically unstructured complex environments where each body-site was supplied with 100 randomly chosen distinct resources regardless of resource class, the clusters were no longer fully separable on a two-dimensional PCoA (right most graph in Fig. 7). This suggests that the clustering of human microbiomes according to body-sites likely reflects the fact that these body sites have metabolically distinct environments that result in different patterns of byproduct secretion.

We also investigated the ability of our model to reproduce the U-shaped curves observed in PCoA of communities at a single body site (see Fig. 8). We found that we could reproduce this pattern using the same simulations used in Fig. 7 to understand metabolically structured simple environments. With the level of dispersal limitation used in these simulations, the U shape primarily results from the stochastic presence or absence of the most abundant species. If we reduce dispersal limitation, however, by initializing each community with nearly all of the possible species, the U shape directly reflects the low-dimensional variability of resource supply in the simple environments. It remains unclear which, if any, of these two explanations corresponds to the pattern in the gut microbiome, whose significance is a matter of ongoing controversy30,31,32.

Dissimilarity-overlap patterns reflect shared environments not universal dynamics

Another pattern obtained from a more recent analysis of the HMP data is an anti-correlation between overlap and dissimilarity of pairs of communities from a given body site (see Fig. 9 and Ref. 33 for details). Due to both stochastic colonization and variable environments, there are usually many species in one sample that are not present in the other. Different pairs of samples overlap to different degrees, and this overlap can be measured in terms of the ratio of the combined population of the shared species to the total population of the two samples. If one focuses on the subset of species that are shared, one can also compare the relative abundance distributions of the two samples within this shared pool, as illustrated in Fig. 9, using standard measures of dissimilarity such as the Jensen-Shannon divergence (see Ref. 33 for detailed discussion). These two quantities are not intrinsically related, as can be seen by evaluating them over a randomly generated table of abundances33.

Figure 9
figure 9

Host-specific dynamics are compatible with dissimilarity-overlap correlation. Top left: The composition of pairs of samples can be compared in two independent ways: “overlap” measures the fraction of each sample comprised by species common to both, and “dissimilarity” measures how different the relative abundance profiles are within this shared pool. The four pairs shown here have increasing overlap and decreasing dissimilarity from left to right, corresponding to the four points indicated in the scatter plot. Dissimilarity and overlap are plotted for 17,955 pairs of stool samples from the HMP, analyzed at the genus level. Solid line is a Lowess smoothing of the data, and red points correspond to the sample pairs illustrated in the first panel. Reproduced with permission from33. Top right: Dissimilarity and overlap for 10,000 pairs of simulated samples from the metabolically distinct simple environments of Fig. 6, with one resource supplied from class A and one from class B. Solid line is a Lowess smoothing of the data. Blue and green points correspond to two representative pairs of communities selected for further analysis in the bottom panel. Bottom: For each sample, the population dynamics near the steady state was approximated with a generalized Lotka-Volterra model. Effective carrying capacities and interaction coefficients computed from the mechanistic model parameters together with the population sizes and resource abundances, as described in the Methods. We have plotted the carrying capacity of each species for two representative pairs of communities with low (left) and high (center) overlap. These pairs are indicated in the scatter plots by blue and green points, respectively. We also show the normalized root-mean-square variability in carrying capacity for all 10,000 sample pairs (right).

This analysis was initially proposed as a way of distinguishing between “universal” and “host-specific” microbial dynamics. It was argued that if the dynamics of human associated microbial communities were universal, different individuals could be modeled with the same dynamic parameters and this would be reflected by a negative correlations between dissimilarity and overlap across cross-sectional samples. In contrast, for host-specific dynamics each individual would have their own kinetic parameters and the dissimilarity and overlap would be uncorrelated. This interpretation has been disputed, however, with numerical simulations of Lotka-Volterra type models showing that a negative correlation can result from host-specific dynamics in the presence of stochasticity, sampling errors, or environmental gradients34.

We re-ran the analysis of33 on our simulated HMP data discussed above and found that the dissimilarity and overlap were negatively correlated at a single body site just as in the real gut microbiome data (see Fig. 9, top right). However, this correlation was absent if we analyzed pairs of samples from distinct body sites, indicating that this signature likely arises due to the fact the all the communities at a given body site exist in a similar external environment (see Supplementary Fig. S3). Importantly, our simulations show that the negative dissimilarity-overlap correlation observed33 can be found even in the absence of universal dynamics since environments with different amounts of externally supplied resources generically give rise to communities with different ecological dynamics. Instead, our simulations suggest that the negative correlation between overlap and dissimilarity found in the HMP may reflect the fact that communities at a given body-site experience similar but not identical environments.

As a further check, we approximated the population dynamics near the steady state using a generalized Lotka-Volterra model (see Methods). This allowed us to explicitly calculate the effective carrying capacities and interaction coefficients for each community. The bottom row of Fig. 9 shows the carrying capacity for two pairs of communities: one pair where the two communities in the pair have a high overlap and another where the communities in the pair have a low overlap. The carrying capacities of species in the high-overlap communities are extremely similar whereas the carrying capacities of the low overlap communities are very different from each other. This provides strong evidence for the important role played by environmental filtering in producing the dissimilarity-overlap pattern observed in the HMP data.

“Modular assembly” of microbial communities

Our analysis of our synthetic HMP data also shows a new pattern: the family-level composition of each community along the nutrient gradient is approximately a linear combination of the compositions of the two extreme communities on the gradient (Fig. 6d). Quantitatively, if \({N}_{i}^{1}\) are the population sizes for each family i in the community supplied with a flux κ1 of resource type 1 alone, and \({N}_{i}^{2}\) are the population sizes in the community supplied with flux κ2 of resource type 2 alone, then the community supplied with flux (1 − α)κ1 of resource type 1 and flux α of resource type 2 has population sizes approximately equal to \((1-\alpha ){N}_{i}^{1}+\alpha {N}_{i}^{2}\). This “additivity” of communities from different environments can also be seen at the species level, by plotting the actual population sizes \({N}_{i}^{{\rm{mix}}}\) versus the weighted average prediction, as shown in Fig. 10.

Figure 10
figure 10

Modularity of community assembly. Left: In the experiments reported in21, synthetic beads composed of different kinds of polysaccharides, including agarose, alginate and carrageenan, were incubated with coastal seawater and colonized by the marine bacteria resident in the seawater sample. 16S rRNA amplicon profiling was performed for communities grown on beads composed of a single kind of polysaccharide, as well as mixtures of two kinds of polysaccharides. Relative abundances of amplicon sequences variants for two different mixtures (Agarose/Alginate and Agaraose/Carrageenan) are plotted versus a weighted average of the relative abundances on the pure beads. Solid lines are fits to a linear mixture model, with R2 of 0.84 and 0.74, respectively. Right: Abundance of each species in simulated communities supplied with mixtures of two resource types, plotted against the average of the abundances for communities supplied with just one of the resource types, with the total energy supply held constant. For the first two panels, all other parameters are the same as for the human microbiome simulations of Fig. 6, except that each sample is initialized with all 5,000 species from the regional pool. Titles indicate the class labels of the two supplied resources for each scenario, and species are colored by metabolic family following Fig. 4. In the third panel, simulations were run with the same number of resources and species, but with all resources assigned to the same resource class, eliminating all metabolic and taxonomic structure. Solid lines are predictions of the additive model where the abundance in the mixture equals the average of the abundances in the single-resource condition. The R2 score of this model is also shown in each panel.

Since this analysis is performed at the species level, meaningful taxonomic groupings are no longer necessary, so we also checked for additivity in simulations where communities lack any metabolic or taxonomic structure (i.e with unstructured metabolic and consumer preference matrices). Figure 10 shows that the population sizes in the mixed-resource communities are not well predicted by the weighted average of single-resource communities, with an R2 of 0.21. This suggests that metabolic and taxonomic structure are necessary to see this additive pattern.

This pattern is difficult to test for in field survey data, since there are many additional factors besides diet that vary from sample to sample, many of which may themselves be correlated with diet. Studying this kind of effect requires controlled experiments where the variable of interest can be systematically varied. One recent experiment colonized seawater communities on small beads composed of different kinds of carbohydrates, which served as the sole externally provided carbon source for the community21. This simplified scenario reflects the conditions of our minimal model more closely, where only one or two nutrient types were externally supplied. The authors of this study compared the weighted average of population sizes from communities grown on two different carbon sources, such as agarose and alginate, and the corresponding population sizes from other communities grown on a mixture of the two. They found the same “additivity” effect we observe in our metabolically structured simulations, with R2 values of 0.84 and 0.74 for two different carbon source pairs. They termed this property “modular assembly” of microbial communities. Our simulations show that modular assembly may be a generic property of complex microbial communities grown in the presence of multiple metabolically distinct resources.

Discussion

We have shown that the Microbial Consumer Resource Model introduced previously8 to describe laboratory experiments in synthetic minimal environments can also reproduce a wide range of experimentally observed patterns in survey data such as the HMP and EMP including harshness/richness correlations, nestedness of community composition, compositional gradients, and dissimilarity/overlap correlations. The MiCRM provides a systematic way of exploring the effect of stochastic colonization, resource competition, and metabolic crossfeeding on large-scale observables. By randomly sampling parameters from well-defined probability distributions, we combine a sufficient level of mechanistic detail to make the parameters physically meaningful, while keeping the number of parameters small enough for systematic investigation of the factors that control different patterns.

Our numerical results complement recent theoretical works suggesting that complex ecosystems may still be well described by random ecoystems35,36, suggesting the essential ecology of diverse ecosystems may be amenable to analysis using techniques from statistical mechanics and random matrix theory. For these reasons, the MiCRM is well-suited to serve as a minimal model for understanding microbial ecology.

Our analysis also suggests several hypotheses relating mechanism to large scale patterns in both the EMP and HMP. We have shown that it is possible to reproduce the richness/harshness correlation and the nestedness of the EMP data by assuming that harsh environments pose an additional energetic cost to organisms. This is true even when communities are grown in otherwise identical environments and lack any taxonomic and metabolic structure. This complements earlier work showing that energy availability is a key driver of community function and structure9. Our simulations on the HMP suggest that environmental gradients and resource availability result in significant environmental filtering and naturally explain the clustering of microbial communities by body site.

We have also identified ecological parameters that can break the observed patterns, allowing us to generate hypotheses about the underlying ecological processes: (a) breaking the nestedness pattern (Fig. 3) with dispersal limitation allowed us to connect nestedness to a selection-dominated regime in the EMP; (b) the loss of compositional gradients in complex environments (Fig. 6) led us to hypothesize that a small number of dominant resource types may drive inter-subject variability in the HMP; (c) the degradation of compositional clustering by body site (Fig. 7) in metabolically unstructured environments highlighted the importance of metabolically relevant differences between resource environments in the HMP; (d) breaking the additivity of communities grown on mixed resource supplies (Fig. 10) allowed us to connect this pattern to taxonomic and metabolic structure in the microbial species.

Great care has to be taken when interpreting large-scale patterns. For example, the negative correlation between dissimilarity and overlap observed in HMP data in Ref. 33 may be indicative of the fact that body-sites across individuals have similar environments rather than a much stronger claim of universal dynamics in the human microbiome. Our work also suggests that many large scale patterns may occur generically across different environmental settings. For example, the additivity observed in our synthetic HMP data is also observed in ocean communities grown on synthetic carbon beads21, suggesting modular assembly may be a generic property of communities grown in environments with metabolically distinct resources.

The analysis presented here shows that it is possible to qualitatively reproduce patterns seen in large-scale surveys such as the EMP and HMP using a simple minimal model. An interesting area of future research is to move beyond qualitative comparisons and ask how minimal models and large scale simulations can be quantitatively compared to large-scale genomic surveys. This problem is especially challenging given the large number of parameters, environments, and experimental designs that must be explained. One potential avenue for doing this is to use statistical methods such as Approximate Bayesian Computation (ABC)37. In ABC, the need to exactly calculate complicated likelihood functions is replaced with the calculation of summary statistics and numerical simulations. In this way, it may be possible to quantitatively relate mechanistic details at the level of microbes to community level patterns observed in large-scale surveys.

Methods

All synthetic data was generated using the Microbial Consumer Resource Model previously described9,10 and summarized below. We found the fixed points of the dynamics for each community using the Python package Community Simulator10: https://github.com/Emergent-Behaviors-in-Biology/community-simulator. Principal Coordinate Analysis was performed on the simulated HMP data using the Python package scikit-bio http://scikit-bio.org/. The pairwise distance matrix was generated using standard scipy commands with the Jensen-Shannon distance metric. Dissimilarity-overlap analysis was performed on the simulation data following the procedure described in Ref. 33. All simulation data and analysis scripts are available at https://github.com/Emergent-Behaviors-in-Biology/microbiome-patterns.

MiCRM dynamical equations

We consider the dynamics of the population densities Ni of S microbial species and the abundances Rα of M resource types in a well-mixed system, governed by the following set of ordinary differential equations:

$$\frac{d{N}_{i}}{dt}={g}_{i}{N}_{i}\left[{\sum }_{\alpha }{w}_{\alpha }(1-{l}_{\alpha }){c}_{i\alpha }{R}_{\alpha }-{m}_{i}\right]$$
(1)
$$\begin{array}{lll}\frac{d{R}_{\alpha }}{dt}= & {\kappa }_{\alpha }-{\tau }_{R}^{-1}{R}_{\alpha }-{\sum }_{i}{c}_{i\alpha }{N}_{i}{R}_{\alpha } & +\ {\sum }_{i,\beta }{D}_{\alpha \beta }\frac{{w}_{\beta }}{{w}_{\alpha }}{l}_{\beta }{c}_{i\beta }{N}_{i}{R}_{\beta }.\end{array}$$
(2)

For this study, the conversion factors gi from energy uptake to population growth were all set to 1, as were the resource qualities wα and the resource dilution rate \({\tau }_{R}^{-1}\). The leakage fractions lα govern how much of each consumed resource is released into the environment as metabolic byproducts, and was set to 0.8 for all α. See Table 1 for list of all parameters and units.

Table 1 Definitions and units for dynamical variables and mechanistic parameters.

Random sampling of consumer preference matrix and metabolic matrix

As noted in the Introduction, modeling highly diverse communities such as microbiomes requires a large number of free parameters. For example, the simulations with 5,000 species performed here required choosing over a million parameter values. In order to explore the typical phenomena produced by our model, we sampled the parameters randomly. Under the sampling scheme described in this section, the model is fully defined by a choice of just twelve parameters, listed in Table 2.

Table 2 Definitions of global parameters used for constructing random ecosystems.

We choose consumer preferences ciα as follows. We assume that each specialist family has a preference for one resource class A (where A = 1…F) with 0 ≤ F ≤ T, and we denote the consumer coefficients for this family by \({c}_{i\alpha }^{A}\). We also consider generalists that have no preferences, with consumer coefficients \({c}_{i\alpha }^{{\rm{gen}}}\). The \({c}_{i\alpha }^{A}\) can be drawn from one of three probability distributions : (i) a Normal/Gaussian distribution, (ii) a Gamma distribution (which ensure positivity of the coefficients), and (iii) a Bernoulli distribution with binary preference levels.

The key parameters for constructing all three distributions are the mean μc and the variance \({\sigma }_{c}^{2}\) of the sum ∑αciα over a row of the matrix.

In the current study, we focus on binary preference levels (option iii). In this model, there are two possible values for each ciα: a low level \(\frac{{c}_{0}}{M}\) and a high level \(\frac{{c}_{0}}{M}+{c}_{1}\). For a given choice of μc, the parameters c0 and c1 together determine the variance \({\sigma }_{c}^{2}\). The elements of \({c}_{i\alpha }^{A}\) are given by

$${c}_{i\alpha }^{A}=\frac{{c}_{0}}{M}+{c}_{1}{X}_{i\alpha },$$
(3)

where Xiα is a binary random variable that equals 1 with probability

$${p}_{i\alpha }^{A}=\left\{\begin{array}{ll}\frac{{\mu }_{c}}{M{c}_{1}}\left[1+\frac{M-{M}_{A}}{{M}_{A}}q\right], & \,{\rm{if}}\,\ \alpha \in A\\ \frac{{\mu }_{c}}{M{c}_{1}}(1-q), & \,{\rm{otherwise}}\,\end{array}\right.$$
(4)

for the specialist families, and

$${p}_{i\alpha }^{{\rm{gen}}}=\frac{{\mu }_{c}}{M{c}_{1}}$$
(5)

for the generalists.

We choose the metabolic matrix Dαβ according to the three-tiered secretion model depicted in Fig. 4. The first tier is a preferred class of ‘waste’ products, such as carboyxlic acids for fermentative and respiro-fermentative bacteria, with Mw members. The second tier contains byproducts of the same class as the input resource (when the input resource is not in the preferred byproduct class). For example, this could be attributed to the partial oxidation of sugars into sugar alcohols, or the antiporter behavior of various amino acid transporters. The third tier includes everything else. We encode this structure in Dαβ by sampling each column β of the matrix from a Dirichlet distribution with concentration parameters dαβ that depend on the byproduct tier, so that on average a fraction fw of the secreted flux goes to the first tier, while a fraction fs goes to the second tier, and the rest goes to the third. The Dirichlet distribution has the property that each sampled vector sums to 1, making it a natural way of randomly allocating a fixed total quantity (such as the total secretion flux from a given input). To write the expressions for these parameters explicitly, we let A(α) represent the class containing resource α, and let w represent the ‘waste’ class. We also introduce a parameter s that controls the sparsity of the reaction network, ranging from a dense network with all-to-all connection when s → 0, to maximal sparsity with each input resource having just one randomly chosen output resource as s → 1. With this notation, we have

$${D}_{\alpha \beta }={\rm{Dir}}{({d}_{1\beta },{d}_{2\beta },{d}_{3\beta },\ldots ,{d}_{M\beta })}_{\alpha }$$
(6)
$${d}_{\alpha \beta }=\left\{\begin{array}{ll}\frac{{f}_{w}}{s{M}_{w}}, & \,{\rm{if}}\,\ A(\beta )\ne w\,{\rm{and}}\,A(\alpha )=w\\ \frac{{f}_{s}}{s{M}_{A(\beta )}}, & \,{\rm{if}}\,\ A(\beta )\ne w\,{\rm{and}}\,A(\alpha )=A(\beta )\\ \frac{1-{f}_{s}-{f}_{w}}{s(M-{M}_{A(\beta )}-{M}_{w})}, & \,{\rm{if}}\,\ A(\beta ),A(\alpha )\ne w\,{\rm{and}}\,A(\alpha )\ne A(\beta )\\ \frac{{f}_{w}+{f}_{s}}{s{M}_{w}}, & \,{\rm{if}}\,\ A(\beta )=w\,{\rm{and}}\,A(\alpha )=w\\ \frac{1-{f}_{w}-{f}_{s}}{s(M-{M}_{w})}, & \,{\rm{if}}\,\ A(\beta )=w\,{\rm{and}}\,A(\alpha )\ne w.\end{array}\right.$$
(7)

The final two lines handle the case when the ‘waste’ type is being consumed. In this case, the fraction allocated to the waste type is the sum of the fractions allocated to ‘same’ and ‘waste’.

Solving for uninvadable equilibrium

We computed the uninvadable equilibrium state of Equation (2) using a novel algorithm inspired by expectation-maximization methods in machine learning. The algorithm is described in detail in Ref. 10, and implemented computationally in the Community Simulator package.

The raw results of the computation have nonzero abundances for all species, due to technical limits on numerical precision in the solver. In all simulations, the abundance distribution was clearly bimodal, with well-separated peaks on a log scale for the surviving vs. extinct species. For purposes of determining species richness, we set the abundance of all species in the “extinct” group to zero. Histograms of the raw results are plotted in the accompanying Jupyter notebook, where the choice of threshold for removing extinct species can be directly verified.

The large simulations with 300 resources and 5,000 species pushed the limits of our implementation of the algorithm, and occasionally failed to converge. Before performing further analysis, we directly verified that a true solution had been found by calculating the per-capita growth rate \(d\ {\rm{ln}}\,\ {N}_{i}\)/dt for all surviving species. A histogram of the maximum value of \(| d\ {\rm{ln}}\,\ {N}_{i}/dt| \) for each community (on a log scale) shows that most simulations are around 10−7, with the upper tail reaching to around 10−5. In the least stable simulation, with S = 4, 900 and two externally supplied resources, the failed runs form a second cluster around \(| d\ {\rm{ln}}\,\ {N}_{i}/dt| =1{0}^{-3}\). To eliminate such runs, we set a threshold for all simulations discarding samples with \(| d\ {\rm{ln}}\,\ {N}_{i}/dt| \ \ge \ 1{0}^{-5}\). For the least stable scenario, 29 of the 900 samples exceeded the threshold, and all others had between 0 and 11.

Synthetic data for global biodiversity patterns

Synthetic data for Figs. 2 and 3 was generated using a regional species pool of size Stot = 180 and M = 90 potential resources. The elements of the 180 × 90 consumer preference matrix ciα were sampled from a binary distribution as described above, with c0 = 0, c1 = 1 and μc = 10, using only one resource class (T = 1) and one consumer family (F = 1). The 90 × 90 metabolic matrix Dαβ was sampled from a Dirichlet distribution as described above, with s = 0.05. The mi were sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. A random number from a uniform distribution between − 0.5 and 9.5 was added to all the mi’s from each sample.

The rest of the parameters differed among the three scenarios we simulated, and were chosen as follows:

  • Simple environment (same as “selection-limited” in Fig. 3): Each sample was stochastically colonized with S = 150 out of the 180 possible species, and supplied with a single external resource, with κ1 = 200 and κα = 0 for all α ≠ 1.

  • Complex environment: Each sample was stochastically colonized with S = 150 out of the 180 possible species, and supplied with all external resources, with κα = 200/M = 2.2 for all α.

  • Dispersal limited: Each sample was stochastically colonized with a randomly chosen number of species, uniformly distributed between S = 1 and S = Stot = 180, and supplied with a single external resource, with κ1 = 200 and κα = 0 for all α ≠ 1.

Synthetic data for human microbiome patterns

To generate synthetic data for Figs. 69, we assumed a regional species pool of size Stot = 5000, with M = 300 possible resource types. Resources were grouped into T = 6 classes of 50 resource types each, labeled A through F. Microbial species were grouped into 6 specialist ‘‘families” of 800 species, with each family specializing in one resource class as described above. The remaining 200 species were designated as generalists, with no bias towards any one resource class. The consumption parameters were set to c0 = 0, c1 = 1 and μc = 10 as for the previous set of simulations. The metabolic matrix sparsity was set to s = 0.3, to reflect the actual sparsity of the E. Coli metabolic network38, and the secretions were allocated with fs = fw = 0.45. The mi were sampled from a Gaussian distribution with mean 1 and standard deviation 0.01. Each community was supplied with the same total incoming energy flux κ = ∑ακα = 1, 000. For each scenario, we simulated 900 independent communities, evenly partitioned among three “body sites” with different environmental characteristics.

The rest of the parameters were varied to construct eight different scenarios. For S = 2, 500 (strong dispersal limitation) and S = 4, 900 (weak dispersal limitation), we made the following four combinations of species properties and environmental conditions:

  • Metabolically distinct simple environments: In each of the three simulated body sites, one resource was chosen from each of two resource classes (A + B, C + D, and E + F). The relative flux levels for these two resources were chosen for each of the 300 communities in the site by randomly sampling a number a from a uniform distribution over the interval [0, 1], and then letting (1 − a)κ be the flux of the first resource, with aκ from the second. Taxonomic structure was incorporated by setting a high strength of specialization q = 0.9.

  • Metabolically distinct complex environments: In each of the three simulated body sites, all resources were supplied from two resource classes (A + B, C + D, and E + F), with flux levels from all other classes still set to zero. The relative flux levels for 100 resource types in each site were randomly sampled for each community, by independently sampling 100 numbers \({\widetilde{\kappa }}_{\alpha }\) from a uniform distribution over the interval [0, 1], and then setting \({\kappa }_{\alpha }=\kappa \frac{{\widetilde{\kappa }}_{\alpha }}{{\sum }_{\beta }{\widetilde{\kappa }}_{\beta }}\). Taxonomic structure was incorporated by setting a high strength of specialization q = 0.9.

  • No taxonomic structure: Same as “metabolically distinct simple environments” above, except that taxonomic structure was removed by setting q = 0.

  • Metabolically overlapping complex environments: Same as “metabolically distinct complex environments” except that the 300 resources were randomly partitioned into three sets of 100, and each body site was supplied with resources from a different set.

Synthetic data for marine microbiome patterns

The abundance distributions in Fig. 5 were generated directly from the “Simple environments” and “No taxonomic structure” simulations described above for the human microbiome patterns.

The tests of modular community assembly in Fig. 10 were performed using the same setup as “Simple environments” in the human microbiome simulations, but with just two “body sites” (A + B and C + D), and three values of a (0, 1 and 0.5). The unstructured metabolism control was performed by setting T = F = 1, assigning all 300 resources to the same resource class before sampling the metabolic and consumer preference matrices.

Relative abundance distributions and fisher log series

To create Fig. 5, we first downloaded the 16S OTU table from the Tara Oceans companion website (http://ocean-microbiome.embl.de/companion.html)20. We performed 300 independent rarefactions to a constant read depth of 10,000. For each possible number of reads, from 1 through the maximum observed, we plotted the number of species assigned that number of reads (“population size”), averaged over all rarefactions.

In many ecological settings, it has been observed that the number s(n) of species with n individuals in a sample of N total individuals closely follows the Fisher log series25,26:

$$s(n)=\frac{\alpha }{n}{x}^{n}$$
(8)

where the parameters x and α are determined from N and the total number of observed species S through the following equations:

$$S=-\alpha {\rm{ln}}\,(1-x)$$
(9)
$$N=\frac{\alpha x}{1-x}.$$
(10)

In the first panel of Fig. 5 we plot Eq. (8) using N = 10, 000 (the read depth we manually imposed for the rarefaction) and S equal to the number of OTU’s with more than 5 reads assigned to them in the original dataset.

For the simulation data in Fig. 5, we had the further advantage of having access to multiple independent trials under statistically similar conditions. Instead of averaging over multiple rarefactions generated from the same underlying dataset, we averaged over single samples of N = 1, 000 individuals taken from each of the 900 parallel communities. We plotted the Fisher log series with N = 1, 000 and S equal to the number of species with nonzero abundance.

We also plotted the distributions obtained when the underlying relative proportions of S species are generated by a simple null model, in which the invasion fitness (low-density growth rate) of each species is sampled from a Gaussian distribution. Species with negative values go extinct, and those with positive values end up with population sizes proportional to the invasion fitness. The resulting relative abundances of species in an infinitely large community follow a truncated Gaussian distribution. This distribution is determined by a single parameter, up to an overall scale that is irrelevant for the purposes of the current analysis. The parameter is inferred from the simulation results by matching the fraction of species initially present in the community that survive to equilibrium. In the figure, the green curves come from first sampling S underlying species abundances from this distribution, then sampling N = 1, 000 individuals from the resulting population, and averaging the results over 10,000 independent iterations.

Computation of overlap and dissimilarity

Here we summarize the definitions of the dissimilarity \(D(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})\) and overlap \(O(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})\) between two sets of population size measurements, as given in Ref. 33. Here, μ = 1, 2, …C is an index labeling the sample from which the measurement was taken (as is ν). In order to define these two quantities, we must first introduce some notation concerning the shared species. We let S represent the set of species that are present in both communities, and denote the total number of species in this set by S. We also define two types of normalized abundances:

$${\tilde{N}}_{i}=\frac{{N}_{i}}{{\sum }_{j}{N}_{j}}$$
(11)
$${\widehat{N}}_{i}=\frac{{N}_{i}}{{\sum }_{j\in {{\bf{S}}}^{\dagger }}{N}_{j}}$$
(12)

where the second quantity is normalized only by the set of species that is shared with the other community in the pair. We also define the average composition over the shared species:

$${m}_{i}=\frac{1}{2}({\widehat{N}}_{i}^{\mu }+{\widehat{N}}_{i}^{\nu }).$$
(13)

Using these definitions, we can finally write

$$D(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})=\sqrt{\frac{1}{2}{\sum }_{i\in {{\bf{S}}}^{\dagger }}\left({\widehat{N}}_{i}^{\mu }\ {\rm{ln}}\,\frac{{\widehat{N}}_{i}^{\mu }}{{m}_{i}}+{\widehat{N}}_{i}^{\nu }\ {\rm{ln}}\,\frac{{\widehat{N}}_{i}^{\nu }}{{m}_{i}}\right)}$$
(14)
$$O(\{{N}_{i}^{\mu }\},\{{N}_{j}^{\nu }\})=\frac{1}{2}{\sum }_{i\in {{\bf{S}}}^{\dagger }}({\tilde{N}}_{i}^{\mu }+{\tilde{N}}_{i}^{\nu }).$$
(15)

The first equation is simply the square root of the Jensen-Shannon divergence between the relative abundances of the overlapping species, and the second measures the relative abundance of the species in the overlapping set, averaged over the two communities.

Computation of effective Lotka-Volterra parameters

The distinction between “host-specific” vs. “universal” population dynamics is most clearly defined in terms of a closed set of equations for the dynamics of the population sizes, with environmental factors treated implicitly33. We can transform the MiCRM into a model of this form by examining the regime where the resource dynamics are “fast” compared to the timescale for changes in population sizes. We can then simplify the form of the resulting model by performing a Taylor expansion of the growth rate around the equilibrium population sizes \({\bar{N}}_{i}\), resulting in generalized Lotka-Volterra equations parameterized by a set of carrying capacities and interaction coefficients.

We start by writing the full dynamical equation (2) in a more compact form:

$$\frac{d{N}_{i}}{dt}={g}_{i}{N}_{i}\left[{\sum }_{\alpha }{\widetilde{w}}_{\alpha }{c}_{i\alpha }{R}_{\alpha }-{m}_{i}\right]$$
(16)
$$\frac{d{R}_{\alpha }}{dt}={\kappa }_{\alpha }-{\tau }^{-1}{R}_{\alpha }-{\sum }_{j\beta }{Q}_{\alpha \beta }{N}_{j}{c}_{j\beta }{R}_{\beta }$$
(17)

with

$${\widetilde{w}}_{\alpha }={w}_{\alpha }(1-{l}_{\alpha })$$
(18)
$${Q}_{\alpha \beta }={\delta }_{\alpha \beta }-{D}_{\alpha \beta }\frac{{w}_{\beta }}{{w}_{\alpha }}{l}_{\beta }.$$
(19)

We now invoke the “fast resource equilibration” assumption to set dRα/dt = 0, and solve for the resource concentrations \({\bar{R}}_{\alpha }\) as functions of the set of population sizes {Nj}. Inserting this result back into the equations for the dynamics of Ni, we have:

$$\frac{d{N}_{i}}{dt}={g}_{i}{N}_{i}\left[{\sum }_{\alpha }{\widetilde{w}}_{\alpha }{c}_{i\alpha }{\bar{R}}_{\alpha }(\{{N}_{j}\})-{m}_{i}\right].$$
(20)

To obtain the local Lotka-Volterra coefficients, we perform a Taylor expansion of the term in brackets around \({N}_{j}={\bar{N}}_{j}\), up to first order in the distance from equilibrium:

$$\frac{d{N}_{i}}{dt}={g}_{i}{N}_{i}\left[{\sum }_{\alpha j}{\widetilde{w}}_{\alpha }{c}_{i\alpha }\frac{\partial \bar{R}}{\partial {N}_{j}}({N}_{j}-{\bar{N}}_{j})-{m}_{i}+{\mathcal{O}}({({N}_{j}-{\bar{N}}_{j})}^{2})\right]$$
(21)
$$=\ \frac{{r}_{i}}{{K}_{i}}{N}_{i}\left[{K}_{i}-{N}_{i}-{\sum }_{j\ne i}{\alpha }_{ij}{N}_{j}+{\mathcal{O}}({({N}_{j}-{\bar{N}}_{j})}^{2})\right]$$
(22)

where

$${\alpha }_{ij}=-\,\frac{{\sum }_{\alpha }{\widetilde{w}}_{\alpha }{c}_{i\alpha }\frac{\partial \bar{R}}{\partial {N}_{j}}}{{\sum }_{\beta }{\widetilde{w}}_{\beta }{c}_{i\beta }\frac{\partial \bar{R}}{\partial {N}_{i}}}$$
(23)
$${K}_{i}={\sum }_{j}{\alpha }_{ij}{\bar{N}}_{j}-\frac{{m}_{i}}{{\sum }_{\beta }{\mathop{w}\limits^{ \sim }}_{\beta }{c}_{i\beta }\frac{{\rm{\partial }}\bar{R}}{{\rm{\partial }}{N}_{i}}}$$
(24)
$${r}_{i}={K}_{i}{g}_{i}{\sum }_{\alpha }{\widetilde{w}}_{\alpha }{c}_{i\alpha }\frac{\partial \bar{R}}{\partial {N}_{i}}.$$
(25)

We can compute the derivatives of \(\bar{R}\) through implicit differentiation to obtain

$${\sum }_{\beta }{A}_{\alpha \beta }\frac{\partial {\bar{R}}_{\beta }}{\partial {N}_{j}}=-\,{\sum }_{\beta }{Q}_{\alpha \beta }{c}_{j\beta }{\bar{R}}_{\beta }$$
(26)

with

$${A}_{\alpha \beta }={\tau }^{-1}{\delta }_{\alpha \beta }+{Q}_{\alpha \beta }{\sum }_{i}{\bar{N}}_{i}{c}_{i\beta }.$$
(27)

Thus we find

$$\frac{\partial {\bar{R}}_{\alpha }}{\partial {N}_{j}}=-\,{\sum }_{\beta \gamma }{({A}^{-1})}_{\alpha \beta }{Q}_{\beta \gamma }{c}_{j\gamma }{\bar{R}}_{\gamma }$$
(28)

and conclude

$${\alpha }_{ij}=\frac{{\sum }_{\alpha \beta }{c}_{i\alpha }{W}_{\alpha \beta }{c}_{j\beta }}{{\sum }_{\gamma \delta }{c}_{i\gamma }{W}_{\gamma \delta }{c}_{i\delta }}$$
(29)

where

$${W}_{\alpha \beta }={\sum }_{\gamma }{\widetilde{w}}_{\alpha }{({A}^{-1})}_{\alpha \gamma }{Q}_{\gamma \beta }{\bar{R}}_{\beta }.$$
(30)

Since all the parameters and equilibrium abundances are known in the simulations, this set of equations allows us to compute for each community μ (μ = 1, 2…300) the bare growth rates \({r}_{i}^{\mu }\), the interactions \({\alpha }_{ij}^{\mu }\) and the carrying capacities \({K}_{i}^{\mu }\). The scatter plot in the bottom right panel of Fig. 9 shows the normalized RMS variability in the carrying capacities for each pair of samples μ and ν, computed as:

$${{\rm{variability}}}_{\mu \nu }=\frac{\sqrt{\frac{1}{{S}^{\dagger }}{\sum }_{i\in {{\bf{S}}}^{\dagger }}{({K}_{i}^{\mu }-{K}_{i}^{\nu })}^{2}}}{\frac{1}{2{S}^{\dagger }}{\sum }_{j\in {{\bf{S}}}^{\dagger }}({K}_{j}^{\mu }+{K}_{j}^{\nu })}$$
(31)

where S is the number of surviving species shared between the two communities, and S is the set of indices of the shared species.