Introduction

The main goal of community ecology is to understand the underlying forces that determine species numbers and identities, and their relative abundances, across spatio-temporal scales. Many recent efforts are aimed at developing robust methods to estimate and understand the relative importance of different community assembly processes [1]. Ecologists generally accept that community assembly must necessarily be driven by both dispersal-assembly and niche-assembly mechanisms [2,3,4]. Dispersal-assembly theories emphasize (1) stochastic colonization (driven largely by species pools) and (2) local extinctions (driven by random events affecting small populations sizes) [5, 6], while niche-assembly highlights (1) the importance of species differences and (2) their interactions in predicting community composition [7,8,9]. Niche-assembly mechanisms can affect either extinction or colonization. From this perspective, local extinctions result in general from negative interactions. They are often caused by competitive exclusion of sub-dominant species disappearing from the system due to a worse overall performance [10, 11], but strong environmental filtering (cf. species sorting [12]) can also prevent colonization and establishment of poorly adapted species.

In contrast, positive interactions (facilitation) decrease the probability of local extinctions, enhancing persistence and biodiversity [13,14,15]. Assessing the link between trait variation, the environment, and biotic interactions (either positive or negative) is not trivial [1, 16,17,18]. Further, there is an increasing recognition of the need for methods to make robust predictions that help understand responses of species assemblages to environmental shifts [19, 20].

In studying community assembly of macroorganisms, spatio-temporal scales are commonly huge, making long-term ecological research challenging. Microbes, in contrast, thrive in highly diverse communities and are characterized by short time scales. This makes microbial communities ideal study systems, providing robust insights into assembly processes [21, 22]. Species and individuals within assemblages respond to the environment according to their functional characteristics. In particular, microbes with stress tolerance traits are selected in specific environmental conditions [23]. Genomic traits allow the linkage of bacterial functions to their environmental preferences at different scales, from local habitats to biogeographical regions [24]. This approach could be especially helpful across habitats with strong environmental constraints, where a variety of functional adaptations can emerge along a gradient.

Moreover, there is an increasing recognition that variations in the distribution of functional traits along environmental gradients can be quantified [18, 25, 26]. Whenever niche-assembly mechanisms are overridden by strong random processes, the local community composition is expected to be akin to a random sample from the regional pool [27].

When this happens, local trait distributions should not differ significantly from the regional trait distribution. Conversely, differences between local and regional trait distributions are a fingerprint of the existence of specific drivers of community assembly. The classical framework by Webb [28] considers clustering in trait values as a signature of environmental filtering, while Mayfield and Levine [10] remarked that such clustering can result also from competitive exclusion.

Here, we use shifts in the distribution of functional traits along an environmental gradient to assess the ecological impact of a changing environment on community trait structure. Our methods are inspired by the data randomization/resampling null model approach [29]. Our trait-based approach consistently finds thresholds in an environmental variable along a gradient with the aim of interpreting meaningful changes in the type of drivers responsible for community assembly. We call this index-based method RTCC—Randomized Trait Community Clustering.

Materials and methods

Details related to sampling sites, molecular methods, bioinformatics processing, and additional statistics are provided in SI Materials and Methods, while null models of community assembly are fully described in this section. See also supplementary text and Fig. S1 for a summarized work-flow of the methods carried out in this study. In addition, source code for the RTCC Method and sample data files are available in GitHub. Briefly, samples were collected from the Monegros Desert area (NE Spain, 41°42′N, 0°20′W) which harbors among the largest number of inland saline lakes in Europe [30]. A total of 148 samples were taken from 14 different shallow lagoons (Fig. S4). We explored the spatio-temporal variation of a regional set of these local aquatic communities driven by a wide salinity gradient (0.1–40% of dissolved salts) (Fig. S5A). Temporal data of wind speeds were obtained from the Meteorological state agency reporting data of an automatic station in the area (Bujaraloz) (Fig. S5B). For DNA analyses, water samples were pre-filtered in situ through a 50 μm-pore-size net, to retain large zooplankton and algae, and 100–500 mL were subsequently filtered on 5 μm and then 0.2 μm pore-size polycarbonate membranes (47 mm diameter).

The membranes were enzymatically digested and phenol–chloroform extracted [30]. To obtain community profiles, we performed a NGS sequencing step targeting the V4 region of the bacterial 16S rRNA gene by means of a high-speed multiplexed Illumina MiSeq platform. Raw sequences were processed using the UPARSE pipeline [31]. After the processing steps, we only kept those samples that presented library sizes larger than 10,000 reads/sample. The final pool consisted of 136 samples with 9993 operational taxonomic units (OTUs). In order to associate OTUs to genome content, we used a 16S rRNA database [32] linked to the IMG genomic database [33]. OTUs were matched to 16S rRNA gene records available in the PATRIC genomic database (as of January 2016) [32]. The average percentage of abundance matched per sample was ca. 62%. We tested the representativeness of the subset matching genomes in our approach (see Figs.S1 and S2), i.e., we explored if the ordination pattern of the whole dataset was consistent with the ordination of the subset. In addition, a Mantel test (based on Spearman correlation) was also used to conclude on this decision-making step. See also the RTCC Method protocol in SI document. Functional predictions based on representative genomes allow for an inferring of genomic and metabolic potential of 16S rRNA sequences [34, 35]. For this we downloaded available genomic traits of potential interest from IMG [33]: genome size, gene count, % GC, % coding base, % CDS, % RNA, rRNA count, % transporter proteins, % signal peptide, and % transmembrane proteins. We selected 10 traits that summarize genomic structural/functional variability, some of them known to respond to ecological adaptations.

Null models of community assembly

In order to quantify the importance of environmental filtering in the assembly of ecological communities, we developed a measure of the degree of clustering of a set of samples (or sites). The pool of species was formed by all the species observed in the complete set of samples taken from the lagoons. First we calculated a measure of average dissimilarity in the trait distribution of a sample as the pair-wise mean trait difference between every species pair in the sample, the functional mean pairwise distance (MPD) [28]

$$\Delta x=\frac{2}{n(n-1)}\sum_{i=1}^n\sum_{j=i+1}^n {|x_{i}-x_{j}|}$$
(1)

where xi is the trait value of species i in the sample (with n reported species). Next we tested if the observed measure ∆x significantly differed from the expectation of the same measure across realizations of consistent null models with the same number of species as appearing in the empirical sample. We assumed that any species from the pool can take part in a null-model synthetic community. We build the null-model dissimilarity distribution over replicates to obtain a p-value, i.e., the quantile defined by the empirical difference within the distribution of simulated differences across realizations. We tested whether the empirical sample dissimilarity value was significantly large or small, indicating over-dispersion or clustering, respectively. If the observed difference of our empirical sample was under the lower 5% of the simulated distribution trait differences, we considered that sample as showing significant clustering (Fig. 1a, left). We repeated this test for every sample in our data set, which yielded the p-value distributions for each target trait (Fig. 1a, right). Let N be the number of samples in the data set, and Nc the number of samples that showed significant clustering when tested against the null model. We defined a clustering index,

$$h = \frac{{N_{\mathrm{{c}}}}}{N},$$
(2)

as the proportion of samples showing significant clustering over all the samples considered. Note that the clustering index is a global measure for the whole set of samples. Error bars for the clustering index were calculated by averaging this quantity over 100 repetitions of the distribution of p-values according to the null model (Fig. 1b).

Fig. 1
figure 1

Methodological framework for the Randomized Trait Community Clustering (RTCC) method. a For each sample from different water bodies over a given region we test the hypothesis that observed average dissimilarity (see Eq. (1)) is compatible with a given null model. These non-parametric randomizations assign a p-value for each sample. For all samples in the dataset we represent the p-value distributions obtained by the null model for a given functional trait. Significantly low p-values are related to clustering (p < 0.05), and significantly high p-values to over-dispersion. b For the whole sample set, which we define here as a metacommunity, we calculate the fraction of samples that yield significant clustering, i.e., its clustering index (see Eq. (2)). We average several realizations of the null model in order to get average clustering index and error bars. c Samples are sequentially removed along decreasing values of the environmental variable to define nested metacommunities; for each of them we calculate the clustering index (left). The significance of this curve is assessed by building an ensemble of curves corresponding to removing samples in random orderings, which are then used to define the shaded area (right)

The null models developed differ in the definition of the species pools potentially available for each of the samples, that is, in the way species were drawn from the species pool:

  1. i.

    Random assembly: For all samples, all species are potentially available. They were chosen at random from the pool with equal probability.

  2. ii.

    Abundance-based assembly: For all samples, all species are potentially available, but had a probability of being chosen proportional to their abundance in the entire species pool.

  3. iii.

    Environmental range-based assembly: For a given sample, only those species whose ranges for the environmental factor encompass the value observed in that sample are eligible to be drawn and form the corresponding randomly generated sample. Potentially eligible species are drawn with uniform probability. Then, in this constrained way, we build a set of random synthetic samples, each characterized by a given value of the environmental factor and the same number of species as observed in each corresponding observed sample. For each species, the range of the environmental factor is defined by the minimum and maximum values of the variable across the samples in which the species is observed.

  4. iv.

    Environmental range and abundance-based assembly: As (iii), but, additionally, species whose ranges for the environmental factor contain the value of the empirical sample are randomly drawn proportionally to their abundances in the species pool.

For the purpose of discerning if significant clustering (or the lack of it) was influenced by the environmental gradient, we iteratively excluded the sample from the set with the highest value of the environmental variable of interest, also excluding from the pool, the corresponding species that appeared only in that sample, and recalculated the clustering index with the remaining samples (Fig. 1c, left). Additionally, we tested the significance of clustering patterns according to the four null models by applying the same sequential removal of samples, but first randomizing their salinity order (see Supplementary Materials for further details). These randomizations provided a 95% confidence interval for the clustering index (Fig. 1c, right) that allows you to decide if the observed clustering pattern as the environmental variable decreases is related to the environmental factor or not.

Statistics

Significant breakpoints of observed patterns along environmental gradients, either from ordination analyses or clustering indices, were estimated by means of the maximum F statistic derived from sequential Chow-tests (sctest function from strucchange package) [36] (see Fig. S3).

Results

We developed a detailed theoretical and methodological framework for the RTCC method (see Fig. 1, the methods section for more details, and Fig. S1 for a summarized workflow of the methodology). To test the potential of this trait-based approach for gaining new insights about different community assembly mechanisms, we explored microbial communities along an environmental (salinity) gradient in a set of shallow saline ponds. We first evaluated 10 genomic traits that were available in public databases and that summarized genomic structural/functional variability, some of them known to respond to ecological adaptations (Fig. 2 and see Supplementary Material, section "Matching genomes"). Among them, the percentage of DNA dedicated to signal peptide synthesis exhibited the strongest clustering signal. Therefore, this trait, characterizing each species within a sample, was primarily used to further conduct the RTCC analyses. The pairwise differences for this trait were significantly smaller than those expected in synthetic communities randomly assembled from the pool according to the first null model (see “"Methods” section below).

Fig. 2
figure 2

Clustering distribution patterns for different functional traits. The p-value distributions obtained by the random assembly null model for 10 functional traits are shown. Colored stripes show the ranges of significant over-dispersion (p ≥ 0.95) and significant clustering (p ≤ 0.05). Signal peptide exhibit a strong signal of clustering

To assess the role of environmental factors in sorting species from empirical communities along the gradient, we analyzed the relationship between clustering patterns and two environmental variables with a battery of assembly null models (see the “Methods” section). Salinity and wind velocity should have different impacts on community assembly and, therefore, result in distinct patterns of clustering index values. We use wind as a control variable because it should represent a stochastic continuous perturbation, while salinity would provide a deterministic influence. We represented clustering index values by sequentially removing samples in decreasing order of the environmental variable (see Fig. 3a, b). We then tested the significance of the environmental variable by comparing the declining patterns with replicates of the same methodology based on random orderings (95% confidence intervals are shown as shaded areas). The rightmost point of the curves corresponded to the clustering index for all samples in the data set. The random assembly null model yielded a clustering index value h~0.9, which means that around a 90% of samples were not compatible with random assemblages, that is, showed an average trait dissimilarity significantly different from null model expectations. In other words, the vast majority of empirical samples exhibited a high degree of clustering in signal peptide compared to purely random assemblages. Such level of clustering declined when synthetic communities were built by selecting species with a probability proportional to relative abundances, and when ranges of salinity were taken into account in the assembly of simulated communities. As null model complexity increased, the fraction of samples compatible with the null hypothesis increased (see Fig. 3b). For the most constrained model, which takes into account both species abundances and salinity ranges, only at most ~30% of samples were not compatible with the null hypothesis underlying this model (see Fig. 3, lower curve). All null models showed a plateau in clustering index as the environmental variable decreased, but, at some point, a sharp decline occurred. The curves for salinity lay outside the confidence interval expected for randomized orderings, hence the null model was not able to explain the variation of clustering as salinity decreased. This result is consistent with a deterministic role of the salinity environmental gradient in community assembly. We repeated the same methodology using wind speed as a control environmental variable. Wind velocity showed a different pattern from salinity, that is, white curves laying within the shaded area in both tests (Fig. 3a). Since clustering values for ordered sample removal lie within null model confidence intervals for randomized orderings, wind speed is not a deterministic driver of community assembly, but, as expected, plays a stochastic role (Fig. 3a). Although the OTU-based ordination approach revealed a weak positive correlation signal, this marginal correlation is well below the strong correlation observed for salinity (compare D and B panels in Fig. S6).

Fig. 3
figure 3

Null models for community assembly. a Randomized signal peptide community clustering in salinity and maximum wind speed according to null models (i) and (ii), random assembly and abundance-based assembly, respectively. The clustering index for the signal peptide trait is plotted versus the maximum value of the environmental variable among all samples still present after every removal step. Samples are sequentially removed one-by-one in decreasing order of the variable. Note that wind velocities have been normalized so that their maximum values correspond to the maximum salinity observed. The shaded gray areas correspond to confidence intervals for the different null models when the random removal of samples is repeated multiple times. b Same as a for null models (iii) and (iv), environmental range-based assembly, and environmental range and abundance-based assembly, respectively, which are meaningful only for salinity. c Clustering index curves are plotted again along with their thresholds, all of them lying within the rectangle (shaded in gray) yielded by the OTU-based approach, whose limits corresponds to the significant breakpoints of the observed pattern along the salinity gradient (Fig. S6)

Chow tests were used to determine a threshold value in salinity over which environmental constraints emerge as a leading driver. We found percentages of salinity ranging from 3.2% for the random assembly null model and 4.8% for the ranges and abundances null model. These values lay within the range of salinity that presented significant breakpoints on the changing rate of communities in the ordination analysis (Fig. 3c, see also Fig. S6). We observed that traits other than signal peptide could be used to determine thresholds along the gradient (for example, the percentage of GC pairs in sequences; see Fig. S7). However, this was not necessarily true for any trait (for example, the percentage of DNA associated to transporter proteins, see also Fig. S7). Finally, signal peptide average values across genomes per sample (relative to the metacommunity average) were plotted along the salinity gradient. We observed that average trait values decreased as salinity increased (Fig. S8).

Discussion

Competitive exclusion increasing local extinctions of sub-adapted species and strict environmental filtering preventing effective colonization of non-tolerant species along an environmental gradient, lie at two extremes of a continuum. Accurately disentangling these two main drivers is challenging. We developed a method to evaluate the effect of an environmental gradient on community trait structure, and provide a conceptual framework (Fig. 4) to interpret consistent discrete environmental thresholds that helps separate an assembly regime driven by a rich combination of several biotic and abiotic factors from a regime characterized by strong environmental constraints. The observed threshold unveils the switch point above which a structuring force starts dominating along the gradient (see the highest values of the clustering index along the plateau). Conversely, the sharp decline of the clustering index below the threshold value can be regarded as a signature of the rapid decay of the predominant role of a given environmental variable in driving local community assembly (Fig. 4). Thus, the lower the environmental pressure the lower the proportion of samples exhibiting significant clustering (i.e., the lower clustering index) due to the increased dispersion of trait values within local samples.

Fig. 4
figure 4

Conceptual framework. A conceptual model for the interpretation of results based on the Randomized Trait Community Clustering (RTCC) method. We indicate the type of dominant deterministic forces acting along a gradient of a meaningful environmental variable. The picture shows the pattern observed along the salinity gradient by using null model (iv). The shaded gray area corresponds to confidence intervals for the same null model but randomizing sequential sample removal. A significant correlation of the trait clustering pattern with an environmental gradient exists if clustering index values for a given trait (under ordered sequential removal) lie outside the shaded area, this is, the null model fails to explain the variation of clustering along the gradient. The vertical line is the threshold salinity value separating the two community assembly regimes

Trait-based approaches have been widely used by ecologists to identify general community-level patterns of macro-organisms based on phenotypic characters [25, 37,38,39]. Some authors have made considerable efforts to introduce this framework in the study of microbial communities [22, 40,41,42], among others. Burke et al. [43] suggests that gene function, rather than the RNA taxonomy approach, holds the key to properly studying bacterial community assembly. Community composition may vary through several processes, while community functionality may remain stable due to functional redundancy of taxa [44, 45]. If stochastic forces are not too strong, selected species traits are expected to be differently favored and cluster around certain optimal values along a structuring environmental gradient. The distribution of trait values across species along a given gradient will be the result of a balance between environmental filtering and biotic interactions. Some traits can be unambiguously related to either stress-tolerance or competitive abilities. In this case, they can be used to distinguish between environmental filtering vs. competitive exclusion [17, 23] in spite of these two process producing the same trait clustering patterns [10].

Our findings indicate that the relative amount of genomic DNA coding for signal peptide shows a significant clustering pattern along the salinity gradient. Signal peptides are short peptides (5–30 aminoacids) present at the N-terminus of newly synthesized proteins that control the final destiny of these proteins, this is, whether they end up anchored to the cell membrane or are extracellularly secreted [46]. The diversity of these signaling systems decreases along salinity gradients in favor of the signal peptide-dependent Tat system, which dominates under high salinity conditions [47,48,49]. The observed decrease in average signal peptide values per sample along the salinity gradient (see Fig. S8) suggests that this trait could be involved in salinity stress tolerance, although confirming this requires further experimental research. Likewise, as salinity increases, we also observed higher similarity in GC content across locally co-occurring species (see Fig. S7). Several studies have suggested that hypersaline inhabitants are characterized by high GC content although there are some exceptions to these general patterns [50]. Moreover, it has been suggested that the stress effect of extreme environments is reflected in the nucleotide composition [40]. Other functional traits are not critically influenced by this gradient. In other words, some traits with relevant ecological roles may be related to other functional strategies not linked to the main environmental gradient, such as, for instance, the rRNA operon copy number, known to reflect bacterial growth rates [51], which are related to competitive ability [41].

Salinity has been described as one of the major environmental drivers of microbial community composition at a global scale [52,53,54]. Important ecological changes have been described along salinity gradients from multipond solar salterns [55], with decreasing biodiversity as salinity increases [56], with an accompanying loss of metabolic processes [57]. Here, the specific trait pattern reveals the ecological functional adaptation of bacterial communities to the salinity gradient. Structural changes shown under a taxonomic view followed a similar pattern along the gradient (Fig. S6B). An increase in salinity beyond a threshold close to 5% triggers both a rapid change in community composition, and a decrease in mean relative signal peptide values (Fig. S8, upper panel). Maximum trait clustering was also reached at the same threshold. Above 5% salinity, community trait structure tends to stabilize, which means that there is functional redundancy in taxa while community composition still changes, but showing a slower turnover.

Between ca. 10% and 40% salinity range, community composition remained more stable. This shows that the RTCC method provided information not only about functional adaptation of communities—complementing results from their taxonomic turnover—but also about the accurate position of a salinity threshold (5%) over which environmental constraints strongly shape community assembly. Since traits are selected based on the degree of clustering signal rather than on previous trait knowledge, our method opens new possibilities to objectively determine both whether or not and where a functional adaptation occurs along an environmental gradient.

It is reasonable to think that, in principle, if an environmental variable changes very smoothly, community composition, and, consequently, community trait structure, should also change smoothly. However, the opposite seems to be true in microbial communities along a salinity gradient. The RTCC accurately finds the position of a salinity threshold.

Ecological theory predicts that the end points of community assembly can be multiple [58]. The possibility that several stable species assemblages can exist under the same environmental conditions opens the door to observing abrupt transitions in community composition as a single environmental variable slowly changes. The study of the origins and underlying causes of these break points in community composition along environmental gradients deserves further empirical and theoretical work.

To conclude, we provide a new conceptual approach with the RTCC method that could be useful in guiding hypothesis-driven studies coping with the high complexity of biological systems. Although we used a data set from microbial communities, we highlight the potential of this method to be applied more generally to quantitative phenotypic or genotypic trait data measured for macro-organismal communities as well. A combination of theoretical modeling through trait-based analyses allows us to go beyond description and deepen our mechanistic understanding of community assembly. Quantifying how community assembly is shaped by the environment is critical to predicting the ecological impact of environmental changes, and our approach provides a powerful tool to advance this analysis.