Ecological indicator values reveal missing predictors of species distributions

The questions of how much abiotic environment contributes to explain species distributions, and which abiotic factors are the most influential, are key when projecting species realized niches in space and time. Here, we show that answers to these questions can be obtained by using species’ ecological indicator values (EIVs). By calculating community averages of plant EIVs (397 plant species and 3988 vegetation plots), we found that substituting mapped environmental predictors with site EIVs led to a doubling of explained variation (22.5% to 44%). EIVs representing light and soil showed the highest model improvement, while EIVs representing temperature did not explain additional variance, suggesting that current temperature maps are already fairly accurate. Therefore, although temperature is frequently reported as having a dominant effect on species distributions over other factors, our results suggest that this might primarily result from limitations in our capacity to map other key environmental factors, such as light and soil properties, over large areas.

Species distribution models 1-3 (SDMs, also termed habitat suitability models or ecological niche models) establish a statistical relationship between environmental variables (predictors) and observed species occurrences to estimate the realized environmental niche (sensu Hutchinson) and predict the realized or potential distribution of species [4][5][6] . Over the last two decades, the popularity of these models grew spectacularly 7 while technical developments allowed predictions at increasingly fine spatial and temporal resolutions [8][9][10][11][12][13] . However, despite accumulated knowledge on the factors influencing niches and distributions, most SDMs still lack inclusion of several ecophysiologically important factors 9,13,14 . One reason for this limitation is that, although we know which variables are theoretically important (e.g. various expressions of heat, light, water and nutrients for plants), we still don't know well what potential importance each variable has in explaining species distributions (e.g. climate versus substrate predictors for plants) 3 , and how well the commonly used environmental predictors truly reflect the species requirements 14 . Precise answers to such questions are still lacking, largely because we still have a very limited theoretical or experimental knowledge of species' fundamental niches and how it relates to realized niches in the wild 8,15,16 , but also because we have limited tools to assess the relative importance of all potential environmental factors on the realized and potential distribution of species at various scales.
In contrast to SDMs that use environmental variables directly to predict species or community distribution, species' ecological indicator values [17][18][19] (EIVs) represent semi-quantitative estimates of environmental conditions suited for individual species based on comprehensive and usually long-lasting compilations of expert knowledge and field or experimental measurements. EIVs allow one to estimate environmental conditions in a given biotope based on the occurrence and abundance of its constituent plant species, by averaging their individual indicator values [20][21][22][23][24] . One must therefore clearly distinguish EIVs defined for a species, representing central estimates of univariate niche dimensions, from those obtained for a site based on the aggregated values of the constituent species in the target community. These site ecological indicator values (site EIVs) thus represent integrated signals of species-environment relationships at the level of communities, and as such provide robust information on the long-term environmental conditions characterizing the site 25,26 . As a result, these are commonly used in paleoclimatology to reconstruct past climate 27 , in studies focusing on shorter time periods of (anthropogenic) climate warming 28,29 and in habitat monitoring studies for conservation purposes 30 . Yet, few studies so far have attempted -or even discussed -the possibility of using EIVs to assess the relative importance of environmental factors to predict species distributions, i.e. to combine SDMs and EIVs 23,24,31,32 .
Here, we develop the idea that EIVs calculated at observation sites can shed light on which variables could be the most important drivers of species distributions 24 , and which ones should be further included in SDMs to capture more of the species' environmental determinism. We show, using plants as a case study, how a modelling framework incorporating site EIVs (Fig. 1) can be used to evaluate the predictive power and suitability of those environmental variables most commonly used in SDMs (i.e. mainly climatic or topo-climatic predictors 14 ). We hypothesize that plant EIVs represent more proximal 33 descriptors of plant species distributions, allowing to estimate how much additional environmental variance can be potentially explained by SDMs, and ultimately approximate the maximum level of environmental determinism that can be quantified in SDMs at a given spatial resolution. We test and illustrate this idea using a dataset of 397 plant species from forests (3076 sites, r = 10 m) and open grasslands (912 sites, 8 × 8 m) occupying a complex landscape across an entire mountain region (700 km 2 ), spanning wide environmental gradients (375 to 3210 m) at high spatial resolution (25 m).

Results and Discussion
Substitution of each individual environmental predictor by its corresponding site EIV improved average model performance for all metrics (Wilcox.test, p < 0.001; Fig. 2) compared to the reference model (M_ref; see Tables 1, S1 and S2) and for all variables except annual mean temperature. The strongest single variable effect was observed when the annual sum of solar radiation was replaced by the site EIV for light (M_L; +9-12%) and the model including all site EIVs always performed best (M_EIV; +17-22%, Fig. 2). This increase in average model performance was not the result of a few species highly benefiting from the addition of site EIVs, but was driven by a general improvement of individual SDMs for the large majority of species (Fig. 2). For instance, adding the site EIVs for soil (N, R or N + R), light (L) and moisture (M) improved the predictions for up to 93%, 86% and 81% of the species respectively (Fig. 2, Fig. S1). Furthermore, these improvements could not be caused by the addition of more variables (except for M_Soil; but see Fig. S1), as all models included the same number of predictors (or less in case of M_K), providing strong evidence that EIVs contain important additional information not accounted for by traditional environmental predictors.
Results also showed that, apart from temperature, correlations were usually low between commonly used environmental variables and site EIVs ( Table 2). Some less commonly used environmental variables, such as cloud cover or site water balance (precipitation minus potential evapotranspiration), were more strongly correlated with site EIVs. Yet, selecting these as predictors (M_cor) did not improve overall model performance (Fig. 2), but instead shifted variables' importance (Fig. 3). The success of modelling the distribution of site EIVs in geographic space as a function of easily available environmental predictors (Table S3) strongly varied among the site EIVs, with very good predictions for site EIVs for temperature (T; cor = 0.94-0.96) and very poor predictions for site EIVs for continentality (K; cor = 0.41-0.53; Table S3). This strongly suggests that site EIVs represent more than simple combinations of easily available environmental predictors typically used in SDMs, and highlights their distinct and informative content. While site EIVs represent an integrated signal of the long-term local environmental conditions 17,19 (depending on the ecosystem, from years to centuries), the commonly used environmental GIS layers represent interpolated or modelled trends not accounting sufficiently for local modifications of  Table 1.
Apart from improving the average model performance, the replacement of environmental variables by site EIVs also significantly shifted the variable importance in all the models except those with the site EIV for temperature (M_T; Fig. 3). While temperature was by far the dominating variable in the reference model with topo-climatic predictors (M_ref) as well as in most SDM studies 14 , the inclusion of site EIVs led to a more even distribution of variable importance values (Fig. 3). The dominating importance of temperature might therefore be, at least partly, an artefact of the quality of the environmental variables used. The fact that the site EIV for temperature (T) (i) was highly correlated with annual mean temperature, (ii) was the only site EIV that could be modelled with good accuracy across the landscape using common environmental variables, and (iii) did not improve the predictive power of the models, suggests that the available high-resolution spatial information on temperature was already accurate, and little additional improvement was possible compared to the other environmental variables. In contrast adding the site EIVs for nutrients, soil pH, available water or light led to a general  Table 1 for details). The data shown is for 397 species evaluated on an independent 'external' data set of 1193 vegetation plots. The different shades of grey from light to dark represent GLM, GAM, RF, MAXENT and SRE models. The boxes represent the median and the 25/75 percentile and the whiskers are 2 SD. www.nature.com/scientificreports www.nature.com/scientificreports/ increase in model performance across all species (improved AUC for 93% of species). As getting information on soil and nutrients is time and money consuming 38 , most studies tend to use more easily available proxies, such as bedrock or soil types or coarse characteristics [39][40][41] . These maps are often based on coarser resolution vector maps    Table 1 for details). The different colors represent different categories of predictors: grey for topographic, red for temperature, blue for water, yellow for light and brown for soil. The bars outlined in black represent environmental predictors and the bars outlined in red represent site EIVs. The boxes represent the median and the 25/75 percentile and the whiskers are 2 SD.
www.nature.com/scientificreports www.nature.com/scientificreports/ and are therefore not able to reflect local variation in nutrient availability, soil depth or water holding capacity that would be needed to model at fine resolution (such as 25 m here). At fine resolution, the effects of soil structure and composition, and the resulting pH, water and nutrient contents, seem key to understand local community assembly 9,31,37,40,42,43 . Available water for plants is often expected to be reflected by seasonal precipitations in SDMs, which might not be very informative in landscapes with rugged topography, as the interpolated precipitation usually only show rough variation along altitude or latitude gradients 9,44 . Yet, soil moisture and water available for plants is known to be highly variable at locations with similar precipitation values but different topographic configurations and soil types 20,37 . These local differences in soil moisture were reflected in the corresponding site EIV (M), but did not seem to be captured by the interpolated precipitation maps, even in combination with topographic predictors. More complex modelling of soil moisture based on soil and topography would be needed to predict these variations 20,45 . Most site EIVs led to a similar increase in model performance across species and habitat types. Yet, the site EIV for light (L) had a much stronger effect in forested areas than in open grasslands (Appendix 1), highlighting the importance of information about available light in highly structured vegetation types with several layers of foliage such as forests or shrub lands 24 , with studies showing a strong improvement of model performance by adding canopy cover as a proxy for shading 24 .
The deviance explained by the SDMs using site EIVs as predictors (M_EIV; 44%) was almost doubled compared to our reference model (M_ref; 22.5%). However, even with the use of site EIVs capturing the local conditions at a high spatial resolution, a considerable proportion of unexplained variation remained in the models (D 2 ranging from 5 to 80%). Assuming that our site EIVs are near-optimal environmental predictors for plants, this would lead to the conclusion that environment alone will never be sufficient to fully explain the distribution of all plant species at high spatial resolutions. Our results thus suggest that, at least for some species, large proportions of environmental variance can remain uncaptured by correlative models, and that other factors need to be accounted for, such as dispersal, population dynamics, biotic interactions and other community assembly processes 46,47 . Furthermore, there will likely always remain some stochasticity (e.g. in local extinction and colonization events) that cannot be accounted for by environmental conditions or by other deterministic factors. Additionally, here we only presented data on single species models, but it can be assumed that when modelling whole plant communities the unexplained deviance and stochasticity cumulate and become even higher 48,49 .

Conclusion
Understanding and being able to predict species-environment relationships at different spatial and temporal scales is key to our biological understanding of the past, present and future functioning of ecosystems 48 . Here, we presented a new way of considering ecological indicator values (EIVs) in species distribution models (SDMs) as reasonable proxies of an optimal set of environmental predictor variables, allowing to unveil the level of environmental determinism that can theoretically be expected, based on expert knowledge, in correlative niche-based SDMs for plant species across a whole mountain region spanning wide environmental gradients at high resolution.
Our study revealed four key findings. (1) Significantly more environmental variance (up to 50% increase) could be explained in models based on site EIVs for the large majority of species, shedding light on which are the most effective environmental predictors and stressing the importance to develop several new environmental maps or improve existing ones, especially for soil nutrients and soil moisture but also for climatic parameters other than temperature, across whole regions. (2) Site EIVs for light and soil provided the greatest improvements, while site EIVs for temperature provided the smallest improvement. (3) Temperature was already largely accounted for in current models, and few additional improvements can be expected for this variable 12 . And finally, (4) even with all important EIVs incorporated, models kept variable parts of unexplained variance, depending on the species, which could correspond to a realistic estimate of the maximum predictable environmental determinism one might achieve in such models.
These findings pave the way toward using the most recent remote sensing technologies and advanced field measurements for developing spatially-explicit environmental predictors 14,50,51 , ultimately enabling plant SDMs to make a critical step forward. The best spatial resolution to use for this, however, still needs further investigations, as it likely depends on the modelled species. Another important research perspective is to evaluate whether the trend toward obtaining increasingly explanatory and predictive models at increased spatial resolution might come to a halt if more residual stochasticity is found as one moves toward very high spatial resolutions 12,52 .

Material and Methods
Case study area and vegetation data. We tested this new SDM approach based on EIVs in the Alps of the Vaud canton in Switzerland (46°10′ to 46°30′N; 6°60′ to 7°10′E), covering an area of ca. 700 km 2 . Elevation ranges from 375 to 3210 m and the vegetation has been and is still influenced by human activity with a mix of pastures and forests from the bottom of the valley up to the treeline at ca. 2100 m.
Our dataset contained 3988 vegetation plots with both forested areas 53 (3076 plots; including clearings and very lose stands of trees at the treeline) and open grasslands 54 (912 plots). The vegetation plots covered an area of 314 m 2 (r = 10 m) in the forests and of 64 m 2 (8 × 8 m) in the open grasslands. Although, the two subsets had different plot sizes and sampling strategies (grid based for forest, random stratified for grasslands), we pooled all available data together to cover a maximum of plant species and habitats across the study area (for details on sampling see Hartmann et al. 53 and Randin et al. 54 ). To ensure that this pooling did not affect our tests, all analyses were also conducted separately for the two datasets (see Appendix S1).
In total, 1096 plant species were recorded but only 397 (35%) had more than 50 presence records throughout all plots, and were selected for modelling (see Appendix S2 for species list). This study thus concerned mainly the "most frequent" species with a regional prevalence higher than 0.012 (1.2%). (2019) 9:3061 | https://doi.org/10.1038/s41598-019-39133-1 www.nature.com/scientificreports www.nature.com/scientificreports/ environmental data. Based on 30 years (1961Based on 30 years ( -1990 of meteorological data from national weather stations, fourteen climatic variables (Table S5) were calculated according to the method described in Zimmermann & Kienast 44 and Zimmerman et al. 55 . A digital elevation model (DEM) was used to spatially interpolate the climatic data and to create topographic variables (Table S5). Additionally, we used a soil pH map based on Buri et al. 43 . All environmental data had a spatial resolution of 25 m, a resolution similar to the areas used to calculate the ecological indicator values (EIV; see below). This allowed a straightforward comparison of different sets of predictors composed from environmental variables and EIVs.
For our species distribution models (SDMs) we selected the following six environmental variables: annual mean temperature, annual sum of precipitation, annual sum of radiation, topographic position, slope and soil pH. These variables were selected because: (1) they are expected to be of high ecophysiological significance 6,9,56 as they cover the main factors necessary for plant life 57 (i.e., temperature, water, light, soil); (2) are not highly correlated (<0.5 except for annual temperature and annual sum of precipitation, Table S6); and (3) have already been used in numerous studies dealing with large numbers of different species and taxonomic groups 14,28,40,58,59 .
To analyze the correlation of the ecological indicator values and the bio-climatic predictors, we used an extended set of 17 widely available environmental variables (Table S5) 19 . Landolt's EIVs are largely similar to the more renown ones proposed by Ellenberg et al. 18 , but specifically adapted to the flora and environmental conditions of Switzerland. There are a large number of EIVs (for details see Landolt et al. 19 ), but in this study we focused on six of the most commonly used, namely: T for temperature;, M for soil moisture/water availability;, L for light;, K for continentality;, R for soil pH; and N for soil nutrients.
Based on the species list of each vegetation plot, we calculated the mean Landolt EIV per plot/site (site EIV) according to Landolt et al. 19 . We decided to calculate the site EIV as averaged EIV across all species present at a given site not weighted by abundance (see Diekmann 25 for a discussion of weighted versus unweighted averaged values) because (1) abundance data was not available for all sites and (2) our models (SDMs) only use presence/ absence information. To prevent any circularity in the conclusion, the focal species (being modelled) was always excluded in the calculation of site EIV, resulting in slightly different predictor values for each species and plot 24 .

Testing the explanatory power of indicator values in SDMs.
We used the set of our six classical environmental variables (annual mean temperature, annual sum of precipitation, annual sum of radiation, topographic position, slope and soil pH) as a "reference model" (M_ref; Table 1). To test the alternative predictive power of site EIVs, we used a framework with eight different sets of predictors for our SDMs (Fig. 1), replacing classical environmental variables by site EIVs (M_T, M_Soil, M_R, M_N, M_F, M_K, M_L, M_EIV; Table 1). Additionally, based on our correlation analysis we choose the set of the six bio-climatic predictors most strongly correlated with the six ecological indicator values (M_cor; Table 1). This set of predictors should minimize the difference between environmental variables and site EIV-based predictions by maximizing cross-correlations and should therefore represent a potentially optimal set of bio-climatic predictors.
All SDMs were run with the R package biomod2 60,61 in R 3.1.2 62 using generalized linear models (GLM), generalized additive models (GAM), surface range envelope (SRE), random forest (RF) and maximum entropy (MAXENT 63 ) as modelling technique. We selected these techniques as they are the most commonly used in the literature and therefore allow representative results. We decided not to use an ensemble modelling approach 64 as we were mostly interested in potential differences among model inputs rather than in obtaining the most accurate predictions.
To evaluate the models, we split the data into 70% as calibration data and 30% as evaluation data, the latter subset being never used in any step of the model building, therefore allowing independent "external" evaluation. Additionally, the calibration data set was split again randomly 10 times into 70%/30% to add internal evaluation (by repeated cross-validation). In both internal and external evaluation, we calculated the area under the curve 65 (AUC), true skill statistic 66 (TSS), Cohens kappa 67 (KAPPA), the proportion of correct predictions (Accuracy) and (for GLM and GAM models only) the explained deviance 6 (D 2 ). We also compared the variable importance, estimated by permutation tests 60 , among all the different sets of predictors using a pairwise-wilcoxon-test with a holm correction for p-values. All the evaluation metrics presented in this study were calculated on the 30% of independent "external" evaluation data.
Testing whether EIV could not simply be more complex combinations of the environmental variables. Additional to the field recorded site EIVs, we also used several techniques (GLM, GAM, GBM and RF) to model site EIVs in space based on the same bio-climatic predictors used for the SDMs (Table S5). This allowed us to assess if site EIVs can be modelled from the same environmental variables used in SDMs (using perhaps some more complex relationships), and thus do not bring much additional explanatory power than these environmental variables themselves, or if they provide new information that cannot be accounted for, even by more complex combinations of the environmental variables. To assess the model accuracy of the predicted site EIVs, we calculated the Pearson correlation and root square mean error of the predicted and field recorded site EIVs based on a 25-fold cross-validation (70/30% split).