## Background & Summary

A major challenge in ecology is to understand the processes underlying community assembly and biodiversity patterns across space1,2. Over the three last decades, trait-based research, by taking up this challenge, has drawn increasing interest3, in particular with the aim of predicting biodiversity response to environment. In community ecology, it has been equated to the ‘Holy Grail’ that would allow ecologists to approach the potential processes underlying metacommunity patterns4,5,6,7. In macroecology, it is common to study biodiversity variation through its taxonomic and functional facets along gradients of environmental drivers8,9,10. In biodiversity-ecosystem functioning research, trait-based diversity measures complement taxonomic ones to predict ecosystem functions11 offering early-warning signs of ecosystem perturbation12.

The topic of Trait-Environment Relationships (TER) has been extensively studied across the globe and across the tree of life. However, each study deals with a specific system, taxonomic group, and geographic region and uses different methods to assess the relationship between species traits and the environment. As a consequence, we do not know how generalizable apparent relationships are, nor how they vary across ecosystems, realms, and taxonomic groups. In addition, while there is an emerging synthesis about the role of traits for terrestrial plant communities13,14, we know much less about other groups and ecosystem types.

To address these gaps, we introduce the CESTES database - a global database for metaCommunity Ecology: Species, Traits, Environment and Space. This database assembles 80 datasets from studies that analysed empirical multivariate trait-environment relationships between 1996 (the first multivariate study of TER15) and 2018. All considered datasets include four data matrices (Fig. 1): (i) community data (species abundances or presences/absences across multiple sites), (ii) species traits (sensu lato), (iii) environmental variables across sites, and (iv) spatial coordinates. The database is global in extent and covers different taxonomic groups, ecosystem types, levels of human disturbance, and spatial scales (Fig. 2).

Several global trait databases already exist or are emerging, such as the Open Traits working group16, the Freshwater Information Platform and its Taxa and Autecology Database for Freshwater Organisms17, the PREDICTS database for Projecting Responses of Ecological Diversity In Changing Terrestrial Systems18,19, and the TRY20 plant trait database for Quantifying and scaling global plant trait diversity. In comparison to these initiatives, the CESTES database has several unique features. Specifically, it maintains the original matching between the community, environmental, and spatial data that go along with the trait information. Keeping this original matching of the data ensures homogeneity in the data structure and allows for targeted analyses of TER. We include all taxonomic groups for which the appropriate matrices are available including groups poorly represented in most trait compilations (e.g., invertebrates and bats). The trait information is particularly diverse, ranging from life-history and morphological to trophic traits, dispersal abilities and tolerances, and covering various ecological mechanisms. CESTES only includes data where georeferenced coordinates, or relative coordinates of the sampling sites (hereafter: spatial coordinates) and environmental variables are available to enable spatial and scaling community analyses. We prioritized studies with abundance or biomass data (as opposed to presence/absence) to facilitate the calculation of a broad range of biodiversity metrics and the study of different facets of biodiversity. The data available in CESTES are open access without restriction, except via citation of this paper (and any original paper that plays a particularly important role in the analyses). Importantly, the CESTES database is meant to be a live database21: it will be maintained in the future and new datasets will be added as they become available.

The CESTES database aims to significantly contribute to research in biogeography, macroecology (including in complement with phylogenies), community and metacommunity ecology, and biodiversity-ecosystem functioning. On the one hand, the quality of its content and structure will allow meta-analyses and syntheses (e.g., the role of taxonomic and functional diversity in spatial patterns of communities). On the other hand, specific datasets will enable the exploration of new questions on a given group, realm, or type of ecosystem.

## Methods

### Data compilation

#### Database scoping

The rationale for developing the CESTES database is generally for the study of TER in relation to metacommunity ecology and/or macroecological questions. As such, we focussed on datasets that were appropriate within the metacommunity or macroecology context (i.e. species assemblages distributed across space) and that focussed on traits to understand biodiversity patterns and responses. This prerequisite led us to identify multivariate trait-based studies as the most relevant and rich source of datasets that could fulfil these two requirements.

Given the complexity that still pertains to trait typology13, we did not restrict ourselves to any specific definition of traits and integrated all possible species characteristics if they were used as “traits” in the original study. We thus included ecophysiological, functional, life-history and biological traits, as well as response and effects traits. CESTES users can select traits according to their study needs.

We identified eligible datasets based on two strategies: 1. Literature search, aiming to initiate the database construction along a structured workflow, 2. Networking, aiming to extend the database and open the sharing possibilities, if the datasets fulfilled the CESTES requirements.

The main condition for dataset eligibility was that the TER was the focus of the study and data use. This ensured that: 1. the trait and the taxonomic information were collected from similar biogeographic areas (minimizing mismatches between the geographic origins of trait and taxonomic data), 2. the sampled sites were associated with contextual environmental information that was relevant to the community and traits under study.

#### Literature search

We searched for multivariate trait-based studies published between 1996 and 2018 via a systematic literature search on the Clarivate Analytics Web of Science Core database. Following Leibold & Chase2, we focussed on studies that included (in any of their contents) the following terms (including spelling variations): “RLQ”15 and “fourth-corner”22,23 because both of them are the predominant methods of multivariate trait-based analyses in ecology24. The “RLQ” refers to a co-inertia analysis that summarizes the overall link between the three matrices of species abundances/presences-absences (L), species traits (Q) and environment (R). The “fourth-corner” refers to a permutation analysis of these three matrices that tests individual trait-environment relationships. The use of RLQ and fourth-corner analyses on the datasets ensures that all of them: 1. are multivariate and include both several species, several traits, and several sites (potentially including spatial information) to align with a metacommunity-like structure, 2. have a comparable structure and can be used in comparative analyses and syntheses.

The search query was:

ALL = (“fourth-cornerORfourth cornerORfourthcornerORRLQ”)

This search resulted in 368 papers.

Note that the “fourth corner” term more generally and commonly refers to the widely studied question of the links between trait and environment variations22. Most studies that look at TER, regardless of the method of analysis they use, would often acknowledge the historical background of their question by referring in their paper to the “fourth corner problem”. Consequently, by including the “fourth corner” search term, we identified eligible multivariate datasets that were not necessarily analysed by fourth corner analysis/RLQ, but also by e.g. trait-based generalized linear/additive models25,26. However, although this literature search strategy was well suited for identifying sources of multivariate datasets, it could appear as too specific. In order to relax the constraints due to this specificity, we complemented the data search by a networking strategy (see Networking section).

#### Scanning strategy

Among the 368 studies resulting from the literature search, we scanned through the Introduction and Methods sections. We selected the studies that used at least the three matrices of species abundances, or presences/absences across multiples sites (“comm”), corresponding environment information across sites (“envir”), and species trait information (“traits”). At first, we prioritized datasets that had spatial coordinates of the sampling sites (“coord”) because the spatial aspect is crucial for metacommunity research2. Spatial coordinates, or the relative locations, could sometimes be reconstructed from the maps presented in the publications. Review and opinion papers, medical and simulation studies were not considered. Following this filter, we identified a subset of 105 eligible datasets.

#### Networking

The network strategy took place in parallel to the data search and relied on both formal and informal communications and exchanges with colleagues through conferences, workshops, group meetings, emails, etc. This allowed us to identify new data providers, or new datasets that we had not found via the earlier literature search. From this networking, we identified an additional set of 34 potentially eligible datasets.

#### Dataset collection and request

From the total of 139 eligible datasets, 7.2% of the datasets were available on the online supplementary materials of the publication. These were downloaded and formatted for CESTES’ purposes.

When the datasets were not directly available, we sent a data request via email. In order to launch the CESTES database in a reasonable amount of time, we had to set time limits for the request phase, namely between January and August 2018. As a result, in total 96 authors were contacted, of whom 58% shared their data. In terms of datasets, more than 50% of the eligible datasets were shared and complete (Fig. 3). We also received ‘spontaneous’ datasets that were not part of our initial request, but fulfilled CESTES’ requirements and were thus included in the database. Out of the final complete 80 datasets, 55 were obtained via the literature search, and 25 were obtained from the networking strategy.

Because we received 10 valuable datasets that had no spatial coordinates, we decided to open the ceste subsection of the CESTES database and populate it with these specific datasets. Some of them could be upgraded to CESTES database when the authors are able to provide the coordinates.

### Data processing

#### Dataset checking, cleaning and formatting

We downloaded and received datasets in various formats (.doc, .pdf, .csv, .RData, .txt, .shp, etc.). Following Broman & Woo27, we harmonized and gathered them in Excel files, one file per dataset. This was the most convenient storage format for creating multiple sheets (community, traits, environment, coordinates), handling heterogeneous types of information, and building metadata specific to each dataset. This storage solution also facilitated visual checking and cleaning of the data records.

CESTES provides both the processed and the unprocessed (i.e. original) datasets. The processed datasets include “comm”, “traits” and “envir”, i.e. with no empty sites, no “ghost” species (i.e. species that are recorded in none of the sites of the study area), and no NAs (Not Available information) in the matrices. NA removal was based on a compromise in the relative frequency of NAs in the rows and columns of each table; when too many sites compared to the sample size (e.g. >50% of the sites) had NAs for one single variable, this variable was removed, whereas when there were some sites (e.g. <30% of the sites) showing NAs for more than one variable, we removed those sites instead of removing the variables. Since CESTES is primarily designed for trait-based analyses, we removed a trait when it included too many NAs across species (i.e. when the trait value was NA for more than 50% of the species in the community). Similarly, we removed species for which no, or too incomplete trait information was available (i.e. when keeping the species would have implied to lose several traits). This was the case for 29 datasets out of the 80. The number of species removed varied from 1 to 209 species (mean = 27, median = 10, sd = 45) that represented from 1 to 72% of the initial species pool (mean = sd = 17%). (Note that this high maximum value is due to only one single dataset where trait data were exceptionally limiting and implied to remove an important number of species without trait information).

When this overall cleaning procedure implied removing any of the species, traits, or environmental variables, we kept the information of the original unprocessed tables within the Excel file in separate sheets. We named these sheets “commfull”, “traitsfull” and “envirfull”, respectively. Thus, the user can either directly use the processed sheets (“comm”, “traits” and “envir”), or the original ones and apply any other filtering strategies. In doing so, we make sure that CESTES is flexible depending on the users’ goals and needs.

Cleaning steps that altered the original dataset (other than formatting) are reported in the “Notes” sheet so that the user can trace back what has been done over the data processing.

When the data included several temporal horizons (sampling years, or seasons treated as different replicates in the original publication), we split them into different datasets for each time horizon to facilitate further analyses. This explains why several datasets can correspond to one single study area (see Online-only Table 1 attached to this manuscript, and the Data Records section).

All the entries from the four data sheets - “comm”, “trait”, “envir” and “coord” - were listed and described in a “DataKey” sheet to describe the tables’ content (Fig. 4). This required a thorough examination of the original papers to extract the relevant information for every dataset. In several cases, we required additional exchanges with the data owners for clarifications. Any empty cell in the “DataKey” sheet reflects a lack of information. Importantly, this sheet should not substitute for reading of the original paper and we strongly recommend the users to thoroughly examine each paper before using the data (see Online-only Table 2).

## Data Records

### Storage and formats used

We stored the CESTES database via three different storage systems and two types of formats to provide the users with several alternatives in accessing and using the data.

Storage

Formats

A zipped folder called “CESTES.zip” includes two alternative formats for the CESTES database:

• a “xCESTES” folder that includes 80 Excel files (one file per dataset), each named according to the following structure: “AuthorPublicationYear.xlsx”.

• a “rCESTES” folder that includes the CESTES core processed database (comm, traits, envir, coord matrices) as an R list object “CESTES.RData” plus two R scripts, and two metadata tables for data processing and exploration (see Usage Notes section).

The “CESTES.zip” folder also includes:

• an extended metadata table, “CESTES_metadata.xlsx”, that provides the general metadata information of all the datasets (i.e., combining the information from the Online-only Tables 12 of this Data Descriptor)

• a tutorial document, “HOW_TO_SHARE_MY_DATA_FOR_CESTES.pdf”, that explains how to share data for integrating future datasets in the database (see Supplementary File 1).

The original, unprocessed files as they were provided by the data owners (thus possibly in different formats, various structures, with possible mistakes, without metadata, etc.), are available by request to the corresponding author, AJ.

We would also encourage any potential data contributors to contact AJ with possible data (cestes@idiv.de). The specific guidelines on which types of data are eligible to integration in CESTES, and on how to structure the data, and metadata are provided in the Supplementary File 1, as well as at: https://icestes.github.io/sharedata. If the dataset fulfils CESTES’ requirements and is provided in the right format with the appropriate metadata information, it will be included in the database. Each time the database is updated through the iDiv Biodiversity portal (https://idata.idiv.de/), a new DOI will be generated for the whole updated database, ensuring the new contributors are acknowledged and become part of the ‘CESTES consortium’. This will allow storing the data on a permanent platform and prevent them from sinking into oblivion30.

### Structure of the database and the datasets

The 80 files currently in CESTES are structured into at least 8 sheets, depending on the original information and specificities of each dataset (Fig. 1).

The first four sheets include the processed core-data themselves:

1. 1.

comm”: matrix of species abundances (68) or presences/absences (12), with species in columns and sites in rows (species are sometimes OTUs in some groups such as phytoplankton, or genus in some groups such as macroinvertebrates, or morphospecies where relevant).

2. 2.

traits”: matrix of species trait information, i.e. any trait, be it functional, biological, life-history traits, either quantitative or categorical, functional group, etc., with traits in columns and species in rows.

3. 3.

envir”: matrix of environmental variables in the broad sense of environment, i.e. any type of biotic and abiotic conditions or habitat characteristics relevant to the community of interest according to the original publication, with variables in columns and sites in rows.

4. 4.

coord”: matrix of spatial coordinates, with X, the longitude and Y, the latitude as columns (in the Geographical Coordinate System as used in the original study) and sites in rows.

In every dataset, a “DataKey” sheet provides a description of all the entries of the four matrices (Fig. 4). Specific comments and information about any alteration applied to the dataset can be found in the “Notes” sheet, e.g. the species, or variables that were removed due to missing information, how the trait values were averaged across species when several measurements were available, how the original dataset was split into several datasets when there were several sampling periods, etc. The contact person for each dataset is also specified at the top of the “Notes” sheet of the dataset.

When the cleaning procedure implied changing the original datasets (see Data processing section above), we kept the information of the unaltered tables within the Excel file in separate sheets: “commfull”, “traitsfull”, “coordfull”, and/or “envirfull”.

The “splist” sheet includes the full list of taxa and the “sitelist” sheet, the list of sites. Both can provide additional information about the species (e.g. taxonomic classification) and the sites (e.g. regional information) when specified by the authors. Note that the species (site) names might not appear in the “splist” (“sitelist”) of all the datasets; this is because some authors preferred to provide their data in a redacted form, for instance, by censoring the species or the site names. As this does not hamper most of the analyses in community ecology, these datasets were integrated in the database.

Finally, when trait information was semi-quantitative and already fuzzy coded, we added a sheet “blo” to specify the Blocks information that is needed for weighting procedures in some trait analyses31.

### Description of the database

The CESTES database includes 80 datasets that cover different areas of the globe, ecosystem types, taxonomic groups, and spatial extents (Fig. 1). An overview of these datasets is presented in the Online-only Table 1.

### ceste, the non-spatial ancillary to CESTES

We provide access to 10 additional datasets that were not completely suitable for the CESTES database, due to the absence of spatial information or insufficient metadata but that were potentially valuable for their three other data matrices (see Online-only Table 3 attached to this manuscript). They follow the same structure as CESTES, except that they do not present the “coord” sheet and sometimes include only partial metadata. Some of the ceste datasets are likely to be enhanced in the near future and upgraded to the CESTES database as soon as they are made complete. ceste is stored in a zipped folder named “ceste.zip” that includes a series of 11 Excel files (10 data files + 1 metadata file) and can be found at the following links:

### CESTES, a live database

The current CESTES database is the starting point of a broader data-sharing project that aims to continue integrating new data as they become available, and as new contributors join the consortium by sharing their data.

In order to maintain the CESTES database in the future, we set up three measures to facilitate the data exchange and communication about the database:

1. 1.

a project website that advertises the database project and fosters data sharing: https://icestes.github.io/,

2. 2.

a tutorial to guide people on how to share their data (Supp. Mat. 1; https://icestes.github.io/sharedata),

3. 3.

a designated email address where people can send their data and ask questions about the CESTES project (cestes@idiv.de).

The data will be checked, curated, and integrated in the database through the iDiv Biodiversity Portal. This will update the database and generate a new DOI for the whole updated database, ensuring the new contributors are acknowledged.

### Citation of the individual datasets and of the database

Each CESTES and ceste dataset (CESTES10,15,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94 and ceste95,96,97,98,99,100,101,102) is associated with reference(s) that should be cited in addition to the CESTES database only if a single or few specific datasets are used separately from the database. For instance, if one uses only Villéger’s datasets (Villeger2012a, b, c, d, and e), one would have to cite Villéger’s original study10 (to acknowledge the study antecedence), and the CESTES database28 (because it is through the CESTES database that the structured data and metadata were made available). The list of citations for each dataset is provided in Online-only Table 2. See also CESTES29 for updated versions of the live database and follow the last news about the database via https://icestes.github.io/posts/.

## Technical Validation

The technical validity of the CESTES database relies on five qualities pertaining to the datasets, and the overall database: the datasets (1) have individually been subject to peer-review process, (2) have reliable sampling properties, (3) have been thoroughly checked and cleaned, are ready-to-use for analyses and accompanied with metadata information; and the database (4) has a wide taxonomic and geographical coverage, and (5) will keep on extending in the future.

### Peer-reviewed data and TER relevance

All the datasets included in CESTES had already been the subject of publication(s) in peer reviewed scientific journals, or PhD theses (see Online-only Table 2). Therefore, each of the dataset has already received technical validation through both analysis and evaluation. In addition, since the focus of those studies was the species trait-environment relationships, the choice of the traits and environmental variables has already been the result of scientific reflection by the authors about the potential relevance of these variables with respect to the ecological context and the scale of study.

### Reliable sample properties

The datasets include an average of 71 sites, 72 species, 12 environmental variables, and 14 traits (Fig. 5 & Online-only Table 1). In the particular context of fourth-corner analysis, Dray & Legendre23 showed that datasets with fewer than 30 species need to have substantially more than 30 sites in order for this multivariate method to perform well and detect existing TER. If we refer to the thresholds their simulation study found, we can say that 75% of the CESTES datasets can support multivariate analyses of a very good to good statistical power, i.e. have a Type II error risk of less than 10% (Fig. 6). The remaining 25% fall not far from the 30% limit, meaning that the risk of failing to detect significant TER although these exist is 30%. For these datasets, the users might need to be cautious if they intend to apply fourth-corner analysis and might need to consider other methods.

### Thorough data quality control and harmonized metadata information

Each dataset has been the subject of thorough manual checking for matching between site, species and traits number and identity across the four data matrices, the absence of empty sites, “ghost” species, NAs, and the consistency between the data received and the original publication. All dataset alterations that we applied are reported in the “Notes” sheet of the dataset file. Therefore, we offer a clean database while keeping full transparency on the steps taken to this end. A large effort was also invested in reporting precise and harmonized metadata information for every dataset (Fig. 4) so that CESTES users can easily have a full overview of the type of data.

### Wide taxonomic and geographical coverage

The geographical coverage of the database is global (Fig. 1) although as with most global compilations, there are clearly some regions of the world that are underrepresented, including Africa, Asia, India, Russia, which is typical103. To evaluate whether we might have missed potential datasets e.g., in Russia, we ran an additional literature search across a set of international journals specialised in Russian ecology (namely, Russian Journal of Ecology, Russian Journal of Marine Biology and Russian Journal of Biological Invasions). We used as search terms the simple association of “traits” AND “environment”. We found very few trait-based studies, only five studies that were relevant to the CESTES database, and only one that had spatial coordinates and agreed to share their data. In addition, we found a dearth of studies from North America. One reason for this was that a related database project focussing on plants primarily in the Americas has recently emerged104. Authors of plant datasets from these regions were less likely to respond to data request (only 17 out of 35 responded). This might also explain why, although our criteria covered all realms and taxonomic groups, we have a relatively low representation of plant data (i.e., less than 30% of our database, Fig. 2b) despite their large representation in other databases.

Importantly, the CESTES database is upgradeable. Anyone interested to share data of this type and contribute is welcome to do so by contacting the corresponding author (AJ, cestes@idiv.de). We will maintain and add to the CESTES database in the future to enrich it with more taxa, ecosystem types, and locations in the world. On the longer term, three possible and very valuable extensions could include (1) individual-based traits measurements, since intraspecific variation in traits can be highly valuable to study environmental tracking7,105,106, (2) temporal measurements of both traits, environment, and species abundances in order to study the temporal variability of the TER107, (3) phylogenies corresponding to the recorded groups that would help draw inferences from community108 or macroecological patterns8. This would allow extending CESTES’ potential for synthesis work aiming to bridge metacommunity ecology, macroecology, and biodiversity-ecosystem functioning research.

## Usage Notes

In complement to the Excel version of CESTES, the database has also been stored as an.RData object to facilitate its further use for analyses in R109. This object is a list of lists. Each element of the first order list refers to one dataset, which itself is a list of four matrices; $comm,$envir, $traits and$coord.

First, this allows additional data processing. We set up R code routines (“CESTES_DataPrep.R”) that perform a thorough checking of the matrices, especially the match between the matrices’ dimensions, and the absence of empty sites, “ghost” species, and NAs.

The trait, environment, species, and coordinates variables were of mixed types (binary, categorical, ordinal, quantitative). To make the datasets properly readable and analysable by R, we made sure the numerical variables were treated as such by the program. We also re-coded the binary variables into 0/1 (numeric), the character and nominal variables into factors (this option can be turned off in the function), and, when relevant (i.e. when made explicit in the original publication), the numeric integer variables into ordinal variables (ordered factors).

Our R code routines generate data reports and send them to the working directory in the form of .txt files. These give the user different information on the “comm”, “traits”, “envir” and “coord” components of each dataset: list of variables and their types (factor, numeric, etc.), number of sites, species and traits, minimum and maximum value of the community data (that allows checking e.g., whether data are abundances or presences/absences).

The R code also applies some data transformation (e.g. scaling, Hellinger transformation, variable orthonormalization), generates spatial predictors (e.g. Moran Eigenvector Maps) and calculates some usual trait diversity metrics (e.g. Community Weighted Means, functional dispersion).

For all these processing, transformation and calculations, we used several R packages, available on CRAN, mainly readxl, doBy, plyr, dplyr, and gdata for the data processing, vegan, ade4, stats for the data transformation, ape and adespatial for the spatial processing, and picante and FD for the calculation of functional indices109.

All the R functions coded and used for the data preparation are provided in an R script “CESTES_DataPrep.R”. A fully processed and “ready-to-use” version of the CESTES database is stored as an .RData object called “CESTES.RData”.

Finally, further data plotting and metadata exploration are made possible via the R script “CESTES_Plots.R”, and the two metadata .csv files (“ListDat.csv”, “Metadat.csv”).

All these files (namely the database, the two R scripts, and the two metadata files) are stored in a zipped folder called “rCESTES.zip” in the “CESTES” folder at the following links:

The flexibility of the iDiv Biodiversity Portal storage will allow us to keep updating, extending and sustaining the CESTES database and the R scripts in the future.