A global database for metacommunity ecology, integrating species, traits, environment and space

The use of functional information in the form of species traits plays an important role in explaining biodiversity patterns and responses to environmental changes. Although relationships between species composition, their traits, and the environment have been extensively studied on a case-by-case basis, results are variable, and it remains unclear how generalizable these relationships are across ecosystems, taxa and spatial scales. To address this gap, we collated 80 datasets from trait-based studies into a global database for metaCommunity Ecology: Species, Traits, Environment and Space; “CESTES”. Each dataset includes four matrices: species community abundances or presences/absences across multiple sites, species trait information, environmental variables and spatial coordinates of the sampling sites. The CESTES database is a live database: it will be maintained and expanded in the future as new datasets become available. By its harmonized structure, and the diversity of ecosystem types, taxonomic groups, and spatial scales it covers, the CESTES database provides an important opportunity for synthetic trait-based research in community ecology.

www.nature.com/scientificdata www.nature.com/scientificdata/ We identified eligible datasets based on two strategies: 1. Literature search, aiming to initiate the database construction along a structured workflow, 2. Networking, aiming to extend the database and open the sharing possibilities, if the datasets fulfilled the CESTES requirements. Scanning strategy. Among the 368 studies resulting from the literature search, we scanned through the Introduction and Methods sections. We selected the studies that used at least the three matrices of species abundances, or presences/absences across multiples sites ("comm"), corresponding environment information across sites ("envir"), and species trait information ("traits"). At first, we prioritized datasets that had spatial coordinates of the sampling sites ("coord") because the spatial aspect is crucial for metacommunity research 2 . Spatial coordinates, or the relative locations, could sometimes be reconstructed from the maps presented in the publications. Review and opinion papers, medical and simulation studies were not considered. Following this filter, we identified a subset of 105 eligible datasets.
Networking. The network strategy took place in parallel to the data search and relied on both formal and informal communications and exchanges with colleagues through conferences, workshops, group meetings, emails, etc. This allowed us to identify new data providers, or new datasets that we had not found via the earlier literature search. From this networking, we identified an additional set of 34 potentially eligible datasets.
Dataset collection and request. From the total of 139 eligible datasets, 7.2% of the datasets were available on the online supplementary materials of the publication. These were downloaded and formatted for CESTES' purposes. When the datasets were not directly available, we sent a data request via email. In order to launch the CESTES database in a reasonable amount of time, we had to set time limits for the request phase, namely between January and August 2018. As a result, in total 96 authors were contacted, of whom 58% shared their data. In terms of datasets, more than 50% of the eligible datasets were shared and complete (Fig. 3). We also received 'spontaneous' datasets that were not part of our initial request, but fulfilled CESTES' requirements and were thus included in the database. Out of the final complete 80 datasets, 55 were obtained via the literature search, and 25 were obtained from the networking strategy.
Because we received 10 valuable datasets that had no spatial coordinates, we decided to open the ceste subsection of the CESTES database and populate it with these specific datasets. Some of them could be upgraded to CESTES database when the authors are able to provide the coordinates. Data processing. Dataset checking, cleaning and formatting. We downloaded and received datasets in various formats (.doc, .pdf, .csv, .RData, .txt, .shp, etc.). Following Broman & Woo 27 , we harmonized and gathered them in Excel files, one file per dataset. This was the most convenient storage format for creating multiple sheets (community, traits, environment, coordinates), handling heterogeneous types of information, and building metadata specific to each dataset. This storage solution also facilitated visual checking and cleaning of the data records.
CESTES provides both the processed and the unprocessed (i.e. original) datasets. The processed datasets include "comm", "traits" and "envir", i.e. with no empty sites, no "ghost" species (i.e. species that are recorded in none of the sites of the study area), and no NAs (Not Available information) in the matrices. NA removal was based on a compromise in the relative frequency of NAs in the rows and columns of each table; when too many sites compared to the sample size (e.g. >50% of the sites) had NAs for one single variable, this variable was removed, whereas when there were some sites (e.g. <30% of the sites) showing NAs for more than one variable, we removed those sites instead of removing the variables. Since CESTES is primarily designed for trait-based analyses, we removed a trait when it included too many NAs across species (i.e. when the trait value was NA (2020) 7:6 | https://doi.org/10.1038/s41597-019-0344-7 www.nature.com/scientificdata www.nature.com/scientificdata/ for more than 50% of the species in the community). Similarly, we removed species for which no, or too incomplete trait information was available (i.e. when keeping the species would have implied to lose several traits). This was the case for 29 datasets out of the 80. The number of species removed varied from 1 to 209 species (mean = 27, median = 10, sd = 45) that represented from 1 to 72% of the initial species pool (mean = sd = 17%). (Note that this high maximum value is due to only one single dataset where trait data were exceptionally limiting and implied to remove an important number of species without trait information).
When this overall cleaning procedure implied removing any of the species, traits, or environmental variables, we kept the information of the original unprocessed tables within the Excel file in separate sheets. We named these sheets "commfull", "traitsfull" and "envirfull", respectively. Thus, the user can either directly use the processed sheets ("comm", "traits" and "envir"), or the original ones and apply any other filtering strategies. In doing so, we make sure that CESTES is flexible depending on the users' goals and needs.
Cleaning steps that altered the original dataset (other than formatting) are reported in the "Notes" sheet so that the user can trace back what has been done over the data processing.
When the data included several temporal horizons (sampling years, or seasons treated as different replicates in the original publication), we split them into different datasets for each time horizon to facilitate further analyses. This explains why several datasets can correspond to one single study area (see Online-only Table 1 attached to this manuscript, and the Data Records section).
Metadata preparation. All the entries from the four data sheets -"comm", "trait", "envir" and "coord" -were listed and described in a "DataKey" sheet to describe the tables' content ( Fig. 4). This required a thorough examination of the original papers to extract the relevant information for every dataset. In several cases, we required additional exchanges with the data owners for clarifications. Any empty cell in the "DataKey" sheet reflects a lack of information. Importantly, this sheet should not substitute for reading of the original paper and we strongly recommend the users to thoroughly examine each paper before using the data (see Online-only Table 2).

Data records
Storage and formats used. We stored the CESTES database via three different storage systems and two types of formats to provide the users with several alternatives in accessing and using the data. Storage Formats A zipped folder called "CESTES.zip" includes two alternative formats for the CESTES database: • a "xCESTES" folder that includes 80 Excel files (one file per dataset), each named according to the following structure: "AuthorPublicationYear.xlsx". • a "rCESTES" folder that includes the CESTES core processed database (comm, traits, envir, coord matrices) as an R list object "CESTES.RData" plus two R scripts, and two metadata tables for data processing and exploration (see Usage Notes section). . Incomplete data mainly refer to the datasets that had no spatial coordinates (ceste), included unsolved issues, or provided insufficient metadata information.
("Agreed but did not share" refers to authors who replied positively to the first request but then never sent their data despite reminders because e.g., they did not find time to prepare the data).
www.nature.com/scientificdata www.nature.com/scientificdata/ The "CESTES.zip" folder also includes: • an extended metadata table, "CESTES_metadata.xlsx", that provides the general metadata information of all the datasets (i.e., combining the information from the Online-only Tables 1-2 of this Data Descriptor) • a tutorial document, "HOW_TO_SHARE_MY_DATA_FOR_CESTES.pdf ", that explains how to share data for integrating future datasets in the database (see Supplementary File 1).
The original, unprocessed files as they were provided by the data owners (thus possibly in different formats, various structures, with possible mistakes, without metadata, etc.), are available by request to the corresponding author, AJ.
We would also encourage any potential data contributors to contact AJ with possible data (cestes@idiv.de). The specific guidelines on which types of data are eligible to integration in CESTES, and on how to structure the data, and metadata are provided in the Supplementary File 1, as well as at: https://icestes.github.io/sharedata. If the dataset fulfils CESTES' requirements and is provided in the right format with the appropriate metadata information, it will be included in the database. Each time the database is updated through the iDiv Biodiversity portal (https://idata.idiv.de/), a new DOI will be generated for the whole updated database, ensuring the new contributors are acknowledged and become part of the 'CESTES consortium' . This will allow storing the data on a permanent platform and prevent them from sinking into oblivion 30 . Structure of the database and the datasets. The 80 files currently in CESTES are structured into at least 8 sheets, depending on the original information and specificities of each dataset (Fig. 1).
The first four sheets include the processed core-data themselves: 1. "comm": matrix of species abundances (68) or presences/absences (12), with species in columns and sites in rows (species are sometimes OTUs in some groups such as phytoplankton, or genus in some groups such as macroinvertebrates, or morphospecies where relevant). 2. "traits": matrix of species trait information, i.e. any trait, be it functional, biological, life-history traits, either quantitative or categorical, functional group, etc., with traits in columns and species in rows. 3. "envir": matrix of environmental variables in the broad sense of environment, i.e. any type of biotic and abiotic conditions or habitat characteristics relevant to the community of interest according to the original publication, with variables in columns and sites in rows. 4. "coord": matrix of spatial coordinates, with X, the longitude and Y, the latitude as columns (in the Geographical Coordinate System as used in the original study) and sites in rows.
In every dataset, a "DataKey" sheet provides a description of all the entries of the four matrices (Fig. 4). Specific comments and information about any alteration applied to the dataset can be found in the "Notes" sheet, e.g. the species, or variables that were removed due to missing information, how the trait values were averaged across species when several measurements were available, how the original dataset was split into several datasets Fig. 4 "DataKey" structure and example of metadata information in CESTES datasets. A description is given when the variable full name is not self-explanatory or when potentially relevant information was available. Possible empty cells are due to lack of information that could not be recovered from the original publication nor from the data owners.
www.nature.com/scientificdata www.nature.com/scientificdata/ when there were several sampling periods, etc. The contact person for each dataset is also specified at the top of the "Notes" sheet of the dataset.
When the cleaning procedure implied changing the original datasets (see Data processing section above), we kept the information of the unaltered tables within the Excel file in separate sheets: "commfull", "traitsfull", "coordfull", and/or "envirfull".
The "splist" sheet includes the full list of taxa and the "sitelist" sheet, the list of sites. Both can provide additional information about the species (e.g. taxonomic classification) and the sites (e.g. regional information) when specified by the authors. Note that the species (site) names might not appear in the "splist" ("sitelist") of all the datasets; this is because some authors preferred to provide their data in a redacted form, for instance, by censoring the species or the site names. As this does not hamper most of the analyses in community ecology, these datasets were integrated in the database.
Finally, when trait information was semi-quantitative and already fuzzy coded, we added a sheet "blo" to specify the Blocks information that is needed for weighting procedures in some trait analyses 31 .
Description of the database. The CESTES database includes 80 datasets that cover different areas of the globe, ecosystem types, taxonomic groups, and spatial extents (Fig. 1). An overview of these datasets is presented in the Online-only Table 1.
ceste, the non-spatial ancillary to CEStES. We provide access to 10 additional datasets that were not completely suitable for the CESTES database, due to the absence of spatial information or insufficient metadata but that were potentially valuable for their three other data matrices (see Online-only Table 3 attached to this manuscript). They follow the same structure as CESTES, except that they do not present the "coord" sheet and sometimes include only partial metadata. Some of the ceste datasets are likely to be enhanced in the near future and upgraded to the CESTES database as soon as they are made complete. ceste is stored in a zipped folder named "ceste.zip" that includes a series of 11 Excel files (10 data files + 1 metadata file) and can be found at the following links:

CEStES, a live database. The current CESTES database is the starting point of a broader data-sharing
project that aims to continue integrating new data as they become available, and as new contributors join the consortium by sharing their data.
In order to maintain the CESTES database in the future, we set up three measures to facilitate the data exchange and communication about the database: 1. a project website that advertises the database project and fosters data sharing: https://icestes.github.io/, 2. a tutorial to guide people on how to share their data (Supp. Mat. 1; https://icestes.github.io/sharedata), 3. a designated email address where people can send their data and ask questions about the CESTES project (cestes@idiv.de).
The data will be checked, curated, and integrated in the database through the iDiv Biodiversity Portal. This will update the database and generate a new DOI for the whole updated database, ensuring the new contributors are acknowledged.

Citation of the individual datasets and of the database.
Each CESTES and ceste dataset (CESTES 10,15, and ceste [95][96][97][98][99][100][101][102] ) is associated with reference(s) that should be cited in addition to the CESTES database only if a single or few specific datasets are used separately from the database. For instance, if one uses only Villéger's datasets (Villeger2012a, b, c, d, and e), one would have to cite Villéger's original study 10 (to acknowledge the study antecedence), and the CESTES database 28 (because it is through the CESTES database that the structured data and metadata were made available). The list of citations for each dataset is provided in Online-only Table 2. See also CESTES 29 for updated versions of the live database and follow the last news about the database via https://icestes. github.io/posts/.

technical Validation
The technical validity of the CESTES database relies on five qualities pertaining to the datasets, and the overall database: the datasets (1) have individually been subject to peer-review process, (2) have reliable sampling properties, (3) have been thoroughly checked and cleaned, are ready-to-use for analyses and accompanied with metadata information; and the database (4) has a wide taxonomic and geographical coverage, and (5) will keep on extending in the future.
peer-reviewed data and tEr relevance. All the datasets included in CESTES had already been the subject of publication(s) in peer reviewed scientific journals, or PhD theses (see Online-only Table 2). Therefore, each of the dataset has already received technical validation through both analysis and evaluation. In addition, since the focus of those studies was the species trait-environment relationships, the choice of the traits and environmental variables has already been the result of scientific reflection by the authors about the potential relevance of these variables with respect to the ecological context and the scale of study.  Table 1). In the particular context of fourth-corner analysis, Dray & Legendre 23 showed that datasets with fewer than 30 species need to have substantially more than 30 sites in (2020) 7:6 | https://doi.org/10.1038/s41597-019-0344-7 www.nature.com/scientificdata www.nature.com/scientificdata/ order for this multivariate method to perform well and detect existing TER. If we refer to the thresholds their simulation study found, we can say that 75% of the CESTES datasets can support multivariate analyses of a very good to good statistical power, i.e. have a Type II error risk of less than 10% (Fig. 6). The remaining 25% fall not far from the 30% limit, meaning that the risk of failing to detect significant TER although these exist is 30%. For these datasets, the users might need to be cautious if they intend to apply fourth-corner analysis and might need to consider other methods. thorough data quality control and harmonized metadata information. Each dataset has been the subject of thorough manual checking for matching between site, species and traits number and identity across the four data matrices, the absence of empty sites, "ghost" species, NAs, and the consistency between the data received and the original publication. All dataset alterations that we applied are reported in the "Notes" sheet of the dataset file. Therefore, we offer a clean database while keeping full transparency on the steps taken to this end. A large effort was also invested in reporting precise and harmonized metadata information for every dataset (Fig. 4) so that CESTES users can easily have a full overview of the type of data.
Wide taxonomic and geographical coverage. The geographical coverage of the database is global ( Fig. 1) although as with most global compilations, there are clearly some regions of the world that are underrepresented, including Africa, Asia, India, Russia, which is typical 103 . To evaluate whether we might have missed potential datasets e.g., in Russia, we ran an additional literature search across a set of international journals specialised in Russian ecology (namely, Russian Journal of Ecology, Russian Journal of Marine Biology and Russian Journal of Biological Invasions). We used as search terms the simple association of "traits" AND "environment". We found very few trait-based studies, only five studies that were relevant to the CESTES database, and only one that had spatial coordinates and agreed to share their data. In addition, we found a dearth of studies from North America. One reason for this was that a related database project focussing on plants primarily in the Americas has recently emerged 104 . Authors of plant datasets from these regions were less likely to respond to data request (only 17 out of 35 responded). This might also explain why, although our criteria covered all realms and taxonomic groups, we have a relatively low representation of plant data (i.e., less than 30% of our database, Fig. 2b) despite their large representation in other databases.
Upgradeable and extendable database. Importantly, the CESTES database is upgradeable. Anyone interested to share data of this type and contribute is welcome to do so by contacting the corresponding author (AJ, cestes@idiv.de). We will maintain and add to the CESTES database in the future to enrich it with more taxa, ecosystem types, and locations in the world. On the longer term, three possible and very valuable extensions could include (1) individual-based traits measurements, since intraspecific variation in traits can be highly valuable to study environmental tracking 7,105,106 , (2) temporal measurements of both traits, environment, and species abundances in order to study the temporal variability of the TER 107 , (3) phylogenies corresponding to the recorded groups that would help draw inferences from community 108  www.nature.com/scientificdata www.nature.com/scientificdata/ allow extending CESTES' potential for synthesis work aiming to bridge metacommunity ecology, macroecology, and biodiversity-ecosystem functioning research.

Usage Notes
In complement to the Excel version of CESTES, the database has also been stored as an.RData object to facilitate its further use for analyses in R 109 . This object is a list of lists. Each element of the first order list refers to one dataset, which itself is a list of four matrices; $comm, $envir, $traits and $coord.
First, this allows additional data processing. We set up R code routines ("CESTES_DataPrep.R") that perform a thorough checking of the matrices, especially the match between the matrices' dimensions, and the absence of empty sites, "ghost" species, and NAs.
The trait, environment, species, and coordinates variables were of mixed types (binary, categorical, ordinal, quantitative). To make the datasets properly readable and analysable by R, we made sure the numerical variables were treated as such by the program. We also re-coded the binary variables into 0/1 (numeric), the character and nominal variables into factors (this option can be turned off in the function), and, when relevant (i.e. when made explicit in the original publication), the numeric integer variables into ordinal variables (ordered factors).
Our R code routines generate data reports and send them to the working directory in the form of .txt files. These give the user different information on the "comm", "traits", "envir" and "coord" components of each dataset: list of variables and their types (factor, numeric, etc.), number of sites, species and traits, minimum and maximum value of the community data (that allows checking e.g., whether data are abundances or presences/absences).
The R code also applies some data transformation (e.g. scaling, Hellinger transformation, variable orthonormalization), generates spatial predictors (e.g. Moran Eigenvector Maps) and calculates some usual trait diversity metrics (e.g. Community Weighted Means, functional dispersion).  23 . The curves represent their observed thresholds of Type II error rates -red = 30%, orange = 10%, yellow = 5%, grey = 0%. The datasets that fall below these thresholds are theoretically exposed to respectively 30%, 10%, 5% or 0% chance to fail to detect significant TERs with fourth-corner analysis although these exist. The figure shows that the majority of the CESTES datasets fall in a medium (70%) to very good (>95%) power zone (Power = 100% − Type II error). (2020) 7:6 | https://doi.org/10.1038/s41597-019-0344-7 www.nature.com/scientificdata www.nature.com/scientificdata/ For all these processing, transformation and calculations, we used several R packages, available on CRAN, mainly readxl, doBy, plyr, dplyr, and gdata for the data processing, vegan, ade4, stats for the data transformation, ape and adespatial for the spatial processing, and picante and FD for the calculation of functional indices 109 .
All the R functions coded and used for the data preparation are provided in an R script "CESTES_DataPrep.R". A fully processed and "ready-to-use" version of the CESTES database is stored as an .RData object called "CESTES.RData".
All these files (namely the database, the two R scripts, and the two metadata files) are stored in a zipped folder called "rCESTES.zip" in the "CESTES" folder at the following links: The flexibility of the iDiv Biodiversity Portal storage will allow us to keep updating, extending and sustaining the CESTES database and the R scripts in the future.

Code availability
In addition to the Excel storage of the datasets, the CESTES database has also been stored as an.RData object to facilitate its further use for analyses in R 109 . It comes with R code scripts that allow further checking, processing, transforming and exploring the database content (for more details, see the Usage Notes section). We provide all this information in a folder called "rCESTES.zip" within the "CESTES" folder at the following links: