Chronicles of nature calendar, a long-term and large-scale multitaxon database on phenology

We present an extensive, large-scale, long-term and multitaxon database on phenological and climatic variation, involving 506,186 observation dates acquired in 471 localities in Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan. The data cover the period 1890–2018, with 96% of the data being from 1960 onwards. The database is rich in plants, birds and climatic events, but also includes insects, amphibians, reptiles and fungi. The database includes multiple events per species, such as the onset days of leaf unfolding and leaf fall for plants, and the days for first spring and last autumn occurrences for birds. The data were acquired using standardized methods by permanent staff of national parks and nature reserves (87% of the data) and members of a phenological observation network (13% of the data). The database is valuable for exploring how species respond in their phenology to climate change. Large-scale analyses of spatial variation in phenological response can help to better predict the consequences of species and community responses to climate change.


Background & Summary
Phenological dynamics have been recognised as one of the most reliable bio-indicators of species responses to ongoing warming conditions 1 . Together with other adaptive mechanisms (e.g. changes in the spatial distribution and physiological adaptations), phenological change is a key mechanism by which plants and animals adapt to a changing world 2,3 . Many studies have documented that in the northern hemisphere, spring events have become earlier whereas autumn events are occurring later than before, mostly due to rising temperatures [4][5][6] . Despite this broadly shared response, there are systematic differences in phenological responses to climate change among individual species [7][8][9] , different taxonomic groups and trophic levels [10][11][12] . Further, while some studies have reported that different species are likely to have evolved distinct phenological responses to environmental cues 13,14 , others suggest that many species are synchronised because phenotypic plasticity in phenological response to climate may maintain local adaptation 15,16 .
Comprehensive understanding of phenological responses to climate change requires community-wide data that are both long-term and spatially extensive 11,17,18 . Such data are still not common and, with few exceptions 11,17,18 , the assessments of broad-scale taxonomic and geographic variations in phenological changes have generally involved meta-analyses 5,19 , or analyses of large observational databases that either represent mid-latitude systems 4,5,20 or are characterized by low species richness 13 . Therefore, the spatial variation in phenological dynamics of species communities at large scale is still not well known 13,17 . Yet, this information is essential for understanding how species and communities respond to climate change 16 . A further common problem with many previously published data sets is publication bias. Few scientific journals are keen to publish papers reporting no detectable signal in species response to climate change -which can result in strongly biased conclusions in meta-analyses (but see 12,13 ). Assembling monitoring data which has been consistently collected over long time and a large spatial extent addresses these problems directly 12 .
We present a large-scale and long-term dataset that can be used to examine community-level spatial variation in phenological dynamics and its climatic drivers. The database consists of 506,186 observation dates collected in 471 localities in the Russian Federation, Ukraine, Uzbekistan, Belarus and Kyrgyzstan (Fig. 1) over a 129-year period (from 1890 to 2018). During this period, researchers intensively conducted regular observations to record dates at which a predefined list of phenological and climatic events (Fig. 2) occurred. Although 96% of # A full list of authors and their affiliations appears at the end of the paper.

Data DeSCRiptOR
OpeN Fig. 1 Spatial and taxonomic distribution of data. The size of each circle shows the total number of phenological observations, and the coloured sectors the proportions of observations belonging to each taxonomical group. The number of distinct localities in the database is 471, but in the figure data from nearby locations have been pooled into 63 locations which are situated at least 100 km apart.

Fig. 2
Illustration of the structure of the data for phenological events with highest coverage. Each row corresponds to a type of phenological event. For each event, shown are the total number of records (N), the number of locations from which the records originate (L), the number of species that the data involve (S), and the mean number of species per location (S/L). The two heat maps show the temporal coverage of data in terms of years included (reflecting data availability), and in terms of the phenological dates (reflecting the timing of the included events). Further shown is a variance partitioning, with the colours corresponding to the fixed effects of latitude, longitude and their interaction (red), the random effect of the site (blue), the random effect of the taxon or climatic parameter (green), and the residual (grey). The event types are ordered within each taxonomic group according to the total amount of data.
the observations were acquired from 1960 onwards, a few time series are very long. Events measured for plants include e.g. the onset days of leaf unfolding, first flowering time, and leaf fall; for birds they include e.g. days for first spring and last autumn occurrences; for insects, amphibians, reptiles and fungi they include e.g. day of first occurrence in the spring. The plant data were acquired in fixed plots, and the bird data along established routes. Climatic events were recorded as calendar dates when those events took place. Of all phenological dates, 87% were collected by research personnel of nature protected areas and national parks, who followed a systematic protocol. Thus, sampling effort remained nearly constant over time. The remaining 13% of the observations came from a well-established volunteer phenological network of volunteers, who followed a similar systematic protocol.
The recording scheme implemented at nature reserves offers unique opportunities for addressing community-level change across replicate local communities 21 . These data have been systematically collected not as independent monitoring efforts, but using a shared and carefully standardized protocol adapted for each local community. Thus, variability in observation effort is of much less concern than in most other distributed cross-taxon phenological monitoring schemes. To enable analyses of higher-level taxonomical groups, we have included taxonomic classifications for the species in the database.

Methods
Data acquisition. The data were collected by two research programs: the Chronicles of Nature (Letopisi Prirody) monitoring program, and a volunteer network of phenological observers (Fenologicheskii Klub). The Chronicles of Nature monitoring program 22 is based on the network of strictly protected areas (zapovedniks) and national parks. The program gradually evolved during early 1900s 23 and was formally established in 1940 with the aim of streamlining scientific work in protected areas with standardized methodology among the organizations. The program involves the permanent personnel of each participating organization. The results of the monitoring programs are published annually as Chronicle of Nature books. One printed copy of the books was kept in the office of the participating organization and another copy was sent to the Governmental Environmental Conservation Service (or a corresponding entity depending on the specific point in time).
In the Chronicles of Nature monitoring program, bird phenological events are extracted from route-based observations conducted regularly by ornithologists or professional rangers of the protected areas. Plant phenological events are reported either by botanists who visit permanent monitoring plots or transect, or by rangers who conduct regular walk-throughs within the strictly protected area or national park. The insect phenological data are extracted from standardized trapping data collected by entomologists on permanent plots or transects. The amphibian and reptile data are extracted from standardized trapping data collected by herpetologists. The fungal phenological data are collected by mycologists on permanent plots or transects. The weather event data are collected following a list of pre-defined events (e.g. first day of snowfall) by dedicated personnel or sourced from observations made on a local meteorological station. The types of data collected by each organization depends on the expertise of different taxonomic groups in the scientific personnel. For more details on how the data were collected, see 22,[24][25][26][27][28] .
The network of phenological volunteer observers was established by the Russian Geographic Society in 1848 with questionnaires sent out to selected contacts among scientific community, including teachers and general public 29 . The participants of the volunteer observation network make observations throughout the year to collect data on a pre-defined limited set of phenological events related to plants, animals, and weather. The species included in the pre-defined lists were selected so that they could be identified reliably without specific taxonomical training.
Data digitalization and unification. The compilation of the data in a common database was initiated in the context of the project "Linking environmental change to biodiversity change: long-term and large-scale data on European boreal forest biodiversity" (EBFB), funded for 2011-2015 by the Academy of Finland, and continued with the help of other funding to OO since 2016. We organized a series of project meetings that were essential for data acquisition, digitalization and unification. These meetings were organized in Ekaterinburg (Russia) by the The compilation of the data into a common database was conducted by the database coordinators (EM and CL) in Helsinki (Finland). Those participants that already held the data in digital format submitted it in the original format, and those that had the data only in paper format digitized it using Excel-based templates developed in the project meetings. Submitted data were processed by the database coordinators according to the following steps: 1. The data were formatted so that each observation (the phenological date of a particular event in a particular locality and year) formed one row in the data table (e.g. un-pivoting tables that involved several years as the columns). The phenological event names were split into event type (e.g. "first occurrence") and species name. 2. The event type names (provided originally typically in Russian) were translated into English and the species names (usually provided in Russian) were identified to scientific names, using dictionaries that were partly developed and verified in the project meetings. All scientific names were periodically verified by mapping them to the Global Biodiversity Information Facility (GBIF) backbone taxonomy 30 .
www.nature.com/scientificdata www.nature.com/scientificdata/ 3. We associated each data record with the following set of information fields: (1) project name, i.e. the source organization, (2) dataset name, (3) locality name, (4) unique taxon identifier, (5) scientific taxon name, and (6) event type. 4. We imported the data records in the main database (maintained as an EarthCape database at https:// ecn.ecdb.io). During the import, the taxonomic names, locality names, and dataset names were matched against already existing records. 5. The database was published in Zenodo 31 .

Updates and limitations.
There are at least 150 National Parks and Nature Reserves that collect Chronicles of Nature Book data (in Armenia, Azerbaijan, Belarus, Georgia, Kazakhstan, Kyrgyzstan, Moldova, Russian Federation, Tajikistan, Turkmenistan, Ukraine and Uzbekistan). Out of these, the current database covers data from 62 organizations, with the highest coverage in European Russia (Fig. 2). The collection of new data continues in most parks. Thus, the database is not complete, and we aim to support the database with updates, depending on the interest of new partners to join, as well as resources and funding. The technical validation procedures described below will also be applied to any new information included in the database. The resulting new versions of the database will be released through the Zenodo repository to ensure the long-term availability of the database.
The Chronicles of Nature programme involves several kinds of systematically collected data beyond phenology data: e.g. trapping data on small mammals, count data on birds, and yield data on berries and mushrooms 22 . We aim to publish these data as separate data papers.

Data Records
The database is organized in six datasets: (1) a classification of taxa included, (2) a list of phenological events included, (3) a list of climatic events included, (4) information on the study site, (5) the phenology data, and (6) an information sources table for phenology data 31 . All tables are in csv format (columns separated by commas), and their fields are described in Tables 1-6. The tables can be linked to each other using the unique study site names and the unique identifiers for species and climatic evens.

technical Validation
We asked the contributors to carefully check the validity of the phenological dates prior to submission. While uploading the submitted data to the database, we did manual validation checks to pinpoint data records that were suspicious (e.g. summer events recorded in winter), and sent the suspicious data records back for the contributors for correction or validation. However, given the extensive size of the database, it is likely that the database contains a number of erroneous records. Thus, we performed a series of checks to identify spurious data points and to examine for the strength of biological signal in the data.
First, we fitted for each (site -climatic/species name -event type) triplet a von Mises distribution, i.e. the circular normal distribution, where the circularity was used to connect the last day of the previous year to the first year of the next year. We identified as potentially spurious those records that were beyond the 0.9999367 central confidence interval of the fitted distribution (i.e. points located at least four standard deviations away from the mean, assuming a Gaussian distribution). This filtering revealed 322 severe outliers that were returned to the data owners for validation. If the data owner could neither verify nor correct the exceptional date, we marked this data record as suspicious.
Second, for each (site -climatic/species name -event type) triplet we fitted (i) a single von Mises distribution and (ii) a mixture of two von Mises distributions, and compared the fits of the models (i) and (ii) using the Bayesian Information Criteria (BIC). We identified the data as potentially spurious if the mixture model fitted better to the data with BIC difference of 5 or greater, and if the distance between the estimated means of the distributions in the mixture was greater than 30 days. For 214 such cases, we performed a manual examination of the data. This revealed e.g. the use of identical event names with different actual meaning (e.g. first arrival of the Willow Tit Parus montanus, recorded in spring and autumn seasons, and thus related to spring and autumn migration). Next, we repeated exactly the same filtering procedure, but for (climatic/species name -event type) pairs -to ensure that similarly named event types had consistent meaning across all sites.

Major sources of variation in the data.
To quantify the main sources of variation and thus to illustrate the types of ecological signals present in the data, we performed a variance partitioning analysis separately for each group of species and phenological events. As predictors, we used species and the location, the latter of which we further explained by the linear effects of latitude, longitude, and their interaction. These analyses were preformed using the LinearModelFit and Variance functions in Mathematica 11.1; Wolfram Research 2018. As an example, let us consider the event type with highest amount of data, which is the onset of blooming for plants. These data consist of 76,527 phenological dates, originating from 317 sites and representing 845 taxa (Fig. 2). We first computed for each site an average day over the species and years, resulting in 317 site-specific dates. These dates describe when plants on average (over years and plant species) have their onset of blooming on each location. While the collection of species included in the study varies from site to site, we still consider these dates meaningful proxies for the overall phenology of the onset of plant blooming. The amount of variation explained by the site-level averages was 33% of the original variance. Out of the variation explained by the site, 54% was further explained by the linear effects of latitude, longitude, and their interaction. We then partitioned the remaining variation (after the effect of site was accounted for) to components that could be attributed to the species (53% of the original variance) and to the residual (14% of the original variance). This analysis provided rather strong www.nature.com/scientificdata www.nature.com/scientificdata/      www.nature.com/scientificdata www.nature.com/scientificdata/ support for a strong ecological signal being present in the data, as 86% of the variation among the 76,527 data points could be attributed to the main effects of the location and species, and as ca. half of the variation among the locations could be attributed to a simple geographic trend. We note that the residual variation in this analyses should not be interpreted as erroneous noise, as it contains e.g. variation over time, and thus reflects e.g. the impact of climate change on phenology.
We repeated the above described analysis for all groups of phenological events for which there were at least 1000 data records, as well as climatic events related to temperature, snow, and ice. The results are illustrated in Fig. 2. The amount of explained variance is generally relatively high in all cases, suggesting that much of the variation in the data are explained by location and species.