A compilation of North American tree provenance trials and relevant historical climate data for seven species

Tree provenance trials consist of a variety of seed sources (or provenances) planted at several test sites across the range of a species. The resulting plantations are typically measured periodically to investigate provenance performance in relation to abiotic conditions, particularly climate. These trials are expensive and time consuming to establish, but are an important resource for seed transfer systems, which aim to match planting sites with well-adapted (climatically suitable) seed sources. Provenance trial measurements may be underutilized because the data are scattered across publications, conference proceedings, and university theses. Here we document an effort to collect available provenance trial measurements and associated climate data for seven eastern North American tree species (Pinus strobus, Pinus banksiana, Picea glauca, Picea mariana, Quercus rubra, Larix laricina, Betula alleghaniensis). The resulting datasets included a total of 773 provenances and 62 test sites, with 65 historical climate variables appended to each location. We hope this data will support forest managers in making seed transfer decisions, particularly in an era of rapid climate change.


Background & Summary
Natural populations of tree species often vary in traits such as cold hardiness, growth, and bud phenology across their geographic range, due to local adaptation 1 . Much research has been done to quantify the variation in these traits across a species' range for the purpose of elucidating population-level preferences in environmental conditions. Generally, tree provenance studies involve collecting representative seed of natural populations from a variety of locations (also referred to as provenances) across the range of a species and subsequently planting them at one or more test sites with statistically sound experimental designs. The resulting plantations are monitored through time by measuring performance-related variables such as tree height and survival. Better adapted populations are expected to grow faster and exhibit higher survival rates. Although a common assumption is that the local provenance is best, this is not always the case. For example, Lu et al. showed that white spruce provenances from southern Ontario, Canada, performed better (in terms of height and survival) than local provenances at several northwestern Ontario test sites 1 . There are several potential explanations for the success of non-local provenances, including recent climate change and the migration history of the species 1 .
An important application of provenance data is in the development of seed transfer guidelines 2 . In the past, transfer guidelines have typically aimed to limit seed movements in order to ensure the use of local seed sources at reforestation sites. However, with an evolving climate, directional seed movements (usually poleward or upslope) may be needed to maintain a reasonable match between historical climate at the seed source and future climate at the planting site. Furthermore, optimizing future growth gains obtained by planting non-local provenances must be balanced with possible near-term mortality due to climate/weather-related factors such as winter injury. Since provenance trials gather information on provenance performance across a range of climatic conditions, they are well-suited to provide valuable insights into optimal seed transfer movements under climate change. Unfortunately, tree provenance trials are expensive to establish, maintain, and measure. Further, the long lifespan of trees means that it can take decades to obtain information about mature trees from provenance trials. For these reasons, consolidating existing provenance data into a publicly available repository is an important undertaking.
No current resource provides compiled data from eastern North American tree provenance trials in a single product with associated climate data. Instead, these data are scattered across a variety of sources, including journal articles, conference proceedings, university theses, and government reports. Here we present a data product that consolidates provenance data and associated climate estimates for seven major eastern North American tree species. This effort provides timely data to researchers and forest managers given the recent increased focus on seed transfer under climate change. Prospects for rapid climate change over the coming decades 3 and management strategies such as assisted migration increase the relevance of historical data 4 . Climate and elevation data are included along with the provenance measurements, in order to, for example, facilitate the calculation of safe climate transfer distances 5 for each tree species in the database. This article details the data acquisition and data processing methods used to create the databases and describes the structure of the databases and related quality assurance methods. We hope that the databases and information provided in this article will facilitate the use of existing tree provenance data for ongoing seed transfer research and decision making.

Methods
Study design & database considerations. We identified seven major tree species that have significant economic and/or ecological value in our geographic area of interest (northeastern United States and eastern Canada): Picea glauca (white spruce), Picea mariana (black spruce), Pinus banksiana (jack pine), Larix laricina (tamarack), Pinus strobus (white pine), Quercus rubra (red oak), and Betula alleghaniensis (yellow birch)). White spruce, black spruce, and jack pine are important commercial conifer species harvested in large volumes across the northern (boreal) portion of our area of interest 6 . Red oak, yellow birch, and white pine are high value hardwood species 7,8 that have more southern geographic ranges, but have potential for future northward seed transfer. We included data for seed sources from across North America where available, although most originated in the northeastern United States and eastern Canada.
We gathered information for three types of measurement (or response) variables, including tree height, survival, and phenology. Tree height is widely used to indicate provenance growth potential in seed transfer research, but tree survival and phenology data are also important to indicate that a potential seed source can survive and adapt to the conditions at a certain location. Phenology data is of particular interest due to changes in seasonality associated with climate change and the lack of information on how species and populations will respond to these changes 9 . However, this data is not widely reported in tree provenance trials because it is time sensitive and challenging to measure, with multiple visits required to accurately record the phenological stage of the buds 10 .
In order to identify potential data sources and create a functional database that included both provenance and climate data, we completed six major steps (Fig. 1).

Systemic review, metadata collection, & selection of sources.
We conducted an extensive literature review of sources (publications, conference proceedings, government reports, and university theses) relating to tree provenance trials, for each of the seven tree species of interest. We used keywords such as "provenance trials", "height growth", "tree survival", "common garden study", and both the common and scientific names of the tree species when searching for sources using resources such as Google Scholar and Reforestation, Nurseries, & Genetic Resources (RNGR). We reviewed a total of 130 sources, recording metadata such as the number of test sites, the year an individual trial was established, the number of provenances included in each trial, provenance mean values of the variables measured in individual trials, tree age at measurement, etc. We also recorded how trees in each trial were measured for each response variable and whether the trial results were reported in-text. If the location information and trial data were reported and it was likely that all trees in the trials were measured, we digitized the information related to the study in a series of tables. Online-only Table 1 provides a summary of sources selected for inclusion in our data product. Data processing. All the records in the database were converted to consistent units, such as decimal degrees for latitude and longitude, meters for tree height, and days since January 1 for phenology data. All measurements included in the database represent the mean value for the whole test site, including any replications. For some studies, we assumed this to be the case because no information was provided in the original study about methods used to sample the trees to take measurements. In some cases, there were multiple test sites at one location, representing different environments 11 . In this case, the test sites were entered separately, labeled as "Site 1", "Site 2", etc. We also included the test sites separately for different species or studies.
Climate data associated with provenances and test sites. Estimates for some 65 climate variables (see Online-only Table 2 for more details) were obtained by interrogating existing spatial climate models at each seed source and test site location 12 . Growing season-related variables were calculated using a Bessel interpolation in conjunction with monthly climate variables 13 . The climate estimates included averages for the 1961-1990 period, which provide a reasonable approximation of historical long-term climate values at seed source locations. In addition, climate estimates were generated at each test site for each year over the period spanning plantation establishment and measurement. Specifically, we used a Python program to calculate mean, standard deviation, minimum, and maximum climate values over this period.
Once we generated climate estimates at each test site and seed source location, we used a Python script to append the information (using a unique seed source identification code as the primary key) to the tables containing the provenance trial data. If elevation information had not been reported in the original document, we appended elevation estimates at seed source and test site locations using a digital elevation model (DEM) 14 .
www.nature.com/scientificdata www.nature.com/scientificdata/ Data Records Data format & storage location. The databases are provided in a series of comma-delimited files (available in both.txt and.csv format). These tables are organized in a series of folders (see Fig. 2). The first database (henceforth the trial results database) contains information on seed sources from 22 provenance studies in eastern North America, including latitude, longitude, elevation, performance measurements (tree height, survival, and phenology data), and 65 long-term climate variables (i.e., averages for the 1961-1990 period). The second database (henceforth the test sites database) contains information on test sites from the same 22 studies, including latitude, longitude, elevation, and 65 climate variables summarized for the period spanning plantation initiation and measurement. All data were recorded in standardized format and units (decimal degrees, meters, °C, etc.). The databases are accessible in an online repository at https://doi.org/10.5063/K072NQ 15 .  www.nature.com/scientificdata www.nature.com/scientificdata/ Here, we have provided two versions of the trial results database. The first version of the database (trial_ results_database) contains tables with consistent columns to facilitate data analysis in environments such as R. If a certain test site did not have a specific measurement, the column was included but contained no data values. This database contains mean height and survival for all trial measurement ages in single columns (i.e., "Mean Height (m)", "Survival (%)"). To achieve this, we entered multiple records for each seed source. The second version of the database (trial_results_database_primary_key) contains tables with a single row for each seed source to facilitate data analysis with SQL in environments such as Microsoft Access or PostgreSQL. In this case, we included individual columns for each measurement taken at each test site (i.e., "Mean Height Age 10 (m)"). This format means that the seed source identification code can be used as the primary key and can be used to join tables from the same study.
Data structure & citation. The trial results database is organized by species and contains a table for each test site, identified by both the name of the test site location and the data source number. This database also contains information about the data sources, such as the accuracy of the latitude and longitude of the seed sources (whether it was reported in-text or estimated from the location name) and whether or not phenology data was available. In total, this database contains 75 tables. Table 1 reports an overview of the information contained in  www.nature.com/scientificdata www.nature.com/scientificdata/ this database. Detailed information on the data contained in each table can be found in the database, in the table " Table Overview".
The test sites database is organized by test site and includes tables that contain mean, standard deviation, minimum, and maximum climate values for the period between plantation establishment and measurement. If there were multiple measurement years, additional tables were provided for the period between trial establishment and the later measurement years. For example, if a test site was established in 1975 and measured in 1979 and 1980, we included two tables, one for 1975-1979 and one for 1979-1980. In order to avoid tables in the database having the same name, the tables are also named with the unique code of the test site, such as 1988-1990 (pjt281). The purpose of this is to avoid errors when querying the database. The test site database also contains information about the data sources, as well as the test sites (such as whether the elevation data comes from the data source or the DEM). Missing data values were denoted using −9999. In total, this database contains 103 tables. Figure 2 shows a schematic of the folder structure of both databases.
Both databases are organized such that the source of the data is noted alongside the test site name. The full citations of the data sources are found in both databases, in the table named "Guide to Data Sources". For users of the data, the original source should be cited along with the database.
applications. The databases presented here will support seed transfer analyses. The provenance trial data, in combination with the associated climate variables, can be used to calculate critical seed transfer distances -i.e., the distance that a seed source can be moved to improve long term productivity but also minimize the risk of near term mortality 16 . The information provided in the databases may contribute to efforts to develop more robust transfer functions (a function to model climate-performance relationships) by including more test sites and/or seed sources 17 . Furthermore, the datasets can also be used for applications such as mapping trends and finding correlations between climate variables and trial results.

technical Validation
The first method we used for data quality assurance was to examine the data for unreasonable values. For example, tree height mean data for each test site was sorted in ascending and descending order to examine the data for any extreme values. Suspect values were then compared with the source data to check for data entry errors. For quality assurance of the climate data, we mapped the values for mean annual temperature for each provenance and test site in ArcMap (Fig. 3); similar efforts were undertaken for a selection of the other climate variables as www.nature.com/scientificdata www.nature.com/scientificdata/ well (Figs. 4, 5). Accuracy of the mean, standard deviation, minimum, and maximum calculations was ensured by comparing the results with hand calculations for a selection of data values to ensure that the Python program was working properly.