Unbalanced historical phenotypic data from seed regeneration of a barley ex situ collection

The scarce knowledge on phenotypic characterization restricts the usage of genetic diversity of plant genetic resources in research and breeding. We describe original and ready-to-use processed data for approximately 60% of ~22,000 barley accessions hosted at the Federal ex situ Genebank for Agricultural and Horticultural Plant Species. The dataset gathers records for three traits with agronomic relevance: flowering time, plant height and thousand grain weight. This information was collected for seven decades for winter and spring barley during the seed regeneration routine. The curated data represent a source for research on genetics and genomics of adaptive and yield related traits in cereals due to the importance of barley as model organism. This data could be used to predict the performance of non-phenotyped individuals in other collections through genomic prediction. Moreover, the dataset empowers the utilization of phenotypic diversity of genetic resources for crop improvement.

The scarce knowledge on phenotypic characterization restricts the usage of genetic diversity of plant genetic resources in research and breeding. We describe original and ready-to-use processed data for approximately 60% of~22,000 barley accessions hosted at the Federal ex situ Genebank for Agricultural and Horticultural Plant Species. The dataset gathers records for three traits with agronomic relevance: flowering time, plant height and thousand grain weight. This information was collected for seven decades for winter and spring barley during the seed regeneration routine. The curated data represent a source for research on genetics and genomics of adaptive and yield related traits in cereals due to the importance of barley as model organism. This data could be used to predict the performance of non-phenotyped individuals in other collections through genomic prediction. Moreover, the dataset empowers the utilization of phenotypic diversity of genetic resources for crop improvement.
Design Type(s) data integration objective • metadata search and retrieval objective

Measurement Type(s) Phenotypic_Measurement
Technology Type(s) digital curation Factor Type(s) temporal_instant • geographic location • season

Background & Summary
Cereals are staple food and a valuable source of nutrients around the world 1 . Among them, barley (Hordeum vulgare sp.) is the fourth most produced crop 2 . The main end-uses of barley are brewing, feed, and food production 3 . In terms of crop adaptation barley can be classified into two distinct gene pools: winter and spring type [4][5][6] . While winter type barley needs vernalization for flowering stimulation, spring type barley does not require it 3 . Barley has a diploid genome and its 7 chromosomes represent the base genome of all Triticeae species. For this and many more reasons barley has become a model organism in cereal genetics and genomics 7 . In addition, the availability of a high quality reference sequence of the barley genome, well established protocols for genome editing and elaborated approaches for genomic selection will greatly benefit barley breeding in the future [7][8][9][10][11] . Establishing germplasm collections has involved assemblage and preservation of the existing allelic diversity and their utilization 12,13 . In the case of barley, more than seven decades of major efforts have resulted in about half a million ex situ accessions worldwide [13][14][15] . Germplasm collections are an outstanding resource of genetic diversity for research and plant improvement. For instance, genebank collections represent a rich source of unexplored trait variation which is absent in public and private breeding programs. This variation could potentially boost selection gain in plant breeding to increase both yield potential and sustainability and to facilitate adaptation to global change 16,17 . However, leveraging genetic resources of public germplasm collections is still a challenge due to the lack of phenotypic information and the high investments required for the systematic characterization of plant material 9,18,19 . Recently, a method for the exploitation of germplasm based on genomics was proposed 19 . In this context, genebanks are encouraged to maximize the reuse of both phenotypic and genotypic data by the implementation of the FAIR principles referring to: Findability, Accessibility, Interoperability, and Reusability 20 . For example, historical phenotypic records for traits with agronomical relevance have been accumulated during the seed regeneration process at genebanks but are not publicly available or the access to them is limited 16,[19][20][21][22][23] .
This study presents original and ready-to-use processed phenotypic data with the aim of leveraging the use of historical information collected during seed regeneration. The data correspond to historical records on traits flowering time (FT), plant height (PH), and thousand grain weight (TGW) accumulated for seven decades plus the outlier status of all data points and the Best Linear Unbiased Estimations (BLUEs) for winter and spring barley accessions pertaining to these traits. This historical information belongs to the barley collection of the Federal ex situ Genebank for Agricultural and Horticultural Plant Species hosted at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) in Gatersleben (Germany). Conserving and managing a total of~22,000 accessions, the IPK Genebank manages the sixth largest collection worldwide, which covers a broad range of phenotypic variation 14,15,24,25 . This data publication complements a previous research publication 25 which focuses on the valorization of genetic resources by developing, validating and employing a curated data set from seed regeneration. Moreover, part of these BLUEs was recently used to show the potetial of genome wide association for FT in genebank materials of spring barley 26 .

Plant material
The barley collection at the IPK amounts to~22,000 accessions. These accessions were assembled by means of worldwide collecting expeditions, seed exchange with other institutes, and donations. Accession-related information is being documented in the genebank information system of the IPK (GBIS) 22 . This study includes FT, PH, and TGW data recorded during seed regeneration for approximatelly 60% of the barley accessions.

Seed regeneration produced an unbalanced historical data source
Seed regeneration is aimed to supply seed requirements for (i) safeguarding the stored genetic diversity when sample size and seed viability drop beneath a pre-stablished treshold, (ii) conserving new genotypes within the genebank, (iii) research, and (iv) fulfilling external demands of germplasm 27 . The seed regeneration routine in the genebank generated non-orthogonal phenotypic data 23,28,29 across traits and years, e.g., only 12 accessions were evaluated for TGW in 1984 while a record number of 4,789 accessions were characterized in 1970 for PH. Additionally, there were 1% of cases when accessions were multiplied more than once in a year. One of the reasons for this was, for instance, the need to check whether the plant material required vernalization or not. Moreover, the introduction of cold storage in the year 1976 abruptly decreased the periodicity of data generation during seed regeneration, because storage time switched from~3 to >20 years 27 . Furthermore, the use of the collection, or parts of it, in research projects had a positive impact in the amount of data collected per year. For example, the protein screening of cereal genetic resources carried in 1970 brought the largest number of regenerated accessions in a single year (Fig. 1). The data of the present study is based on seed regenerations during the 1946-2015 period. Seed regeneration for barley was conducted in Gatersleben since 1946 in different seasons according to the growth habit of accessions. In more detail, winter accessions were planted between September and December while spring accessions were sown from February until April. Traits assessed on seed germplasm regeneration Each accession was multiplied using plots of at least 3 m 2 and traits FT, PH, and TGW were assessed during seed regeneration. FT stands for the number of days when 50% of the plants reached flowering. For winter barley, FT is expressed in days after the 1 st of January of each year. For spring barley, FT was expressed in days after the sowing date. PH was assessed in cm from the soil surface to the top of spike including awns. TGW was determined after seed harvest and expressed in g on a~12.5% grain moisture basis. Seeds were harvested at maturity stage and were temporary stored at room temperature. Before the 2005/2006 season the standard protocol for TGW assessment at the genebank was based on the average weight of three samples, each containing 100 grains, which was then extrapolated to 1000 grains. From the 2005/2006 season onwards TGW has been determined by using an automatic Marvin digital seed analyzer and considering a seed sample with up to 100 grains. The data management at the genebank was manual until 2011. In this sense, the information was first recorded in field books, then transferred to

Methods for data processing Statistical model
No formal field experimental design was used during seed regeneration while the dataset contains only 1% of cases when accessions were evaluated more than once in a year. For this reason, an unreplicated completely randomized experimental design was assumed for each regeneration cycle during data processing. According to the assumed design, the experimental unit corresponded to a plot. Phenotypic data of each barley type were analyzed separately based on the following mixed model: where μ is the population mean and "Genotypes" were the genetic effects of accessions, which were assumed as fixed factors, while years and error were treated as random. Variances of errors were modelled as specific for each year. In a first step, Equation (1) was used for outlier detection. Later, the BLUEs of accessions were computed by re-fitting the model in Equation (1) but using and enhanced historical dataset in which data points detected as outliers during the first step were discarded.

Code availability
Mixed model equations were solved using the Restricted Maximum Likelihood (REML) algorithm as implemented in ASReml-R 30 . All described statistical approaches were performed in R environment (Version 2.15.3) 31 . Scripts used for outlier detection and estimating BLUEs are included together with the dataset in the public repository described below (Data Citation 1). The use of the code requires the download of the datasets, save them in a working directory and set the working directory in the scripts. The scripts run for a single trait according to one growth habit. For instance, the example scripts run for flowering time (FT) for spring barley. In this case, the resulting files are labeled as "Data.corrected.FT.txt" or "BLUEs.FT.txt" for outlier detection and estimating BLUEs, respectively. In this regard, this study involves 12 outputs that were compiled in four files which are described below.

Data records
The data compiled for this study is publicily available in the Plant Genomics and Phenomics Research Data Repository (PGP) (http://edal-pgp.ipk-gatersleben.de/) 32 and can be accessed here as (Data Citation 1). The dataset is formated using the ISA-Tab format 33 to guarantee a uniform and easy-readable semantical description. It contains the original data as well as the processed data. While the investigation file describes the general project information, the two study files ("s_Spring_Barley.txt" and "s_Winter_Barley.txt") provide information about the investigated accessions. They contain information such as: (i) accession identifiers, e.g., the accession ID as an unique and stable database generated code at the genebank and accession number wich is typically used for researchers but is not stable over the time, (ii) sowing_date corresponding to day.month.year, (iii) harvest_year, (iv) country as geographic place of collection reported by donors or collectors, and (v) the comment column which shows two groups of accessions whose countries are mentioned in the manuscript as Germany and Soviet Union. The assay files of the present study were separated in the historical phenotypic data ("a_Historical. Data_Spring.txt" and "a_Historical.Data_Winter.txt"), which was provided from the IPK genebank information system and was first screened for outliers. Then, outliers were excluded to produce the enhanced assay files ("a_Enhanced_Historical.Data_Spring.txt" and "a_Enhanced_Historical.Data_Winter.txt"). These files accomodated records for up to 2,967 and 9,898 winter and spring accessions, respectively (Table 1). Each accession was phenotyped from 1 to 22 years (Fig. 2) and in each year a range from 12 to 4,789 accessions, across traits, were evaluated (Fig. 1). The heritability for all traits was high and it increased further by up to 17% when applying an outlier correction 25 (Table 2). The Pearson's correlation coefficient (r) estimated on the enhanced data for pairs of years with at least 50 overlapping accessions ranged from 0.60 to 0.72 (Table 3). The precision in computing the BLUEs amounted to 0.89 for TGW and 0.85 for both FT and PH, respectively 25 . Moreover, the maximum coefficient of variation of the year on the enhanced data set was 0.22 (Table 4). Ninety percent of these genetic resources were collected or originated from 30 geographic places. Ethiopia with 32.1% of accessions was a predominant origin for spring barley followed by 7.2% from Germany. Interestingly, although 12.4% of winter barley accessions were collected or originated from the Soviet Union, there was not a clear predominant place of collection for this type of barley which was reflected by a more uniform frequency distribution of accessions according to collection places (Table 5). Furthermore, the dataset contains an additional folder with the BLUEs of accessions included in the files "BLUEs_Spring.txt" and "BLUEs_Winter.txt" (Fig. 3), that were estimated based on the enhanced historical data files. The corresponding study files are labeled as "s_Spring_Barley.txt" and "s_Winter_Barley.txt".

Technical Validation
Validation involves outlier detection, bias assesment for first and second degree statistics and validation of BLUEs of accessions. Methods, results and discussion of this strategy were described in a previous research publication 25 . However, here we make a brief description of validation methods.   Enhancing the quality of the historical data set by implementing an outlier detection approach Outliers may jeopardize the quality of the data negatively affecting statistical estimates 34,35 . The presence of outliers in the historical dataset (Data Citation 1) is plausible because the data was assembled for seven decades under fluctuating conditions of data and seed regeneration management, as well as contrasting weather conditions across years, among others. Both, the assessment and management of outliers in unbalanced historical datasets are challenging. We used an outlier inspection approach by combining rescaled median absolute deviation of standardized residuals with a Bonferroni-Holm test to flag data points as outliers 35 . A data-point was declared as outlier by the implemented test according to a predefined significance threshold of p-value o 0.05. We removed the outliers from the historical data set to obtain an enhanced historical dataset (Data Citation 1). Considering genotypes and years as random   Table 2. Estimates on historical data and enhanced historical data sets for variance components of genotypes (σ 2 G ), years (σ 2 y ), and errors (σ 2 e ); number of environments (E) and heritability (h 2 ) for flowering time (FT), plant height (PH), and thousand grain weight (TGW) of up to 2,967 winter and up to 9,898 spring barley accessions evaluated in up to 69 years of seed regeneration 25    Studying the potential bias in estimating firstand second-degree statistics for different missing data scenarios On average, seed regeneration activities before 1976 were carried out every 3 years for each accession. This was mainly because seed storage was formerly performed at room temperature 27 . However, this condition led to evaluate blocks of accessions corresponding to the year when they entered the genebank, which is often reflecting specific collection hotspots. Therefore, the missing value structure of the phenotypic data collected is potentially deviated from the random scenario. Since estimating first and second degree statistics is potentially biased by the missing data structure, a resampling study was performed considering three missing data scenarios. Firstly, a balanced dataset was derived from the enhanced historical dataset of spring barley. This balanced set included phenotypic records for FT and PH available for the years 1948, 1951, 1954, 1957, 1961, and 1970 for 400 spring accessions. These accessions were collected in 10 geographic places: Turkey (99), Greece (91), Germany (56), United States of America (49), Bulgaria (36), Sweden (18), Japan (14), Albania (13), Austria (12), and countries of the former Soviet Union (12). Later, the balanced dataset was sampled based on three missing data scenarios as follows: in Scenario 1, phenotypic records were randomly sampled from three out of six test years for each accession, which amounted to 1,200 phenotypic data points in total. In Scenario 2, the 400 accessions were randomly grouped into 10 clusters and the phenotypic data for each group was randomly subsampled from 3 years gathering 1,200 phenotypic data points in total. In Scenario 3 the 10 places of collection were considered as groups of accessions and phenotypic data from 3 years was randomly subsampled for each group resulting in 1,200 phenotypic data points. Each scenario was sampled 100 times. Biases in estimating variances of genotypes and errors were calculated asdd d , whered stands for the estimated parameters in each sampling run and d corresponds to the parameter estimated from the  balanced dataset. Moreover, we performed a linear regression of the BLUEs computed for each of 100 resampling runs on the BLUEs from the balanced data set. In this respect, the intercept, the slope, and the coefficient of determination of the linear regression model were considered to measure bias.
Resampling procedure for assessing the precision in computing BLUEs of accessions Precise estimates of trait performance are pivotal for decisions makers on research and breeding. Thus, we performed a resampling procedure 36,37 to assess the precision in estimating BLUEs. The enhanced data set of spring barley was randomly split into two equally sized subsets. Only accessions for which phenotypic data was available in both subsets were considered in each of the 100 resampling runs. Therefore, across 100 runs 3,691, 3,474, and 3,066 accessions were included on average for FT, PH and TGW, respectively. We fitted the model specified in Equation (1) to estimate the BLUEs of accessions in both subsets. Subsequently, precision of estimation was computed as the correlation of BLUEs of accessions between subsets.

Usage Notes
Maximizing the use of genetic resources will benefit current and future efforts to breed new cultivars that are required to address needs in food security, climate resilience, and sustainability 16,38,39 . However, restricted resources limit the systematic phenotyping of germplasm collections 9,18,19 . The strategy described here is based on data that was routinely collected by curators during seed multiplication cycles and is embedded in the scripts used for outlier detection and BLUEs computation. The scripts run for a single trait according to one growth habit. This strategy could be adapted to other genebanks for the validation of their own data in order to increase the amount of data for well characterized accessions at no extra cost. The value of the data will be further leveraged by genotypic information which will become publicly available soon for the IPK barley collection. In the future, both, phenotypic and genotypic information will facilitate the implementation of genomic prediction which is expected to further boost the utilization of genetic resources for research and breeding 19,[40][41][42] . By providing the investigated data using the ISA-Tab format and publishing them via DOI, all research data and the presented results are available in a FAIR-way 20 and can be easily re-used.