Background & Summary

Cereals are staple food and a valuable source of nutrients around the world1. Among them, barley (Hordeum vulgare sp.) is the fourth most produced crop2. The main end-uses of barley are brewing, feed, and food production3. In terms of crop adaptation barley can be classified into two distinct gene pools: winter and spring type46. While winter type barley needs vernalization for flowering stimulation, spring type barley does not require it3. Barley has a diploid genome and its 7 chromosomes represent the base genome of all Triticeae species. For this and many more reasons barley has become a model organism in cereal genetics and genomics7. In addition, the availability of a high quality reference sequence of the barley genome, well established protocols for genome editing and elaborated approaches for genomic selection will greatly benefit barley breeding in the future711.

Establishing germplasm collections has involved assemblage and preservation of the existing allelic diversity and their utilization12,13. In the case of barley, more than seven decades of major efforts have resulted in about half a million ex situ accessions worldwide1315. Germplasm collections are an outstanding resource of genetic diversity for research and plant improvement. For instance, genebank collections represent a rich source of unexplored trait variation which is absent in public and private breeding programs. This variation could potentially boost selection gain in plant breeding to increase both yield potential and sustainability and to facilitate adaptation to global change16,17. However, leveraging genetic resources of public germplasm collections is still a challenge due to the lack of phenotypic information and the high investments required for the systematic characterization of plant material9,18,19. Recently, a method for the exploitation of germplasm based on genomics was proposed19. In this context, genebanks are encouraged to maximize the reuse of both phenotypic and genotypic data by the implementation of the FAIR principles referring to: Findability, Accessibility, Interoperability, and Reusability20. For example, historical phenotypic records for traits with agronomical relevance have been accumulated during the seed regeneration process at genebanks but are not publicly available or the access to them is limited16,1923.

This study presents original and ready-to-use processed phenotypic data with the aim of leveraging the use of historical information collected during seed regeneration. The data correspond to historical records on traits flowering time (FT), plant height (PH), and thousand grain weight (TGW) accumulated for seven decades plus the outlier status of all data points and the Best Linear Unbiased Estimations (BLUEs) for winter and spring barley accessions pertaining to these traits. This historical information belongs to the barley collection of the Federal ex situ Genebank for Agricultural and Horticultural Plant Species hosted at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) in Gatersleben (Germany). Conserving and managing a total of ~22,000 accessions, the IPK Genebank manages the sixth largest collection worldwide, which covers a broad range of phenotypic variation14,15,24,25. This data publication complements a previous research publication25 which focuses on the valorization of genetic resources by developing, validating and employing a curated data set from seed regeneration. Moreover, part of these BLUEs was recently used to show the potetial of genome wide association for FT in genebank materials of spring barley26.

Methods

Plant material

The barley collection at the IPK amounts to ~22,000 accessions. These accessions were assembled by means of worldwide collecting expeditions, seed exchange with other institutes, and donations. Accession-related information is being documented in the genebank information system of the IPK (GBIS)22. This study includes FT, PH, and TGW data recorded during seed regeneration for approximatelly 60% of the barley accessions.

Seed regeneration produced an unbalanced historical data source

Seed regeneration is aimed to supply seed requirements for (i) safeguarding the stored genetic diversity when sample size and seed viability drop beneath a pre-stablished treshold, (ii) conserving new genotypes within the genebank, (iii) research, and (iv) fulfilling external demands of germplasm27. The seed regeneration routine in the genebank generated non-orthogonal phenotypic data23,28,29 across traits and years, e.g., only 12 accessions were evaluated for TGW in 1984 while a record number of 4,789 accessions were characterized in 1970 for PH. Additionally, there were 1% of cases when accessions were multiplied more than once in a year. One of the reasons for this was, for instance, the need to check whether the plant material required vernalization or not. Moreover, the introduction of cold storage in the year 1976 abruptly decreased the periodicity of data generation during seed regeneration, because storage time switched from ~3 to >20 years27 . Furthermore, the use of the collection, or parts of it, in research projects had a positive impact in the amount of data collected per year. For example, the protein screening of cereal genetic resources carried in 1970 brought the largest number of regenerated accessions in a single year (Fig. 1). The data of the present study is based on seed regenerations during the 1946–2015 period. Seed regeneration for barley was conducted in Gatersleben since 1946 in different seasons according to the growth habit of accessions. In more detail, winter accessions were planted between September and December while spring accessions were sown from February until April.

Figure 1
figure 1

Number of accessions tested for flowering time (FT), plant height (PH), and thousand grain weight (TGW) for the time period from 1946 until 2015 for spring (left) and winter barley (right).

Traits assessed on seed germplasm regeneration

Each accession was multiplied using plots of at least 3 m2 and traits FT, PH, and TGW were assessed during seed regeneration. FT stands for the number of days when 50% of the plants reached flowering. For winter barley, FT is expressed in days after the 1st of January of each year. For spring barley, FT was expressed in days after the sowing date. PH was assessed in cm from the soil surface to the top of spike including awns. TGW was determined after seed harvest and expressed in g on a ~12.5% grain moisture basis. Seeds were harvested at maturity stage and were temporary stored at room temperature. Before the 2005/2006 season the standard protocol for TGW assessment at the genebank was based on the average weight of three samples, each containing 100 grains, which was then extrapolated to 1000 grains. From the 2005/2006 season onwards TGW has been determined by using an automatic Marvin digital seed analyzer and considering a seed sample with up to 100 grains. The data management at the genebank was manual until 2011. In this sense, the information was first recorded in field books, then transferred to card files and lately digitized for data storage and computational analysis. From 2011 onwards Personal Digital Assistants (PDAs) were used.

Methods for data processing

Statistical model

No formal field experimental design was used during seed regeneration while the dataset contains only 1% of cases when accessions were evaluated more than once in a year. For this reason, an unreplicated completely randomized experimental design was assumed for each regeneration cycle during data processing. According to the assumed design, the experimental unit corresponded to a plot. Phenotypic data of each barley type were analyzed separately based on the following mixed model:

(1) T r a i t µ + G e n o t y p e s + Y e a r s + E r r o r ,

where μ is the population mean and “Genotypes” were the genetic effects of accessions, which were assumed as fixed factors, while years and error were treated as random. Variances of errors were modelled as specific for each year. In a first step, Equation (1) was used for outlier detection. Later, the BLUEs of accessions were computed by re-fitting the model in Equation (1) but using and enhanced historical dataset in which data points detected as outliers during the first step were discarded.

Code availability

Mixed model equations were solved using the Restricted Maximum Likelihood (REML) algorithm as implemented in ASReml-R30. All described statistical approaches were performed in R environment (Version 2.15.3)31. Scripts used for outlier detection and estimating BLUEs are included together with the dataset in the public repository described below (Data Citation 1). The use of the code requires the download of the datasets, save them in a working directory and set the working directory in the scripts. The scripts run for a single trait according to one growth habit. For instance, the example scripts run for flowering time (FT) for spring barley. In this case, the resulting files are labeled as “Data.corrected.FT.txt” or “BLUEs.FT.txt” for outlier detection and estimating BLUEs, respectively. In this regard, this study involves 12 outputs that were compiled in four files which are described below.

Data records

The data compiled for this study is publicily available in the Plant Genomics and Phenomics Research Data Repository (PGP) (http://edal-pgp.ipk-gatersleben.de/)32 and can be accessed here as (Data Citation 1). The dataset is formated using the ISA-Tab format33 to guarantee a uniform and easy-readable semantical description. It contains the original data as well as the processed data. While the investigation file describes the general project information, the two study files (“s_Spring_Barley.txt” and “s_Winter_Barley.txt”) provide information about the investigated accessions. They contain information such as: (i) accession identifiers, e.g., the accession ID as an unique and stable database generated code at the genebank and accession number wich is typically used for researchers but is not stable over the time, (ii) sowing_date corresponding to day.month.year, (iii) harvest_year, (iv) country as geographic place of collection reported by donors or collectors, and (v) the comment column which shows two groups of accessions whose countries are mentioned in the manuscript as Germany and Soviet Union. In this regard, the group Germany includes accessions from Germany and [Former] East Germany. The group Soviet Union stands for accessions from [Former] Union of Soviet Socialist Republics, Armenia, Azerbaijan, Belarus, Georgia, Estonia, Kyrgystan, Latvia, Lithuania, Moldova, Russia, Tajikistan, Turkmenistan, Ukraine and Uzbekistan. Furthermore, some modifications were done with respect to the original data, e.g. the harvest year 1946 contained only 2 records for PH in winter type barley, which caused serious convergency problems during the fitting of mixed models. For this reason, these two datapoints were removed from the PH records of winter barley.

The assay files of the present study were separated in the historical phenotypic data (“a_Historical.Data_Spring.txt” and “a_Historical.Data_Winter.txt”), which was provided from the IPK genebank information system and was first screened for outliers. Then, outliers were excluded to produce the enhanced assay files (“a_Enhanced_Historical.Data_Spring.txt” and “a_Enhanced_Historical.Data_Winter.txt”). These files accomodated records for up to 2,967 and 9,898 winter and spring accessions, respectively (Table 1). Each accession was phenotyped from 1 to 22 years (Fig. 2) and in each year a range from 12 to 4,789 accessions, across traits, were evaluated (Fig. 1). The heritability for all traits was high and it increased further by up to 17% when applying an outlier correction25 (Table 2). The Pearson’s correlation coefficient (r) estimated on the enhanced data for pairs of years with at least 50 overlapping accessions ranged from 0.60 to 0.72 (Table 3). The precision in computing the BLUEs amounted to 0.89 for TGW and 0.85 for both FT and PH, respectively25. Moreover, the maximum coefficient of variation of the year on the enhanced data set was 0.22 (Table 4). Ninety percent of these genetic resources were collected or originated from 30 geographic places. Ethiopia with 32.1% of accessions was a predominant origin for spring barley followed by 7.2% from Germany. Interestingly, although 12.4% of winter barley accessions were collected or originated from the Soviet Union, there was not a clear predominant place of collection for this type of barley which was reflected by a more uniform frequency distribution of accessions according to collection places (Table 5). Furthermore, the dataset contains an additional folder with the BLUEs of accessions included in the files “BLUEs_Spring.txt” and “BLUEs_Winter.txt” (Fig. 3), that were estimated based on the enhanced historical data files. The corresponding study files are labeled as “s_Spring_Barley.txt” and “s_Winter_Barley.txt”.

Table 1 Number of phenotypic records and accessions tested for flowering time (FT), plant height (PH), and thousand grain weight (TGW) for winter and spring barley assessed for the time period from 1946 until 2015.
Figure 2: Number of years in which accessions were tested for flowering time (FT), plant height (PH), and thousand grain weight (TGW).
figure 2

(a) Spring, and (b) winter barley growth habits.

Table 2 Estimates on historical data and enhanced historical data sets for variance components of genotypes ( σ ˆ G 2 ), years ( σ ˆ y 2 ), and errors ( σ ˆ e 2 ); number of environments (E) and heritability (h2) for flowering time (FT), plant height (PH), and thousand grain weight (TGW) of up to 2,967 winter and up to 9,898 spring barley accessions evaluated in up to 69 years of seed regeneration25.
Table 3 Pearson’s correlation coefficient (r) estimated between pairs of years with at least 50 overlapping accessions for the time period from 1946 until 2015 for spring and winter barley.
Table 4 Mean ± standard deviation (SD) and range for the coefficient of variation of the residuals calculated for each year for flowering time, plant height, and thousand grain weight for spring and winter barley on the enhanced dataset.
Table 5 Distribution of spring and winter barley by geographic place of collection reported by donors or collectors (origin).
Figure 3: Matrix plot of dispersion and relationships among Best Linear Unbiased Estimators (BLUEs) of accessions for flowering time (FT), plant height (PH), and thousand grain weight (TGW).
figure 3

Winter and spring barley are represented in blue and yellow colors, respectively. The diagonal shows the dispersion for each trait as violin plots whereas the upper and lower triangle stands for the relationships among traits. The Pearson’s correlation coefficient (r) is presented in the lower right corner of every plot for each trait combination. Correlations that significantly differ from zero (p-value < 0.0001) are indicated with * while those being not significant are indicated with n.s.

Technical Validation

Validation involves outlier detection, bias assesment for first and second degree statistics and validation of BLUEs of accessions. Methods, results and discussion of this strategy were described in a previous research publication25. However, here we make a brief description of validation methods.

Enhancing the quality of the historical data set by implementing an outlier detection approach

Outliers may jeopardize the quality of the data negatively affecting statistical estimates34,35. The presence of outliers in the historical dataset (Data Citation 1) is plausible because the data was assembled for seven decades under fluctuating conditions of data and seed regeneration management, as well as contrasting weather conditions across years, among others. Both, the assessment and management of outliers in unbalanced historical datasets are challenging. We used an outlier inspection approach by combining re-scaled median absolute deviation of standardized residuals with a Bonferroni-Holm test to flag data points as outliers35. A data-point was declared as outlier by the implemented test according to a predefined significance threshold of p-value < 0.05. We removed the outliers from the historical data set to obtain an enhanced historical dataset (Data Citation 1). Considering genotypes and years as random effects, Equation (1) was re-fitted in order to check the impact of outlier exclusion on variance components and heritability. Heritability was computed as follows: h 2 = σ ˆ G 2 σ ˆ G 2 + σ ˆ e 2 Y ¯ , where σ ˆ G 2 denotes the estimator of the genetic variance, σ ˆ e 2 corresponds to the average variance estimated for the errors, and Y ¯ stands for the average number of years when genotypes have been tested. Assuming random genotype and fixed year effects on Equation (1), the coefficient of variation of the year was computed as:= σ ˆ e 2 Y E , where σ ˆ e 2 corresponds to the year-specific error variance and YE refers to the year effect.

Studying the potential bias in estimating first- and second-degree statistics for different missing data scenarios

On average, seed regeneration activities before 1976 were carried out every 3 years for each accession. This was mainly because seed storage was formerly performed at room temperature27. However, this condition led to evaluate blocks of accessions corresponding to the year when they entered the genebank, which is often reflecting specific collection hotspots. Therefore, the missing value structure of the phenotypic data collected is potentially deviated from the random scenario. Since estimating first and second degree statistics is potentially biased by the missing data structure, a resampling study was performed considering three missing data scenarios. Firstly, a balanced dataset was derived from the enhanced historical dataset of spring barley. This balanced set included phenotypic records for FT and PH available for the years 1948, 1951, 1954, 1957, 1961, and 1970 for 400 spring accessions. These accessions were collected in 10 geographic places: Turkey (99), Greece (91), Germany (56), United States of America (49), Bulgaria (36), Sweden (18), Japan (14), Albania (13), Austria (12), and countries of the former Soviet Union (12). Later, the balanced dataset was sampled based on three missing data scenarios as follows: in Scenario 1, phenotypic records were randomly sampled from three out of six test years for each accession, which amounted to 1,200 phenotypic data points in total. In Scenario 2, the 400 accessions were randomly grouped into 10 clusters and the phenotypic data for each group was randomly subsampled from 3 years gathering 1,200 phenotypic data points in total. In Scenario 3 the 10 places of collection were considered as groups of accessions and phenotypic data from 3 years was randomly subsampled for each group resulting in 1,200 phenotypic data points. Each scenario was sampled 100 times.

Biases in estimating variances of genotypes and errors were calculated as d ˆ d d , where d ˆ stands for the estimated parameters in each sampling run and d corresponds to the parameter estimated from the balanced dataset. Moreover, we performed a linear regression of the BLUEs computed for each of 100 resampling runs on the BLUEs from the balanced data set. In this respect, the intercept, the slope, and the coefficient of determination of the linear regression model were considered to measure bias.

Resampling procedure for assessing the precision in computing BLUEs of accessions

Precise estimates of trait performance are pivotal for decisions makers on research and breeding. Thus, we performed a resampling procedure36,37 to assess the precision in estimating BLUEs. The enhanced data set of spring barley was randomly split into two equally sized subsets. Only accessions for which phenotypic data was available in both subsets were considered in each of the 100 resampling runs. Therefore, across 100 runs 3,691, 3,474, and 3,066 accessions were included on average for FT, PH and TGW, respectively. We fitted the model specified in Equation (1) to estimate the BLUEs of accessions in both subsets. Subsequently, precision of estimation was computed as the correlation of BLUEs of accessions between subsets.

Usage Notes

Maximizing the use of genetic resources will benefit current and future efforts to breed new cultivars that are required to address needs in food security, climate resilience, and sustainability16,38,39. However, restricted resources limit the systematic phenotyping of germplasm collections9,18,19. The strategy described here is based on data that was routinely collected by curators during seed multiplication cycles and is embedded in the scripts used for outlier detection and BLUEs computation. The scripts run for a single trait according to one growth habit. This strategy could be adapted to other genebanks for the validation of their own data in order to increase the amount of data for well characterized accessions at no extra cost. The value of the data will be further leveraged by genotypic information which will become publicly available soon for the IPK barley collection. In the future, both, phenotypic and genotypic information will facilitate the implementation of genomic prediction which is expected to further boost the utilization of genetic resources for research and breeding19,4042. By providing the investigated data using the ISA-Tab format and publishing them via DOI, all research data and the presented results are available in a FAIR-way20 and can be easily re-used.

Additional information

How to cite this article: Gonzalez, M. Y. et al. Unbalanced historical phenotypic data from seed regeneration of a barley ex situ collection. Sci. Data. 5:180278 doi: 10.1038/sdata.2018.278 (2018).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.