Efficient curation of genebanks using next generation sequencing reveals substantial duplication of germplasm accessions

Genebanks are valuable resources for crop improvement through the acquisition, ex-situ conservation and sharing of unique germplasm among plant breeders and geneticists. With over seven million existing accessions and increasing storage demands and costs, genebanks need efficient characterization and curation to make them more accessible and usable and to reduce operating costs, so that the crop improvement community can most effectively leverage this vast resource of untapped novel genetic diversity. However, the sharing and inconsistent documentation of germplasm often results in unintentionally duplicated collections with poor characterization and many identical accessions that can be hard or impossible to identify without passport information and unmatched accession identifiers. Here we demonstrate the use of genotypic information from these accessions using a cost-effective next generation sequencing platform to find and remove duplications. We identify and characterize over 50% duplicated accessions both within and across genebank collections of Aegilops tauschii, an important wild relative of wheat and source of genetic diversity for wheat improvement. We present a pipeline to identify and remove identical accessions within and among genebanks and curate globally unique accessions. We also show how this approach can also be applied to future collection efforts to avoid the accumulation of identical material. When coordinated across global genebanks, this approach will ultimately allow for cost effective and efficient management of germplasm and better stewarding of these valuable resources.


Results
Sequencing and SNP genotyping. We genotyped a large number of Ae. tauschii accessions from WGRC ( Fig. 1), CIMMYT and PAU genebanks. Accessions were grouped into two separate sets; Set 1 consisting of 568 accessions from WGRC and 187 accessions from PAU genebanks, and Set 2 consisting of 388 accessions from CIMMYT genebank. GBS produced ~2 billion raw reads 100 bp in length for Set 1, and DArTSeq generated ~1 billion 77 bp long reads for Set 2. Median count for Set 1 samples was ~1.8 million (~1.5 billion total) and ~2 million (~855 million total) for Set 2. Both sets had an overlapping distribution of barcoded read counts ( Supplementary Fig. S1A), however, except few outliers Set 1 had dense distribution around the median. Using the reads from Set 1 samples, discovery step in TASSEL4 GBS pipeline found a total of ~8.5 million unique 64 bp long tags. On an average, each sample contributed about 281 K unique tags. Set 1 and Set 2 samples had median tag counts of ~306 K and ~214 K per sample, respectively ( Supplementary Fig. S1B). These tags were aligned internally across accessions to find putative SNP sites, which resulted in 91,545 SNPs. Proportion of heterozygosity Scientific RepoRts | (2019) 9:650 | https://doi.org/10.1038/s41598-018-37269-0 per sample ranged from 0-15%, and missing data per SNP ranged from 0.6-78.9% ( Supplementary Fig. S1C,D). Population-level SNP filtering with less than 50% missing data, retained 29,555 SNPs that were used for cluster analysis. For percent identity by state (pIBS), 20,844 pairwise comparisons were performed on average between any two accessions.
Clustering and identifying identical accessions. Two different analyses were performed to identify identical accessions; a cluster analysis and allele matching. Cluster analysis ( Supplementary Fig. S2) provides a quick method to cluster accessions based on genetic distances, however it cannot find identical accessions per se.
For curating genebanks, cluster analysis should be used as a first step to group phenotypically cryptic accessions outside of the species under study and identify other outliers. From the cluster analysis, we observed the strong population structure between lineage 1 and lineage 2 that is known and previously reported in Ae. tauschii 29 . As expected, we could assign all accessions into two large clusters, and identified three outliers which were removed from subsequent analysis ( Supplementary Fig. S2). Accession TA3429 was found to be an outlier in STRUCTURE analysis. Two other accessions, one each from PAU and CIMMYT, clustered with TA3429 to form an outlier group. Corroborated by allele matching analysis, these outliers did not match with any other accession, supporting evidence that they have been either misidentified as Aegilops tauschii or could be the hybrids between two lineages. Contrary to cluster analysis, allele matching provides an absolute percent IBS coefficient that can be used to identify identical accessions. Based on allele matching, different accessions had pairwise identity ranging from 37.5-99.9% ( Supplementary Fig. S3). Each genebank resulted in a bimodal distribution of pIBS because of the strong population structure within Ae. tauschii. The higher pIBS peak represents the percent identity within subpopulations, and lower pIBS peak represents between subpopulations. With genotyping error, it is not possible to expect a 100% allelic identity for accession that should be considered the same. For this study, we implemented 99% allelic identity threshold for declaring accessions identical. This was initially based on expected sequencing error rates and confirmed with biological sample replicates. Minimum and maximum number of duplicated accessions were found in WGRC (25.88%) and PAU (54.01%), respectively, with CIMMYT having 43.04% duplicated accessions (Fig. 2) Fig. S4). After grouping the accessions across all genebanks, only 564 unique accessions were found, representing over 50% duplicated accessions across the combined collections.
Error rate and efficiency. To compute the error rate of the GBS method, 76 accessions from the WGRC were resequenced and used as biological replicates. Of these 76 accessions, 11 had pIBS <99% with their respective original samples. Three accessions had pIBS greater than 95%, two ranged from 93-95%, and the remaining raged from 44-89%. Using the equation 2, the overall error rate was computed to be 3.13%, which is higher than our 1% threshold. To investigate further, multiple seeds from these 11 accessions were planted, however, only eight accessions produced at least one plant. GBS was performed on these eight accessions as described below.
Four out of eight accessions produced only a single plant. These were resequenced and compared with their previously sequenced respective samples (original sample and biological replicate). As initially expected, all four resequenced samples matched with >99% pIBS with either the original sample or the respective biological replicates. Two of these accessions matched with their original sample and other two matched with their biological replicates. These results point to the possibility of sample contamination or mixup that resulted in bad GBS data in one of the two initial GBS runs. Another possibility is that the original seed source was heterogeneous. Seed or sample mixture during the genotyping process of large number of samples is possible, however, we attempted to test the latter conjecture. Remaining four out of eight accessions (TA1581, TA1589, TA1714 and TA2468) produced multiple plants that allowed us to test our hypothesis that the original seed source was heterogeneous. The final GBS was performed on each plant individually and compared with their respective original samples and biological replicates. TA1581 and TA1589 matched nicely with their original sample and all other replicates within this GBS run, but not the previous biological replicate. This points to the possibility that the sample mixup might have happened during the sequencing of previous biological replicates for these two accessions. In contrast, resequenced samples for TA2468 matched with >99% identity with the previous biological replicate and all other samples within this run, but failed to match with the original GBS. This again points to the possibility that the sample contamination or mixup might have happened during the original GBS.
For the final TA1714, a different pattern was observed. Two of the four resequenced samples matched with >99% identity with the original GBS, and the other two matched with the biological replicate. This supports our hypothesis and presents an evidence that the genebank seed source might be heterogeneous that results in lower pIBS. This is further evident in independent gliadin profiling discussed below. After removing these anomalous coefficients, the accuracy improved, and the error rate was reduced to only 0.48%, which is below our 1%  threshold. This supports that the GBS is a robust tool for finding identical accessions in genebanks with very little error due to sequencing and biological sampling within homogenous accessions.
Gliadin profiling. To independently validate the GBS results, gliadin profiling was run on eight independent groups from the cluster analysis in two separate runs. Gliadin proteins were selected for independent confirmation because of their ease of extraction and polymorphic profiling pattern. The first run included ten samples from four different groups ( Supplementary Fig. S5). Per the manufacturer's manual, bands lower than 10 kD were excluded as these are system bands that are produced by the small molecules interacting with lithium dodecyl sulfate (LDS) micelles in gel-staining solution and do not carry useful information. We observed matching banding pattern for the identical samples within the groups. For the second run ( Supplementary Fig. S6), samples were included from four other different groups. As expected, the samples within all groups have similar banding pattern with the following notes. Sample TA2457 (Supplementary Fig. S6 -Lane 7) has the similar banding pattern as other samples from Grp15 (Lanes 5 and 6) but has a smeared profile that might be due to higher amount of extracted protein. Sample TA1579 ( Supplementary Fig. S6, Lane 2) is the only accession from Grp187 and had very different banding pattern as compared to any other lane in this gel. Overall, matching banding pattern for the accessions within a group provides an independent evidence that the accession grouping based on GBS results are accurate.
Detecting accession heterogeneity. TA1714 was hypothesized to be a heterogeneous and TA2457 a homogeneous accession based on the initial GBS grouping. To detect and confirm the heterogeneity in the source seed, these two accessions were subjected to a final GBS run. For TA1714 and TA2457, 12 and 15 seeds of each accession were sampled, respectively, and subjected to both GBS and gliadin profiling. Each seed was split into two halves; one half was crushed for protein extraction and the remaining half with intact embryo was germinated for tissue collection for GBS.
As expected TA1714 showed heterogeneity in both the GBS and gliadin profiling by forming two sub-groups (Supplementary Fig. S7; red and blue box). Samples within each sub-group matched with >99% identity but had lower identity across sub-groups. Gliadin profiling was corroborated with GBS grouping of these samples. Contrary to TA1714, TA2457 did not show different banding pattern among individual plants from this accession ( Supplementary Fig. S8), which supports that TA2457 is homogeneous. Both gliadin profiling results match with the corresponding GBS sub-groups. This independent confirmation with gliadin profiling also supports that GBS can also be implemented to detect sample heterogeneity in the genebank accessions.
Imputing missing passport information. STRUCTURE analysis resulted in posterior probabilities ranging from 0.001-1. Higher posterior probability indicated higher likelihood that the accession belongs to a certain geographical group. Because these geographical groups are not completely isolated, we treated these groups as admixed populations, hence we used the posterior probability of 0.6 or more to assign an accession in a group. Using this analysis, we could assign 24 out of 26 accessions with missing geographical information into one of the geographic clusters. Two remaining accessions could not be assigned to any specific group because of lower probabilities (Supplementary Table S3).

Discussion
Genotyping platform and accuracy. Selecting a genotyping platform is important when a large number of samples are of interest. We genotyped 1143 Ae. tauschii samples using two sequence-based methods. Sequencebased methods, such as GBS, are inexpensive and robust for genotyping a diverse range of uncharacterized species with complex genomes 12 as compared to other methods such as SNP-chip. As no prior SNP information is required for sequence-based methods, they also control for ascertainment bias because the SNP discovery and genotyping is performed on the same samples. Here, we could use newly generated GBS data for Set 1 and previously generated DArTSeq for Set 2, to find duplicated accessions and efficiently curate the genebanks. Even though GBS only assays less than 1% of the genome, it resulted in an average of 20,844 pairwise SNP comparisons for allele matching. Therefore, GBS profiling with an error rate of only 0.48% makes it is a robust tool for this type of germplasm characterization.
Our GBS results were corroborated by gliadin profiling. GBS generates genome-wide biallelic markers, whereas gliadin protein profiling samples multiple alleles from a few loci. This can independently validate GBS results. However, we found that GBS results were robust and quantitative, whereas gliadin profiling was inconsistent and is potentially affected by the experimental conditions, such as the amount of protein extracted. Additionally, gliadin profiling is low-throughput and time consuming as compared to GBS, which further underscores that GBS is a preferred and more powerful tool for such studies.
Collaborating with other genebanks. The ability to combine existing genotypic datasets and germplasm sharing is of great interest for genebank collaborations. As a starting point, this strategy was used on a diploid progenitor of wheat to identify unique accessions within and among genebanks. Here a coordinated effort between WGRC, CIMMYT and PAU could compare 1143 Ae. tauschii accessions across the genebanks and identify both identical and unique accessions across all the genebanks. Collaboration and sharing of germplasm across genebanks has historically produced further duplication as accessions are often re-numbered with local identifiers and the associated original records and identifiers are lost over time. Therefore, the ultimate benefit of this strategy will be realized when this approach is implemented globally in collaboration across all genebanks. The sequencing technology has quickly reached a point to enable globally coordinated effort among all genebanks to genetically curate these collections and find unique accessions in them. These globally unique accessions should then be prioritized and likely shared with other genebanks for additional backup of those irreplaceable accessions. Defining globally unique accessions. We have correlated many accessions with lost or incorrect accession identifiers through genotyping these collections. Most misclassifications happen during sharing of germplasm resources between collections 30 , which leads to significant duplication and incorrect information. Historically, germplasm was frequently shared, however, the associated metadata often was lost or misidentified, resulting in inaccurate classification and the new identifiers assigned lead to duplications in and across collections. Re-collecting at the same locations and sharing germplasm among genebanks also results in duplications within and among genebanks. We found 26-54% redundant accessions within, and a total of over 50% redundant accessions among the WGRC, CIMMYT and PAU genebanks. As a starting point, we only performed this analysis for Ae. tauschii, but this strategy can be extended to other species with different ploidy levels stored in various genebanks. Genebanks worldwide are reported to hold over a million Triticeae accessions 31 . However, if our observations from this study hold true for other species, including the Triticeae tribe, we are vastly overestimating the number of unique germplasm accessions stored in the genebanks. Applying genetic curation across genebanks around the world should be made a coordinated priority. Once unique accessions are identified across all collections, a globally unique ID could be generated and duplicate accessions within and between collections noted. With global curation, genebanks can better coordinate and curate collections efficiently. Currently, 482 genebanks use the GENESYS database (www.genesys-pgr.org) with over 3.6 million accessions which could provide a platform for establishing global curation. Such curation could also help other research endeavors, such as recently funded CGIAR Genebank Platform 2017-2022, whose main goal is to make available 750,000 accessions of crops and trees to the research community for crop improvement.
Curating passport information and metadata. Often, vital metadata associated with shared germplasm, such as geographical or species information, is missing or incorrect. Species classification is a real challenge when dealing with cryptic species. A combination of existing genomic tools and statistical analyses can be used to infer those missing pieces. We used one such combination, GBS and cluster analysis and identified outliers ( Supplementary Fig. S2). Although it is very difficult to accurately assign an accession to a geographical region at district level resolution, genotypic similarities and ancestry relationships can be used to group them together with other accession that have the metadata available. We used such methods to assign 24 out of 26 accessions to a potential geographical region of origin. Meyer (2015) noted that researchers tend to use germplasm with complete passport information and other associated metadata 32 , which provides an incentive to collect and curate the accessions, and infer the missing information.
Future direction for germplasm collection. The role of wild germplasm in crop improvement and the need to collect and preserve as much wild diversity as possible is evident. However, a specific protocol is necessary to avoid the accumulation of redundant accessions and keep only unique ones. One such approach is presented here (Fig. 4). Briefly, when a new accession is collected or received, multiple seeds should be planted for tissue collection, and tissue should be collected in bulk from all plants, which was not ensured in this current study. We only sampled single seed from each accession, and it is possible that we missed within sample heterogeneity. Genotyping should be done on the bulked tissue from several seedlings. However, because Ae. tauschii is a highly self-pollinated species, it is very rare to find within accession heterogeneity unless due to seed mixture. Nevertheless, if possible, multiple independent samples should be sequenced for each accession. High level of heterozygous SNP calls, and mismatches within an accession, should point to the possibility of heterogeneous seed source that can be purified using single seed descent method. Bulked genotype data should be used for comparison to an existing genotyping database to find if the new accession is unique or identical to an existing collection. If unique, a new ID should be assigned, otherwise, the accession should be grouped together with the existing group of accessions. One such case study is explained below.
Case study for collecting new accessions. About 92% of the WGRC accessions were collected in 1950s and 60 s by various explorers and obtained through sharing among various genebanks. To fill the gaps in the collection sites and to preserve more genetic diversity, a recent collection expedition was conducted in June 2012 by WGRC researchers. During this expedition, a total of 44 accessions of Ae. tauschii were collected with passport information (blue dots; Fig. 1). Based on our analysis, 36 collected accessions (~82%) were unique in that they did not match with any other accession, either the newly collected or the already existing accessions. One newly collected accession had pIBS >99% with three already existing WGRC accessions that were collected decades ago. Seven accessions had pIBS >99% with at least one another new accession that were from the same geographic areas. Even though we collected 44 new accessions, but effectively only 36 (~82%) of them were unique. These findings support implementation of a protocol for efficiently curating the genebanks in place, which is based on genotypic data.

Conclusion
There are significant costs associated with running a genebank, beginning with acquiring an accession to storing and maintaining the germplasm. Because genebanks have limited funding and resources, identifying the duplicate accessions would result in a savings on both. Cost effective genotyping methods, such as GBS, can be applied for identifying duplicate accessions, and infer missing geographical and species information. Our results indicate that we are overestimating the diversity stored in the genebanks. Ultimately, identifying unique accessions within and across the genebanks will facilitate the better use of wild germplasm, make sharing more efficient, help breeders work with genetically diverse unique individuals and make better use of the untapped genetic diversity.  Table S1). The germplasm consisted of the accessions collected from natural habitat ( Fig. 1; WGRC collection distribution) and accessions received from other genebanks.

Methods
DNA extraction and genotyping. Two approaches for DNA extraction and the GBS libraries preparation were implemented for WGRC and PAU accessions (hereafter referred to as Set 1), and CIMMYT accessions (hereafter referred to as Set 2). For Set 1, approximately two inches of young leaf tissues from single two to three weeks old seedlings were collected in 96 well plates. Genomic DNA was extracted using Qiagen BioSprint 96 DNA Plant Kit (QIAGEN, Hilden, Germany) and quantified with Quant-iT ™ PicoGreen ® dsDNA Assay Kit (ThermoFisher Scientific, Waltham, MA, USA). At least one random well per plate was left blank with known position for quality control and library integrity. GBS libraries were prepared following the protocol from   33 . Briefly, the libraries were prepared in 95-plex using 384A adapter set. For complexity reduction, DNA for each sample was digested using two enzymes -rare cutter PstI (CTGCAG), to which the uniquely barcoded adaptors were ligated, and frequent cutter MspI (CCGG), to which the common reverse adapter was ligated. All samples from a single plate were pooled and amplified using polymerase chain reaction (PCR). Detailed protocol can be found on Wheat Genetics and Germplasm Improvement website (www.wheatgenetics.org). Libraries were sequenced on ten lanes in total on Illumina HiSeq2000 (Illumina, San Diego, CA, USA) platform at University of Missouri (UMC; Columbia, Missouri) and McGill Univesity-Génome Quebec Innovation Centre (Montreal, Canada) facility. To compute the error rate for the GBS, 76 WGRC accessions were randomly chosen, and were sequenced as biological replications (different seedlings) using the abovementioned protocol. For Set 2, Ae. tauschii accessions were planted in greenhouse in plots. Leaves from single seedling plants were taken and DNA was extracted using modified CTAB (cetyltrimethylammonium bromide) method 34 and quantified using NanoDrop spectrophotometer V2.1.0 (ThermoFisher Scientific, Waltham, MA, USA). Genotyping was performed at DArT, Canberra, Australia using DArTseq 35 methodology that has been used in recent years at CIMMYT [35][36][37] . DArTseq is a combination of diversity array technology (DArT) 38,39 complexity reduction and next-generation sequencing (NGS) methods. Two optimized enzyme sets, PstI-HpaII and PstI-HhaI, were used for complexity reduction. Samples were sequenced twice using two different 4 bp cutters on one end of the RE fragments (HpaII and HhaI) on a total of nine lanes. SNP discovery. Single nucleotide polymorphisms (SNPs) were discovered and typed with TASSEL-GBS 40 framework using TASSEL 4 GBS pipeline using an in-house written Java plugin and a modified Java pipeline without reference genome. Detailed description is provided in Poland, et al. 41 . In brief, 64 bp long valid tags (containing restriction cut site and a barcode) were extracted from each sample, and then similar tags (up to 3 bp differences) were internally aligned to find SNPs. To test putative tag pairs for as allelic SNP calls, Fisher exact test was performed on all aligned tag pairs with one to three nucleotide differences. Tag pairs that failed the test at P ≤ 0.001 were considered biallelic and converted to SNP calls 41  and can be considered alternate tags for SNP alleles at the same locus. Due to the differences in library preparation for Set 1 and Set 2, the tag discovery step was performed using only Set 1 accessions, and then the discovered tags were used as reference to produce SNPs for both sets.

Statistical analyses, allele matching and error computation. Data analyses and genotype curation
were performed in R statistical language 42 using a combination of custom R-scripts and available packages-ape 43 , data.table 44 , ggplot2 45 , and venneuler 46 . In addition to hierarchical clustering, an identity matrix was computed by pairwise comparison of accessions across all SNP sites. Hierarchical clustering group individuals based on the relative genetic distance between individuals, whereas, pairwise allele matching provides an absolute percent identity by state (IBS) coefficient between all individuals. Although, clustering can provide an independent support for allele matching, it is hard to interpret clustering to identify identical accessions. However, clustering can provide a quick method to identify obvious outliers and misclassified accessions. For clustering, population-level SNP filtering was performed to retain the SNPs with ≤50% missing data. In contrast, for pairwise comparison, only those SNP sites without missing data and homozygous in both individuals were used for comparison. A stringent threshold of 99% identity was used to consider two accessions the same to account for a 1% sequencing and alignment error rate. Accessions with ≥99% identity were considered identical within and/or across genebanks. Percent Identity by State (pIBS) was computed using the following equation 1: where, pIBS ij is the percent Identity by State for a given pair of accessions i and j, allele ix and allele jx are the alleles at x th SNP of accessions i and j, respectively, =sign represents an exact successful match (identity by state) between two alleles, and n is the total number of SNP sites in a pairwise comparison. The same equation was used to compute pIBS for an accession with its biological rep for error rate computation. In that case i and j represents the original accession and its biological replicate, respectively. Accessions with pIBS ≥99% (0.99) were grouped together in an arbitrary group number. Group size was computed as number of accessions in a group.
An error rate was computed using biological replicates for 76 accessions. Single to multiple seeds were grown for each accession, DNA was extracted, and sequencing performed as explained above. The error rate was computed using the following equation 2: where, n is the number of accessions with biological replicates, pIBS ij is the percent IBS for ith accession with its jth replicate, and m is the number of replicates for a given accession.
Gliadin profiling. To complement our GBS identity results, we extracted and profiled gliadin proteins from eight independent groups of identical accessions (Supplementary Table S2) that were found to be the same with GBS. A single seed per accession was crushed in pestle and mortar to fine flour and mixed with 70% ethyl alcohol and stored at −4 °C for 24 hours. Following the protein extraction, samples were prepared using Bio-Rad Experion Pro260 kit (Bio-Rad, Hercules, California) following manufacturer's instructions, and loaded on to an Experion Pro260 chip. The chips were read using Bio-Rad Experion automated electrophoresis system (Bio-Rad, Hercules, California). Virtual gel images were analyzed to compare accessions for identical protein banding patterns. For later comparison of protein profiling and GBS for two samples, multiple seeds were subjected to both procedures, where half of the seed was used for protein extraction and the other half with intact embryo was used for germination and tissue collection for DNA extraction.

Imputing passport information.
To facilitate the reduction of missing data and better curation of genebanks, we used genomic data and STRUCTURE 47 software to impute the missing passport information for 26 WGRC accessions. Filtering was performed to remove SNPs with missing data greater than 30% and minor allele frequency less than 0.05, which resulted in 11,823 SNPs. For imputation, all the accessions with available passport information were used as learning samples and the remaining with missing to be imputed. The STRUCTURE parameters were set as follows: 10,000 burn-in iterations followed by 10,000 MCMC iterations, POPDATA = 1, USEPOPINFO = 1, GENSBACK = 1, LOCIPOP = 1, and all other parameters left at default settings. This resulted in posterior probabilities for each accession belonging to a specific geographical group with certain probability.