The barley pan-genome reveals the hidden legacy of mutation breeding

Abstract

Genetic diversity is key to crop improvement. Owing to pervasive genomic structural variation, a single reference genome assembly cannot capture the full complement of sequence diversity of a crop species (known as the ‘pan-genome’1). Multiple high-quality sequence assemblies are an indispensable component of a pan-genome infrastructure. Barley (Hordeum vulgare L.) is an important cereal crop with a long history of cultivation that is adapted to a wide range of agro-climatic conditions2. Here we report the construction of chromosome-scale sequence assemblies for the genotypes of 20 varieties of barley—comprising landraces, cultivars and a wild barley—that were selected as representatives of global barley diversity. We catalogued genomic presence/absence variants and explored the use of structural variants for quantitative genetic analysis through whole-genome shotgun sequencing of 300 gene bank accessions. We discovered abundant large inversion polymorphisms and analysed in detail two inversions that are frequently found in current elite barley germplasm; one is probably the product of mutation breeding and the other is tightly linked to a locus that is involved in the expansion of geographical range. This first-generation barley pan-genome makes previously hidden genetic variation accessible to genetic studies and breeding.

Main

A staple food of ancient civilizations, today barley is used mainly for animal feed and malting. Barley is more adaptable to harsh environmental conditions than its close relative wheat, and maintains an important role in human nutrition in harsh climatic regions that include the Ethiopian and Tibetan highlands2. As in other crops, genomics has been a major driver of progress in barley genetics and breeding in the past decade3. The first draft reference genome for barley4, and its subsequent revisions5,6, have formed the basis for gene isolation7, compiling a single-nucleotide polymorphism (SNP) variation atlas for wild and domesticated germplasm8, and activating plant genetic resources9. At the same time, reduced-representation surveys of structural variation10 and map-based cloning11 have implicated variation in gene content and copy number in the control of agronomic traits. The concept of the pan-genome refers to a species-wide catalogue of genic presence/absence variation (PAV)12, or more generally, structural variation that affects (potentially non-coding) sequences of 50 or more base pairs (bp) in size. Although several methods of pan-genomic analysis that use short-read sequence data in the context of a single reference genome have been devised13, large and complex genomes require multiple high-quality sequence assemblies to capture and contextualize sequences that are absent in—or highly diverged from—a single reference genotype14. Progress in sequencing and genome mapping technologies has only recently made possible the fast and cost-effective assembly of tens of genotypes of large-genome plant species, such as barley (haploid genome size of 5 Gb)15.

Twenty barley reference genomes

The starting point for pan-genomics in barley was the comprehensive survey of species-wide diversity on the basis of the genome-wide genotyping of more than 22,000 barley accessions, mainly from the German national gene bank9. To achieve a good representation of major barley gene pools, we selected accessions that were located in the branches of the first six principal components from the previously published principal component analysis (PCA)9 (Fig. 1a, Extended Data Fig. 1), reflecting the key determinants of population structure: geographical origin, row type and annual growth habit. In addition to these gene pool representatives, our panel included the reference cultivar Morex5, two current or former elite malting varieties (RGT Planet and Hockett), two founder lines of Chinese barley breeding (ZDM01467 and ZDM02064), Golden Promise and Igri (two genotypes with high transformation efficiency16,17), Barke (a successful German variety and the parent of several mutant and mapping populations18,19) and one wild barley (H. vulgare subsp. spontaneum (K. Koch) Thell.) genotype from Israel (B1K-04-12, a desert ecotype collected at Ein Prat)20.

Fig. 1: Chromosome-scale sequences of 20 representative barley genotypes reveal large structural variants.
figure1

a, We selected 20 barley genotypes to represent the genetic diversity space, as revealed by PCA of genotyping-by-sequencing data of 19,778 domesticated varieties of barley9. Principal component (PC)3 and PC4 are shown. The proportion of variance explained by the principal components is indicated in the axis labels. Further principal components are shown in Extended Data Fig. 1a. b, Alignment of the pseudomolecules of chromosome 2H of the Morex and Barke cultivars. The inset zooms in on a 10-Mb inversion that is frequently found in germplasm from northern Europe. Co-linearity plots for all assemblies and chromosomes are shown in Extended Data Fig. 3a.

We constructed chromosome-scale sequence assemblies for 20 accessions (Extended Data Table 1). In brief, paired-end and mate-pair Illumina short reads were assembled into scaffolds of megabase (Mb)-scale contiguity (Extended Data Table 1). Scaffold assembly was done with Minia21 and SOAPDenovo22 following the TRITEX method6 (n = 16), DeNovoMagic from NRGene (n = 3) or W2rap23 (n = 1). We used 10X Genomics Chromium linked-reads and chromosome conformation capture (Hi-C) data to arrange scaffolds into chromosomal pseudomolecules using the TRITEX pipeline6 (Extended Data Table 1). A comparison of the short-read assembly of the Morex cultivar to a long-read assembly of this genotype generated from PacBio long reads showed high co-linearity at chromosomal scale, good concordance in gene space representation and similar power to detect PAV (Extended Data Fig. 2), indicating that short-read assemblies are amenable to pan-genomic analyses in barley. Although the assemblies of the 20 diverse accessions differed in contiguity and the extent of gap sequence in the intergenic space, they had a similar representation of reference gene models (Morex V2) and were highly co-linear to each other at the whole-chromosome scale (Fig. 1b, Extended Data Fig. 3). A similar proportion (about 80%) of the assembled sequence of each genotype was composed of transposable elements, with an average of 113,200 intact full-length long-terminal repeat retro-elements (LTRs) per assembly (Supplementary Table 1). However, we found pronounced differences in the number of shared intact full-length LTR locations: only 17 to 25% of full-length LTR locations present in the wild barley B1K-04-12 were shared at 98% sequence identity and 98% alignment coverage with any domesticated genotype (Extended Data Fig. 4). By contrast, more closely related domesticated genotypes shared between 53% and 67% of their full-length LTRs, consistent with previous reports of rapid sequence turn-over in the non-coding space in large-genome plant species24,25.

De novo gene annotation using Illumina RNA sequencing and PacBio Iso-Seq data (Supplementary Table 2) was performed for three genotypes: Morex (which has previously been reported6), Barke and the Ethiopian landrace HOR 10350 (Extended Data Fig. 5). Gene models defined on the basis of these three assemblies were consolidated and projected onto the remaining 17 assemblies (Extended Data Fig. 5). Between 35,859 and 40,044 gene models were annotated by projection in each assembly (Extended Data Table 1) with an average of 37,515 (s.d. = 896). The number of gene models was about 20% higher in the projections than in de novo annotations (Extended Data Fig. 5e), which indicates that some of the models lack transcript support: possible explanations for the discrepancy are highly tissue-specific expression or pseudogenization. The clustering of orthologous gene models yielded 40,176 orthologous groups. Of these, 21,992 occurred as a single copy in all 20 assemblies; 3,236 occurred in multiple copies in at least one of the 20 assemblies; 13,188 were absent from at least one assembly; and 1,760 were present in only one assembly. On average, 14.7% of gene models annotated in each assembly occurred in tandem arrays that comprised two or more adjacent copies. These results point to abundant genic copy-number variation between barley genotypes. Future transcriptomic studies will ascertain the effect of structural variants on gene expression.

Pan-genome as a tool for genetics and breeding

High-quality genome assemblies are a resource for ascertaining and providing context to structural variants, which can then be genotyped in a wider set of germplasm using low-coverage or reduced-representation sequence data. We used two complementary approaches to detect structural variation: assembly comparison and clustering of single-copy sequences to derive markers that can be scored in short-read data. We used the Assemblytics26 software to discover PAV by pair-wise comparison of 19 chromosome-scale assemblies to the Morex reference. We identified 1,586,262 PAVs, ranging in size from 50 to 999,568 bp, and observed an enrichment for low-frequency variants (Extended Data Fig. 6a, b). PAV density was higher in distal, gene-rich regions (Extended Data Fig. 6c), which are characterized by higher nucleotide diversity and recombination rates8. A total of 5,446 out of 5,602 deletions longer than 5 kilobases (kb) found in Barke relative to Morex were mapped genetically in the 90 recombinant inbred lines of the Morex × Barke population19 with highly concordant positions (Spearman correlation = 0.99) (Extended Data Fig. 6d), which provides support for the accuracy of the detected polymorphisms. At least one member of 18,562 (46%) groups of orthologous genes overlapped with structural variants discovered in the 20 sequence assemblies. As observed in other plant species27, resistance-gene homologues containing NB-ARC and protein kinase domains were frequently found among PAV genes (Supplementary Table 3).

Structural variants cover non-genic regions composed of repetitive sequence, making it hard to establish orthologous relationships or the presence of specific alleles from short-read data only. To derive quantitative estimates of the extent of pan-genomic variation and as a tool for genetic analysis such as association scans, we focused on single-copy regions extracted from each of the 20 assemblies and clustered into a non-redundant set of sequences (hereafter referred to as the ‘single-copy pan-genome’) (Extended Data Fig. 7a). The average cumulative size of single-copy sequence in each accession was 478 Mb (that is, 9.5% of the assembly genome). The total size of non-redundant single-copy sequence was 638.6 Mb, represented by 1,472,508 clusters with an N50 of 1,087 bp (Extended Data Fig. 7b). The single-copy sequence shared among all 20 genotypes amounted to 402.5 Mb, whereas 235.9 Mb were variable (that is, absent or present in higher copy number in at least one assembly) (Fig. 2a). On average, each of the 20 genotypes contained 2.9 Mb of single-copy sequence not present in any other assembly. As observed for transposable element divergence, the wild barley B1K-04-12 had the highest amount of unique single-copy sequence (Extended Data Table 1).

Fig. 2: Single-copy pan-genome and use of PAVs in association mapping.
figure2

a, Cumulative size of single-copy regions in genome assemblies of 20 barley genotypes. The genotypes were ordered according to the size of their unique single-copy sequence. b, Genome-wide association scan for lemma adherence on the basis of PAV markers. The black and red dots in the Manhattan plot denote single-copy sequences that are present and absent in Morex, respectively. c, The most highly associated PAV marker was a 16.7-kb region that is deleted in the naked accession HOR 7552 and that contains the NUD gene11. d, Allelic status of the NUD deletion in 196 domesticated varieties of barley. Normalized single-copy k-mer counts within the 16.7-kb region are shown for hulled (n = 160 genotypes) and naked varieties (n = 36 genotypes).

To test the suitability of the single-copy pan-genome for genetic analysis in a wider diversity panel without high-quality genome sequences, we collected whole-genome shotgun data (threefold coverage) for 200 domesticated and 100 wild varieties of barley (Supplementary Table 4). The abundance of 160,716 single-copy clusters that overlap structural variants was estimated by counting cluster-constituent k-mers (k = 31) in sequence reads of the diversity panel. In addition, we analysed genotyping-by-sequencing data of 19,778 gene bank accessions of domesticated barley9 using the same approach. Abundance estimates based on k-mers (hereafter referred to as ‘pan-genome markers’) showed that loci detected as single-copy sequence in one genome assembly can vary in copy number from zero to many in diverse germplasm (Extended Data Fig. 7c). A PCA of pan-genome markers genotyped in whole-genome shotgun and genotyping-by-sequencing data highlighted the same drivers of global population structure as SNPs (Extended Data Fig. 7d–g). In genome-wide association scans for morphological traits, pan-genome markers revealed—with a good signal-to-noise ratio—peaks that are consistent with previous reports9 (Fig. 2b, Extended Data Fig. 8). Notably, the pan-genome marker that was most highly associated with lemma adherence covered the NUDUM (NUD) gene11 (Fig. 2c). All varieties of naked barley—in which lemmas can be easily separated from grains—are thought to trace back to a single mutational event, deleting the entire NUD sequence11. Another putative knockout allele of NUD (nud1.g) that contains a likely disruptive SNP variant was recently found in Tibetan barley28. All 36 naked accessions in our panel contained the known deletion (Fig. 2d), indicating that broader sampling of barley diversity—with a particular focus on centres of (morphological) diversity—is needed to discover novel rare alleles by genomic analyses.

Compared to reference-free approaches for k-mer-based genome-wide association scans such as AgRenSeq29, trait-associated pan-genome markers are assigned with high precision to genomic positions, and aligning sequence assemblies in their vicinity provides immediate information about differences between haplotypes (Fig. 2c). Furthermore, the reduction of marker number by implicit clustering of k-mers into single-copy loci allows the use of standard mixed linear models30,31 to correct for genomic relatedness.

A map of polymorphic inversions

Chromosome-scale sequence assemblies can reveal large-scale rearrangements that are challenging to detect with other methods. Large inversions (more than 5 Mb in size) were prominent in the genome alignments of our 20 assemblies (Fig. 1b, Extended Data Fig. 3a, c). Previous reports on segregating inversions in barley are anecdotal and have focused on induced mutants32,33. To discover inversions in a broader set of germplasm, we mined patterns of contact frequencies in Hi-C data of a diversity panel mapped to a single reference genome34. Among 69 barley genotypes (67 domesticated and 2 wild accessions) (Supplementary Table 5), Hi-C-based inversion scans revealed a total of 42 events that ranged from 4 to 141 Mb in size (mean size of 23.9 Mb) (Extended Data Fig. 9a). Most of these events occurred in the low-recombining pericentromeric regions of the barley chromosomes and segregated at low frequency: 25 events were observed only once (Extended Data Fig. 9b, c, Supplementary Table 6). We focus here on two notable examples: a frequent event on chromosome 2H and an inversion in the distal region of the long arm of chromosome 7H.

The inversion in chromosome 7H detected in the RGT Planet cultivar was the largest event that segregated in our panel (141 Mb) (Fig. 3a). In a biparental mapping population derived from a cross between RGT Planet and the non-carrier cultivar Hindmarsh (Fig. 3b), this event repressed recombination in an interval that spanned 49 cM in the genetic map of the Morex × Barke population19, which is isogenic for absence of the inversion (Fig. 3c, Supplementary Table 7). We also observed a moderately distorted segregation (57% allele frequency, χ2 = 4.88, P < 0.05) in favour of the Hindmarsh allele in this interval. Recombination frequencies were increased in the flanking regions of the inversion in the RGT Planet × Hindmarsh population relative to Morex × Barke, which suggests a compensatory mechanism to maintain an average number of one-to-two crossovers per chromosome in the presence of large tracts of suppressed recombination35.

Fig. 3: Identification and characterization of a large inversion on chromosome 7H.
figure3

a, Alignment of the 7H pseudomolecules of the Morex and RGT Planet cultivars. b, Alignment of physical and genetic positions mapped in the RGT Planet × Hindmarsh (R × H) (left) and Morex × Barke (M × B) (right) populations. Red shading marks the inverted region. c, We converted genetic distances to recombination rates in the R × H (left) and M × B (right) populations. A single marker per recombination block is shown. d, We designed a PCR marker (Supplementary Figs. 1, 2a) to screen for the presence of the inversion in gene bank accessions that represent the Valticky and Diamant cultivars.

By focusing on the inversion breakpoints in the RGT Planet sequence assembly, we designed a diagnostic PCR assay (Supplementary Fig. 2a, b, d) to rapidly genotype the presence of the inversion in 1,406 accessions (Supplementary Table 8). The inverted haplotype occurred at low frequency (1.3%) in the whole panel, but was found in many lines in the RGT Planet pedigree (Supplementary Fig. 3)—including commercially successful barley cultivars of past decades, such as Triumph, Quench and Sebastian. The earliest cultivar that carried the inversion was Diamant. As one of the donors of the semi-dwarf growth habit, Diamant was a highly influential founder line of modern barley breeding and traces back to a mutant induced by gamma irradiation of the Czech cultivar Valticky36. We genotyped several gene bank accessions and germplasm samples of both Valticky and Diamant. Notably, none of the Valticky samples carried the inversion, whereas it segregated in the Diamant samples (Fig. 3d). Quantitative trait loci mapping for yield-related traits in the RGT Planet × Hindmarsh population did not show signals on chromosome 7H (Supplementary Fig. 2e, Supplementary Table 9), consistent with selective neutrality of the inversion. This strongly suggests that mutation breeding in the 1960s has given rise to a cryptic large inversion, which—unbeknownst to breeders—segregates in elite varieties of barley.

The second inversion we focused on spanned 10 Mb in the interstitial region of chromosome 2H (Fig. 1b) and was present in 26 out of 69 Hi-C samples (Supplementary Table 8). Local PCA and haplotype analysis in our panel of 200 domesticated and 100 wild varieties of barley indicated a single origin of the inverted haplotype (Fig. 4a, b, Supplementary Fig. 2c). The inversion occurred only among domesticated barley of Western geographical origin9, indicating that it arose or has risen to high frequency after domestication. The inverted region contains 46 high-confidence genes in the Morex cultivar. The closest gene to the inversion breakpoint—at 448 kb distance from the distal breakpoint in the non-carrier Morex—was HvCENTRORADIALIS (HvCEN)37 (Fig. 4c). Although induced mutants of HvCEN flower very early, natural variation in HvCEN has previously been implicated in environmental adaptation to northern European climates37. All of the inversion carriers we analysed had HvCEN haplotype III, which is associated with later flowering in spring barley varieties from northern Europe37,38. Further research is required to determine whether the inversion close to HvCEN has direct functional consequences (for instance, by modulating HvCEN expression) or whether it hitchhiked along with a tightly linked causal variant.

Fig. 4: Analysis of a frequent inversion on chromosome 2H.
figure4

a, A PCA showing the localization of inversion carriers in the diversity space of global domesticated barley. The correspondence of PCA coordinates to correlates of population structure is shown in Extended Data Fig. 1. Red dots denote carriers of the inverted haplotype (n = 87) in a panel of 200 domesticated varieties of barley. b, PCA for a diversity panel comprising 200 domesticated (red and green dots) and 100 wild varieties of barley (blue dots). SNP markers detected in whole-genome shotgun data and located in the inverted regions were used. c, Schematic of the inverted region. The HvCEN gene is closest to the breakpoint that is distal in Morex (distance of 449 kb) and proximal in Barke (distance of 433 kb) assemblies. A total of 46 and 44 high-confidence (HC) genes were annotated in the Morex and Barke assemblies, respectively. The yellow arrows (not drawn to scale) mark the positions of PCR primers to probe for the presence of the inversion (Supplementary Fig. 2c).

Discussion

The digital representation of the pan-genome can expand the repertoire of natural or induced sequence variation that is accessible to genetic analyses and breeding. Our comparison of 20 chromosome-scale sequence assemblies has revealed pervasive variation in genes and non-coding regions. Focusing on single-copy sequences, we translated this variation into scorable markers that are amenable to population genetic analysis and association scans. A notable finding was the prevalence of large (more than 5 Mb in size) inversion polymorphisms in current elite germplasm. It is likely that the suppression of genetic recombination in inversion heterozygotes has manifested itself in hard-to-explain patterns of long-range linkage and segregation distortion between elite lines in breeding programmes. Our map of inversion polymorphisms will provide breeders with a point of reference to avoid—or interpret correctly—crosses between carriers and non-carriers. We found abundant structural variation in 20 representative barley genotypes, but individual events occurred at low frequency (Extended Data Figs. 6, 9). This observation, combined with the slow saturation of the single-copy pan-genome (Fig. 2a), motivate the genomic analysis of more genotypes to expand the barley pan-genome. The next phase of barley pan-genomics will focus on an augmented panel of domesticated and wild germplasm, working towards the long-term goal of high-quality genome sequences of all barley plant genetic resources as part of a biodigital resource centre39,40.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Library preparation, sequencing data generation and genome assembly of 20 diverse varieties of barley

High-molecular-weight DNA was extracted from one-week-old seedlings of 20 diverse barley accessions given in Supplementary Table 10, using a previously described large-scale DNA extraction protocol41. For the NRGene DeNovoMAGIC3.0 assemblies, 450-bp paired-end (PE450) libraries of Morex, Barke, HOR 10350 and B1K-04-12 were prepared at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben. The 450-bp paired-end libraries for other accessions, 800-bp paired-end libraries and mate-pair libraries of three sizes were prepared and sequenced at the University of Illinois Roy J. Carver Biotechnology Center. The 10X Genomics Chromium libraries were prepared at the University of Saskatchewan Wheat Molecular Breeding Laboratoryand sequenced by Genome Quebec or prepared and sequenced at the Roy J. Carver Biotechnology Center, using the manufacturers’ recommendations. Published tethered chromosome conformation data for Morex, Barke, HOR 10350 and B1K-04-12 (ref. 42) was used for scaffolding the respective genome. For the other accessions, in situ Hi-C libraries were prepared using a previously described method43. Sequencing data generated from each of the libraries are given in Supplementary Table 10. NRGene DeNovoMAGIC3.0 scaffold assemblies were provided for Barke, HOR 10350 and B1K-04-12. The 10X Chromium, population sequencing (POPSEQ) and Hi-C data were then used to prepare chromosome-scale assemblies using the TRITEX pipeline6 (commit: 7041ff2). For the other assemblies, the TRITEX pipeline was also used for contig assembly and scaffolding with mate-pair and 10X data (Extended Data Table 1). High-confidence gene models annotated on the Morex V2 reference6 and full-length cDNA sequences44 were aligned to the assemblies to assess gene-space completeness with the parameters of ≥90% query coverage and ≥97% (≥90% for full-length cDNA) identity.

Tissue collection and RNA extraction

Plant material for the collection of tissues for RNA sequencing (RNA-seq) and Iso-Seq was grown in the greenhouse at IPK Gatersleben with day–night temperatures of 21 °C–18 °C. Embryonic tissue, leaves, roots, internode, inflorescence (5 mm) and developing seeds (5 and 15 days after pollination) were collected as previously described4, snap-frozen in liquid nitrogen and stored at −80 °C until RNA extractions were performed. RNA was extracted from the collected tissues using a Trizol extraction protocol4 and purified using Qiagen RNeasy miniprep columns as per the manufacturer’s instructions. RNA quality was checked on Agilent RNA HS screen tape and RNA with RIN value greater than 8 was used for RNA-seq and Iso-Seq library construction.

RNA-seq library preparation and data generation

RNA-seq libraries were prepared from purified RNA using the TruSeq RNA sample preparation kit (Illumina) as per the manufacturer’s recommendation at IPK Gatersleben. Libraries were pooled at equimolar concentrations, quantified by qPCR and paired-end-sequenced on an Illumina HiSeq 2500 for 200 cycles. The data generated for each tissue are given in Supplementary Table 2.

Iso-Seq data generation and analysis

Two libraries for each embryonic tissue RNA and pooled RNA from seven tissues (described in ‘Tissue collection and RNA extraction’) were prepared for Barke and HOR 10350 using the PacBio Iso-Seq protocol. In brief, double-stranded RNA was synthesized using SMARTer PCR cDNA synthesis kit (Clontech; cat. no. 634925). Two fractions of cDNA with different size profiles were prepared by using differing ratios of DNA to Ampure XP beads (Beckman Coulter, cat. no. A63882). Equimolar concentration of each fraction were pooled, and a minimum of one microgram of double-stranded cDNA was used for Iso-Seq library construction as per the PacBio library construction protocol. Two additional libraries from pooled RNA tissues were prepared using cDNA prepared from TeloPrime v.1.0 kit (Lexogen) following the manufacturer’s instructions. Libraries were quantified and sequenced on a PacBio Sequel device at IPK Gatersleben. Data were analysed using SMRTLink v.5.0 Isoseq v.1.0 pipeline or Isoseq3 pipeline (https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda). The steps involved in Iso-Seq data analysis were the generation of circular consensus sequences, and then the classification of circular consensus sequence reads into full-length non-chimeric reads and non-full length reads on the basis of the presence of primer sequences and polyA sequences. Full-length non-chimeric reads were then clustered on the basis of sequence similarity to yield high- and low-quality isoforms. The data generated and method of library preparation are given in Supplementary Table 2.

Gene projections and repeat annotation

Gene models for Morex, Barke and HOR 10350 were predicted using transcriptome data (Supplementary Table 2) and protein homology evidence, and derived by a previously described annotation pipeline5. High-confidence gene models from these accessions were aligned to pseudo-chromosomes of each accession separately using blat45. For each genomic region identified by blat, additional alignments were performed by exonerate46 in its genomic neighbourhood ranging between 20 kb upstream and 20 kb downstream of the match position. A series of quality criteria was applied to select high-confidence gene models in each accession. The functional annotation for genes of 20 accessions was carried out using the AHRD pipeline v.3.3.3 (https://github.com/groupschoof/AHRD). Orthologous gene groups between the twenty accessions were predicted using OrthoFinder47 v.2.3.1 with default parameters.

Repeat annotation

To obtain a consistent transposon annotation across all lines for transposons and tandem repeats, the same methods were applied to all 20 barley lines. Transposons were detected and classified by homology search against the REdat_9.7_Poaceae section of the PGSB transposon library48. The program vmatch (http://www.vmatch.de, version 2.3.0) was used for that purpose as a fast and efficient matching tool that is well-suited for such large and highly repetitive genomes. Vmatch was run with the following parameters: identity ≥ 70%, minimal hit length 75 bp, seed length 12 bp (exact command line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5). To remove overlapping annotations, the vmatch output was filtered for redundant hits via a priority-based approach. Higher scoring matches were assigned first and lower scoring hits at overlapping positions were either shortened or removed if they were contained to ≥90% in the overlap or <50 bp of rest length remained. The resulting transposon annotations are overlap-free, but disrupted elements from nested insertions have not been defragmented into one element. Still-intact full-length LTR retrotransposons were identified with LTRharvest49, a program that scans the genome for LTR retrotransposon specific structural hallmarks, such as long terminal repeats, RNA cognate primer binding sites and target site duplications. LTRharvest (included in genometools 1.5.9) was run with the following parameter settings: ‘overlaps best -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 3000 -maxdistltr 25000 -similar 85 -mintsd 4 -maxtsd 20 -motif tgca -motifmis 1 -vic 60 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3’. All candidates were annotated for PfamA domains using hmmer3 (http://hmmer.org, version 3.1b2) and filtered to remove false positives. The inner domain order served as a criterion for the LTR-retrotransposon superfamily classification into either Gypsy or Copia. In the cases of insufficient domain information, the elements were assigned as still undetermined.

Most of the transposons insert at random locations leading to novel and usually unique sequence stretches at both borders around the inserted element and the neighbouring original sequence. The de novo detected full-length LTR set provides defined element borders, a prerequisite for the exact positioning of transposable element junctions. We used 100-bp single transposable element junctions with 50 bp outside and 50 bp inside the element from both sides of the element and merged them to 200 bp joined junctions per element. Junctions from the reverse strand were reverse-complemented. The 200-bp joined junctions from all 20 lines were clustered with vmatch dbcluster (http://www.vmatch.de, version 2.3.0) at 98% identity and 98% mutual length coverage (command-line parameters: 98 98 -e 2 -l 98 -d). About 97% of the clusters belonged to the 1:1 type with a maximum of 1 member per line and were used for the downstream analyses. Using the above-described 200-bp joined junctions instead of full sequences reduces the amount of data for clustering to 2%, from about 10 kb to 200 bp per full-length LTR element, thus allowing a sequence clustering of 2.2 million elements in the first place. By including sequence information outside of the element, the repetitiveness of high-copy transposable element families is removed and at the same time the syntenic context is provided even for elements located on chrUn (that is, not assigned to chromosomal pseudomolecules).

PAV detection and validation

Owing to higher sensitivity in detecting deletions over insertions, a paired genome alignment strategy was used in which each assembly was aligned to reference genome Morex reciprocally by treating Morex as a query and reference using Minimap2 (v.2.17)50. From these two alignments, insertion and deletions were called using Assemblytics (v.1.2.1)26. Then, only deletions were selected in both alignments and converted into PAVs with regard to Morex. In addition, a hard filter was used to discard PAVs containing more than 5% gaps (Ns) and nested PAVs. We used a previously described method51 to map deletions longer than 5 kb in Barke relative to Morex using whole-genome shotgun data for 90 Morex × Barke recombinant inbred lines19. Mosdepth (v.0.2.9)52 was used for determining read depth in genomic intervals.

k-mer-based genome-wide association

PAVs overlapping with single copy regions were identified by BedTools (v.2.28.0)53. k-mer sequences with step size of 2 bp were retrieved from single-copy regions residing within PAVs. The abundances of the extracted k-mer sequences were counted in sequence reads using BBDuk (BBMap_37.93) (https://sourceforge.net/projects/bbmap/). k-mer counts were obtained for whole-genome shotgun data of 300 diverse varieties of barley generated in the present study and previously published genotyping-by-sequencing data9. k-mer counts were imported into R (v.3.5.1)54 and normalized for differences in read depth between samples. The normalized k-mer counts were then used for genome-wide association scans using GAPIT330 and PCA using standard R functions.

Construction of single-copy pan-genome

To identify single-copy regions in each genome, genomic regions covered by 31-mers occurring more than once were masked using BBDuk (BBMap_37.93)55. Based on masking, single-copy regions in each assembly were obtained in .bed format and subsequently related sequences were retrieved using BEDTools (v2.28.0)53. Single-copy sequences from all the assemblies were combined to perform an all-against-all blast search. The blast results were filtered (>90% identity and minimum 80% alignment length) and then clustered using the igraph package56. A representative from each cluster (the largest contained sequence) was selected and used for estimating pan-genome size. Clusters shared by all the 20 accessions are referred to as the core genome, and clusters with sequences originating from 1 to 19 genotypes are considered as the variable genome.

Hi-C library preparation, sequencing and inversion calling

In situ Hi-C libraries were prepared from one-week-old seedlings of barley IPK core50 collection9 (Supplementary Table 5) based on a previously described protocol43 Sequencing, Hi-C raw data processing and inversion calling were performed as previously described34 using the MorexV2 reference genome sequence assembly6. The breakpoint regions were identified by pairwise genome alignment using Minimap2 (v.2.17)50 and PipMaker (http://pipmaker.bx.psu.edu/cgi-bin/pipmaker?basic)57.

Resequencing, SNP calling and PCA

Raw reads (Supplementary Table 4) were trimmed with cutadapt (v.1.15) and aligned to the MorexV2 genome assembly6 using Minimap2 (v.2.17)50. The alignments were sorted using Novosort (V3.06.05) (http://www.novocraft.com). BCFtools (v.1.8)58 was used to call SNPs and short insertions and deletions (indels). The resulting VCF file was converted into Genomic Data Structure (GDS) format using SeqArray package59 in R to obtain a SNP matrix. Finally, hard filtering was applied to remove SNPs having more than 10% missing data and heterozygosity. Previously generated genotyping-by-sequencing data9 were aligned to the MorexV2 reference and identified SNPs using a previously described variant calling pipeline9. PCAs were performed using snpgdsPCA() function of the package SNPrelate60.

RGT Planet × Hindmarsh mapping population

A cross was made between RGT Planet (maternal plant) and Hindmarsh (pollen donor). In total, 38 F2 plants from the direct cross and 233 individual heads from F3 seeds were progressed to the F6 generation by single seed descent method. The F6 recombinant inbred lines (RIL) (224 in total) were used for construction of a genetic linkage map. Genomic DNA was extracted from the leaves of a single plant per RIL using the cetyl-trimethyl-ammonium bromide method. DNA quality was assessed on 1% agarose gels and quantified using a NanoDrop spectrophotometer (Thermo Scientific NanoDrop Products). DNA was diluted into 50 ng/μl and placed in a 96-well plate for PCR. DArT-seq genotyping-by-sequencing was performed using the DArT-seq platform (DArT PL) according to the manufacturer’s protocol (https://www.diversityarrays.com/). In brief, 100 μl of 50 ng μl−1 DNA was sent to DArT PL, and genotyping-by-sequencing was performed using complexity reduction followed by sequencing on a HiSeq Illumina platform as previously described61 (Supplementary Table 9). Sequences flanking polymorphisms detected by DArT-seq were aligned against the MorexV2 genome assembly to determine their physical positions (Supplementary Table 7).

Field experiments and phenotypic data

Field experiments were conducted at six sites: Gibson, Western Australia (WA, −33.612176, 121.798438); Williams, Western Australia (−33.577668, 116.734934); Wongan Hills, Western Australia (−30.848953, 116.756461); Merredin, Western Australia (−31.487009, 118.229668); South Perth, Western Australia (−31.991186, 115.887944); and Shepperton, Victoria (−36.487551, 145.388470). The distance between South Perth and Shepperton is over 3,300 km. The Merredin site is located inland and receives little rainfall, whereas the Gibson site receives a high amount of rainfall: the other sites are in between. The experimental design for field trial sites was performed as previously described62. In brief, all regional field trials (partially replicated design) were planted in a randomized complete block design using plots of 1 by 5 m2, laid out in a row–column format and the middle 3 m was harvested for grain yield. Field trials in South Perth and Shepperton were conducted using double rows with a 40-cm distance within and between rows, owing to space constraints. Seven control varieties were used for spatial adjustment of the experimental data. Measurements were taken at each plot of each field experiment in the study to determine flowering time (days to Zadoks stage (ZS)49), plant height and grain yield. In brief, heading date was recorded as the number of days from sowing to 50% awn emergence above the flag leaf (ZS49), as a proxy for flowering time. Plant height was determined by estimating the average height from the base to the tip of the head of all plants in each plot. Grain yield (kg ha−1) was determined by destructively harvesting all plant material from each plot to separate the grain, and then determining grain mass. Grain yield data of the field experiments, as well as plant height and heading data, were analysed using linear mixed models in ASReml-R (https://www.vsni.co.uk/software/asreml-r/) to determine best linear unbiased predictions or best linear unbiased estimations for each trait for further analysis. Local best practices for fertilization and disease control were adopted for each trial site.

Quantitative trait loci (QTL) mapping

Software MapQTL6 was used for the QTL analysis63. The genotypic data, phenotypic data and genetic map were formatted and imported to MapQTL6. Interval mapping was conducted for each trait, and then the markers with a logarithm of odds (LOD) value of above 3.0 were selected as cofactors. Multiple QTL model mapping was performed to re-calculate the QTL. If the markers with the highest LOD value were inconsistent with the cofactor markers, then the new markers were selected as cofactors and re-calculated. The QTL results and charts were exported from the software.

Long-read sequence assembly of the Morex cultivar

PacBio libraries were constructed using SMRTbell Template Prep Kit 1.0 and sized on a SAGE Blue Pippin instrument 20–80 kb. Sequencing was performed on Sequell II device at the HudsonAlpha Institute using V.1.0 chemistry and 10-h movie time. Data were generated from a total of five SMRT cells, yielding 604 Gb of raw sequence reads. A total of 520.72 Gb of this set (104.15×) was used for assembly (Supplementary Tables 11, 12). Previously published Illumina short-read data (ERR3183748 and ERR31837496 (Supplementary Table 12)) was used for polishing and error correction. Before use, Illumina fragment reads were screened for phix contamination. Reads composed of >95% simple sequences were removed. Illumina reads shorter than 50 bp after trimming for adaptor and quality (q < 20) were removed. The final read set consists of 605,178,701 reads, representing a total of 43.17× of high-quality Illumina bases. The initial assembly was generated by assembling 32,743,478 PacBio reads (104.15× sequence coverage) using MECAT (v.1.1)64 and subsequently polished using Arrow65. This produced an initial assembly of 1,577 scaffolds (1,577 contigs), with a contig N50 of 10.4 Mb, 987 scaffolds larger than 100 kb and a total genome size of 4,139.8 Mb (Supplementary Table 13).

A first round of breaking chimeric scaffolds was done using the POPSEQ genetic map19 to identify contigs bearing markers from distant genomic regions. A total of 17 misjoins were identified and resolved. Homozygous SNPs and indels were corrected in the release consensus sequence using a subset of about 30× of the Illumina reads described above in this section. Reads were aligned using BWA-MEM66. Homozygous SNPs and indels were discovered with GATK’s UnifiedGenotyper tool67. A total of 59 homozygous SNPs and 15,759 homozygous indels were corrected. After these correction steps, the assembly contains 4,139.7 Mb of sequence, consisting of 1,594 contigs with a contig N50 of 10.2 Mb. A second round of chimaera breaking by inspecting Hi-C contact matrices as described in the TRITEX pipeline6. Published Hi-C data of the Morex cultivar was used5. Corrected contigs were arranged into pseudomolecules using TRITEX.

Comparison of PacBio continuous long read (CLR) and TRITEX assemblies of the Morex cultivar

Full-length cDNA sequences44 were aligned to the assemblies to assess gene space completeness. Only alignments with query coverage ≥90% and identity ≥90% were considered. Whole-genome assemblies were done with Minimap2. Structural variant calling with Assemblytics (v.1.2.1)26 (Morex TRITEX versus Morex CLR; Morex CLR versus Barke) and extraction of single-copy regions were done as described in ‘PAV detection and validation’.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability

All raw sequence data collected in this study and sequence assemblies have been deposited at the European Nucleotide Archive (ENA). Accession codes for raw data and assemblies are listed in Supplementary Tables: Supplementary Table 14 (assemblies), Supplementary Table 10 (assembly raw data), Supplementary Table 4 (whole-genome shotgun sequencing), Supplementary Table 5 (Hi-C) and Supplementary Table 9 (DArT-seq). Assemblies, annotations and analysis results were deposited under a DOI in the PGP repository68 using the e!DAL submission system69 and are accessible under the URL https://doi.org/10.5447/ipk/2020/24. Assemblies and gene annotations can also be downloaded from https://barley-pangenome.ipk-gatersleben.de. The Barley Pedigree Catalogue is available at http://genbank.vurv.cz/barley/pedigree/.

Code availability

Source code is released in a public Bitbucket repository, at https://bitbucket.org/ipk_dg_public/barley_pangenome/.

References

  1. 1.

    Bayer, P. E., Golicz, A. A., Scheben, A., Batley, J. & Edwards, D. Plant pan-genomes are the new reference. Nat. Plants 6, 914–920 (2020).

    PubMed  Google Scholar 

  2. 2.

    Dawson, I. K. et al. Barley: a translational model for adaptation to climate change. New Phytol. 206, 913–931 (2015).

    PubMed  Google Scholar 

  3. 3.

    Stein, N. & Muehlbauer, G. J. The Barley Genome (Springer, 2018).

  4. 4.

    International Barley Genome Sequencing Consortium. A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716 (2012).

    ADS  Google Scholar 

  5. 5.

    Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature 544, 427–433 (2017).

    CAS  PubMed  ADS  Google Scholar 

  6. 6.

    Monat, C. et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 20, 284 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Mascher, M. et al. Mapping-by-sequencing accelerates forward genetics in barley. Genome Biol. 15, R78 (2014).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Russell, J. et al. Exome sequencing of geographically diverse barley landraces and wild relatives gives insights into environmental adaptation. Nat. Genet. 48, 1024–1030 (2016).

    CAS  PubMed  Google Scholar 

  9. 9.

    Milner, S. G. et al. Genebank genomics highlights the diversity of a global barley collection. Nat. Genet. 51, 319–326 (2019).

    CAS  PubMed  Google Scholar 

  10. 10.

    Muñoz-Amatriaín, M. et al. Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol. 14, R58 (2013).

    PubMed  PubMed Central  Google Scholar 

  11. 11.

    Taketa, S. et al. Barley grain with adhering hulls is controlled by an ERF family transcription factor gene regulating a lipid biosynthesis pathway. Proc. Natl Acad. Sci. USA 105, 4062–4067 (2008).

    CAS  PubMed  ADS  Google Scholar 

  12. 12.

    Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).

    CAS  PubMed  ADS  Google Scholar 

  13. 13.

    Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189 (2019).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Danilevicz, M. F., Tay Fernandez, C. G., Marsh, J. I., Bayer, P. E. & Edwards, D. Plant pangenomics: approaches, applications and advancements. Curr. Opin. Plant Biol. 54, 18–25 (2020).

    CAS  PubMed  Google Scholar 

  15. 15.

    Monat, C., Schreiber, M., Stein, N. & Mascher, M. Prospects of pan-genomics in barley. Theor. Appl. Genet. 132, 785–796 (2019).

    PubMed  Google Scholar 

  16. 16.

    Coronado, M.-J., Hensel, G., Broeders, S., Otto, I. & Kumlehn, J. Immature pollen-derived doubled haploid formation in barley cv. Golden Promise as a tool for transgene recombination. Acta Physiol. Plant. 27, 591–599 (2005).

    CAS  Google Scholar 

  17. 17.

    Schreiber, M. et al. A genome assembly of the barley ‘transformation reference’ cultivar Golden Promise. G3 10, 1823–1827 (2020).

    CAS  PubMed  Google Scholar 

  18. 18.

    Gottwald, S., Bauer, P., Komatsuda, T., Lundqvist, U. & Stein, N. TILLING in the two-rowed barley cultivar ‘Barke’ reveals preferred sites of functional diversity in the gene HvHox1. BMC Res. Notes 2, 258 (2009).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). Plant J. 76, 718–727 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Hübner, S. et al. Strong correlation of wild barley (Hordeum spontaneum) population structure with temperature and precipitation variation. Mol. Ecol. 18, 1523–1536 (2009).

    PubMed  Google Scholar 

  21. 21.

    Chikhi, R., Limasset, A. & Medvedev, P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Clavijo, B. J. et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 27, 885–896 (2017).

    MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Anderson, S. N. et al. Transposable elements contribute to dynamic genome content in maize. Plant J. 100, 1052–1065 (2019).

    CAS  PubMed  Google Scholar 

  25. 25.

    Brunner, S., Fengler, K., Morgante, M., Tingey, S. & Rafalski, A. Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell 17, 343–360 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Nattestad, M. & Schatz, M. C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).

    PubMed  PubMed Central  ADS  Google Scholar 

  28. 28.

    Yu, S. et al. A single nucleotide polymorphism of Nud converts the caryopsis type of barley (Hordeum vulgare L.). Plant Mol. Biol. Report. 34, 242–248 (2016).

    CAS  Google Scholar 

  29. 29.

    Arora, S. et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 37, 139–143 (2019).

    CAS  PubMed  Google Scholar 

  30. 30.

    Lipka, A. E. et al. GAPIT: genome association and prediction integrated tool. Bioinformatics 28, 2397–2399 (2012).

    CAS  PubMed  Google Scholar 

  31. 31.

    Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

    CAS  PubMed  Google Scholar 

  32. 32.

    Ekberg, I. Cytogenetic studies of three paracentric inversions in barley. Hereditas 76, 1–30 (1974).

    CAS  PubMed  Google Scholar 

  33. 33.

    Ramage, R. & Suneson, C. Translocation-gene linkages on barley chromosome 7. Crop Sci. 1, 319–320 (1961).

    Google Scholar 

  34. 34.

    Himmelbach, A. et al. Discovery of multi-megabase polymorphic inversions by chromosome conformation capture sequencing in large-genome plant species. Plant J. 96, 1309–1316 (2018).

    CAS  PubMed  Google Scholar 

  35. 35.

    Ederveen, A., Lai, Y., van Driel, M. A., Gerats, T. & Peters, J. L. Modulating crossover positioning by introducing large structural changes in chromosomes. BMC Genomics 16, 89 (2015).

    PubMed  PubMed Central  Google Scholar 

  36. 36.

    Bouma, J. & Ohnoutka, Z. Importance and Application of the Mutant ‘Diamant’ in Spring Barley Breeding (IAEA, 1991).

  37. 37.

    Comadran, J. et al. Natural variation in a homolog of Antirrhinum CENTRORADIALIS contributed to spring growth habit and environmental adaptation in cultivated barley. Nat. Genet. 44, 1388–1392 (2012).

    CAS  PubMed  Google Scholar 

  38. 38.

    Bustos-Korts, D. et al. Exome sequences and multi-environment field trials elucidate the genetic basis of adaptation in barley. Plant J. 99, 1172–1191 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Mascher, M. et al. Genebank genomics bridges the gap between the conservation of crop diversity and plant breeding. Nat. Genet. 51, 1076–1081 (2019).

    CAS  PubMed  Google Scholar 

  40. 40.

    Khan, A. W. et al. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci. 25, 148–158 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Dvorak, J., McGuire, P. E. & Cassidy, B. Apparent sources of the A genomes of wheats inferred from polymorphism in abundance and restriction fragment length of repeated nucleotide sequences. Genome 30, 680–689 (1988).

    CAS  Google Scholar 

  42. 42.

    Himmelbach, A., Walde, I., Mascher, M. & Stein, N. Tethered chromosome conformation capture sequencing in Triticeae: a valuable tool for genome assembly. Bio Protoc. 8, e2955 (2018).

    CAS  Google Scholar 

  43. 43.

    Padmarasu, S., Himmelbach, A., Mascher, M. & Stein, N. in Plant Long Non-Coding RNAs (eds Chekanova, J. & Wang, H.-L.) 441–472 (Springer, 2019).

  44. 44.

    Matsumoto, T. et al. Comprehensive sequence analysis of 24,783 barley full-length cDNAs derived from 12 clone libraries. Plant Physiol. 156, 20–28 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Spannagl, M. et al. PGSB PlantsDB: updates to the database framework for comparative plant genome research. Nucleic Acids Res. 44, D1141–D1147 (2016).

    CAS  PubMed  Google Scholar 

  49. 49.

    Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).

    PubMed  PubMed Central  Google Scholar 

  50. 50.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Gutierrez-Gonzalez, J. J., Mascher, M., Poland, J. & Muehlbauer, G. J. Dense genotyping-by-sequencing linkage maps of two synthetic W7984×Opata reference populations provide insights into wheat structural diversity. Sci. Rep. 9, 1793 (2019).

    PubMed  PubMed Central  ADS  Google Scholar 

  52. 52.

    Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).

    CAS  PubMed  Google Scholar 

  53. 53.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    R Core Team. R: A Language and Environment for Statistical Computing http://www.R-project.org (R Foundation for Statistical Computing, 2013).

  55. 55.

    Bushnell, B. BBMap: A Fast, Accurate, Splice-aware Aligner (Lawrence Berkeley National Laboratory, 2014).

  56. 56.

    Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Syst. 1695, 1–9 (2006).

    Google Scholar 

  57. 57.

    Schwartz, S. et al. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10, 577–586 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Zheng, X. & Gogarten, S. SeqArray: big data management of genome-wide sequence variants. R package version 1.10.6 https://github.com/zhengxwen/SeqArray (accessed January 2017).

  60. 60.

    Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Akbari, M. et al. Diversity arrays technology (DArT) for high-throughput profiling of the hexaploid wheat genome. Theor. Appl. Genet. 113, 1409–1420 (2006).

    CAS  PubMed  Google Scholar 

  62. 62.

    Hill, C. B. et al. Hybridisation-based target enrichment of phenology genes to dissect the genetic basis of yield and adaptation in barley. Plant Biotechnol. J. 17, 932–944 (2019).

    CAS  PubMed  Google Scholar 

  63. 63.

    Van Ooijen, J. MapQTL 5, Software for the Mapping of Quantitative Trait Loci in Experimental Populations (Kyazma, 2004).

  64. 64.

    Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).

    CAS  PubMed  Google Scholar 

  65. 65.

    Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    CAS  PubMed  Google Scholar 

  66. 66.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  67. 67.

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Arend, D. et al. PGP repository: a plant phenomics and genomics data publication infrastructure. Database (Oxford) 2016, baw033 (2016).

    Google Scholar 

  69. 69.

    Arend, D. et al. e!DAL—a framework to store, share and publish research data. BMC Bioinformatics 15, 214 (2014).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank M. Knauft, I. Walde and S. König for technical assistance; D. Schüler for sequence data management; J. Bauernfeind, T. Münch and H. Miehe for IT administration; D. Arend for help with data submission; M. Bayer for advice on transcriptome analysis; and M. Herz for pedigree information. This research was supported by grants from the German Federal Ministry of Education and Research to N.S., M.M., U.S., M.S. and K.F.X.M. (SHAPE, FKZ 031B0190), to U.S. and K.F.X.M. (de.NBI, FKZ 031A536) and to N.S. (COBRA, FKZ 031A323A); the Australian Grain Research and Development Cooperation (9176507) to C.L., K.C., P.L. and P.W.; JST CREST Japan (no. JPMJCR16O4 to K.M. and T.H.); JST Mirai Program Japan (no. 18076896 to K.S.); the National Key R&D Program of China (2018YFD1000701 and 2018YFD1000700) to D.X. and J.Z.; by funding from the China Agriculture Research System (CARS-05) and the Agricultural Science and Technology Innovation Program to C.W. and G.G. Support for 10X sequencing was provided by a research grant from Genome Canada and Genome Prairie to C.P. and J.E.; and by the Natural Science Foundation of China (31620103912) and the National Key R&D Program of China (2018YFD1000706) to G.Z. We acknowledge support from the European Research Council (ERC Shuffle, project identifier: 66918) to R.W.

Author information

Affiliations

Authors

Contributions

N.S. and M.M. designed the study. N.S. coordinated experiments and sequencing. M.M. supervised sequence assembly. M. Spannagl and K.F.X.M. supervised annotation. U.S. supervised data management and submission. S.P., A.H., J.E., D.X., L.B.B. and J.G. performed sequencing experiments. M.J., C.M., Y.G., C.P., J.J. and J.S. performed sequence assembly. M.J. performed structural variation and genome-wide association scan analysis. A.F. submitted sequence data. G.H., T.L., H.G., V.S.B., N.K. and D.L. annotated and analysed genes and transposable elements. S.P., M.J., X.-Q.Z., T.T.A., G. Zhou, C.T., C.H., P.W., M.M. and C.L. analysed polymorphic inversions. H.B., J.G., J.S., J.Z., C.W., G.G., G. Zhang, K.M., T.H., K.S., K.J.C., P.L., C.J.P., C.L., M. Schreiber, R.W. and N.S. contributed sequence data. M.J., S.P., C.L. and M.M. wrote the paper with input from all co-authors.

Corresponding authors

Correspondence to Chengdao Li or Martin Mascher or Nils Stein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Victor Albert, Scott Jackson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Pan-genome selection in the global barley diversity space.

PCA with genotyping-by-sequencing data of 19,778 varieties of domesticated barley sampled from the gene bank of the IPK9. The first six principal components are shown. Samples are coloured to highlight the pan-genome selection (first row), or according to geographic origin (second row), row type (third row) or annual growth habit (fourth row). The proportion of variance explained by the principal components is indicated in the axis labels of the first row. The map was created with the R package mapdata54.

Extended Data Fig. 2 Comparison between long-read and short-read assemblies of the Morex cultivar.

a, Co-linearity between Morex V2 (short-read) assembly and the Morex PacBio CLR assembly at the pseudomolecule level. b, Summary statistics of the Morex PacBio CLR assembly and Morex V2 assembly. c, Alignment of NUDUM locus (16 kb) between Morex PacBio CLR and Morex V2. d, Structural variants between Morex V2 and Morex PacBio CLR assemblies as detected and classified by Assemblytics. e, PAVs between Barke and the Morex V2 and Morex CLR assemblies.

Extended Data Fig. 3 Assessment of contiguity and completeness in 20 genome assemblies.

a, Whole-genome alignments of assemblies of 19 diverse barley accessions to the Morex V2 reference assembly. b, Alignment summary of full-length coding sequences (32,878) from the MorexV2 annotation and full-length cDNAs (28,622 full-length cDNAs) in each assembly. Alignments with less than 90% query coverage and 97% (less than 90% for full-length cDNAs) identity were discarded. c, Whole-genome alignments show some examples of large chromosomal inversions identified using Hi-C data.

Extended Data Fig. 4 Pairwise shared syntenic full-length LTR locations.

The wild variety B1K-04-12 is set apart as an outgroup, as it shares only 19–26% of its still-intact full-length LTR positions with the other landraces and cultivars. The highest similarity is found between the Barke and RGT Planet cultivars (67% shared full-length LTRs).

Extended Data Fig. 5 Gene projection and transposable element annotation.

a, Schematic of the gene projection workflow. TE, transposable element. b, Pipeline for annotation and removing transposable elements. c, Steps to identify tandemly arrayed gene (TAG) clusters in each assembly. d, Summary of gene projections and transposable element annotation in 20 accessions. e, Comparison between de novo annotations and gene projections for three genotypes. Reported counts refer to non-transposable-element genes.

Extended Data Fig. 6 Summary of PAVs detected in pan-genome assemblies.

a, Size distribution of PAVs. b, Number of PAVs between 20 genome assemblies. c, Distribution of PAVs along the barley genome. d, Co-linearity between physical position of PAVs detected between the Morex and Barke cultivars, and mapped genetically in the POPSEQ population.

Extended Data Fig. 7 Analysis of the single-copy pan-genome.

a, Pipeline used to select single-copy k-mers in PAVs as markers for genome-wide association scan analysis. b, Summary of single-copy sequence in 20 genome assemblies and results of their clustering. c, Copy number of single-copy sequences in a diversity panel comprising 200 domesticated and 100 wild accessions. Frequency ranges from blue (low) to red (high). dg, Comparison of PCA on the basis of PAV and SNP variants in whole-genome shotgun data of 200 diverse accessions (d, e) and 19,778 varieties of domesticated barley9 (f, g). Top panels show PCA results from 160,716 PAVs; bottom panels show PCA results from 779,503 of genotyping-by-sequencing SNPs. The accessions are coloured according to geographical origin and row type (using the colour code defined in Extended Data Fig. 1).

Extended Data Fig. 8 PAV-based genome-wide association scans using whole-genome shotgun and genotyping-by-sequencing data.

a, Manhattan plots of PAV-based genome-wide association scans for morphological traits, including adherence of grain hull, row type, length of rachilla hairs and awn roughness, using whole-genome shotgun data from 200 diverse varieties of domesticated barley. b, PAV-based genome-wide association scan results for these traits using genotyping-by-sequencing data from 1,000 diverse varieties of domesticated barley collected from the gene bank of the IPK9. The 200 varieties of barley used for whole-genome shotgun sequencing are a subset of the 1,000 genotyping-by-sequencing genotypes.

Extended Data Fig. 9 Characterization of large inversions in barley.

a, Inversion size distribution. b, Recombination in inverted regions. Recombination rate was determined in the Morex × Barke RIL population19 (n = 90 genotypes). c, Number of inversions present as singletons or shared between two or more accessions on each chromosome.

Extended Data Table 1 Summary statistics of 20 pan-genome assemblies and annotation

Supplementary information

Supplementary Figure

Supplementary Figure 1 | PCR-based genotyping of the 7H inversion. This is the original gel image from which the blue sections were cropped and used for Fig. 3. Morex and RGT planet were used as controls. All Valticky lines carry Morex allele of 7H inversion. Two Diamant lines (HOR 14972, HOR 4092) carry the RGT Planet allele, one Diamant line (HOR 2073) carries the Morex allele. In another two Diamant lines, neither the RGT Planet allele nor Morex allele was amplified. One cropped section in Fig. 3 does not contain molecular weight marker, but from the original image, it is clear that all correspond to correct fragment sizes.

Reporting Summary

Supplementary Figure 2

| In-depth analysis of two inversions on 2H and 7H. (a) Schematic illustration showing precise positions of breakpoints for 7H frequent inversion between Morex and RGT Planet. (b) PCR assay for genotyping 7H inversion. The location of three PCR primers are shown in (a) with yellow marks (not drawn to scale). (c) PCR assay for genotyping the 2H inversion. Primer locations are shown in Fig. 4c. (d) Hi-C contact probability matrix of RGT Planet computed for chromosome 7H. The intensity of pixels represents the normalized Hi-C links counted between 1 Mb windows on chromosome 7H. The frequent 7H inversion was spotted as a pattern of higher than expected interaction frequency against Morex V2 reference genome, marked by blue lines. (e) QTL results for grain yield, plant height and different growth stages from multiple sites in RGT Planet x Hindmarsh population.

Supplementary Figure 3

| PCR-based genotyping of the 7H inversion in the pedigree of RGT Planet. Yellow color denotes carriers of RGT Planet allele. Blue colored cultivars are non-carriers. red color culitvars have unknown status as no fragment was amplified. Cultivars shown in white boxes were not assayed because of unavailability of seeds or DNA. Pedigree data were retrieved from the Barley Pedigree Catalogue (http://genbank.vurv.cz/barley/pedigree/).

Supplementary Table

Supplementary Table 1. Summary statistics of repetitive elements in twenty barley genomes.

Supplementary Table

Supplementary Table 2. Summary of RNA-seq data used for gene annotation.

Supplementary Table

Supplementary Table 3. Pfam domains most frequently observed in PAV genes.

Supplementary Table

Supplementary Table 4. Summary of whole-genome short-gun (WGS) sequencing for 200 domesticated and 100 wild accessions.

Supplementary Table

Supplementary Table 5. Summary of Hi-C data used in this study.

Supplementary Table

Supplementary Table 6. Inversions detected by Hi-C.

Supplementary Table

Supplementary Table 7. Genetic map of the RGT Planet and Hindmarsh (RxH) population (sheet1) and the Morex and Barke (MxB) population (sheet2).

Supplementary Table

Supplementary Table 8. Screening of the 7H inversion in a large collection of modern varieties and breeding lines (sheet 1) and the 7H inversion in the lines in the pedigree of RGT Planet (sheet 2) and PCR validation of the 2H inversion identified by PCA (Fig. 4b) (sheet 3).

Supplementary Table

Supplementary Table 9. Accession IDs of DArTseq reads of RxH recombinants.

Supplementary Table

Supplementary Table 10. Summary of raw sequencing data generated for pan-genome assemblies.

Supplementary Table

Supplementary Table 11. PacBio library statistics for the libraries included in the Morex genome assembly and their respective assembled sequence coverage levels.

Supplementary Table

Supplementary Table 12. Genomic libraries included in the Morex CLR assembly and their respective assembled sequence coverage levels in the final release.

Supplementary Table

Supplementary Table 13. Summary statistics of the initial output of the Quiver polished MECAT assembly. The table shows total contigs and total assembled basepairs for each set of scaffolds greater than the size listed in the left hand column.

Supplementary Table

Supplementary Table 14. Accessions IDs for 20 pan-genome assemblies and Morex PacBio CLR assembly.

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jayakodi, M., Padmarasu, S., Haberer, G. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588, 284–289 (2020). https://doi.org/10.1038/s41586-020-2947-8

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing