The barley pan-genome reveals the hidden legacy of mutation breeding

Jayakodi, Murukarthick; Padmarasu, Sudharsan; Haberer, Georg; Bonthala, Venkata Suresh; Gundlach, Heidrun; Monat, Cécile; Lux, Thomas; Kamal, Nadia; Lang, Daniel; Himmelbach, Axel; Ens, Jennifer; Zhang, Xiao-Qi; Angessa, Tefera T.; Zhou, Gaofeng; Tan, Cong; Hill, Camilla; Wang, Penghao; Schreiber, Miriam; Boston, Lori B.; Plott, Christopher; Jenkins, Jerry; Guo, Yu; Fiebig, Anne; Budak, Hikmet; Xu, Dongdong; Zhang, Jing; Wang, Chunchao; Grimwood, Jane; Schmutz, Jeremy; Guo, Ganggang; Zhang, Guoping; Mochida, Keiichi; Hirayama, Takashi; Sato, Kazuhiro; Chalmers, Kenneth J.; Langridge, Peter; Waugh, Robbie; Pozniak, Curtis J.; Scholz, Uwe; Mayer, Klaus F. X.; Spannagl, Manuel; Li, Chengdao; Mascher, Martin; Stein, Nils

doi:10.1038/s41586-020-2947-8

Download PDF

Article
Open access
Published: 25 November 2020

The barley pan-genome reveals the hidden legacy of mutation breeding

Nature volume 588, pages 284–289 (2020)Cite this article

46k Accesses
272 Citations
276 Altmetric
Metrics details

Subjects

Abstract

Genetic diversity is key to crop improvement. Owing to pervasive genomic structural variation, a single reference genome assembly cannot capture the full complement of sequence diversity of a crop species (known as the ‘pan-genome’¹). Multiple high-quality sequence assemblies are an indispensable component of a pan-genome infrastructure. Barley (Hordeum vulgare L.) is an important cereal crop with a long history of cultivation that is adapted to a wide range of agro-climatic conditions². Here we report the construction of chromosome-scale sequence assemblies for the genotypes of 20 varieties of barley—comprising landraces, cultivars and a wild barley—that were selected as representatives of global barley diversity. We catalogued genomic presence/absence variants and explored the use of structural variants for quantitative genetic analysis through whole-genome shotgun sequencing of 300 gene bank accessions. We discovered abundant large inversion polymorphisms and analysed in detail two inversions that are frequently found in current elite barley germplasm; one is probably the product of mutation breeding and the other is tightly linked to a locus that is involved in the expansion of geographical range. This first-generation barley pan-genome makes previously hidden genetic variation accessible to genetic studies and breeding.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Jarkko Salojärvi, Aditi Rambani, … Patrick Descombes

Genetic gains underpinning a little-known strawberry Green Revolution

Article Open access 19 March 2024

Mitchell J. Feldmann, Dominique D. A. Pincot, … Steven J. Knapp

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Main

A staple food of ancient civilizations, today barley is used mainly for animal feed and malting. Barley is more adaptable to harsh environmental conditions than its close relative wheat, and maintains an important role in human nutrition in harsh climatic regions that include the Ethiopian and Tibetan highlands². As in other crops, genomics has been a major driver of progress in barley genetics and breeding in the past decade³. The first draft reference genome for barley⁴, and its subsequent revisions^5,6, have formed the basis for gene isolation⁷, compiling a single-nucleotide polymorphism (SNP) variation atlas for wild and domesticated germplasm⁸, and activating plant genetic resources⁹. At the same time, reduced-representation surveys of structural variation¹⁰ and map-based cloning¹¹ have implicated variation in gene content and copy number in the control of agronomic traits. The concept of the pan-genome refers to a species-wide catalogue of genic presence/absence variation (PAV)¹², or more generally, structural variation that affects (potentially non-coding) sequences of 50 or more base pairs (bp) in size. Although several methods of pan-genomic analysis that use short-read sequence data in the context of a single reference genome have been devised¹³, large and complex genomes require multiple high-quality sequence assemblies to capture and contextualize sequences that are absent in—or highly diverged from—a single reference genotype¹⁴. Progress in sequencing and genome mapping technologies has only recently made possible the fast and cost-effective assembly of tens of genotypes of large-genome plant species, such as barley (haploid genome size of 5 Gb)¹⁵.

Twenty barley reference genomes

The starting point for pan-genomics in barley was the comprehensive survey of species-wide diversity on the basis of the genome-wide genotyping of more than 22,000 barley accessions, mainly from the German national gene bank⁹. To achieve a good representation of major barley gene pools, we selected accessions that were located in the branches of the first six principal components from the previously published principal component analysis (PCA)⁹ (Fig. 1a, Extended Data Fig. 1), reflecting the key determinants of population structure: geographical origin, row type and annual growth habit. In addition to these gene pool representatives, our panel included the reference cultivar Morex⁵, two current or former elite malting varieties (RGT Planet and Hockett), two founder lines of Chinese barley breeding (ZDM01467 and ZDM02064), Golden Promise and Igri (two genotypes with high transformation efficiency^16,17), Barke (a successful German variety and the parent of several mutant and mapping populations^18,19) and one wild barley (H. vulgare subsp. spontaneum (K. Koch) Thell.) genotype from Israel (B1K-04-12, a desert ecotype collected at Ein Prat)²⁰.

**Fig. 1: Chromosome-scale sequences of 20 representative barley genotypes reveal large structural variants.**

We constructed chromosome-scale sequence assemblies for 20 accessions (Extended Data Table 1). In brief, paired-end and mate-pair Illumina short reads were assembled into scaffolds of megabase (Mb)-scale contiguity (Extended Data Table 1). Scaffold assembly was done with Minia²¹ and SOAPDenovo²² following the TRITEX method⁶ (n = 16), DeNovoMagic from NRGene (n = 3) or W2rap²³ (n = 1). We used 10X Genomics Chromium linked-reads and chromosome conformation capture (Hi-C) data to arrange scaffolds into chromosomal pseudomolecules using the TRITEX pipeline⁶ (Extended Data Table 1). A comparison of the short-read assembly of the Morex cultivar to a long-read assembly of this genotype generated from PacBio long reads showed high co-linearity at chromosomal scale, good concordance in gene space representation and similar power to detect PAV (Extended Data Fig. 2), indicating that short-read assemblies are amenable to pan-genomic analyses in barley. Although the assemblies of the 20 diverse accessions differed in contiguity and the extent of gap sequence in the intergenic space, they had a similar representation of reference gene models (Morex V2) and were highly co-linear to each other at the whole-chromosome scale (Fig. 1b, Extended Data Fig. 3). A similar proportion (about 80%) of the assembled sequence of each genotype was composed of transposable elements, with an average of 113,200 intact full-length long-terminal repeat retro-elements (LTRs) per assembly (Supplementary Table 1). However, we found pronounced differences in the number of shared intact full-length LTR locations: only 17 to 25% of full-length LTR locations present in the wild barley B1K-04-12 were shared at 98% sequence identity and 98% alignment coverage with any domesticated genotype (Extended Data Fig. 4). By contrast, more closely related domesticated genotypes shared between 53% and 67% of their full-length LTRs, consistent with previous reports of rapid sequence turn-over in the non-coding space in large-genome plant species^24,25.

De novo gene annotation using Illumina RNA sequencing and PacBio Iso-Seq data (Supplementary Table 2) was performed for three genotypes: Morex (which has previously been reported⁶), Barke and the Ethiopian landrace HOR 10350 (Extended Data Fig. 5). Gene models defined on the basis of these three assemblies were consolidated and projected onto the remaining 17 assemblies (Extended Data Fig. 5). Between 35,859 and 40,044 gene models were annotated by projection in each assembly (Extended Data Table 1) with an average of 37,515 (s.d. = 896). The number of gene models was about 20% higher in the projections than in de novo annotations (Extended Data Fig. 5e), which indicates that some of the models lack transcript support: possible explanations for the discrepancy are highly tissue-specific expression or pseudogenization. The clustering of orthologous gene models yielded 40,176 orthologous groups. Of these, 21,992 occurred as a single copy in all 20 assemblies; 3,236 occurred in multiple copies in at least one of the 20 assemblies; 13,188 were absent from at least one assembly; and 1,760 were present in only one assembly. On average, 14.7% of gene models annotated in each assembly occurred in tandem arrays that comprised two or more adjacent copies. These results point to abundant genic copy-number variation between barley genotypes. Future transcriptomic studies will ascertain the effect of structural variants on gene expression.

Pan-genome as a tool for genetics and breeding

High-quality genome assemblies are a resource for ascertaining and providing context to structural variants, which can then be genotyped in a wider set of germplasm using low-coverage or reduced-representation sequence data. We used two complementary approaches to detect structural variation: assembly comparison and clustering of single-copy sequences to derive markers that can be scored in short-read data. We used the Assemblytics²⁶ software to discover PAV by pair-wise comparison of 19 chromosome-scale assemblies to the Morex reference. We identified 1,586,262 PAVs, ranging in size from 50 to 999,568 bp, and observed an enrichment for low-frequency variants (Extended Data Fig. 6a, b). PAV density was higher in distal, gene-rich regions (Extended Data Fig. 6c), which are characterized by higher nucleotide diversity and recombination rates⁸. A total of 5,446 out of 5,602 deletions longer than 5 kilobases (kb) found in Barke relative to Morex were mapped genetically in the 90 recombinant inbred lines of the Morex × Barke population¹⁹ with highly concordant positions (Spearman correlation = 0.99) (Extended Data Fig. 6d), which provides support for the accuracy of the detected polymorphisms. At least one member of 18,562 (46%) groups of orthologous genes overlapped with structural variants discovered in the 20 sequence assemblies. As observed in other plant species²⁷, resistance-gene homologues containing NB-ARC and protein kinase domains were frequently found among PAV genes (Supplementary Table 3).

Structural variants cover non-genic regions composed of repetitive sequence, making it hard to establish orthologous relationships or the presence of specific alleles from short-read data only. To derive quantitative estimates of the extent of pan-genomic variation and as a tool for genetic analysis such as association scans, we focused on single-copy regions extracted from each of the 20 assemblies and clustered into a non-redundant set of sequences (hereafter referred to as the ‘single-copy pan-genome’) (Extended Data Fig. 7a). The average cumulative size of single-copy sequence in each accession was 478 Mb (that is, 9.5% of the assembly genome). The total size of non-redundant single-copy sequence was 638.6 Mb, represented by 1,472,508 clusters with an N50 of 1,087 bp (Extended Data Fig. 7b). The single-copy sequence shared among all 20 genotypes amounted to 402.5 Mb, whereas 235.9 Mb were variable (that is, absent or present in higher copy number in at least one assembly) (Fig. 2a). On average, each of the 20 genotypes contained 2.9 Mb of single-copy sequence not present in any other assembly. As observed for transposable element divergence, the wild barley B1K-04-12 had the highest amount of unique single-copy sequence (Extended Data Table 1).

**Fig. 2: Single-copy pan-genome and use of PAVs in association mapping.**

To test the suitability of the single-copy pan-genome for genetic analysis in a wider diversity panel without high-quality genome sequences, we collected whole-genome shotgun data (threefold coverage) for 200 domesticated and 100 wild varieties of barley (Supplementary Table 4). The abundance of 160,716 single-copy clusters that overlap structural variants was estimated by counting cluster-constituent k-mers (k = 31) in sequence reads of the diversity panel. In addition, we analysed genotyping-by-sequencing data of 19,778 gene bank accessions of domesticated barley⁹ using the same approach. Abundance estimates based on k-mers (hereafter referred to as ‘pan-genome markers’) showed that loci detected as single-copy sequence in one genome assembly can vary in copy number from zero to many in diverse germplasm (Extended Data Fig. 7c). A PCA of pan-genome markers genotyped in whole-genome shotgun and genotyping-by-sequencing data highlighted the same drivers of global population structure as SNPs (Extended Data Fig. 7d–g). In genome-wide association scans for morphological traits, pan-genome markers revealed—with a good signal-to-noise ratio—peaks that are consistent with previous reports⁹ (Fig. 2b, Extended Data Fig. 8). Notably, the pan-genome marker that was most highly associated with lemma adherence covered the NUDUM (NUD) gene¹¹ (Fig. 2c). All varieties of naked barley—in which lemmas can be easily separated from grains—are thought to trace back to a single mutational event, deleting the entire NUD sequence¹¹. Another putative knockout allele of NUD (nud1.g) that contains a likely disruptive SNP variant was recently found in Tibetan barley²⁸. All 36 naked accessions in our panel contained the known deletion (Fig. 2d), indicating that broader sampling of barley diversity—with a particular focus on centres of (morphological) diversity—is needed to discover novel rare alleles by genomic analyses.

Compared to reference-free approaches for k-mer-based genome-wide association scans such as AgRenSeq²⁹, trait-associated pan-genome markers are assigned with high precision to genomic positions, and aligning sequence assemblies in their vicinity provides immediate information about differences between haplotypes (Fig. 2c). Furthermore, the reduction of marker number by implicit clustering of k-mers into single-copy loci allows the use of standard mixed linear models^30,31 to correct for genomic relatedness.

A map of polymorphic inversions

Chromosome-scale sequence assemblies can reveal large-scale rearrangements that are challenging to detect with other methods. Large inversions (more than 5 Mb in size) were prominent in the genome alignments of our 20 assemblies (Fig. 1b, Extended Data Fig. 3a, c). Previous reports on segregating inversions in barley are anecdotal and have focused on induced mutants^32,33. To discover inversions in a broader set of germplasm, we mined patterns of contact frequencies in Hi-C data of a diversity panel mapped to a single reference genome³⁴. Among 69 barley genotypes (67 domesticated and 2 wild accessions) (Supplementary Table 5), Hi-C-based inversion scans revealed a total of 42 events that ranged from 4 to 141 Mb in size (mean size of 23.9 Mb) (Extended Data Fig. 9a). Most of these events occurred in the low-recombining pericentromeric regions of the barley chromosomes and segregated at low frequency: 25 events were observed only once (Extended Data Fig. 9b, c, Supplementary Table 6). We focus here on two notable examples: a frequent event on chromosome 2H and an inversion in the distal region of the long arm of chromosome 7H.

The inversion in chromosome 7H detected in the RGT Planet cultivar was the largest event that segregated in our panel (141 Mb) (Fig. 3a). In a biparental mapping population derived from a cross between RGT Planet and the non-carrier cultivar Hindmarsh (Fig. 3b), this event repressed recombination in an interval that spanned 49 cM in the genetic map of the Morex × Barke population¹⁹, which is isogenic for absence of the inversion (Fig. 3c, Supplementary Table 7). We also observed a moderately distorted segregation (57% allele frequency, χ² = 4.88, P < 0.05) in favour of the Hindmarsh allele in this interval. Recombination frequencies were increased in the flanking regions of the inversion in the RGT Planet × Hindmarsh population relative to Morex × Barke, which suggests a compensatory mechanism to maintain an average number of one-to-two crossovers per chromosome in the presence of large tracts of suppressed recombination³⁵.

**Fig. 3: Identification and characterization of a large inversion on chromosome 7H.**

By focusing on the inversion breakpoints in the RGT Planet sequence assembly, we designed a diagnostic PCR assay (Supplementary Fig. 2a, b, d) to rapidly genotype the presence of the inversion in 1,406 accessions (Supplementary Table 8). The inverted haplotype occurred at low frequency (1.3%) in the whole panel, but was found in many lines in the RGT Planet pedigree (Supplementary Fig. 3)—including commercially successful barley cultivars of past decades, such as Triumph, Quench and Sebastian. The earliest cultivar that carried the inversion was Diamant. As one of the donors of the semi-dwarf growth habit, Diamant was a highly influential founder line of modern barley breeding and traces back to a mutant induced by gamma irradiation of the Czech cultivar Valticky³⁶. We genotyped several gene bank accessions and germplasm samples of both Valticky and Diamant. Notably, none of the Valticky samples carried the inversion, whereas it segregated in the Diamant samples (Fig. 3d). Quantitative trait loci mapping for yield-related traits in the RGT Planet × Hindmarsh population did not show signals on chromosome 7H (Supplementary Fig. 2e, Supplementary Table 9), consistent with selective neutrality of the inversion. This strongly suggests that mutation breeding in the 1960s has given rise to a cryptic large inversion, which—unbeknownst to breeders—segregates in elite varieties of barley.

The second inversion we focused on spanned 10 Mb in the interstitial region of chromosome 2H (Fig. 1b) and was present in 26 out of 69 Hi-C samples (Supplementary Table 8). Local PCA and haplotype analysis in our panel of 200 domesticated and 100 wild varieties of barley indicated a single origin of the inverted haplotype (Fig. 4a, b, Supplementary Fig. 2c). The inversion occurred only among domesticated barley of Western geographical origin⁹, indicating that it arose or has risen to high frequency after domestication. The inverted region contains 46 high-confidence genes in the Morex cultivar. The closest gene to the inversion breakpoint—at 448 kb distance from the distal breakpoint in the non-carrier Morex—was HvCENTRORADIALIS (HvCEN)³⁷ (Fig. 4c). Although induced mutants of HvCEN flower very early, natural variation in HvCEN has previously been implicated in environmental adaptation to northern European climates³⁷. All of the inversion carriers we analysed had HvCEN haplotype III, which is associated with later flowering in spring barley varieties from northern Europe^37,38. Further research is required to determine whether the inversion close to HvCEN has direct functional consequences (for instance, by modulating HvCEN expression) or whether it hitchhiked along with a tightly linked causal variant.

**Fig. 4: Analysis of a frequent inversion on chromosome 2H.**

Discussion

The digital representation of the pan-genome can expand the repertoire of natural or induced sequence variation that is accessible to genetic analyses and breeding. Our comparison of 20 chromosome-scale sequence assemblies has revealed pervasive variation in genes and non-coding regions. Focusing on single-copy sequences, we translated this variation into scorable markers that are amenable to population genetic analysis and association scans. A notable finding was the prevalence of large (more than 5 Mb in size) inversion polymorphisms in current elite germplasm. It is likely that the suppression of genetic recombination in inversion heterozygotes has manifested itself in hard-to-explain patterns of long-range linkage and segregation distortion between elite lines in breeding programmes. Our map of inversion polymorphisms will provide breeders with a point of reference to avoid—or interpret correctly—crosses between carriers and non-carriers. We found abundant structural variation in 20 representative barley genotypes, but individual events occurred at low frequency (Extended Data Figs. 6, 9). This observation, combined with the slow saturation of the single-copy pan-genome (Fig. 2a), motivate the genomic analysis of more genotypes to expand the barley pan-genome. The next phase of barley pan-genomics will focus on an augmented panel of domesticated and wild germplasm, working towards the long-term goal of high-quality genome sequences of all barley plant genetic resources as part of a biodigital resource centre^39,40.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Library preparation, sequencing data generation and genome assembly of 20 diverse varieties of barley

High-molecular-weight DNA was extracted from one-week-old seedlings of 20 diverse barley accessions given in Supplementary Table 10, using a previously described large-scale DNA extraction protocol⁴¹. For the NRGene DeNovoMAGIC3.0 assemblies, 450-bp paired-end (PE450) libraries of Morex, Barke, HOR 10350 and B1K-04-12 were prepared at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben. The 450-bp paired-end libraries for other accessions, 800-bp paired-end libraries and mate-pair libraries of three sizes were prepared and sequenced at the University of Illinois Roy J. Carver Biotechnology Center. The 10X Genomics Chromium libraries were prepared at the University of Saskatchewan Wheat Molecular Breeding Laboratoryand sequenced by Genome Quebec or prepared and sequenced at the Roy J. Carver Biotechnology Center, using the manufacturers’ recommendations. Published tethered chromosome conformation data for Morex, Barke, HOR 10350 and B1K-04-12 (ref. ⁴²) was used for scaffolding the respective genome. For the other accessions, in situ Hi-C libraries were prepared using a previously described method⁴³. Sequencing data generated from each of the libraries are given in Supplementary Table 10. NRGene DeNovoMAGIC3.0 scaffold assemblies were provided for Barke, HOR 10350 and B1K-04-12. The 10X Chromium, population sequencing (POPSEQ) and Hi-C data were then used to prepare chromosome-scale assemblies using the TRITEX pipeline⁶ (commit: 7041ff2). For the other assemblies, the TRITEX pipeline was also used for contig assembly and scaffolding with mate-pair and 10X data (Extended Data Table 1). High-confidence gene models annotated on the Morex V2 reference⁶ and full-length cDNA sequences⁴⁴ were aligned to the assemblies to assess gene-space completeness with the parameters of ≥90% query coverage and ≥97% (≥90% for full-length cDNA) identity.

Tissue collection and RNA extraction

Plant material for the collection of tissues for RNA sequencing (RNA-seq) and Iso-Seq was grown in the greenhouse at IPK Gatersleben with day–night temperatures of 21 °C–18 °C. Embryonic tissue, leaves, roots, internode, inflorescence (5 mm) and developing seeds (5 and 15 days after pollination) were collected as previously described⁴, snap-frozen in liquid nitrogen and stored at −80 °C until RNA extractions were performed. RNA was extracted from the collected tissues using a Trizol extraction protocol⁴ and purified using Qiagen RNeasy miniprep columns as per the manufacturer’s instructions. RNA quality was checked on Agilent RNA HS screen tape and RNA with RIN value greater than 8 was used for RNA-seq and Iso-Seq library construction.

RNA-seq library preparation and data generation

RNA-seq libraries were prepared from purified RNA using the TruSeq RNA sample preparation kit (Illumina) as per the manufacturer’s recommendation at IPK Gatersleben. Libraries were pooled at equimolar concentrations, quantified by qPCR and paired-end-sequenced on an Illumina HiSeq 2500 for 200 cycles. The data generated for each tissue are given in Supplementary Table 2.

Iso-Seq data generation and analysis

Two libraries for each embryonic tissue RNA and pooled RNA from seven tissues (described in ‘Tissue collection and RNA extraction’) were prepared for Barke and HOR 10350 using the PacBio Iso-Seq protocol. In brief, double-stranded RNA was synthesized using SMARTer PCR cDNA synthesis kit (Clontech; cat. no. 634925). Two fractions of cDNA with different size profiles were prepared by using differing ratios of DNA to Ampure XP beads (Beckman Coulter, cat. no. A63882). Equimolar concentration of each fraction were pooled, and a minimum of one microgram of double-stranded cDNA was used for Iso-Seq library construction as per the PacBio library construction protocol. Two additional libraries from pooled RNA tissues were prepared using cDNA prepared from TeloPrime v.1.0 kit (Lexogen) following the manufacturer’s instructions. Libraries were quantified and sequenced on a PacBio Sequel device at IPK Gatersleben. Data were analysed using SMRTLink v.5.0 Isoseq v.1.0 pipeline or Isoseq3 pipeline (https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda). The steps involved in Iso-Seq data analysis were the generation of circular consensus sequences, and then the classification of circular consensus sequence reads into full-length non-chimeric reads and non-full length reads on the basis of the presence of primer sequences and polyA sequences. Full-length non-chimeric reads were then clustered on the basis of sequence similarity to yield high- and low-quality isoforms. The data generated and method of library preparation are given in Supplementary Table 2.

Gene projections and repeat annotation

Gene models for Morex, Barke and HOR 10350 were predicted using transcriptome data (Supplementary Table 2) and protein homology evidence, and derived by a previously described annotation pipeline⁵. High-confidence gene models from these accessions were aligned to pseudo-chromosomes of each accession separately using blat⁴⁵. For each genomic region identified by blat, additional alignments were performed by exonerate⁴⁶ in its genomic neighbourhood ranging between 20 kb upstream and 20 kb downstream of the match position. A series of quality criteria was applied to select high-confidence gene models in each accession. The functional annotation for genes of 20 accessions was carried out using the AHRD pipeline v.3.3.3 (https://github.com/groupschoof/AHRD). Orthologous gene groups between the twenty accessions were predicted using OrthoFinder⁴⁷ v.2.3.1 with default parameters.

Repeat annotation

To obtain a consistent transposon annotation across all lines for transposons and tandem repeats, the same methods were applied to all 20 barley lines. Transposons were detected and classified by homology search against the REdat_9.7_Poaceae section of the PGSB transposon library⁴⁸. The program vmatch (http://www.vmatch.de, version 2.3.0) was used for that purpose as a fast and efficient matching tool that is well-suited for such large and highly repetitive genomes. Vmatch was run with the following parameters: identity ≥ 70%, minimal hit length 75 bp, seed length 12 bp (exact command line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5). To remove overlapping annotations, the vmatch output was filtered for redundant hits via a priority-based approach. Higher scoring matches were assigned first and lower scoring hits at overlapping positions were either shortened or removed if they were contained to ≥90% in the overlap or <50 bp of rest length remained. The resulting transposon annotations are overlap-free, but disrupted elements from nested insertions have not been defragmented into one element. Still-intact full-length LTR retrotransposons were identified with LTRharvest⁴⁹, a program that scans the genome for LTR retrotransposon specific structural hallmarks, such as long terminal repeats, RNA cognate primer binding sites and target site duplications. LTRharvest (included in genometools 1.5.9) was run with the following parameter settings: ‘overlaps best -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 3000 -maxdistltr 25000 -similar 85 -mintsd 4 -maxtsd 20 -motif tgca -motifmis 1 -vic 60 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3’. All candidates were annotated for PfamA domains using hmmer3 (http://hmmer.org, version 3.1b2) and filtered to remove false positives. The inner domain order served as a criterion for the LTR-retrotransposon superfamily classification into either Gypsy or Copia. In the cases of insufficient domain information, the elements were assigned as still undetermined.

Most of the transposons insert at random locations leading to novel and usually unique sequence stretches at both borders around the inserted element and the neighbouring original sequence. The de novo detected full-length LTR set provides defined element borders, a prerequisite for the exact positioning of transposable element junctions. We used 100-bp single transposable element junctions with 50 bp outside and 50 bp inside the element from both sides of the element and merged them to 200 bp joined junctions per element. Junctions from the reverse strand were reverse-complemented. The 200-bp joined junctions from all 20 lines were clustered with vmatch dbcluster (http://www.vmatch.de, version 2.3.0) at 98% identity and 98% mutual length coverage (command-line parameters: 98 98 -e 2 -l 98 -d). About 97% of the clusters belonged to the 1:1 type with a maximum of 1 member per line and were used for the downstream analyses. Using the above-described 200-bp joined junctions instead of full sequences reduces the amount of data for clustering to 2%, from about 10 kb to 200 bp per full-length LTR element, thus allowing a sequence clustering of 2.2 million elements in the first place. By including sequence information outside of the element, the repetitiveness of high-copy transposable element families is removed and at the same time the syntenic context is provided even for elements located on chrUn (that is, not assigned to chromosomal pseudomolecules).

PAV detection and validation

Owing to higher sensitivity in detecting deletions over insertions, a paired genome alignment strategy was used in which each assembly was aligned to reference genome Morex reciprocally by treating Morex as a query and reference using Minimap2 (v.2.17)⁵⁰. From these two alignments, insertion and deletions were called using Assemblytics (v.1.2.1)²⁶. Then, only deletions were selected in both alignments and converted into PAVs with regard to Morex. In addition, a hard filter was used to discard PAVs containing more than 5% gaps (Ns) and nested PAVs. We used a previously described method⁵¹ to map deletions longer than 5 kb in Barke relative to Morex using whole-genome shotgun data for 90 Morex × Barke recombinant inbred lines¹⁹. Mosdepth (v.0.2.9)⁵² was used for determining read depth in genomic intervals.

k-mer-based genome-wide association

PAVs overlapping with single copy regions were identified by BedTools (v.2.28.0)⁵³. k-mer sequences with step size of 2 bp were retrieved from single-copy regions residing within PAVs. The abundances of the extracted k-mer sequences were counted in sequence reads using BBDuk (BBMap_37.93) (https://sourceforge.net/projects/bbmap/). k-mer counts were obtained for whole-genome shotgun data of 300 diverse varieties of barley generated in the present study and previously published genotyping-by-sequencing data⁹. k-mer counts were imported into R (v.3.5.1)⁵⁴ and normalized for differences in read depth between samples. The normalized k-mer counts were then used for genome-wide association scans using GAPIT3³⁰ and PCA using standard R functions.

Construction of single-copy pan-genome

To identify single-copy regions in each genome, genomic regions covered by 31-mers occurring more than once were masked using BBDuk (BBMap_37.93)⁵⁵. Based on masking, single-copy regions in each assembly were obtained in .bed format and subsequently related sequences were retrieved using BEDTools (v2.28.0)⁵³. Single-copy sequences from all the assemblies were combined to perform an all-against-all blast search. The blast results were filtered (>90% identity and minimum 80% alignment length) and then clustered using the igraph package⁵⁶. A representative from each cluster (the largest contained sequence) was selected and used for estimating pan-genome size. Clusters shared by all the 20 accessions are referred to as the core genome, and clusters with sequences originating from 1 to 19 genotypes are considered as the variable genome.

Hi-C library preparation, sequencing and inversion calling

In situ Hi-C libraries were prepared from one-week-old seedlings of barley IPK core50 collection⁹ (Supplementary Table 5) based on a previously described protocol⁴³ Sequencing, Hi-C raw data processing and inversion calling were performed as previously described³⁴ using the MorexV2 reference genome sequence assembly⁶. The breakpoint regions were identified by pairwise genome alignment using Minimap2 (v.2.17)⁵⁰ and PipMaker (http://pipmaker.bx.psu.edu/cgi-bin/pipmaker?basic)⁵⁷.

Resequencing, SNP calling and PCA

Raw reads (Supplementary Table 4) were trimmed with cutadapt (v.1.15) and aligned to the MorexV2 genome assembly⁶ using Minimap2 (v.2.17)⁵⁰. The alignments were sorted using Novosort (V3.06.05) (http://www.novocraft.com). BCFtools (v.1.8)⁵⁸ was used to call SNPs and short insertions and deletions (indels). The resulting VCF file was converted into Genomic Data Structure (GDS) format using SeqArray package⁵⁹ in R to obtain a SNP matrix. Finally, hard filtering was applied to remove SNPs having more than 10% missing data and heterozygosity. Previously generated genotyping-by-sequencing data⁹ were aligned to the MorexV2 reference and identified SNPs using a previously described variant calling pipeline⁹. PCAs were performed using snpgdsPCA() function of the package SNPrelate⁶⁰.

RGT Planet × Hindmarsh mapping population

A cross was made between RGT Planet (maternal plant) and Hindmarsh (pollen donor). In total, 38 F₂ plants from the direct cross and 233 individual heads from F₃ seeds were progressed to the F6 generation by single seed descent method. The F₆ recombinant inbred lines (RIL) (224 in total) were used for construction of a genetic linkage map. Genomic DNA was extracted from the leaves of a single plant per RIL using the cetyl-trimethyl-ammonium bromide method. DNA quality was assessed on 1% agarose gels and quantified using a NanoDrop spectrophotometer (Thermo Scientific NanoDrop Products). DNA was diluted into 50 ng/μl and placed in a 96-well plate for PCR. DArT-seq genotyping-by-sequencing was performed using the DArT-seq platform (DArT PL) according to the manufacturer’s protocol (https://www.diversityarrays.com/). In brief, 100 μl of 50 ng μl⁻¹ DNA was sent to DArT PL, and genotyping-by-sequencing was performed using complexity reduction followed by sequencing on a HiSeq Illumina platform as previously described⁶¹ (Supplementary Table 9). Sequences flanking polymorphisms detected by DArT-seq were aligned against the MorexV2 genome assembly to determine their physical positions (Supplementary Table 7).

Field experiments and phenotypic data

Field experiments were conducted at six sites: Gibson, Western Australia (WA, −33.612176, 121.798438); Williams, Western Australia (−33.577668, 116.734934); Wongan Hills, Western Australia (−30.848953, 116.756461); Merredin, Western Australia (−31.487009, 118.229668); South Perth, Western Australia (−31.991186, 115.887944); and Shepperton, Victoria (−36.487551, 145.388470). The distance between South Perth and Shepperton is over 3,300 km. The Merredin site is located inland and receives little rainfall, whereas the Gibson site receives a high amount of rainfall: the other sites are in between. The experimental design for field trial sites was performed as previously described⁶². In brief, all regional field trials (partially replicated design) were planted in a randomized complete block design using plots of 1 by 5 m², laid out in a row–column format and the middle 3 m was harvested for grain yield. Field trials in South Perth and Shepperton were conducted using double rows with a 40-cm distance within and between rows, owing to space constraints. Seven control varieties were used for spatial adjustment of the experimental data. Measurements were taken at each plot of each field experiment in the study to determine flowering time (days to Zadoks stage (ZS)49), plant height and grain yield. In brief, heading date was recorded as the number of days from sowing to 50% awn emergence above the flag leaf (ZS49), as a proxy for flowering time. Plant height was determined by estimating the average height from the base to the tip of the head of all plants in each plot. Grain yield (kg ha⁻¹) was determined by destructively harvesting all plant material from each plot to separate the grain, and then determining grain mass. Grain yield data of the field experiments, as well as plant height and heading data, were analysed using linear mixed models in ASReml-R (https://www.vsni.co.uk/software/asreml-r/) to determine best linear unbiased predictions or best linear unbiased estimations for each trait for further analysis. Local best practices for fertilization and disease control were adopted for each trial site.

Quantitative trait loci (QTL) mapping

Software MapQTL6 was used for the QTL analysis⁶³. The genotypic data, phenotypic data and genetic map were formatted and imported to MapQTL6. Interval mapping was conducted for each trait, and then the markers with a logarithm of odds (LOD) value of above 3.0 were selected as cofactors. Multiple QTL model mapping was performed to re-calculate the QTL. If the markers with the highest LOD value were inconsistent with the cofactor markers, then the new markers were selected as cofactors and re-calculated. The QTL results and charts were exported from the software.

Long-read sequence assembly of the Morex cultivar

PacBio libraries were constructed using SMRTbell Template Prep Kit 1.0 and sized on a SAGE Blue Pippin instrument 20–80 kb. Sequencing was performed on Sequell II device at the HudsonAlpha Institute using V.1.0 chemistry and 10-h movie time. Data were generated from a total of five SMRT cells, yielding 604 Gb of raw sequence reads. A total of 520.72 Gb of this set (104.15×) was used for assembly (Supplementary Tables 11, 12). Previously published Illumina short-read data (ERR3183748 and ERR3183749⁶ (Supplementary Table 12)) was used for polishing and error correction. Before use, Illumina fragment reads were screened for phix contamination. Reads composed of >95% simple sequences were removed. Illumina reads shorter than 50 bp after trimming for adaptor and quality (q < 20) were removed. The final read set consists of 605,178,701 reads, representing a total of 43.17× of high-quality Illumina bases. The initial assembly was generated by assembling 32,743,478 PacBio reads (104.15× sequence coverage) using MECAT (v.1.1)⁶⁴ and subsequently polished using Arrow⁶⁵. This produced an initial assembly of 1,577 scaffolds (1,577 contigs), with a contig N50 of 10.4 Mb, 987 scaffolds larger than 100 kb and a total genome size of 4,139.8 Mb (Supplementary Table 13).

A first round of breaking chimeric scaffolds was done using the POPSEQ genetic map¹⁹ to identify contigs bearing markers from distant genomic regions. A total of 17 misjoins were identified and resolved. Homozygous SNPs and indels were corrected in the release consensus sequence using a subset of about 30× of the Illumina reads described above in this section. Reads were aligned using BWA-MEM⁶⁶. Homozygous SNPs and indels were discovered with GATK’s UnifiedGenotyper tool⁶⁷. A total of 59 homozygous SNPs and 15,759 homozygous indels were corrected. After these correction steps, the assembly contains 4,139.7 Mb of sequence, consisting of 1,594 contigs with a contig N50 of 10.2 Mb. A second round of chimaera breaking by inspecting Hi-C contact matrices as described in the TRITEX pipeline⁶. Published Hi-C data of the Morex cultivar was used⁵. Corrected contigs were arranged into pseudomolecules using TRITEX.

Comparison of PacBio continuous long read (CLR) and TRITEX assemblies of the Morex cultivar

Full-length cDNA sequences⁴⁴ were aligned to the assemblies to assess gene space completeness. Only alignments with query coverage ≥90% and identity ≥90% were considered. Whole-genome assemblies were done with Minimap2. Structural variant calling with Assemblytics (v.1.2.1)²⁶ (Morex TRITEX versus Morex CLR; Morex CLR versus Barke) and extraction of single-copy regions were done as described in ‘PAV detection and validation’.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Data availability

All raw sequence data collected in this study and sequence assemblies have been deposited at the European Nucleotide Archive (ENA). Accession codes for raw data and assemblies are listed in Supplementary Tables: Supplementary Table 14 (assemblies), Supplementary Table 10 (assembly raw data), Supplementary Table 4 (whole-genome shotgun sequencing), Supplementary Table 5 (Hi-C) and Supplementary Table 9 (DArT-seq). Assemblies, annotations and analysis results were deposited under a DOI in the PGP repository⁶⁸ using the e!DAL submission system⁶⁹ and are accessible under the URL https://doi.org/10.5447/ipk/2020/24. Assemblies and gene annotations can also be downloaded from https://barley-pangenome.ipk-gatersleben.de. The Barley Pedigree Catalogue is available at http://genbank.vurv.cz/barley/pedigree/.

Code availability

Source code is released in a public Bitbucket repository, at https://bitbucket.org/ipk_dg_public/barley_pangenome/.

References

Bayer, P. E., Golicz, A. A., Scheben, A., Batley, J. & Edwards, D. Plant pan-genomes are the new reference. Nat. Plants 6, 914–920 (2020).
Article PubMed Google Scholar
Dawson, I. K. et al. Barley: a translational model for adaptation to climate change. New Phytol. 206, 913–931 (2015).
Article PubMed Google Scholar
Stein, N. & Muehlbauer, G. J. The Barley Genome (Springer, 2018).
International Barley Genome Sequencing Consortium. A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716 (2012).
Article ADS CAS Google Scholar
Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature 544, 427–433 (2017).
Article CAS PubMed ADS Google Scholar
Monat, C. et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 20, 284 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mascher, M. et al. Mapping-by-sequencing accelerates forward genetics in barley. Genome Biol. 15, R78 (2014).
Article PubMed PubMed Central CAS Google Scholar
Russell, J. et al. Exome sequencing of geographically diverse barley landraces and wild relatives gives insights into environmental adaptation. Nat. Genet. 48, 1024–1030 (2016).
Article CAS PubMed Google Scholar
Milner, S. G. et al. Genebank genomics highlights the diversity of a global barley collection. Nat. Genet. 51, 319–326 (2019).
Article CAS PubMed Google Scholar
Muñoz-Amatriaín, M. et al. Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol. 14, R58 (2013).
Article PubMed PubMed Central CAS Google Scholar
Taketa, S. et al. Barley grain with adhering hulls is controlled by an ERF family transcription factor gene regulating a lipid biosynthesis pathway. Proc. Natl Acad. Sci. USA 105, 4062–4067 (2008).
Article CAS PubMed ADS PubMed Central Google Scholar
Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).
Article CAS PubMed ADS PubMed Central Google Scholar
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189 (2019).
Article PubMed PubMed Central CAS Google Scholar
Danilevicz, M. F., Tay Fernandez, C. G., Marsh, J. I., Bayer, P. E. & Edwards, D. Plant pangenomics: approaches, applications and advancements. Curr. Opin. Plant Biol. 54, 18–25 (2020).
Article CAS PubMed Google Scholar
Monat, C., Schreiber, M., Stein, N. & Mascher, M. Prospects of pan-genomics in barley. Theor. Appl. Genet. 132, 785–796 (2019).
Article PubMed Google Scholar
Coronado, M.-J., Hensel, G., Broeders, S., Otto, I. & Kumlehn, J. Immature pollen-derived doubled haploid formation in barley cv. Golden Promise as a tool for transgene recombination. Acta Physiol. Plant. 27, 591–599 (2005).
Article CAS Google Scholar
Schreiber, M. et al. A genome assembly of the barley ‘transformation reference’ cultivar Golden Promise. G3 10, 1823–1827 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gottwald, S., Bauer, P., Komatsuda, T., Lundqvist, U. & Stein, N. TILLING in the two-rowed barley cultivar ‘Barke’ reveals preferred sites of functional diversity in the gene HvHox1. BMC Res. Notes 2, 258 (2009).
Article PubMed PubMed Central CAS Google Scholar
Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). Plant J. 76, 718–727 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hübner, S. et al. Strong correlation of wild barley (Hordeum spontaneum) population structure with temperature and precipitation variation. Mol. Ecol. 18, 1523–1536 (2009).
Article PubMed Google Scholar
Chikhi, R., Limasset, A. & Medvedev, P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 (2016).
Article CAS PubMed PubMed Central Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Article PubMed PubMed Central Google Scholar
Clavijo, B. J. et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 27, 885–896 (2017).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Anderson, S. N. et al. Transposable elements contribute to dynamic genome content in maize. Plant J. 100, 1052–1065 (2019).
Article CAS PubMed Google Scholar
Brunner, S., Fengler, K., Morgante, M., Tingey, S. & Rafalski, A. Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell 17, 343–360 (2005).
Article CAS PubMed PubMed Central Google Scholar
Nattestad, M. & Schatz, M. C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).
Article PubMed PubMed Central ADS CAS Google Scholar
Yu, S. et al. A single nucleotide polymorphism of Nud converts the caryopsis type of barley (Hordeum vulgare L.). Plant Mol. Biol. Report. 34, 242–248 (2016).
Article CAS Google Scholar
Arora, S. et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 37, 139–143 (2019).
Article CAS PubMed Google Scholar
Lipka, A. E. et al. GAPIT: genome association and prediction integrated tool. Bioinformatics 28, 2397–2399 (2012).
Article CAS PubMed Google Scholar
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Article CAS PubMed Google Scholar
Ekberg, I. Cytogenetic studies of three paracentric inversions in barley. Hereditas 76, 1–30 (1974).
Article CAS PubMed Google Scholar
Ramage, R. & Suneson, C. Translocation-gene linkages on barley chromosome 7. Crop Sci. 1, 319–320 (1961).
Article Google Scholar
Himmelbach, A. et al. Discovery of multi-megabase polymorphic inversions by chromosome conformation capture sequencing in large-genome plant species. Plant J. 96, 1309–1316 (2018).
Article CAS PubMed Google Scholar
Ederveen, A., Lai, Y., van Driel, M. A., Gerats, T. & Peters, J. L. Modulating crossover positioning by introducing large structural changes in chromosomes. BMC Genomics 16, 89 (2015).
Article PubMed PubMed Central Google Scholar
Bouma, J. & Ohnoutka, Z. Importance and Application of the Mutant ‘Diamant’ in Spring Barley Breeding (IAEA, 1991).
Comadran, J. et al. Natural variation in a homolog of Antirrhinum CENTRORADIALIS contributed to spring growth habit and environmental adaptation in cultivated barley. Nat. Genet. 44, 1388–1392 (2012).
Article CAS PubMed Google Scholar
Bustos-Korts, D. et al. Exome sequences and multi-environment field trials elucidate the genetic basis of adaptation in barley. Plant J. 99, 1172–1191 (2019).
Article CAS PubMed PubMed Central Google Scholar
Mascher, M. et al. Genebank genomics bridges the gap between the conservation of crop diversity and plant breeding. Nat. Genet. 51, 1076–1081 (2019).
Article CAS PubMed Google Scholar
Khan, A. W. et al. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci. 25, 148–158 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dvorak, J., McGuire, P. E. & Cassidy, B. Apparent sources of the A genomes of wheats inferred from polymorphism in abundance and restriction fragment length of repeated nucleotide sequences. Genome 30, 680–689 (1988).
Article CAS Google Scholar
Himmelbach, A., Walde, I., Mascher, M. & Stein, N. Tethered chromosome conformation capture sequencing in Triticeae: a valuable tool for genome assembly. Bio Protoc. 8, e2955 (2018).
Article CAS PubMed PubMed Central Google Scholar
Padmarasu, S., Himmelbach, A., Mascher, M. & Stein, N. in Plant Long Non-Coding RNAs (eds Chekanova, J. & Wang, H.-L.) 441–472 (Springer, 2019).
Matsumoto, T. et al. Comprehensive sequence analysis of 24,783 barley full-length cDNAs derived from 12 clone libraries. Plant Physiol. 156, 20–28 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Article PubMed PubMed Central CAS Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).
Article PubMed PubMed Central CAS Google Scholar
Spannagl, M. et al. PGSB PlantsDB: updates to the database framework for comparative plant genome research. Nucleic Acids Res. 44, D1141–D1147 (2016).
Article CAS PubMed Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gutierrez-Gonzalez, J. J., Mascher, M., Poland, J. & Muehlbauer, G. J. Dense genotyping-by-sequencing linkage maps of two synthetic W7984×Opata reference populations provide insights into wheat structural diversity. Sci. Rep. 9, 1793 (2019).
Article PubMed PubMed Central ADS CAS Google Scholar
Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).
Article CAS PubMed Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing http://www.R-project.org (R Foundation for Statistical Computing, 2013).
Bushnell, B. BBMap: A Fast, Accurate, Splice-aware Aligner (Lawrence Berkeley National Laboratory, 2014).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal Complex Syst. 1695, 1–9 (2006).
Google Scholar
Schwartz, S. et al. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10, 577–586 (2000).
Article CAS PubMed PubMed Central Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zheng, X. & Gogarten, S. SeqArray: big data management of genome-wide sequence variants. R package version 1.10.6 https://github.com/zhengxwen/SeqArray (accessed January 2017).
Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
Article CAS PubMed PubMed Central Google Scholar
Akbari, M. et al. Diversity arrays technology (DArT) for high-throughput profiling of the hexaploid wheat genome. Theor. Appl. Genet. 113, 1409–1420 (2006).
Article CAS PubMed Google Scholar
Hill, C. B. et al. Hybridisation-based target enrichment of phenology genes to dissect the genetic basis of yield and adaptation in barley. Plant Biotechnol. J. 17, 932–944 (2019).
Article CAS PubMed Google Scholar
Van Ooijen, J. MapQTL 5, Software for the Mapping of Quantitative Trait Loci in Experimental Populations (Kyazma, 2004).
Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).
Article CAS PubMed Google Scholar
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS PubMed Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Arend, D. et al. PGP repository: a plant phenomics and genomics data publication infrastructure. Database (Oxford) 2016, baw033 (2016).
Article Google Scholar
Arend, D. et al. e!DAL—a framework to store, share and publish research data. BMC Bioinformatics 15, 214 (2014).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank M. Knauft, I. Walde and S. König for technical assistance; D. Schüler for sequence data management; J. Bauernfeind, T. Münch and H. Miehe for IT administration; D. Arend for help with data submission; M. Bayer for advice on transcriptome analysis; and M. Herz for pedigree information. This research was supported by grants from the German Federal Ministry of Education and Research to N.S., M.M., U.S., M.S. and K.F.X.M. (SHAPE, FKZ 031B0190), to U.S. and K.F.X.M. (de.NBI, FKZ 031A536) and to N.S. (COBRA, FKZ 031A323A); the Australian Grain Research and Development Cooperation (9176507) to C.L., K.C., P.L. and P.W.; JST CREST Japan (no. JPMJCR16O4 to K.M. and T.H.); JST Mirai Program Japan (no. 18076896 to K.S.); the National Key R&D Program of China (2018YFD1000701 and 2018YFD1000700) to D.X. and J.Z.; by funding from the China Agriculture Research System (CARS-05) and the Agricultural Science and Technology Innovation Program to C.W. and G.G. Support for 10X sequencing was provided by a research grant from Genome Canada and Genome Prairie to C.P. and J.E.; and by the Natural Science Foundation of China (31620103912) and the National Key R&D Program of China (2018YFD1000706) to G.Z. We acknowledge support from the European Research Council (ERC Shuffle, project identifier: 66918) to R.W.

Author information

These authors contributed equally: Murukarthick Jayakodi, Sudharsan Padmarasu

Authors and Affiliations

Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
Murukarthick Jayakodi, Sudharsan Padmarasu, Cécile Monat, Axel Himmelbach, Yu Guo, Anne Fiebig, Uwe Scholz, Martin Mascher & Nils Stein
Plant Genome and Systems Biology (PGSB), Helmholtz Center Munich, German Research Center for Environmental Health, Neuherberg, Germany
Georg Haberer, Venkata Suresh Bonthala, Heidrun Gundlach, Thomas Lux, Nadia Kamal, Daniel Lang, Klaus F. X. Mayer & Manuel Spannagl
Department of Plant Sciences, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
Jennifer Ens & Curtis J. Pozniak
Western Barley Genetics Alliance, State Agricultural Biotechnology Centre, College of Science, Health, Engineering and Education, Murdoch University, Murdoch, Western Australia, Australia
Xiao-Qi Zhang, Tefera T. Angessa, Gaofeng Zhou, Cong Tan, Camilla Hill, Penghao Wang & Chengdao Li
Agriculture and Food, Department of Primary Industries and Regional Development, South Perth, Western Australia, Australia
Gaofeng Zhou & Chengdao Li
The James Hutton Institute, Dundee, UK
Miriam Schreiber & Robbie Waugh
HudsonAlpha, Institute for Biotechnology, Huntsville, AL, USA
Lori B. Boston, Christopher Plott, Jerry Jenkins, Jane Grimwood & Jeremy Schmutz
Montana BioAg Inc, Missoula, MT, USA
Hikmet Budak
Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (ICS-CAAS), Beijing, China
Dongdong Xu, Jing Zhang, Chunchao Wang & Ganggang Guo
College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
Guoping Zhang
Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, Yokohama, Japan
Keiichi Mochida
Kihara Institute for Biological Research, Yokohama City University, Yokohama, Japan
Keiichi Mochida
Institute of Plant Science and Resources, Okayama University, Kurashiki, Japan
Keiichi Mochida, Takashi Hirayama & Kazuhiro Sato
School of Agriculture, Food and Wine, University of Adelaide, Glen Osmond, South Australia, Australia
Kenneth J. Chalmers, Peter Langridge & Robbie Waugh
School of Life Sciences, University of Dundee, Dundee, UK
Robbie Waugh
School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
Klaus F. X. Mayer
Hubei Collaborative Innovation Centre for Grain Industry, Yangtze University, Jingzhou, China
Chengdao Li
German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
Martin Mascher
Center for Integrated Breeding Research (CiBreed), Georg-August-University Göttingen, Göttingen, Germany
Nils Stein

Authors

Murukarthick Jayakodi
View author publications
You can also search for this author in PubMed Google Scholar
Sudharsan Padmarasu
View author publications
You can also search for this author in PubMed Google Scholar
Georg Haberer
View author publications
You can also search for this author in PubMed Google Scholar
Venkata Suresh Bonthala
View author publications
You can also search for this author in PubMed Google Scholar
Heidrun Gundlach
View author publications
You can also search for this author in PubMed Google Scholar
Cécile Monat
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Lux
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Kamal
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lang
View author publications
You can also search for this author in PubMed Google Scholar
Axel Himmelbach
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Ens
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Qi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tefera T. Angessa
View author publications
You can also search for this author in PubMed Google Scholar
Gaofeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Cong Tan
View author publications
You can also search for this author in PubMed Google Scholar
Camilla Hill
View author publications
You can also search for this author in PubMed Google Scholar
Penghao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Miriam Schreiber
View author publications
You can also search for this author in PubMed Google Scholar
Lori B. Boston
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Plott
View author publications
You can also search for this author in PubMed Google Scholar
Jerry Jenkins
View author publications
You can also search for this author in PubMed Google Scholar
Yu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Anne Fiebig
View author publications
You can also search for this author in PubMed Google Scholar
Hikmet Budak
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chunchao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jane Grimwood
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Schmutz
View author publications
You can also search for this author in PubMed Google Scholar
Ganggang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Guoping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Keiichi Mochida
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Hirayama
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Sato
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth J. Chalmers
View author publications
You can also search for this author in PubMed Google Scholar
Peter Langridge
View author publications
You can also search for this author in PubMed Google Scholar
Robbie Waugh
View author publications
You can also search for this author in PubMed Google Scholar
Curtis J. Pozniak
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Scholz
View author publications
You can also search for this author in PubMed Google Scholar
Klaus F. X. Mayer
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Spannagl
View author publications
You can also search for this author in PubMed Google Scholar
Chengdao Li
View author publications
You can also search for this author in PubMed Google Scholar
Martin Mascher
View author publications
You can also search for this author in PubMed Google Scholar
Nils Stein
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S. and M.M. designed the study. N.S. coordinated experiments and sequencing. M.M. supervised sequence assembly. M. Spannagl and K.F.X.M. supervised annotation. U.S. supervised data management and submission. S.P., A.H., J.E., D.X., L.B.B. and J.G. performed sequencing experiments. M.J., C.M., Y.G., C.P., J.J. and J.S. performed sequence assembly. M.J. performed structural variation and genome-wide association scan analysis. A.F. submitted sequence data. G.H., T.L., H.G., V.S.B., N.K. and D.L. annotated and analysed genes and transposable elements. S.P., M.J., X.-Q.Z., T.T.A., G. Zhou, C.T., C.H., P.W., M.M. and C.L. analysed polymorphic inversions. H.B., J.G., J.S., J.Z., C.W., G.G., G. Zhang, K.M., T.H., K.S., K.J.C., P.L., C.J.P., C.L., M. Schreiber, R.W. and N.S. contributed sequence data. M.J., S.P., C.L. and M.M. wrote the paper with input from all co-authors.

Corresponding authors

Correspondence to Chengdao Li, Martin Mascher or Nils Stein.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Victor Albert, Scott Jackson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Pan-genome selection in the global barley diversity space.

PCA with genotyping-by-sequencing data of 19,778 varieties of domesticated barley sampled from the gene bank of the IPK⁹. The first six principal components are shown. Samples are coloured to highlight the pan-genome selection (first row), or according to geographic origin (second row), row type (third row) or annual growth habit (fourth row). The proportion of variance explained by the principal components is indicated in the axis labels of the first row. The map was created with the R package mapdata⁵⁴.

Extended Data Fig. 2 Comparison between long-read and short-read assemblies of the Morex cultivar.

a, Co-linearity between Morex V2 (short-read) assembly and the Morex PacBio CLR assembly at the pseudomolecule level. b, Summary statistics of the Morex PacBio CLR assembly and Morex V2 assembly. c, Alignment of NUDUM locus (16 kb) between Morex PacBio CLR and Morex V2. d, Structural variants between Morex V2 and Morex PacBio CLR assemblies as detected and classified by Assemblytics. e, PAVs between Barke and the Morex V2 and Morex CLR assemblies.

Extended Data Fig. 3 Assessment of contiguity and completeness in 20 genome assemblies.

a, Whole-genome alignments of assemblies of 19 diverse barley accessions to the Morex V2 reference assembly. b, Alignment summary of full-length coding sequences (32,878) from the MorexV2 annotation and full-length cDNAs (28,622 full-length cDNAs) in each assembly. Alignments with less than 90% query coverage and 97% (less than 90% for full-length cDNAs) identity were discarded. c, Whole-genome alignments show some examples of large chromosomal inversions identified using Hi-C data.

Extended Data Fig. 4 Pairwise shared syntenic full-length LTR locations.

The wild variety B1K-04-12 is set apart as an outgroup, as it shares only 19–26% of its still-intact full-length LTR positions with the other landraces and cultivars. The highest similarity is found between the Barke and RGT Planet cultivars (67% shared full-length LTRs).

Extended Data Fig. 5 Gene projection and transposable element annotation.

a, Schematic of the gene projection workflow. TE, transposable element. b, Pipeline for annotation and removing transposable elements. c, Steps to identify tandemly arrayed gene (TAG) clusters in each assembly. d, Summary of gene projections and transposable element annotation in 20 accessions. e, Comparison between de novo annotations and gene projections for three genotypes. Reported counts refer to non-transposable-element genes.

Extended Data Fig. 6 Summary of PAVs detected in pan-genome assemblies.

a, Size distribution of PAVs. b, Number of PAVs between 20 genome assemblies. c, Distribution of PAVs along the barley genome. d, Co-linearity between physical position of PAVs detected between the Morex and Barke cultivars, and mapped genetically in the POPSEQ population.

Extended Data Fig. 7 Analysis of the single-copy pan-genome.

a, Pipeline used to select single-copy k-mers in PAVs as markers for genome-wide association scan analysis. b, Summary of single-copy sequence in 20 genome assemblies and results of their clustering. c, Copy number of single-copy sequences in a diversity panel comprising 200 domesticated and 100 wild accessions. Frequency ranges from blue (low) to red (high). d–g, Comparison of PCA on the basis of PAV and SNP variants in whole-genome shotgun data of 200 diverse accessions (d, e) and 19,778 varieties of domesticated barley⁹ (f, g). Top panels show PCA results from 160,716 PAVs; bottom panels show PCA results from 779,503 of genotyping-by-sequencing SNPs. The accessions are coloured according to geographical origin and row type (using the colour code defined in Extended Data Fig. 1).

Extended Data Fig. 8 PAV-based genome-wide association scans using whole-genome shotgun and genotyping-by-sequencing data.

a, Manhattan plots of PAV-based genome-wide association scans for morphological traits, including adherence of grain hull, row type, length of rachilla hairs and awn roughness, using whole-genome shotgun data from 200 diverse varieties of domesticated barley. b, PAV-based genome-wide association scan results for these traits using genotyping-by-sequencing data from 1,000 diverse varieties of domesticated barley collected from the gene bank of the IPK⁹. The 200 varieties of barley used for whole-genome shotgun sequencing are a subset of the 1,000 genotyping-by-sequencing genotypes.

Extended Data Fig. 9 Characterization of large inversions in barley.

a, Inversion size distribution. b, Recombination in inverted regions. Recombination rate was determined in the Morex × Barke RIL population¹⁹ (n = 90 genotypes). c, Number of inversions present as singletons or shared between two or more accessions on each chromosome.

Extended Data Table 1 Summary statistics of 20 pan-genome assemblies and annotation

Full size table

Supplementary information

Supplementary Figure

Supplementary Figure 1 | PCR-based genotyping of the 7H inversion. This is the original gel image from which the blue sections were cropped and used for Fig. 3. Morex and RGT planet were used as controls. All Valticky lines carry Morex allele of 7H inversion. Two Diamant lines (HOR 14972, HOR 4092) carry the RGT Planet allele, one Diamant line (HOR 2073) carries the Morex allele. In another two Diamant lines, neither the RGT Planet allele nor Morex allele was amplified. One cropped section in Fig. 3 does not contain molecular weight marker, but from the original image, it is clear that all correspond to correct fragment sizes.

Reporting Summary

Supplementary Figure 2

| In-depth analysis of two inversions on 2H and 7H. (a) Schematic illustration showing precise positions of breakpoints for 7H frequent inversion between Morex and RGT Planet. (b) PCR assay for genotyping 7H inversion. The location of three PCR primers are shown in (a) with yellow marks (not drawn to scale). (c) PCR assay for genotyping the 2H inversion. Primer locations are shown in Fig. 4c. (d) Hi-C contact probability matrix of RGT Planet computed for chromosome 7H. The intensity of pixels represents the normalized Hi-C links counted between 1 Mb windows on chromosome 7H. The frequent 7H inversion was spotted as a pattern of higher than expected interaction frequency against Morex V2 reference genome, marked by blue lines. (e) QTL results for grain yield, plant height and different growth stages from multiple sites in RGT Planet x Hindmarsh population.

Supplementary Figure 3

| PCR-based genotyping of the 7H inversion in the pedigree of RGT Planet. Yellow color denotes carriers of RGT Planet allele. Blue colored cultivars are non-carriers. red color culitvars have unknown status as no fragment was amplified. Cultivars shown in white boxes were not assayed because of unavailability of seeds or DNA. Pedigree data were retrieved from the Barley Pedigree Catalogue (http://genbank.vurv.cz/barley/pedigree/).

Supplementary Table

Supplementary Table 1. Summary statistics of repetitive elements in twenty barley genomes.

Supplementary Table

Supplementary Table 2. Summary of RNA-seq data used for gene annotation.

Supplementary Table

Supplementary Table 3. Pfam domains most frequently observed in PAV genes.

Supplementary Table

Supplementary Table 4. Summary of whole-genome short-gun (WGS) sequencing for 200 domesticated and 100 wild accessions.

Supplementary Table

Supplementary Table 5. Summary of Hi-C data used in this study.

Supplementary Table

Supplementary Table 6. Inversions detected by Hi-C.

Supplementary Table

Supplementary Table 7. Genetic map of the RGT Planet and Hindmarsh (RxH) population (sheet1) and the Morex and Barke (MxB) population (sheet2).

Supplementary Table

Supplementary Table 8. Screening of the 7H inversion in a large collection of modern varieties and breeding lines (sheet 1) and the 7H inversion in the lines in the pedigree of RGT Planet (sheet 2) and PCR validation of the 2H inversion identified by PCA (Fig. 4b) (sheet 3).

Supplementary Table

Supplementary Table 9. Accession IDs of DArTseq reads of RxH recombinants.

Supplementary Table

Supplementary Table 10. Summary of raw sequencing data generated for pan-genome assemblies.

Supplementary Table

Supplementary Table 11. PacBio library statistics for the libraries included in the Morex genome assembly and their respective assembled sequence coverage levels.

Supplementary Table

Supplementary Table 12. Genomic libraries included in the Morex CLR assembly and their respective assembled sequence coverage levels in the final release.

Supplementary Table

Supplementary Table 13. Summary statistics of the initial output of the Quiver polished MECAT assembly. The table shows total contigs and total assembled basepairs for each set of scaffolds greater than the size listed in the left hand column.

Supplementary Table

Supplementary Table 14. Accessions IDs for 20 pan-genome assemblies and Morex PacBio CLR assembly.

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jayakodi, M., Padmarasu, S., Haberer, G. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588, 284–289 (2020). https://doi.org/10.1038/s41586-020-2947-8

Download citation

Received: 03 April 2020
Accepted: 09 September 2020
Published: 25 November 2020
Issue Date: 10 December 2020
DOI: https://doi.org/10.1038/s41586-020-2947-8

This article is cited by

Technology-enabled great leap in deciphering plant genomes
- Lingjuan Xie
- Xiaojiao Gong
- Longjiang Fan
Nature Plants (2024)
Are cereal grasses a single genetic system?
- Martin Mascher
- Marina Püpke Marone
- Nils Stein
Nature Plants (2024)
A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range
- Qichao Lian
- Bruno Huettel
- Raphael Mercier
Nature Genetics (2024)
Genomic resources for a historical collection of cultivated two-row European spring barley genotypes
- Miriam Schreiber
- Ronja Wonneberger
- Robbie Waugh
Scientific Data (2024)
Plant pangenomes for crop improvement, biodiversity and evolution
- Mona Schreiber
- Murukarthick Jayakodi
- Martin Mascher
Nature Reviews Genetics (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Twenty barley reference genomes

Pan-genome as a tool for genetics and breeding

A map of polymorphic inversions

Discussion

Methods

Library preparation, sequencing data generation and genome assembly of 20 diverse varieties of barley

Tissue collection and RNA extraction

RNA-seq library preparation and data generation

Iso-Seq data generation and analysis

Gene projections and repeat annotation

Repeat annotation

PAV detection and validation

k-mer-based genome-wide association

Construction of single-copy pan-genome

Hi-C library preparation, sequencing and inversion calling

Resequencing, SNP calling and PCA

RGT Planet × Hindmarsh mapping population

Field experiments and phenotypic data

Quantitative trait loci (QTL) mapping

Long-read sequence assembly of the Morex cultivar

Comparison of PacBio continuous long read (CLR) and TRITEX assemblies of the Morex cultivar

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links