The mosaic oat genome gives insights into a uniquely healthy cereal crop

Cultivated oat (Avena sativa L.) is an allohexaploid (AACCDD, 2n = 6x = 42) thought to have been domesticated more than 3,000 years ago while growing as a weed in wheat, emmer and barley fields in Anatolia1,2. Oat has a low carbon footprint, substantial health benefits and the potential to replace animal-based food products. However, the lack of a fully annotated reference genome has hampered efforts to deconvolute its complex evolutionary history and functional gene dynamics. Here we present a high-quality reference genome of A. sativa and close relatives of its diploid (Avena longiglumis, AA, 2n = 14) and tetraploid (Avena insularis, CCDD, 2n = 4x = 28) progenitors. We reveal the mosaic structure of the oat genome, trace large-scale genomic reorganizations in the polyploidization history of oat and illustrate a breeding barrier associated with the genome architecture of oat. We showcase detailed analyses of gene families implicated in human health and nutrition, which adds to the evidence supporting oat safety in gluten-free diets, and we perform mapping-by-sequencing of an agronomic trait related to water-use efficiency. This resource for the Avena genus will help to leverage knowledge from other cereal genomes, improve understanding of basic oat biology and accelerate genomics-assisted breeding and reanalysis of quantitative trait studies.

) differed significantly between genotypes in sheath. Two-sided Welch t-tests were used to determine if metabolite levels were equal in glossy and glaucous samples with p-values adjusted using Benjamini-Hochberg procedure (glaucous flag leaf n=4; glossy flag leaf n=3; sheath n=3; glume n=2).

Supplementary Figure 26.
Heatmap of detected metabolite levels across tissues and genotypes. Detected levels are scaled and centred feature-wise. Out of the 76 features detected, none differed between genotypes for flag leaves, hentriacontane-14,16-dione and two unknown features differed between genotypes for sheath (two-sided Welch's t-test, p-value adjusted using Benjamini-Hochberg, adjusted p <0.05; glaucous flag leaf n=4; glossy flag leaf n=3; sheath n=3; glume n=2). As only two replicates were used for glume, no statistical analysis was done, but a reduction in hentriacontane-14,16-dione in glossy relative to glaucous glume is shown in Fig. 4g. Hentriacontane-14,16-dione is a beta-diketone, which can self-assemble into tubule structures at the plant cuticle. These tubules are absent in glossy.1 cuticles ( Supplementary Fig. 23). These results indicate that the bright green visual appearance of the glossy.1 mutant panicle and sheath may be caused by a lack of tubule wax crystals primarily composed of hentriacontane-14,16-dione.  Fig. 30). The Myb factor, Wax ester synthase/diacylglycerol acyltransferase 1 (WSD1), and the short-chain dehydrogenase/reductase (SDR) are not found in the barley cluster but are present close to the cer-cq orthologs in several of the Avenas (Supplementary Figs. 31-33). The orientation of the Myb factor and the lipase on Sang chromosome 1C is inferred based on the orientation in OT3098. Gene lengths are scaled relative to other genes within the same chromosome, and include introns. The cluster on ChrUn in OT3098 has been flipped to align with the cluster on 3A in Sang. Relative to barley, the cluster in wheat is known to have undergone duplications 48 -in oat we see both reordering of genes relative to barley, but also that the clusters are not only located on chromosome 2. The clusters on 1C and 2C are present in both A. sativa and in the CD tetraploid A. insularis. The cluster on 1C is located in a region identified as translocated from the D subgenome, in between two blocks syntenic to chromosomes 3A and 3C. The syntenic block closest to the 2C cluster is syntenic to chromosomes 2A and 2D. The 3A cluster is located in a block syntenic with 1C and 3C. These syntenic blocks are shown in Extended Data Fig. 3. The A genome diploid A. atlantica has genes homologous to the cluster genes on both chromosomes 3 and 4. On chromosome 7 the C genome diploid A. eriantha has a lipase/carboxyl transferase and two cytochrome P450s homologous to cluster genes.  Supplementary Tables 1, 4-6, 8-10, 21   Supplementary Tables 2, 3, 7, 11-20, 22 are provided in a separate excel document.     Supplementary Table 10. Multiplicity of oat gene families. Number of orthologous gene families and genes as a function of their multiplicity (first column) on the A-, C-, and Dsubgenomes (0: absent, 1: exactly one member and N: more than one member per subgenome). Multiplicities are not ordered for the three subgenomes. Columns 2 to 6 list the number of gene families and the number of genes mapping to the A-, C-, and D-subgenomes. The proportion of tandems in each group is shown in the seventh column, with groups significantly enriched for tandemly repeated genes in red. The last column counts the number of genomic bins (bin size 100 genes) with significant spatial clustering. 'total' summarises totals for the subgenome assignments of genes and gene families in extant oat (i.e., after occurrence of translocations), while 'ancestral' numbers are based on phylogenetic origin and ancestral subgenomic states (i.e., before occurrences of translocations in the tetraploid and hexaploid ancestors). (*) Note that gene families are based on orthologous inference while tandem genes are solely defined by their genomic neighbourhood and sequence similarity and hence also include distantly related paralogous copies. Further details are provided in Supplementary Methods. Ultra-high molecular weight (uHMW) DNA was prepared using isolated nuclei from 5 g of etiolated leaves and following the standard CTAB method, except that cut tips were used to manipulate the sample and extreme care was taken to avoid mechanical agitation and freezethaw cycles. The size of the isolated DNA fragments was above 300kb as verified by pulsedfield gel electrophoresis ( Supplementary Fig. 35b). The uHMW DNA fragments were used to construct two 10X Genomics Chromium genome libraries (10X Genomics, Pleasanton, CA). Clustering was done by 'cBot,' and samples were sequenced on HiSeqX (HiSeq Control Software HD 3.4.0.38/RTA 2.7.7) with a 2x151 setup using 'HiSeq X SBS' chemistry.

Supplementary
As detailed in Supplementary Table 21, the shotgun and MP libraries were used to generate 2,414Gb of sequencing data, equivalent to 226× genome depth (coverage), based on the assembled genome size of ~11Gb. The 10X Chromium library sequencing produced an additional 586Gb of sequencing data and 47× coverage, yielding a total coverage of 273×.

A. sativa genome assembly
The sequencing data have been processed and assembled using the DeNovoMAGIC™ assembler application version 3.0 (NRGene, Nes Ziona, Israel) and the assembly statistics are summarised in Supplementary Table 1. DeNovoMAGIC™ is a DeBruijn-graph-based assembler designed to extract the underlying information in the raw Illumina reads to solve the complexity of the DeBruijn graph due to genome polyploidy, heterozygosity, and repetitiveness. This is accomplished using accurate-reads-based traversing of the graph that iteratively connects consecutive phased contigs across local repeats to generate long phased scaffolds 58 . The 10X Chromium data were utilised to phase polyploidy/heterozygosity, support scaffold validation and further elongate the phased scaffolds.
In brief, the DeNovoMAGIC™ algorithm is composed of the following steps: 1. Reads pre-processing. PCR duplicates, Illumina adaptor AGATCGGAAGAGC and Nextera linkers (for MP libraries) were removed from the data. The 2x250bp overlapping reads from the 450-470 bp shotgun libraries were merged with a minimal required overlap of 10 bp to create the paired-end 'stitched' reads (PE reads). 2. Error correction. Following pre-processing, PE reads were scanned to detect and filter reads with a putative sequencing error (containing a subsequence that does not reappear several times in other reads). 3. Contigs assembly. The first step of the assembly involves the building of a De Bruijn graph (kmer=127 bp) of contigs from all the PE (including pre-processed 450-700 bp shotgun libraries) and MP reads. Next, PE reads were used to find reliable paths in the graph between contigs for repeat resolving and contigs extension. The 10X Chromium barcoded reads were mapped to contigs to ensure that adjacent contigs were connected only in cases in which those contigs originate from a single stretch of genomic sequence (reads from the same two or more barcodes were mapped to both contigs). 4. Scaffold assembly. Contigs were linked into scaffolds with PE and MP information, estimating gaps between the contigs according to the distance of PE and MP links. In addition, 10X Chromium data were used to validate and support correct phasing during scaffolding. 5. Gap Filling. A final gap filling step used PE and MP links and De Bruijn graph information to detect a unique path connecting the gap edges. 6. Scaffold elongation and refinement. 10X Chromium barcoded reads were mapped to the assembled scaffolds and clusters of reads with the same barcode mapped to adjacent contigs in the scaffolds were identified to be part of a single long molecule. Next, each scaffold was scanned with a 20 kb window to ensure that the number of distinct clusters that cover the entire window (indicating a support for this 20 kb connection by several long molecules) was statistically significant with respect to the number of clusters that span the left and the right edge of the window. In cases where a potential scaffold assembly error was detected the scaffold was broken at the two edges of the suspect 20 kb window. Finally, barcodes that were mapped to scaffold edges were compared (first and last 20kb sequences) to generate a scaffold graph with a link connecting two scaffolds with more than two common barcodes. Linear scaffold paths in the scaffolds graph were composed into the final scaffold output of the assembly.

Chromosome conformation capture sequencing
Chromosome conformation capture sequencing (Hi-C) data was generated using the protocol of Padmarasu et al. (2019) 59 .

Pseudomolecule construction for A. sativa cv. Sang
Pseudomolecule construction for A. sativa cv. Sang was done with the TRITEX pipeline protocol 8 . The NRGene assembly of A. sativa cv. Sang was digested in silico with DpnII using EMBOSS restrict 60 . Hi-C and 10X linked-read data were aligned to the NRGene assembly with Minimap2 61 . Alignment records were converted to binary Sequence Alignment/Map format using SAMtools 62 and sorted with Novosort (http://www.novocraft.com/products/novosort/). A list of Hi-C links was extracted from Hi-C alignments using BEDTools 63 and TRITEX scripts. Linked-read alignment was aggregated into molecules using BEDTools 63 and TRITEX scripts (https://bitbucket.org/tritexassembly/tritexassembly.bitbucket.io). Hi-C links, 10X molecules, and guide map alignments were imported to the R statistical environment 64 and analyzed further with TRITEX scripts. Breakpoints in chimeric scaffolds joining unlinked sequences were detected as drops in physical coverage with Hi-C links and/or 10X molecules. An initial Hi-C map was generated using the minimum spanning tree algorithm described by Beier et al. (2017) 65 . The assembly and Hi-C map were iteratively corrected by inspecting Hi-C contact matrices, guide map alignments and physical coverage with 10X and Hi-C reads. Sequence files in FASTA format and AGP tables for pseudomolecules were compiled using TRITEX scripts. The pseudomolecules of cv. Sang were aligned against the pseudomolecules constructed from a long-read sequence assembly of cv. OT3098. The OT3098 pseudomolecules were downloaded from GrainGenes 54 .

Genome assembly of Avena insularis Durieu and Avena longiglumis Ladizinsky sp. nov Plant Material and DNA extraction
For whole-genome assembly, single plants from A. insularis (BYU209; Ladizinsky (1998); Sicily, Italy 12 ) and A. longiglumis (CN58138; Oran, Algeria) were grown hydroponically in an isolated growth chamber under a 12-h photoperiod. Growing temperatures ranged from 18°C (night) to 20°C (day). The hydroponic growth solution was made using MaxiBloom® Hydroponics Plant Food (General Hydroponics, Sevastopol, CA, USA) at a concentration of 1.7 g/L. BYU209 and CN58138 are publicly available and maintained in germplasm collections at Brigham Young University (Provo, Utah, USA) and the National Plant Genebank (PGRC, Saskatoon, Saskatchewan, CA), respectively.

DNA sequencing, primary assembly and polishing
In preparation for PacBio CLR sequencing, high molecular weight DNA was extracted from 72-h dark-treated leaf samples using a CTAB-Qiagen Genomic-tip protocol 66 . For wholegenome sequencing, large-insert SMRTBell libraries (> 20 kb), selected using a SageElf (Sage Science, Inc., Beverly, MA, USA), were prepared according to standard manufacture protocols and sequenced at the BYU DNA Sequencing Center (Provo, UT, USA) using P6-C4 chemistry on a Sequel II instrument (Pacific BioSciences, Menlo Park, CA, USA). For whole genome polishing, DNA for each species was sent to Genome Quebec (Montreal, Quebec, CA) for NovaSeq (2X 150-bp paired-end) sequencing from standard 500-bp insert libraries. Trimmomatic v0.35 67 was used to remove adapter sequences, leading and trailing bases with a quality score below 20 or average per-base quality of 20 over a 4-nucleotide sliding window. After trimming, any reads shorter than 75 nucleotides in length were removed. A primary contig assembly for both A. insularis and A. longiglumis (Supplementary Table 18) was constructed with using Canu v1.9 68 with default parameters with the following flags: corOutCoverage = 40, correctedErrorRate=0.035, utgOvlErrorRate=0.065, trimReadsCoverage=2, and trimReadsOverlap=500. The primary assembly was polished twice using Arrow from the GenomicConsensus package in the Pacific BioSciences SMRT portal v5.1.0 followed by a single round of insertion/deletion correction detected by the Illumina reads using PILON v0.22 69 .

Chromosome conformation capture sequencing
This sequencing was performed as for cv. Sang.

Pseudomolecule construction for A. longiglumis CN58138 and A. insularis BYU209
Pseudomolecule construction for A. longiglumis CN58138 and A. insularis BYU209 was done as described above for A. sativa cv. Sang with some modifications. The order of single-copy sequence (≥ 1 kb in length) in the pseudomolecule of A. sativa cv. Sang was used as a guide map. Single-copy sequences were extracted using BBDuk (https://jgi.doe.gov/data-andtools/software-tools/bbtools/bb-tools-user-guide/) and TRITEX scripts (https://bitbucket.org/tritexassembly/tritexassembly.bitbucket.io/src/master/miscellaneous/ma sk_assembly.zsh), and aligned to the Canu assemblies of both species using Minimap2 61 . Hi-C data were used for chimera detection and pseudomolecule construction. After iterative correction and refinement of the Hi-C map based on inspection of contact matrices, pseudomolecules were compiled using TRITEX scripts.

Whole Genome Alignments (WGAs) between the Avena reference assemblies
A whole genome alignment (WGA) between A. sativa, A. insularis, A. longiglumis, A. eriantha and A. atlantica was generated using the cactus pipeline version 1.0 70 . Prior to the alignment step, all nucleotide sequences were 20-kmer-softmasked to reduce complexity and facilitate construction of the WGA using the tallymer subtools from the genome tools package version 1.6.1 71 . The pipeline was run stepwise with default settings described at https://github.com/ComparativeGenomicsToolkit/cactus#running-step-by-step. Dot plots of pairwise alignments were made using the supplied scripts by the cactus package.

Synteny framework and genome assembly comparison between A. sativa cv. Sang and OT3098
Synteny between the two oat genome assemblies was established using an all-against-all blastp matrix restricted to the top ten hits for each query and McScanX with default parameters on the 21 chromosomes. Likewise, the full protein similarity matrix with a minimal expectation value (e <= 10 -10 ) was surveyed to identify reciprocal or bidirectional best blast hits (BBH). Multiple collinear assignments -which are common for a hexaploid, were resolved selecting the highest scoring alignment. Only syntenic segments with a minimum number of 10 gene pairs were reported. Results were visualised and analysed with two different dot plot schemes, one using the actual physical gene position as axis locations (method dotA), and a second enumerating gene order along both axes (method dotB). Computed borders between collinear segments including inversions and low gene density regions were manually evaluated and -if necessary, refined in interactive dot plots. Using the resulting gene-based, aligned coordinate system of the collinear parallel and anti-parallel segments, inversions between the OT3098 and Sang assemblies were identified from the endpoints of each segment. All reported inversions were additionally inspected and confirmed. To identify regions of unbalanced/reduced gene densities, we computed for each segment with sliding overlapping windows (window size of 20 and window steps of 10 syntenic genes) their slopes using the dotB coordinate system. Elevated slopes approximated very well the boundaries of these regions and were subsequently manually refined to identify inflection points of the curve. We restricted our report to regions with increased slope values containing at least 50 OT3098 genes (syntenic and non-syntenic). The results of these analyses are available as Supplementary Tables 2 and 7. We further associated the results from the unbalanced/reduced gene density and inversions/translocations analyses (in the form of locations with coordinates and unique identifiers per region/inversion) with the orthologous gene framework between the Sang and the OT3098 assemblies (described above). The resulting association table is provided with the Sang gene annotation.

Hexaploid oat transcriptome atlas Plant Material and Tissue
An overview of the tissue samples and developmental stages used in this work, including the Zadoks decimal stage estimate, is provided in Supplementary Table 20. Oat plants were grown in pots under long-day conditions (18 h day) in a climate chamber with day/night temperature of 20/18°C and light photo flux density of 240 µmol/m 2 /s. Tissue was collected from developing seed, leaf, root, crown and stem. Samples from seed development included three time points collected during the day or night: Early (4 dpa = seed green, densely covered with trichomes, not fully elongated, endosperm not filling); Mid (14 dpa = seed green, trichomes dense at top only, fully elongated, milky white endosperm); Late (20 dpa = seed mostly pale yellow, fully elongated, trichomes dense at top only, soft dough endosperm. Samples were collected during the same time of day and were frozen in liquid nitrogen immediately upon collection. Total RNA isolation was performed using the Qiagen RNeasy plant Mini kit following the manufacturer's instructions. A specification of each sample used in annotation, network analysis and single-gene mapping, and the additional tissue samples used in this work from publicly available data are outlined in Supplementary Table 20.

Library preparation and Sequencing
The methods used for developing seed samples are detailed here (for details pertaining to all samples see Supplementary Table 20). RNA concentration was measured with a Qubit 3.0 Fluorometer (Thermo Fisher Scientific) and RNA quality was evaluated using the Caliper LabChip GX Nucleic Acid Analyzer (PerkinElmer). The Illumina protocol (TruSeq Stranded mRNA Reference Guide, 1000000040498 v00) was adapted to be run on an Agilent BRAVO robot. In brief, mRNA was isolated from total RNA using poly dT-coated beads and then sheared to 150-400 bp fragments through chemical fragmentation. The mRNA was converted into cDNA using reverse transcriptase, and a clean-up step using AMPure XP beads was used to remove enzymes and short fragments. Next the fragments were adenylated (a single adenosine is added to the 3' end of the blunt fragments which prevents self-ligation during the adapter ligation step) followed by ligation of adapters and a clean-up step (AMPure XP beads) to remove the enzymes and unligated adapters. This was followed by PCR amplification and a final PCR clean-up step (AMPure XP beads). Libraries were sequenced on a Illumina HiSeq 2500 (HiSeq Control Software 2.2.58/RTA 1.18.64) with a 2x126 setup using 'HiSeq SBS Kit v4' chemistry. The Bcl to FastQ conversion was performed using bcl2fastq from the CASAVA software suite. The quality scale used was Sanger / phred33 / Illumina 1.8+.
In the evidence-based step, multiple RNA-seq datasets (Supplementary Table 19, datasets: a,b,c,k-r) were used for the genome-guided prediction of gene structures. RNA-seq data were mapped to the genome reference with HISAT2 73 (version 2.1.0, parameter -dta) and assembled to transcripts with Stringtie 74 (version 1.2.3, parameters -m 150 -t -f 0.3). Transdecoder (version 3.0.0) (https://github.com/TransDecoder/TransDecoder) was used to identify potential open reading frames and predict protein sequences. The predicted protein sequences were compared against a protein reference database (UniProt Magnoliophyta, reviewed/Swiss-Prot) using BLASTP 75 (-max_target_seqs 1 -evalue 1e-05) and hmmscan version 3.1b2 76 was used to identify conserved protein family domains for all proteins. BLAST and hmmscan results were then processed by Transdecoder-predict and the best translation per transcript sequence was selected. Finally, results from the two gene prediction approaches were combined and redundant protein sequences were removed. An intermediate step of confidence classification was performed next. Protein sequences with an open reading frame were extracted from the annotation using gffread 77 (v0.11.6) and Biopython 78 . Diamond 79 (v0.9.29.130, e-value 1) was used to compare these to AnnoDB, and a database containing validated magnoliophyta proteins (UniMag) downloaded from Uniprot on 2017-02-20, and PTREP (release 16 80 ), a database that contains hypothetical proteins with deduced amino acid sequences where in many cases, frameshifts have been omitted, and is useful for the identification of divergent TEs without significant similarity at the DNA sequence level. The two former databases were filtered for sequences containing both start and stop codons. Hits with a subject and query coverage above 80% were selected and protein sequences were further classified into high and low confidence. High confidence (HC) hits were complete and had a subject-and query-coverage above the threshold in the UniMag database or no blast hit in UniMag but in AnnoDB and not in PTREP.
A low confidence (LC) hit on the other hand is not complete and has a hit in the UniMag or AnnoDB database but not in PTREP, or no hit in UniMag and AnnoDB and PTREP but the protein sequence is incomplete. This pipeline is available at https://github.com/PGSB-HMGU/plant.annot.git.
A first AUGUSTUS 88 (v3.3.1, options --UTR=on --alternatives-from-evidence=true --allow_hinted_splicesites=atac) prediction was done using the wheat parameters. To avoid potential over-prediction, guiding hints were generated using transcriptome data, predicted protein sequences of HC genes, and TE-hints. Retraining AUGUSTUS was done using all 100% supported, non-redundant transcripts that did not match the TEs in the Hypothetical TREP protein sequences database PTREP 80 , release 19, using the flanking region computed by the script provided as a part of AUGUSTUS and the standard initialization, with 1000 sequences used for evaluation during training and 500 sequences used for final performance testing, following the steps provided by Hoff and Stanke (2019) 88 . AUGUSTUS (v3.3.3, options --UTR=off --alternatives-from-evidence=true --allow_hinted_splicesites=atac) was used to predict genes using the new oat parameters.
EVidenceModeler 89 (EVM, git commit 73350ce, partition options --segmentSize 900000 --overlapSize 50000) was used to merge predictions and alignments from the non-redundant annotation based on the homology information and the Stringtie transcripts, PASA pipeline, and AUGUSTUS. A constraint on the output path in the EVM script was removed to allow for processing multiple weight settings in parallel. All inputs were provided as gene predictions to EVM, and all inputs except for AUGUSTUS were also provided as protein alignment evidence to EVM. Evaluation of the different weight settings was performed using BUSCO 90 , v3.0.2, liliopsida_odb10 created on 2017-12-01, and embryophyta_odb9 created on 2017-02-13, protein mode). BUSCO statistics served as the main guide for choosing which EVM weights to use. PASApipeline 83 v2.4.1 was used to update the annotation resulting from EVM to add UTRs. This was done by loading the EVM annotation into a PASA pipeline database, and running two rounds of updates.
Confidence classification was done on the final PASA pipeline output as described above. Several coverage thresholds were evaluated based on BUSCO results (v4.0.4, liliopsida_odb10 created on 2019-11-20, protein mode) and blastn 75 (v2.9.0+, options -max_target_seqs 1evalue 1e-10) hits against Iso-Seq transcripts, with a threshold of 65% set for both query and subject coverage across all three databases. Predicted proteins were classified as described previously 7 .
Representative transcripts for genes with several isoforms were selected by performing a blast all vs. all search with blastp 75 against AnnoDB. The best transcript was chosen based on coverage and identity. If no hit was found against the database the longest transcript was selected.
Additionally, a few genes, including CslF6 and several prolamins and avenins, were curated manually, with most being integrated into version 1.1 of the gene annotation. Manual curation was based on identifying previously published sequences of the genes of interest, and building multiple sequence alignments (MSAs) of these and our candidate genes. The MSAs were reviewed together with transcript evidence to improve the gene annotations.

Annotation of the A. insularis and A. longiglumis genome sequences
Structural gene annotations for A. insularis and A. longiglumis were done using a lift-over approach. The method is described in detail in Hoff and Stanke (2019) 92 . Briefly, genes from each individual A. sativa subgenome and the previously published A. eriantha and A. atlantica genomes 14 were lifted to the respective progenitor genome sequences as follows: • A. sativa D gene models were lifted to A. insularis D, A. sativa C and A. eriantha gene models were lifted to A. insularis C and A. sativa A and A. atlantica gene models were lifted to A. longiglumis. • To minimise annotation artefacts and archive maximum uniformity between all genome annotations, de novo gene models from A. sativa A and C were additionally lifted to A. atlantica and A. eriantha, respectively. • Lifted annotations were merged if necessary and all annotations were subjected to the confidence classification pipeline detailed in the A. sativa de novo annotation section.

Repeat annotation Identification and clustering of tandem repeats
Tandem repeats consist of monomer sequence units arranged directly next to each other. For a simple and pragmatic classification they are commonly split into three main categories by their unit length: 1-9 bp minisatellites, 10-99 bp microsatellites and ≥ 100 bp satellites. In the oat assemblies tandem repeats were annotated with TandemRepeatsFinder v 4.07b under default parameters (Match Mismatch Delta PM PI Minscore MaxPeriod: 2 7 7 80 10 50 500) 93 .
Overlapping annotations were removed with a priority based approach assigning higher scoring and longer elements first. Elements which overlapped already assigned elements were either discarded (> 90% overlap) or shortened (≤ 90% overlap) if their rest length exceeded 49 bp.
To obtain a collection of non-redundant tandem repeat sequences the consensus sequences of the tandem repeat units (output of TandemRepeatsFinder) were clustered with vmatch dbcluster under stringent conditions into subfamilies (≥ 98% identity and a mutual overlap ≥98%, command line: vmatch dbcluster 98 98 -v -identity 98 -exdrop 3 -seedlength 20 -d -p) (http://www.vmatch.de). For each of the 300 largest tandem repeat clusters the chromosomal location of their members was plotted along the chromosomes. Only 10% of the tandem repeat subfamilies were more or less evenly distributed across all subgenomes, 52 % showed a strong enrichment for the A and D subgenome , 36% were C, 2% A and 1 % CD enriched. Some of the enrichments showed an almost 100% specificity for the ancestral subgenomes, examples are given in Extended Data Fig. 1c. Centromere located and mostly subgenome specific tandem repeat subfamilies could also be identified (11 subgenome specific centromere clusters for the D subgenome, four for C and one unspecific occurring on A, D and C). A comparison of the tandem repeat composition between the short read Sang assembly and the long read OT3098 assembly is given in Supplementary Table 6.

Transposon annotation by homology to a TE-library
Transposons were detected and classified in the hexaploid oat cv Sang and var. OT3098 assemblies by a homology search against the REdat_9.9_Poaceae section of the PGSB transposon library 94 . The TE-library had beforehand been filled up with 744 de novo detected high quality full length LTR-retrotransposon template sequences which occurred at least two times as complete element in the assembly as determined by the pragmatic "80/80 clustering" (≥ 80% identity, ≥ 80% mutual sequence coverage).
The program vmatch (http://www.vmatch.de) was used for the homology search as a fast and efficient matching tool that is well suited for large and highly repetitive genomes. Vmatch was run with the following parameters: identity ≥ 70%, minimal hit length 75 bp, seedlength 12 bp (exact command-line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5). To remove overlapping annotations the vmatch output was filtered for redundant hits via a priority based approach.
Higher scoring matches were assigned first and lower scoring hits at overlapping positions were either shortened or removed if they were contained to ≥ 90% in the overlap or if <50 bp of rest length remained. The resulting transposon annotation is overlap free, although disrupted elements from nested insertions have not been de-fragmented into one element. With the described approach the TE content presently annotated by homology amounts to 63.8% divided into 61.7% retrotransposons and 1.8% (still underrepresented) DNA-transposons in both assemblies.

Genome structure, rearrangements and translocations Chromosome nomenclature
We assigned a novel nomenclature to the oat chromosomes by inspecting the gene-based collinearity to barley and orientations of core region gene sets.

Aligning the 6k markers
Manifest sequences of the markers 51 were aligned to the pseudomolecules using gmap 84 v2020.06.01. Only markers aligning perfectly were kept. In cases where a marker was aligning in multiple locations, but not to chrUn, and only once to the chromosome corresponding to the merge group given in the consensus map 10 , this location was used. Markers not aligning perfectly, or aligning in multiple locations but in a way that could not be resolved by the consensus map, were discarded. This analysis was done using an R script (v4.1.0, tidyverse v1.3.1 , ggplot2 95 v3.3.5).

Recombination frequency (r)
Marker data were analysed from 5 bi-parental populations of recombinant inbred lines (RILs) which were assayed using genotyping by sequencing (GBS) following detailed methods described in a companion paper 21 . Briefly, raw GBS sequence data from FASTQ files were re-analyzed using Haplotag software 96  To compute global crossover frequency on a linear scale, a genome-wide map of crossover locations in each RIL progeny was computed. A single crossover was inferred at a random position between two markers with different homozygous states, and 0.5 crossovers were inferred if one allele state was heterozygous. Crossovers were then counted in windows of 20 Mbp (up to 10 Mbp left or right, where possible) at each 1 Mbp increment of each full pseudomolecule. The crossover rate at each 1Mbp increment was expressed as the crossover frequency per progeny per 100Mb.
Recombination matrices among different parts of the genome were computed and examined separately for two of the five RIL populations. The sorted and imputed genotype matrix was used to compute average pairwise recombination frequencies between all possible 16 Mbp windows at 1 Mbp increments. Within each pair of windows, r was computed as the arithmetic mean of the recombination rate between all possible marker pairs in the corresponding two windows. If no marker pairs were available within a given window, the nearest pair of markers to the start and end of the window were used to represent the window. Recombination matrices within and between chromosomes were visualised in two dimensions using a heat map of blended colours representing different average recombination frequencies.

Subgenome-specific kmers
To compute kmer sets with an enriched specificity for either the A-, D-or C-subgenome, we employed whole genome sequence information of the A-(A. longiglumis) and C-lineage (A. eriantha) diploids and the DC-tetraploid A. insularis. KMC tools 98 v3 was used to both determine kmer (K=27) counts and occurrences and to perform the set operations union, intersection and difference in the respective kmer sets. For each analysis, we distinguished between the source genomes/species that provide the kmer sequences and the subgenome labels and the target genome/species to which the source kmers are positioned using vmatch 99 ; www.vmatch.de). For each source kmer set S, we counted canonical kmers in the source genome and selected unique kmers after applying the set operations to query their positions in the target genome. To reconstruct the subgenomic ancestry in oat by kmers, we conducted a two-step process. First we analysed the genomic distribution of A-and C-specific kmers in A. insularis to identify potential translocations that have occurred in the tetraploid ancestor of oat. The subgenome specific kmers were detected as , where Ains and Cins are the A-and C-specific kmer sets, and Sins, Slon and Seri are the total source kmers of A. insularis, A. longiglumis and A. eriantha, respectively.
A total of 198,625,685 and 260,894,419 kmers were found for Cins and Ains, respectively. Subgenomic kmers were mapped to the tetraploid genome, binned into adjacent sets of 2,000 kmers by their genomic position, and sliding windows with an overlap of 1,000 kmers were tested for their A-or C-ancestry using a binomial test with an event probability adjusted for the relative frequencies of A-and C-specific kmers. To reduce noise within the data and to target large-scale translocation in A. insularis and A.sativa, we employed a one-dimensional Gaussian kernel filter to the likelihoods of the genomic kmer bins (Supplementary Fig. 10 and Extended Data Fig. 4a).
Based on the segmentation of the A. insularis genome described above, we identified A-, Cand D-specific kmer sets using the following set operation schema: where Ssat and Slon are the total kmers in A. sativa and A. longiglumis, Asat represents the Aspecific query kmers, and Cins and Dins are the previously identified D-and C-specific kmers in A.insularis. Analogous operations were defined for Dsat and Csat, the D-and C-specific query kmers for oat. Due to the close relationship between the A-and D-lineages, the number of A-(~202×10 6 ) and D-specific kmers (~452×10 6 ) was considerably lower than the C-specific total (~845×10 6 ). The assignment of genomic regions to one of the three subgenomes in oat was performed similar to the approach described for A. insularis, but using bins of 5,000 adjacent kmers and three tests for each possible kmer set. As expected for their higher similarity, A and D subgenomic regions were less distinguishable than comparisons with the C-subgenome. However, for all translocations reported in this study, subgenomic assignments were unambiguous.

Syntenic Blocks
Syntenic orthologous and homoeologous blocks of spatially correlated genes within all seven Avena species and subgenomes were computed from a high quality ortholog set that has been derived from parsing the gene trees of the orthofinder analysis described below. Requiring a monophyletic origin for the Avena genes and exactly one B. distachyon ortholog as outgroup, we identified 9,253 gene families with exactly one gene copy in each of the selected diploids (A. longiglumis and A. eriantha), two copies in the tetraploid A. insularis and three copies in hexaploid oat. These selected gene families were also the basis of the triad analysis in "Definition of triads" under "Gene expression analysis" in the following text. Importantly, we did not force any multiplicity on the tetra-and hexaploid families but intentionally included {0,1,2}-groups. Collinearity of orthologous genes between species is widely used as strong evidence for common ancestry. However, even high quality assemblies contain a small proportion of erroneously oriented scaffolds, and in multiple species comparisons, already one such scaffold can impair collinearity and synteny analysis. To minimise these effects, and given the expected dynamics of genomic rearrangements in the Avena clade, we defined syntenic blocks that did not require a strict collinear order but at least 20 (orthologous respectively homoeologous) genes per block congruently located in close genomic proximity in all seven species or subgenomes. First, we transformed the genomic position of each ortholog/homoeolog to an ordinal position i=1,2,…N (where N is the total number of selected genes on the respective chromosome) and normalised its index by the total N. Next, we generated for each of the 9,253 genegroups a feature vector V obeying a gene order A. eriantha, A. longiglumis, two copies of A. insularis and three copies of A. sativa. The distance between two feature vectors V i , V j is calculated by the following rules: 1. Only columns of the same species are considered.
where P i , P j indicate relative position of gene pair i,j. 4. Out of the four and nine possible comparisons for the tetra-and hexaploid case, the two and three lowest distances are selected 5. The total distance is the sum of all seven pairwise distances obtained from (i) to (iv).
The resulting distance matrix of 9,253×9,253 pairwise distances was transformed into an undirected graph with a genegroup as node and edges between nodes with a maximal distance of 0.2. Lastly, a community clustering using the Louvain algorithm (https://github.com/taynaud/python-louvain) was performed and clusters with less than 20 genes per diploid species were discarded. In total, 357 blocks, 51 per species or subgenomewere detected with a mean size of 65 Mb and a mean gene number of 174 genes ( Supplementary Fig. 36). Out of 9,253 gene families, 8,855 (95.7%) were assigned to one of the syntenic blocks.

Gene family analysis Orthofinder
We identified gene families in oat by a genome-wide phylogenetic comparison of the A. sativa proteome to a representative set of 10 Poaceae species with an emphasis on the Triticeae clade. This set comprised Z. mays (B73v5), O. sativa (IRGSP-1.0), B. distachyon (v3.2), H. vulgare (Morex v3), S. cereale (IRGSC Lo7 v3), hexaploid bread wheat (T. aestivum; IWGSC v1.1), tetraploid A. insularis, and three diploid species, A. longiglumis and A. atlantica as A-lineage and A.eriantha as C-lineage representatives. For the Avena species, both low-and highconfidence genes were included in the phylogenies. For the others with multiple splice variants either available representative transcripts were used or the longest transcript was selected. Using Arabidopsis thaliana (Araport 11) as an additional outgroup, we employed Orthofinder 100 v2.4 to extract orthologs and co-orthologs between the Poaceae species and grouped them into gene families.

Definition of gene multiplicity and tandem genes
From the Orthofinder results described above, only families containing at least one A. sativa gene were retained. Next, we filtered out all singleton LC genes and groups consisting entirely of LC genes and for which at least one gene had a transposon signature. Lastly, HC genes assigned as singletons in the orthofinder run were added to derive the final reported number of 31,219 oat gene families comprising a total of 107,495 LC and HC genes. We denote that these families might contain a small number of closely related out-paralogs that were not sufficiently resolved by the phylogenetic analysis. However, given the distribution of cluster sizes (Supplementary Figure 11) and the vast majority (>93%) of highly confined families with ≤ 6 genes, putative out-paralogous relationships play only a very minor role.
To analyze subgenomic multiplicity, we confined the above data set to those 28,145 families for which all genes in one family had a location on one of the 21 oat chromosomes, thereby omitting clusters for which at least one gene was located on unanchored scaffolds ('chromosome unknown'). Spatial clustering of a particular class of subgenomic multiplicity was assessed for gene bins of size 100 by a hypergeometric test where N is the total number of analysed genes (85,099), K is the total number of genes, and k is the number of genes in the respective bin belonging to the tested class of multiplicity. In practice, we applied the cumulative density function (CDF) of the hypergeometric distribution as implemented in SciPy 101 v1.6.1. Adjacent sliding gene bins overlapped by 50 genes. Reported probabilities were Bonferroni corrected.
In contrast to gene families, tandem genes were not phylogenetically defined but by their sequence homology and genomic proximity. Sequence similarities were derived from an all vs. all blastn comparisons of oat coding sequences of the 28,145 gene families used for the multiplicity analysis and applying a minimum e-value threshold ≤ 10 -30 . To filter for gene neighbourhood, we constructed an undirected graph with homologous genes as nodes (see above) and inserted edges between genes if at most nine unrelated (i.e. non-similar) genes were located between them. Tandem clusters of size ≥ 2 genes were subsequently retrieved as connected components using networkx v2.5 (https://networkx.org). A total of 5,432 tandem clusters comprising 14,776 genes were detected.

Identification of the cellulose and callose synthase gene families
Protein sequences of A. sativa were used to identify cellulose, cellulose-like and callose synthase genes using a set of reference genes comprised of genes retrieved by exploiting publicly available reference sequences 102-104 as well as of gene retrieved from UniProtKB and filtered for manually annotated records (Supplementary Table 22). Reference protein sequences were mapped against A. sativa cv. Sang protein sequences (representative transcripts) using blastp 75 (-max_hsps 1 -evalue 1e-2 -use_sw_tback).
The domain structure of reference sequences was determined through InterProScan5 105 (iprlookup -goterms -pa -f TSV -dp -appl Pfam,TIGRFAM,SUPERFAMIL). The domain structure of A. sativa genes was used from the AHRD result. A gene was considered a hit when the oat protein domain structure overlapped with the domain structure of the reference sequence and 60% of each of the sequences was included in the alignment.
In addition, missed sequences that had the PF03552 and PF00535 protein domains were added to the oat set of candidate genes.The resulting set of oat candidate genes was compared to the Orthofinder result and potentially missed orthologs were included. In an additional iteration step, the set of oat candidate genes was used to search for candidates in the A. sativa protein set again. Moreover, this set was mapped to A. sativa cv. Sang and OT3098 genomes with tblastn 75 to search for genes that were not annotated. This revealed no hits.

Investigation of gene family expansion in CesA, Csl and callose synthase gene families in oat
After identifying the cellulose synthase superfamily genes in Avena sativa, these were compared between Avena sativa and other species to investigate potential expansions and/or contractions that may be relevant to the elevated beta-glucan content of oats. An Orthofinder analysis was conducted as the basis for in-depth gene family analysis. In order to assign cellulose synthase gene superfamily members to the cellulose synthase (CesA) subfamily and the seven cellulose synthase-like subfamilies, CslA, CslC, CslD, CslE, CslF, CslH, and CslJ homology information from other species (e.g. rice, Arabidopsis) was used. A multiple sequence alignment of all sequences belonging to the cellulose gene superfamily was calculated using MUSCLE 106 .

Identification and characterization of storage protein gene models
Prolamin gene models were identified using avenin sequences retrieved from UniProt Knowledgebase appended with avenin transcripts identified by Tanner  Additionally, globulins possessing cupin_1 domain (PF00190) were identified. Sequences were aligned with public avenin and ATI sequences and protein sub-groups were defined. For globulins, the identified gene models were aligned to 11S-, 7S-globulins and globulin-1 sequences from wheat.
To compare protein size and N storage capacity of the storage protein sequences of oats to other crop species, prolamin and globulin protein sequences of wheat (Triticum aestivum IWGSC RefSeq v1.1), rice (Oryza sativa Japonica Group IRGSP-1.0) and soybean (Glycine max v2.1) were retrieved from Ensembl Plants. Protein sequences were aligned using ClustalW and protein subtypes have been identified. Amino acid composition of the identified gene models was determined in CLC Genomics Workbench v21 (Qiagen, Aarhus, Denmark). The protein length normalised sum of asparagine and glutamine residue counts was used to compare N storage capacity. Data was visualised using custom R 64 packages, ggplot2 108 and ggpubr 109 .

Epitope mapping and phylogenetic analysis
The predicted avenin and ATI protein sequences from each of the oat sub-genomes were combined with the predicted gliadin, glutenin, hordein and ATI protein sequences from Hordeum vulgare subsp. vulgare cv. Morex genome assembly v3 and Triticum aestivum assembly v1 annotation v1.1) for phylogenomic analysis. The oat sub-genomes were handled as independent taxa. Sequence alignment and tree constructions were performed as described by Juhasz et al., 2018 37 . Visualisation and annotation of the phylogenetic tree was performed using CLC Genomics Workbench v21. Organism, protein group, and Pfam domain information was used for annotation.
CD-specific T cell epitopes 42 were mapped to the prolamin sequences using 100% sequence identity. Epitope count for each protein was summed and used to annotate the phylogenetic tree. Presence of baker's asthma epitopes was determined using epitopes retrieved from the Immune Epitope Database and Analysis Resource (https://www.iedb.org) and used for annotation.

Promoter analysis
Cis-acting transcription factor binding sites (TFBSs) associated with storage protein gene expression were collected from public promoter motif databases (PLACE 110 , PlantCare 111 ) and appended with motifs from published results [112][113][114] . The 1,000-bp 5′-end non-coding promoter sequences of the identified gene models were extracted from the Sang oat genome, wheat genome (IWGSC assembly v1 7 ), soybean (Glycine max assembly v2.1) and rice genome (Oryza sativa Japonica group assembly IRGSP-1.0). TFBS motifs were mapped to the promoter sequences and the mapped annotations were further analysed in FileMaker Pro (FileMaker Pro Advanced v17). The number of motifs was normalised by the gene size to evaluate motif enrichments. The MEME suite and Simple Enrichment Analysis (SEA 115 ) was used to identify the significantly enriched motifs in the oat, wheat, rice and soybean prolamins and globulins. The total of normalised motif counts was calculated for each prolamin and globulin type in oat, wheat, rice and soybean and visualised using the Morpheus R package (Morpheus, https://software.broadinstitute.org/morpheus).

Protein extraction and LC-MS data acquisition Plant material
Grains from the oat cv. Sang (identical seed batch used for genome sequencing) were visually inspected and milled using a Planetary Micro Mill PULVERISETTE.
Wholemeal flour (20 mg; n=5 biological replicates) was weighed into a 1.5 mL micro-tube and 200 µL (10 µL/mg) of either Urea or Tris-HCl was added with vortex mixing until the flour was thoroughly mixed with the solvent. The tubes were then sonicated for 5 min at room temperature and incubated in a thermomixer (600 rpm, 45 min, 21°C). Next, the tubes were centrifuged for 15 min at 20,800 ×g. For the IPA-DTT extraction protocol, flour samples (20 mg) were resuspended in IPA-DTT with vortex mixing and sonication for 5 min. The sample tubes were incubated at 50°C for 30 min in a thermomixer (600 rpm). Finally, the sample tubes were centrifuged for 15 min at 20,800 ×g.

Protein digestion
Protein extracts (100 µL, n=5) were transferred to 10 kDa MWCO filters (Millipore, Sydney, Australia. Protein digestion protocol was previously described in detail by Bose et al., 2019 116 . In brief, the protein on the filter was washed twice with a buffer consisting of 8 M urea in 0.1 M Tris-HCl (pH 8.5) with centrifugation for 15 min at 20,800 × g. Iodoacetamide (50 mM; 100 μL) prepared in 8 M urea and 100 mM Tris-HCl was added to the filters for cysteine alkylation with incubation in the dark for 20 min prior to centrifugation for 10 min at 20,800 × g. The buffer was exchanged with 100 mM ammonium bicarbonate (pH 8.0) by two consecutive wash/centrifugation steps. The sequencing grade digestion enzyme, trypsin (Promega, Alexandria, Australia) solution (4 μg in 200 μL, 20 μg/mL in 50 mM ammonium bicarbonate and 1 mM CaCl2) was applied to the filter and incubated for 18 h at 37°C in a thermomixer at 600 rpm. The tryptic peptides were collected by centrifugation (20,800 x g, 15 min) followed by an additional wash with 200 µL of 50 mM ammonium bicarbonate, and the combined filtrates were subsequently lyophilized. Digested peptides were resuspended in 100 μL of 1% formic acid before analysis by LC-MS/MS as previously described by Bose et al. (2019) 116 .

Global proteome profiling
The digested peptides were reconstituted in 100 µL of 1% formic acid (FA) and chromatographic separation (4 μL) on an Ekspert nanoLC415 (Eksigent, Dublin, CA, U.S.A.) directly coupled to a TripleTOF 6600 liquid chromatography tandem mass spectrometry (LC-MS/MS, SCIEX, Redwood City, CA, USA). The peptides were desalted for 5 min on a ChromXP C18 (3 μm, 120 Å, 10 mm × 0.3 mm) trap column at a flow rate of 10 μL/min 0.1% FA, and separated on a ChromXP C18 (3 μm, 120 Å, 150 mm × 0.3 mm) column at a flow rate of 5 μL/min at 30˚C. A linear gradient from 3-25% solvent B over 38 min was employed followed by: 5 min from 25% B to 32% B; 2 min 32% B to 80% B; 3 min at 80% B; 1 min 80% B to 3% B; and 8 min re-equilibration. The solvents were as follows: (A) 5% dimethylsulfoxide (DMSO), 0.1% formic acid (FA), 94.9% water; and (B) 5% DMSO, 0.1% FA, 90% acetonitrile, 4.9% water. The instrument parameters were as follows: ion spray voltage 5,500 V, curtain gas 25 psi, GS1 15 psi, and GS2 15 psi, heated interface 150°C. Data were acquired in information-dependent acquisition mode comprising a time-of-flight (TOF)-MS survey scan followed by 30 MS/MS, each with a 40-ms accumulation time. First stage MS analysis was performed in positive ion mode with 0.25-s accumulation time. Tandem mass spectra were acquired on precursor ions >150 counts/s with charge state 2−5 and dynamic exclusion for 15 s with a 100 ppm mass tolerance. Spectra were acquired over two gas phase separation methods: m/z 350-650 and m/z 650-1,250 using the manufacturer's rolling collision energy based on the size and charge of the precursor ion. Protein identification was done using ProteinPilot™ 5.0.3 software (SCIEX) with searches conducted against the database of translated Sang gene models. The total number of proteins in the custom database was 159,362. The resulting mass spectrometry proteomics raw and result data have been deposited to the CSIRO public data access portal (see Materials). Peptides that were detected at the 95% confidence level were used in the phylogenetic tree to evidence protein level expression.

Synteny analysis of TaZIP4-B2
Synteny analysis was performed between A. sativa and T. aestivum 7 using the tool McScan 117 of the jcvi utility library 118 .
Tblastn 75 was used to exploit the A. sativa Sang and OT3098 genome sequences for orthologs of TaZIP4-B2 in order to exclude the possibility that TaZIP4-B2 is missing in A. sativa due to missed annotation in A. sativa Sang or that it is missing in the assembly due to an assembly or sequencing error. However, no evidence for TaZIP4-B2 besides the orthologs of the group 3 chromosomes was found in both the short-read (Sang) and long-read (OT3098) assemblies. The orthogroup OG0011060, containing protein sequences of TaZIP4-B2, its paralogs, and the orthologs in A. sativa, A. eriantha, A. atlantica, A. insularis, and A. longiglumis as well as B. distachyon and H. vulgare as outgroups were extracted from the Orthofinder output (section "Orthofinder"). These, in addition to the orthologs from A. sativa cv. OT3098 were used for a phylogenetic analysis. Protein sequences were aligned using MUSCLE 106 v3.8.1551 and a phylogenetic tree was calculated using fasttree 119 2.1.11 and visualised with iTol 120 6.3.

Gene expression analysis
Read trimming and quality control RNA-seq reads from 62 samples and 6 different tissues (leaf, root, crown, seeds, glumes, and spikelets) were used for gene expression analysis and for network construction (Supplementary  Table 19  Mapping of RNA-seq reads to reference Salmon v1.1.0 122 (index with option --keepDuplicates) was used to build a decoy-aware transcriptome index using the pseudomolecules as decoy sequences. Transcript abundances were quantified using Salmon (quant with options --libType A --gcBias --seqBias --validateMappings) for the trimmed RNA-seq reads mentioned above.

Differential gene expression analysis in a developmental time course of developing seeds
The transcript-level estimates from Salmon were summarised to the gene-level using the tximport package 123 v1.12.3. DESeq2 124 v1.24.0 was used to test for differentially expressed genes (DEGs) between the different stages of seed development (mid vs. early, late vs. early and late vs. mid) while controlling for differences between day and night samples (three biological replicates) (Supplementary Table 20, dataset "c"). A pre-filtering step was included to remove genes with less than 10 counts in total. A two-tailed threshold-based Wald test was used to test the null hypothesis that the absolute value of the logarithmic fold change (LFC) is less than or equal to 1. The Benjamini-Hochberg (BH) 125 procedure was used to adjust the P values for multiple testing. Genes with an adjusted P value of < 0.01 were considered significantly differentially expressed.

Analyses of expressed genes
Average TPM expression was calculated for each of the six tissues. Samples from different experiments and treatments were kept apart, resulting in 11 combinations of tissue/experiment/treatment hereafter called "condition". For average expression values across conditions, a gene was considered expressed when the TPM was above 0.5 in at least one condition. Using the average expression per condition, the expression of each gene across all conditions in which it was expressed (TPM ≥ 0.5), was determined. Thus, an average value across all conditions could be generated instead of a geometric mean across all samples, to account for the variation in the number of samples per condition. It also excludes conditions in which a gene is not expressed. This average across expressed conditions is referred to as the "global analysis" hereafter in the main text and in the supplementary materials and tables.

Definition of triads
For the identification of triads with a 1:1:1 correspondence across the three homoeologous subgenomes (exactly one member on each of the A-, C-, and D-subgenomes), the orthofinder clusters were pruned to exclude A. thaliana and filtered for clusters that include three A. sativa genes, 2 A. insularis genes and one of each A. longiglumis and A. eriantha as well as one B. dystachion gene as an outgroup. In addition, these clusters were required to be monophyletic for the Avena clades. From this, clusters with at least one A. sativa gene located on unanchored scaffolds ('chromosome unknown') were excluded. This resulted in 7,726 clusters with 1:1:1 A:C:D clusters. Next, we also identified a remaining set of 1,508 triads that were located in inter-subgenomic translocations and had an unbalanced copy number for the subgenomes. We adjusted subgenomic assignments of their genes using the translocation history in oat. The first set of triads is referred to as 'ancestral triads' while the second set is called 'relocated triads'.

Relative expression levels of the A-, C-, and D-subgenome homoeologs across triads
The analysis focused exclusively on the gene triads that had a 1:1:1 correspondence across the three homoeologous subgenomes, including 7,726 ancestral triads and 1,508 triads with at least one gene in a translocated region (total of 9,234 triads or 27,702 genes). A triad was considered to be expressed when the sum of the expression of the A-, C-, and D-subgenome homoeologs was ≥ 0.5 TPM. To standardise the relative expression of each homoeolog across the triad, the TPM of each of them was divided by the sum of all of them resulting in values between 0 and 1 for each of them. This normalised expression was calculated for the average across all samples ("global analysis" as described previously) as well as for seed tissue from early, middle and late seed developmental stages (Supplementary Table 19, dataset "c"). The values of the relative contributions of each subgenome per triad were used to plot the ternary diagrams using the R package ggtern 126 .

Definition of homoeolog expression bias categories
Ideal normalised expression bias values were defined for the seven expression categories as described by Ramírez-González et al. (2018) 31 . For every triad, the euclidean distance was calculated between the observed normalised relative expression of the triad and the ideal normalised expression of each of the seven categories. The expression bias category for the triad was then assigned by selecting the expression category with the shortest distance. This was done for the average across all expressed tissues as well as for seed tissue only.

WGCNA network construction
Coexpression networks were built using the RNA-Seq datasets a, c, d, e, f, k, and l (Supplementary Table 19) as described above with the R-package WGCNA 127 . The previously defined transcript-level estimates from Salmon were summarised to the gene-level using the R-package tximport v1.12.3 123 . The expression level of each gene was normalised using the function varianceStabilizingTransformation from DESeq2 124 to eliminate differences in sequencing depth between studies. A threshold of 5 TPM in at least three samples was used an expression threshold and the most variable 20,000 gene (6,887 from the A-subgenome; 5,470 from the C-subgenome; 6,637 from the D-subgenome; 1,006 mapping to the unassigned chromosome) were selected as an input for the network. The plateau of the scale-free topology fit index curve was used to set the soft thresholding for co-expression similarity. The selected soft powers used was 9. Dissimilarity was calculated as 1-adjacency. To identify co-expression modules, genes were clustered based on the dissimilarity measure. To detect gene coexpression modules the dynamic branch cut method cutreeDynamic was applied with the method "hybrid", a deepSplit of 1, and a minClusterSize of 30.

Orthology-based ontology annotation of oat genes
Inferred orthologous relationships were used to transfer automatic and experimentally validated annotations from orthologous genes to the respective genes in the genomes of A sativa cv. Sang. Gene Ontology (GO 128 , Plant Ontology (PO 129 , and Plant Trait Ontology (TO 130 ; term annotations were obtained and pooled from Gene Ontology (http://geneontology.org/-gene-associations), TAIR (https://www.ara-bidopsis.org/), and Gramene (ftp://ftp.gramene.org/pub/-gramene/release52 /data/ontology/) resources. Gene identifiers were mapped to public resources using the UniProtKB mapping Single-gene-mapping of the glossy.1 epicuticular wax mutant Plant material, mutants and F3 segregating pools The glossy panicle mutant designated glossy.1 was identified among the Belinda TILLING population 45 during a field amplification experiment in 2017, and a forward genetics investigation of the genetic cause was initiated. This mutant exhibited a glossy panicle and sheath compared to the glaucous phenotype of the parental cv. Belinda, cv. Sang and other members of the TILLING population. In contrast to the glossy panicle and sheath, the leaf blades were similarly glaucous. The glossy.1 phenotype was stable under greenhouse conditions with an 18-h photoperiod. A cross was made using Belinda as mother and the mutant as pollen donor (Belinda × glossy.1). The resulting F1 seeds were germinated in soil in a greenhouse and allowed to self-fertilise. Upon ocular inspection, the Belinda × glossy.1 mutant F1 plants appeared less glaucous than Belinda, indicative of the mutation being semi-dominant. An F2 population was raised to maturity in the greenhouse and the allelic status of the glossy.1 locus was deduced from the phenotype of F3 progeny (40 F3 siblings per F2 plant). The number of glossy and glaucous F2 progeny was 11 and 45 (14 homozygous, 31 heterozygous) respectively, which was compatible with a single glossy.1 locus being responsible for the glossy phenotype (null hypothesis is that homozygous glossy:heterozygous:homozygous glaucous have distribution 1:2:1, 2 (2, N = 56) = 0.96, P = 0.62). A single seed from each homozygous glossy and homozygous glaucous progeny was sown in vermiculite and the second leaf was collected and pooled (11 and 14 plants, respectively) to form two tissue pools for DNA isolation, referred to hereafter as the glossy and glaucous pools. The basic principle of bulk segregant analysis was used in an integrative genomics approach to conduct genetic mapping of shared alleles in back-cross progeny pools. During the early grain filling stage, pairs of whole glumes were collected from four representative F3 glossy and glaucous sibling plants, and this tissue was processed as outlined in the following sections on RNA isolation, SEM, and GC/MS. Eventual identification of the candidate glossy.1 locus using the forward genetics approach is outlined hereafter, in combination with an intersection of moderate or high impact mutations in other whole genome sequenced TILLING mutants, allowed the identification of one additional mutant (termed glossy.2) with a nonsense mutation in the same candidate gene. A reverse genetics approach on this second mutant included SEM of the glume to investigate the phenotype of glossy.2 at the cuticle surface.

Scanning electron microscopy (SEM)
For SEM, whole glumes, the apical portion of the sheath and basal portion of the flag leaf from representative F3 glossy and glaucous sibling plants were air dried in clean petri dishes at 50°C in darkness for 48-hr. Air dried specimens were carefully glued onto SEM stubs with the glume adaxial surface (side facing the seed), flag leaf abaxial surface, and sheath adaxial surface in contact with the stub and sputter-coated with gold (Cesington 108 auto, 45 seconds, 20mA). For the glossy.1 and glossy.2 mutants, Belinda and Sang, only the glume was analysed. The preparations were viewed using a scanning electron microscope (Hitachi SU3500) applying an accelerating voltage of 5 kV. Images were collected near the centre of the glume for all specimens at ×1000 and ×4000 magnification.

Wax extraction and Gas chromatography-mass spectrometry (GC/MS)
For representative F3 glossy and glaucous sibling plants, the glumes from 10 florets from a single panicle and plant at the early grain filling stage served as a single GC/MS sample and biological replicate. The accompanying flag leaf and the apical 15 cm of sheath (with culm removed) served as single biological replicates. Two replicates were used for glume, three replicates were used for sheath and the glaucous flag leaf, and four replicates were used for the glossy flag leaf. For flag leaf and sheath samples, wax was extracted by immersing the tissue for 10 s in a glass tube containing 10 mL dichloromethane. For glumes, wax was extracted in the same way, but with the glumes within a sample immersed consecutively in the dichloromethane. For glume single samples, wax was extracted by consecutively immersing the tissue for 10 s in a glass tube containing 10 mL dichloromethane. The solvent was evaporated under a stream of nitrogen gas at room temperature. The solid residue was redissolved in 30 µL dichloromethane and transferred to glass vials. Then, 30 µL of trimethylsilyl trifluoroacetamide (MSTFA) was added and the vial left at room temperature for 30 min. Analyses were performed on an Agilent 7890 gas chromatograph connected to a 5975 single quadrupole mass analyzer according to a previously published method 131 , but with the final isocratic temperature adjusted to 340 C. Data were analysed using a Matlab script 132 and peaks annotated by retention index and mass spectra matching in the NIST MS library. Metabolites were identified based on retention index and mass spectrum in the NIST MS library.

GC/MS statistics
The two genotypes and three tissue types were compared using Welch's t-test for sheath and flag leaf respectively. P-values were adjusted for multiple testing using the Benjamini-Hochberg FDR method. As only two replicates were used for glume, no statistics were computed for this tissue type.
Illumina whole-genome resequencing of cultivars, EMS mutants, and back-crossed pools. The glossy and glaucous tissue pools were ground to a fine powder using liquid nitrogen in a mortar and pestle. Genomic DNA was isolated using the Qiagen DNeasy Plant Mini Kit according to the manufacturer's protocol. Shotgun libraries were prepared using 350-bp fragments (chemical fragmentation) and the Illumina TruSeq PCR-free Preparation Kit (Illumina, San Diego, CA). Libraries were sequenced on a NovaSeq6000 (NovaSeq Control Software 1.7.0/RTA v3.4.4) with a 151nt(Read1)-10nt(Index1)-10nt(Index2)-151nt(Read2) setup using 'NovaSeqXp' workflow in 'S4' mode flowcell. The Bcl to FastQ conversion was performed using bcl2fastq_v2.20.0.422 from the CASAVA software suite. The quality scale used was Sanger / phred33 / Illumina 1.8+. The cultivars Sang and Belinda were similarly sequenced to identify mutations novel to the TILLING mutants. Additional TILLING mutants with novel phenotypes were also sequenced outside the main aims of this work, and data from one such mutant (glossy.2) were used in the forward genetics approach for glossy.1. For data availability see Supplementary Table 19.
Glume RNA sequencing A pair of glumes from a single floret and individual plant served as a single biological replicate. Tissue was collected for four replicates each from representative F3 glossy and glaucous sibling plants. Each sample was snap frozen in liquid nitrogen in 2mL screw cap tubes containing two 6mm glass beads. While frozen, tissue was pulverised using a Precellys Tissue Homogenizer and this sample tube then served as the starting vessel for total RNA isolation using the Qiagen RNeasy plant Mini kit following the manufacturer's instructions. Sequencing libraries were prepared as outlined above except that libraries were sequenced on NovaSeq6000 (NovaSeq Control Software 1.7.0/RTA v3.4.4) with a 151nt(Read1)-10nt(Index1)-10nt(Index2)-151nt(Read2) setup using 'NovaSeqXp' workflow in 'S4' mode flow cell. For data availability see Supplementary Table 19.
Variants present in unrelated lines from the TILLING and parent population were filtered using bcftools v1.12 134 . To identify regions conserved in the pool, a sliding mean of the mutant allele frequency was plotted across each chromosome using custom R 64 scripts (v4.1.0; tidyverse 95 v1.3.1; ggplot2 142 v3.3.5; vcfR 143 v1.12.0; svglite 144 v2.0.0; fs 145 v1.5.0; slider 146 v0.2.2; tidymodels 147 v0.1.4 window size of 100 observations). Bootstrap confidence intervals were computed by recomputing the sliding mean across 1,000 bootstrap samples for each chromosome, and calculating the interval capturing 95% of the observations for each position across the genome. SnpEff 148 v4.3.1t was used to identify mutated genes, and genes with moderate-or high-impact mutations were considered for further analysis. Only homozygous mutations were considered as a part of this analysis. Genes were additionally filtered based on being located in a region identified to be conserved in the pool, gene expression in the glume, read support, type of the mutation found in the gene, and whether or not the gene was mutated in other lines sharing the phenotype. Genes on chrUn were included in the analysis, and interesting candidates were localised using the Hi-C data previously used to construct the pseudomolecules.

Beta-diketone biosynthetic gene cluster phylogenetic analysis
Cluster genes were selected based either on proximity to our candidate gene, or on being a part of the same orthogroup as either of the candidate, the barley cer-q, -u, and -c genes or of a gene close to our candidate. Genes in OT3098 were selected based on the source of the projection annotation gene model, and coordinates were updated to match OT3098 v2 using gmap 84 v2020.06.01.