Reference genome assemblies reveal the origin and evolution of allohexaploid oat

Common oat (Avena sativa) is an important cereal crop serving as a valuable source of forage and human food. Although reference genomes of many important crops have been generated, such work in oat has lagged behind, primarily owing to its large, repeat-rich polyploid genome. Here, using Oxford Nanopore ultralong sequencing and Hi-C technologies, we have generated a reference-quality genome assembly of hulless common oat, comprising 21 pseudomolecules with a total length of 10.76 Gb and contig N50 of 75.27 Mb. We also produced genome assemblies for diploid and tetraploid Avena ancestors, which enabled the identification of oat subgenomes and provided insights into oat chromosomal evolution. The origin of hexaploid oat is inferred from whole-genome sequencing, chloroplast genomes and transcriptome assemblies of different Avena species. These findings and the high-quality reference genomes presented here will facilitate the full use of crop genetic resources to accelerate oat improvement.

C ommon oat (A. sativa L., 2n = 6x = 42, AACCDD genomes) is an important cereal crop cultivated worldwide and has long been prized by consumers primarily because it is one of the richest sources of protein, fat and vitamin B1 among all crops 1 . In addition, oats are a widely grown cool-season annual forage species, and represent a major source of high-quality forage for livestock globally 2 . Common oat belongs to the genus Avena in the grass family Poaceae, and the genus comprises a polyploid series of wild, weedy and cultivated species distributed across six continents 3 . The diploids have either the AA or CC genomes; the tetraploids mainly contain the AABB or CCDD 4 (previously AACC) genomes; all hexaploid species, including common oat, share the same AACCDD genomic constitutions 5 . Due to the lack of genome sequences from hexaploid oat and its close diploid and tetraploid relatives, the phylogenetic history and divergence time among the A, C and D genomic lineages remain unclear. Polyploid plants often have notable advantages in biomass production, vigor and adaptability to environmental changes, contributing to the emergence of important agronomic traits in food crops [6][7][8] . Oat has good adaptability to a wide range of climatic conditions, enabling oat to reliably produce grains in marginal regions with harsh conditions. Therefore, crop polyploidization plays an important role in next-generation crop improvement to overcome food security challenges 8 . Many genomes of commercially important crops have been sequenced and assembled, which has improved the understanding of crop evolutionary history and the development of efficient approaches for selecting important traits 9 . Oat has lagged behind in this regard, primarily due to the large genome 10 that contains highly repetitive DNA sequences, and the fact that both the A-and D-subgenomes are similar to the A-genome diploid and were difficult to distinguish from one another in previous studies 11 . Currently, relatively little is known regarding the position and distribution of genes on each of the oat chromosomes and their evolution during the polyploidization events that gave rise to the hexaploid species, which limits the full and effective utilization of oat germplasm. With the recent advances made by Oxford Nanopore Technologies (ONT), the ONT system now offers ultralong sequence reads, delivering high contiguity with low assembly errors caused by long repetitive regions 12 . This technology has facilitated complete telomere-to-telomere (T2T) genome assembly in various species, including Homo sapiens, by resolving long, complex repetitive regions 13,14 . It is very suitable for assembling the large, complex polyploid oat genome, with a high content of repetitive sequences and high subgenomic homology. Here, we used the ultralong ONT system to sequence the genome of the oat variety 'Sanfensan' (A. sativa ssp. nuda), a hexaploid oat landrace that originated from the diversity center of hulless oat, together with assembling and resequencing the genomes of additional diploid and tetraploid Avena species to gain insights into the evolutionary processes that established the dominant subgenome in allopolyploid oat species and demonstrate the utility of the oat reference genome in identifying genes that underlie agronomic traits.

Results
Assembly and annotation of the oat genome. We assembled the Sanfensan genome into 326 contigs based on 1,028 Gb of ultralong reads (N50 length: 52 kb; ~100× genome coverage). The draft genome assemblies were then corrected by using ~650 Gb of cleaned Illumina paired-end reads, resulting in a total assembly size of 10.76 Gb, 99.06% of which was arranged into 21 chromosomes (based on 1,296 Gb of Hi-C data) after dissociating 72 misjoined contigs into 182 contigs using three-dimensional proximity information (Extended Data Fig. 1a). The final assembly contained 436 corrected contigs with N50 of 75.27 Mb and a maximum length of 313.87 Mb. Of these, 323 contigs were anchored onto 21 pseudochromosomes (Table 1, Supplementary Tables 1-3 and Extended Data Fig. 1a). The quality of the Sanfensan genome assembly was supported by assessments including NG plots, long terminal repeat (LTR) assembly index (LAI; 18.34), BUSCO (99.44%) and the coverage of full-length transcripts by the assembled genome (92.00%) (Extended Data Fig. 2), together with the markers from hexaploid oat consensus map 15 and high level of syntenic relationship with the other hexaploid OT3098 v2 reference genome (https://wheat. pw.usda.gov/jb?data=/ggds/oat-ot3098v2-pepsico) (Extended Data Fig. 3). In addition, we found that 99. 87% and 96.30% of the assemblies were covered at more than 20× sequencing depth with ultralong reads and short reads, respectively, indicating high accuracy at the nucleotide level (Extended Data Fig. 2d,e). Finally, 99.75% of the 4,336,693,678 Illumina paired-end reads could be mapped onto the assembly, with 98.22% properly paired alignments. From this  Tables 4 and 5). Moreover, 86.95% (9.35 Gb) of the assembly was annotated as repetitive elements (Supplementary Table 6), which is higher than previously reported genomes of barley (80.80%) 16 and bread wheat (84.70%) 9 .
To distinguish the subgenomes accurately and clarify the polyploidization history of the hexaploid oat, we sequenced and assembled its most likely ancestral species A. longiglumis (2n = 2x = 14, AlAl genome) and A. insularis (2n = 4x = 28, CCDD genome) 5 , resulting in >60× genome coverage for A. longiglumis (218.67 Gb) and A. insularis (374.77 Gb). The assembled A. longiglumis genome was 3.74 Gb in size, 99.20% of which was anchored onto seven chromosomes with the A. atlantica genome as the reference 17 . The tetraploid A. insularis genome assembly was 7.52 Gb in size, 95.08% of which was arranged into 14 chromosomes by Hi-C analysis (Extended Data Fig. 1b). We annotated 43,477 and 89,995 protein-coding genes in the A. longiglumis and A. insularis genomes respectively (Table 1). Similarly, most of the   A. longiglumis (86.83%) and A. insularis (87.11%) genomes are composed of repetitive elements (Table 1 and Supplementary  Table 6). Based on the genomic similarity among these three genomes, we successfully assigned the 21 pseudochromosomes of hexaploid oat to the A, C and D subgenomes (Extended Data Fig. 4a-d). The subgenome assignment was supported by the over-representation of the A-and C-genome-specific satellite repeats, As120a and Am1, respectively 18,19 , in the corresponding subgenomes ( Fig. 1), and by mapping the whole-genome sequencing reads from a range of Avena species (Extended Data Fig. 4e,f). The nomenclature system of OT3098 v2 was adopted for naming the chromosomes of Sanfensan, which was consistent with that approved by the International Oat Nomenclature Committee. Evolutionary position of oat among cereal crops. The Poaceae family consists of many agronomically important species, commonly known as cereals, that are classified into three subfamilies: Oryzoideae (rice), Panicoideae (maize, sorghum) and Pooideae (Triticeae: wheat, barley and rye; Aveneae: oat). Among them, Oryzoideae and Pooideae belong to the BOP clad (Bambusoideae, Oryzoideae and Pooideae), which is one of the two primary groups in the Poaceae family. To further clarify the evolutionary position of oat among cereal crops, we used our oat reference genome assemblies to perform phylogenomic analysis to include oat. Gene family analysis across 23 subgenomes from 16 species belonging to the BOP clade identified 2,237 single-copy orthologs (Supplementary  Tables 7-9). Phylogenomic analyses (Fig. 2a) revealed that the divergence between Aveneae and Triticeae took place after the speciation of Oryzoideae, with the approximate times for the two events being 28.2 and 47.9 million years ago (mya), respectively. The tribes Aveneae and Lolieae were indicated to be more closely related than Triticeae. The diversification of oat species occurred ~8.7 mya, which is earlier than wheat species (~6.6 mya) and falls within the previously estimated speciation time of Avena diploids (5.4-12.9 mya) 17 . We compared the gene families among Aveneae, Triticeae and Lolieae and found 6,425 common gene families in these three tribes, and there are 1,608 gene families specific to Aveneae (Fig. 2b).
The top three most enriched Gene Ontology (GO) terms are 'response to auxin' , 'enzyme inhibitor activity' and 'transcription factor activity' (Supplementary Table 10). The ancestral grass karyotype (AGK) was previously reconstructed as a post-ρ AGK with 12 protochromosomes and a pre-ρ AGK with seven protochromosomes by comparing extant species 20 . Rice was identified as the most slowly evolving species and has 12 chromosomes, most of which closely resemble the post-ρ AGK. Based on the phylogenomic analysis, Triticeae (wheat, barley and rye) and Aveneae (oat) clustered together with Oryzoideae (rice) as an outgroup. We therefore used the rice genome as the ancestral reference and the barley genome as a closely related reference to investigate the chromosomal evolution of oat and to compare it with another allohexaploid cereal species, wheat. We identified 733 and 872 syntenic blocks that included 33,065 and 37,042 orthologous gene pairs from the subgenomes of hexaploid oat based on rice and barely, respectively. (Fig. 2c and Supplementary Table  11). Essentially, the third homologous group of oat and wheat was derived from a single ancient chromosome, AGK1 or Os1, except for the AGK3 or Os3 segment that was translocated to chromosome 3C in A. eriantha. The second group was mainly derived from the insertion of AGK7 or Os7 into AGK4 or Os4, and the remaining five groups were each derived from at least three ancestral chromosomes via complex translocations (Fig. 2c). When the oat assembly was compared with the three subgenomes of common wheat using barley genomes as a reference, a large number of chromosomal rearrangements were identified. For example, although the second homoeologous group of oat and wheat are mainly derived from AGK4 or Os4 and AGK7 or Os7, the arrangement patterns of these two ancestral chromosomes are different.
Polyploidization history and reticulate evolution. Previous phylogenetic studies could not be well distinguish between the A and D subgenomes of Avena, which led to confusion and gave inconsistent results about the origins of hexaploid oat [3][4][5]11,21 . We have now definitely assigned chromosomes to the A and D subgenomes with the resolution afforded by the complete genome assemblies. Based on this foundation, and to identify the subgenome donors accurately, we resequenced the genomes of 14 Avena species representing different genomic subtypes and ploidy levels (As, Al, Ac, Ad, Cv, Cp, AB and CD). A total of 5,014.96 Gb of genome resequencing data (Supplementary Table 1) were mapped against the A, C and D subgenomes (Methods and Supplementary Table 12). The identity distribution clearly showed that the Al/As diploid species, and the C and D subgenomes of A. insularis have the highest similarities with the A, C and D subgenomes of hexaploid oat, respectively (Fig. 3a). The SNP-based phylogenetic tree of the three subgenomes consistently implies that the species cluster according to the genome subtypes, in which Al and As are the most closely related to the A subgenome of hexaploid oat, whereas none of the extant diploids shows a particularly close relationship with the C and D subgenomes (Fig. 3b). These results are further supported by the phylogenetic analyses of transcriptome data from 11 diploid species (Fig. 3c), suggesting that the A-genome progenitor of hexaploid oat was more likely the ancestor of the Al and As diploids than an extant diploid species, which may be the reason that previous studies 22 have considered different A diploids to be the A-genome donors of hexaploid oat.
We also assembled the chloroplast genomes of different Avena species (Supplementary Table 13), as the maternal inheritance characteristic can be used to determine the maternal genome donor to hexaploid oat. By including 25 previously reported chloroplast genomes (Supplementary Table 14) 22,23 , our phylogenetic analysis ( Fig. 3d) showed that the C genome was undoubtedly the male parent in the polyploidization and that the D genome, rather than the A genome, was the maternal donor in hexaploid oat. The evolutionary order of the different A-genome subtypes was Ac-Ad-Al-As (Fig. 3d). The relatively low collinearity between the C genomes of the diploid and polyploid species is consistent with the The yellow and blue arrows and lines connecting the chromosomes represent the observed large chromosomal translocations (>40 Mb) from the A and C genomes, respectively. The dark gray arrow and line indicate the inversion in chromosome 3C between A. insularis and Sanfensan. b, Large C-to-A or C-to-D translocations are supported by mapping reads from the C genome diploid to the hexaploid reference genome; blue arrows indicate C-to-D and C-to-A intergenomic translocations. c, FISH using the C genome-specific repeat as a probe confirms the C-to-A and C-to-D translocations. Fluorescence signals from the A genome-specific repeat (As120a) are shown in green, and signals from the C genome-specific repeat (Am1) are in red. The white arrows indicate C-to-D and C-to-A intergenomic translocations. d, FISH confirms the inversion on chromosome 3C observed in the hexaploid oat genome. Oligo-5SrDNA (red) and Oligo-6C343 (green) gave clear hybridization signals on the short and long arms of the tetraploid 3C, respectively, whereas both signals are observed on the long arm of chromosome 3C in the hexaploid. For karyotyping in panels c and d, at least three slides for each accession and ten chromosomes per slide were examined. Scale bars, 2 μm. e, Comparison of Ka/Ks value distributions between the three subgenomes of hexaploid oat. The central line for each box plot indicates the median. The top and bottom edges of the box indicate the first and third quartiles and the whiskers extend 1.5 times the interquartile range beyond the edges of the box. Numbers of samples used in each assay are indicated as n. The asterisks represent significant differences (two-tailed Wilcoxon rank-sum test, **P < 0.01). f, Two-dimensional hierarchical cluster analysis of gene expression among single-copy homoeologous oat genes compared with organ-specific gene expression. TPM, transcripts per million. g, Analysis of log 2 -fold changes in pairwise gene expression between homoeologous genes showed biased expressions. DEGs, differentially expressed genes. Dot plots show the fold changes for each triplet ordered as shown in the y axis (f). The numbers of significantly differentially expressed triplets (one-tailed Fisher's exact test, P < 0.05) across all organs are shown at the top of the box, and the significance of the differences in the values between subgenomes was estimated using the one-tailed Wilcoxon rank-sum test (*P < 0.05, **P < 0.01).
nuclear-cytoplasmic interaction hypothesis, which suggests that the paternally inherited genome of an allopolyploid is usually more prone to genetic changes than the maternally derived genome 24 .
In addition, our results indicate that the D-genome progenitor of hexaploid oat is more closely related to the A-genome than to the C-genome and may be extinct. The C and A/D lineages diverged approximately 8 mya, followed by the A-genome subtypes (Ac/ Ad) and the D genome around 3.5 mya. Cultivated ACD-genome hexaploid oat originated around 0.5 mya from the hybridization between a paternal Al/As-genome diploid ancestor and a maternal CD-genome tetraploid that is closely related to A. insularis and originated from an allotetraploidy event between a paternal C-genome and a maternal D-genome diploid (Fig. 3e). These findings clarified the evolutionary history of oats based on various pieces of evidence at the genomic level and provided the most possible clues to the subgenomic origin of hexaploid oat, which will be of great value for introgression breeding and the transfer of traits from the closest extant wild relatives (As/Al genome diploids and CD genome tetraploids) to cultivated oat.
Subgenome structure and dominance. Consistent with the inference that the A and D subgenomes of oats are closely related, the synteny between them (54.40% genes in collinear blocks) is more extensive (P < 2.2 × 10 −16 , Fisher's exact test) than that between the C and A/D subgenomes in hexaploid (43.59% in C and A and 41.43% in C and D) and tetraploid (38.81% in C and D) (Extended  Table 17). Chromosomal rearrangements have been implicated as one of the driving forces for shaping adaption and speciation by affecting gene expression through position effects [25][26][27] . In oat, the 1C/1A translocation (previously designated as 7C/17A) is well known to be associated with the division of cultivated oat into A. sativa L and A. byzantina K. Koch (sub)species 28 and variations in crown freezing tolerance and winter field survival 29,30 . Functional enrichment analyses showed that 'multicellular organism development' (GO:0007275, 37/112, BH-Adjusted P = 9.69 × 10 −12 ), 'ubiquitin-dependent protein catabolic process' (GO:0006511, 51/218, BH-Adjusted P = 4.84 × 10 −9 ), and 'oxidation-reduction process' (GO:0055114, 373/3,273, BH-Adjusted P = 3.00 × 10 −6 ) are the top three most enriched biological processes terms for genes positioned within the six large intergenomic translocations that occurred during tetraploidization (Supplementary Table 18). Genes positioned within the 1A/1C translocation were significantly enriched for two biological process terms 'photosynthesis, light harvesting' (GO:0009765, 11/139, BH-Adjusted P = 2.92 × 10 −2 ) and 'regulation of systemic acquired resistance' (GO:0010112, 5/23, BH-Adjusted P = 2.92 × 10 −2 ) (Supplementary Table 19). These processes are essential for responses to abiotic and biotic stress and thus might be involved in local environmental adaption and (or) speciation in oat.
Unlike the chromosomal translocations in the tetraploid A. insularis ( Tables 15 and 16 and Extended Data Fig. 7). These results suggest that the homoeologous rearrangements after hexaploidization played an important role in forming the genome structure of cultivated oats. Moreover, the homoeologous rearrangements in hexaploid oat appeared to be biased among the three subgenomes in that 88.4% (931.94/1054.30 Mb) occurred between the A and D subgenomes, which is much higher than that occurred in A and C (11.2%, 117.71/1054.30 Mb) or D and C (0.04%, 4.66/1054.30 Mb). This finding may support the hypothesis that the presence of two well-conserved homologous genomes in the same nucleus would facilitate inter-subgenome recombination and rearrangement after polyploidization 31 .
Assembly results showed that the C subgenome is ~20% larger than the A and D subgenomes, which can be explained almost entirely by the relative abundances of repeat (Supplementary Table 6). Intriguingly, we observed biased gene fractionation among the subgenomes of hexaploid oat. First, the identification of presence/ absence variations (PAVs) (lost genes in tetraploid C vs D, 1,618 vs 1,315, degrees of freedom (df) = 1, P = 4.13 × 10 −9 ; hexaploid C vs A, 2,367 vs 790, df = 1, P < 2.2 × 10 −16 ; and C vs D, 2,367 vs 1,079, df = 1, P < 2.2 × 10 −16 ; chi-squared test) and pseudogenes (C vs A, 28,056 vs 23,695, df = 1, P < 2.2 × 10 −16 ; C vs D, 28,056 vs 24,009, df = 1, P < 2.2 × 10 −16 ; chi-squared test) consistently showed a higher rate of gene loss in the C subgenome of hexaploid oat (Extended Data Fig. 8a-c). More contracted gene families in this subgenome (C vs A: 1,100 vs 569; C vs D: 1,100 vs 535) also support this result (Extended Data Fig. 8d-f). The contracted gene families are mainly related to 'fructose 6-phosphate metabolism' , 'the glycine metabolism' and 'translational termination' (Supplementary Table 20). Second, using barley as the outgroup, we estimated the nonsynonymous to synonymous substitution rate ratios (Ka/Ks) of the 7,353 single-copy orthologs identified between the oat (sub)genomes. The results showed that the polyploids exhibited an accelerated rate of evolution relative to their A-and C-genome diploid progenitors and that the C subgenome was subject to less purifying selection than the other two subgenomes of hexaploid oat (Fig. 4e and Extended Data Fig. 9). Third, examination of the orthologs expression patterns showed that the number of preferentially expressed genes in the C subgenome was significantly lower than that in the A (up-regulated genes in A vs C, 9,375 vs 7,541, P = 0.009548, Wilcoxon rank-sum test) and D (D vs C, 9,359 vs 8,139, P = 0.04693, Wilcoxon rank-sum test) subgenomes (Fig. 4f,g and Supplementary  Table 21). Interestingly, these expression-biased genes have more AS variants than those with balanced expression (Extended Data Fig. 10a Fig. 10b). We further found that genes with relatively higher TE densities near genes tend to have lower expression levels (Extended Data Fig. 10c). This observation is consistent with the hypothesis that subgenome gene expression dominance is influenced by TE-density differences between subgenomes as observed in other allopolyploids 7,32 . Taken together, these results demonstrate subgenome dominance in hexaploid oat.

Genes related to important agronomic traits.
Oat production is threatened by several agricultural diseases, one of which is crown rust caused by the basidiomycete fungus Puccinia coronata var. avenae. Nucleotide-binding site leucine-rich repeat (NBS-LRR) proteins, which are encoded by a class of resistance genes (R-genes), play important roles in plant immune signaling 33 . We identified 1,269 R-genes across the three subgenomes of Sanfensan, showing a contraction compared with the numbers identified in the tetraploid and diploid (sub)genomes (Fig. 5a). Most of these R-genes occur in clusters at the distal ends of all chromosome arms. Mapping DNA markers cosegregating with or flanking the known crown rust genes (Supplementary Table 22) revealed these genes colocalize with R-genes (Fig. 5b). These discoveries could serve as promising candidates for future mapping and gene cloning studies. Cultivated oats are generally classified into two production-related morphological types, hulled and hulless, which is one of the domestication traits (Fig. 6a). The caryopsis of hulled oat is tightly surrounded by a thick, tough and hard-to-remove hull, whereas hulless oat has a papery hull that is easily threshed when mature 34 . To identify genomic loci contributing to the hulless trait in oat, we performed a genome-wide association analysis based on 49,702 SNPs from 659 diverse oat accessions (Supplementary Table 23). We first infer the population structure and linkage disequilibrium (LD) pattern of the oat collection. Both principal-component analysis (Fig. 6b) and neighbor-joining tree analysis (Fig. 6c) revealed weak population structure, which is consistent with previous studies [35][36][37] . Most of the hulless landraces were tightly clustered reflecting domestication bottleneck for hulless oat 34 . The LD decay rate was measured as the physical distance at which the average pairwise r 2 dropped to 0.2. Genome-wide LD decay rate of this population was estimated at 2.29 Mb (Fig. 6d), The long-range LD observed in oat is similar to that in other self-fertilizing species such as wheat 38, 39 and barley 40,41 . An association scan detected a strong peak on the end of chromosome 4D, which collocated with the previously reported N1 locus 34,42 (Fig. 6e,f). We searched the vicinity of the peak for plausible candidate genes with particular attention to genes within the genomic region covered by markers flanking the N1 locus 42 (Fig. 6f), and came across a gene (A.satnudSFS4D01G000045) annotated as a receptor-like kinase, which is about 30 kb distant from the most highly associated marker. A homologous gene AtVRLK1 in Arabidopsis was found to be involved in secondary cell wall thickening 43 , and the mutant of the homologous gene mis2 in rice displayed an open hulled spikelet 44 . Comparison the coding sequences (CDSs) of A.satnudSFS4D01G000045 identified a SNP in first exon predicted to cause amino acid changes (Fig. 6g). The association of this SNP with the hulless trait was validated by the development of KASP (Kompetitive allele-specific polymerase chain reaction) markers (Fig. 6h). By comparing transcriptome data between 10 hulled and 12 hulless oats, we found A.satnudSFS4D01G000045 is differentially expressed with hulless oats having higher expression levels (P < 0.01, Student's t test) (Fig. 6i). Further examination of the expression patterns in the panicles at different developmental stages revealed this gene is highly expressed during panicle development in Sanfensan (hulless) but is expressed at very low levels in panicles of Ogle (hulled) (Fig. 6j, Student's t test). These results indicated that A.satnudSFS4D01G000045 may be a promising plausible candidate gene that controls the hulled/hulless trait in oat.

Discussion
The Sanfensan oat, together with its diploid and tetraploid ancestors reference genomes presented here, constitutes an important community resource for cereal genomics and provides comprehensive and specific insights into the evolutionary history of oat. This study is the most comprehensive phylogenomic analysis of Avena to date, as it included samples representing all extant Avena genomes and developed the largest number of molecular markers evaluated thus far. Whole-genome sequencing, the de novo assembly of transcriptomes and complete plastid genomes show agreement on conclusion. The proposed model for the chronological formation   of polyploid oat has been clarified, and the investigation into the evolution of the oat subgenomes during polyploidization events offers a window into the study of polyploid genome evolution. The high-quality reference genomes will facilitate the identification of genes that are related to important agronomic traits, including disease resistance and end-use functionality. These genomic resources are an important step toward the genetic improvement of oat.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-022-01127-7.  Table 2). The raw ONT long reads were subjected to self-correction using the NextCorrect module implemented in NextDenovo (v2.0-beta.1) (https://github. com/Nextomics/NextDenovo). The corrected reads were then assembled into contigs using NextDenovo (Supplementary Table 3). To improve the accuracy of the assembled contigs, a two-step polishing strategy was applied: (1) the corrected Nanopore reads were first used for initial polishing with Racon (v1.4.21) 45 (three rounds), and (2) the highly accurate short reads were then used to further correct the assemblies with NextPolish (v1.0.5) (four rounds).

Methods
We used Hi-C technology to obtain chromosome-level genome assemblies of Sanfensan and A. insularis. A total of 1,312.83 Gb and 816.93 Gb of raw Hi-C data were generated for Sanfensan and A. insularis, respectively. The polished contigs of Sanfensan and A. insularis were further clustered, ordered and anchored to pseudochromosomes using Hi-C data with the LACHESIS program 46 . We used RaGOO (v1.1) 47 with the default parameters to anchor the contigs of A. longiglumis to seven pseudochromosomes with the previously published As genome of the diploid species A. atlantica 17 as the reference.
Genome quality assessment. Multiple approaches were used to evaluate the quality of the assembled genomes. First, the NG values for all integer thresholds (1-100%) were calculated and used to generate an NG graph for each of the three assembled genomes to estimate the continuity of the assemblies. Second, the short paired-end reads from each species were mapped to their corresponding genome assemblies using BWA (v0.7.10-r789) 48 with default settings. Variants were called using the HaplotypeCaller module of GATK (v4.1.9.0) 49 . The obtained SNPs and InDels were used to estimate the base-level accuracy and relative low heterozygosity of the three assemblies. Third, the BUSCO (v5.2.2) 50 pipeline was used to evaluate the completeness of the gene space. Finally, the LAI 51 was used to evaluate the completeness in the more repetitive genomic regions.
Subgenome assignment. A reference-guided strategy based on subgenome sequence similarity was used to distinguish the subgenomes of A. insularis and Sanfensan. The A. longiglumis reference genome was split into 100-bp nonoverlapping fragments and mapped onto the hexaploid genome assembly using BWA with the default settings. The uniquely mapped fragments were retained (Extended Data Fig. 4b-d). A syntenic block was defined based on the presence of at least five syntenic fragments (Extended Data Fig. 4a,b). The chromosomes with the highest similarity to A. longiglumis were assigned to the A subgenome, and the chromosomes with the lowest similarity were assigned to the C subgenome, because previous studies have shown that the C genome is highly diverged from the A and D genomes 4,5 ; the remaining chromosomes with median similarity were accordingly assigned to the D subgenome (Extended Data Fig. 4a). Using the same method, the A. insularis genomic sequences were aligned to the Sanfensan reference genome, and they showed a high level of similarity with the hexaploid chromosomes that were assigned to the C and D subgenomes (Extended Data Fig. 4b,d). The subgenome assignments were further validated by the quantified depth of coverage of the paired-ended reads from A. longiglumis and A. insularis t (Extended Data Fig. 4e,f) and the abundance of two types of DNA repeats in the A. insularis and Sanfensan genomes, which have been reported to be overrepresented in the Avena C 18 and A 19 genomes, respectively (Fig. 1).
Protein-coding genes were predicted using an evidence-based annotation workflow by integrating evidence from transcriptomic data, homologue searches and ab initio prediction. Transcriptomic data were generated by performing PacBio full-length transcriptome sequencing using total RNA isolated from mixed plant organs. Raw reads were processed with IsoSeq3 pipeline (https://github. com/PacificBiosciences /IsoSeq) to identify full-length, nonchimeric circular consensus sequences (CCSs). The resulting high-quality CCSs were mapped onto the reference genome for de-redundancy. The nonredundant isoforms were then used to determine the locations of potential intron-exon boundaries using GeneMarkST 58 . Protein sequences from A. atlantica, A. eriantha, Brachypodium distachyon, Hordeum vulgare, Oryza sativa and Triticum aestivum were used as protein evidence. Ab initio gene prediction was performed using GeneMark-ET (v4.0) 59 and AUGUSTUS (v2.4) 60 with two rounds of iterative training. All gene predictions were integrated using the recommended settings of EVidenceModeler (v1.1.1) 61 after removing TE-related genes, pseudogenes and noncoding genes using TransposonPSI (v1.0.0) 62 with the default settings.
Noncoding RNAs, including microRNAs, small nuclear RNAs, ribosomal RNAs and regulatory elements, were identified using the Infernal (v1.1.2) 63 program to search against the Rfam database 64 . The ribosomal RNAs, transfer RNAs and microRNAs were identified using RNAmmer (v1.2) 65 , tRNAscan-SE (v2.0) 66 and miRanda (v3.0), respectively. The putative domains and GO terms of the predicted proteins were identified using the InterProScan (v5.22) 67 program with the default settings. Table 7) for which high-quality reference genomes were available, mainly representing the Pooideae subfamilies Aveneae, Lolieae and Triticeae tribes that contain the main cereal crops were used for gene family clustering analyses. Gene families of the 23 subgenomes from 16 species were identified using the OrthoFinder (v2.2.7) 68 program with default settings. The genes that were exclusively found in each tribe (Aveneae, Lolieae and Triticeae) were identified. Significantly overrepresented GO terms in each group were identified using the R package 'topGO' (https://www.bioconductor.org/packages/release/bioc/html/ topGO.html).

Identification of orthologous and homoeologous gene sets.
Using the gene families identified by the OrthoFinder program, 2,237 one-to-one orthologous gene sets were identified for the 23 subgenomes of 16 grass species. We identified 7,353 one-to-one orthologous gene sets for the eight Avena (sub)genomes and H. vulgare cv. 'Morex' . We also identified 12,225 one-to-one homoeologous gene sets (triads) for hexaploid oat (Supplementary Table 9).

Phylogenomic analyses of cereal crops.
To investigate the evolutionary position of oat among cereal crops, a phylogenetic tree was constructed using the 2,237 conserved single-copy genes among the 23 subgenomes of 16 grass species. For this purpose, conserved CDS alignments of these single-copy orthologues were extracted by using Gblocks (v0.9b) 69 and were concatenated to generate a supermatrix. The transversion rates at fourfold degenerate (4DTv) sites were extracted from this supermatrix and subject to analysis with RAxML (v.8.2.7) 70 to generate the maximum likelihood tree using the GTR + I + Γ model. Divergence times were estimated using the MCMCTree program in the PAML(v4.7) package 71 .
To investigate the chromosomal evolution of the Avena genomes, we selected oat diploid, tetraploid and hexaploid species together with wheat to analyze the synteny between these genomes and the rice or barely genome using MCScanX (git-97e74f40) 72 with the default settings, and the identified syntenic blocks were then used to deduce the homologous relationships between rice or barely genes and the protein sequences of Avena and wheat. The collinearity between species was identified and plotted using MCScanX (python version) (Fig. 2c).

The evolution and allopolyploidization history of oat.
For whole-genome resequencing, 14 other Avena accessions were chosen, representing all genome types found among extant Avena diploid, tetraploid and hexaploid species (Supplementary Table 1). All genome sequencing was performed on an Illumina HiSeq X-Ten instrument generating 400-base paired-end libraries.
The raw reads from the 14 newly sequenced accessions as well as A. longiglumis and A. insularis were trimmed using Trimmomatic (v.0.40) 73 . Potential polymerase chain reaction duplicates were removed using SAMtools (v1.9) 74 . The genomic variants were identified using the HaplotypeCaller module in GATK (v4.1.9.0), and SNPs that met any of the following criteria were further discarded: (1) SNPs with quality score <50, QD score <2, FS score >60 or MQ score <40; (2) SNPs with more than two alleles; (3) SNPs at or within 5 bp from any InDels; (4) Genotypes with extremely high (greater than three-fold average depth) or extremely low (less than one-third average depth) coverage were assigned as missing sites; (5) SNPs with a minor allele frequency < 0.05; and (6) SNPs with missing sample rate >0.3. SNPs in syntenic regions across the A, C, and D subgenomes for each accession were isolated and used for maximum-likelihood (ML) tree construction using the RAxML software with 200 bootstrap replicates. The phylogenetic relationships of the species related to the A, C, and D lineages were constructed based on A-, C-, and D-type SNPs, respectively. The level of identity between the individual Sanfensan subgenomes and each sequenced accession was calculated by mapping ~1× clean reads of each accession onto the Sanfensan subgenomes.
All sequenced diploid accessions were further subjected to transcriptome sequencing. The raw reads were cleaned with Trimmomatic. De novo assembly was performed using Trinity (v2.0.3) 75 with the default settings. CDSs were predicted using TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder). Proteomes from these diploid accessions was subjected to gene family analysis against Hordeum vulgare protein sequences by using OrthoFinder, and single-copy gene families that were conserved in all species were retained for further study. The protein sequences from each conserved gene family were aligned using MUSCLE (v3.8.31) 76 with the default parameters, and the alignments were CDS back-translated into CDS from the corresponding protein alignments. The same methods described above were used for phylogenetic tree construction and divergence time estimation.
The chloroplast genomes of all sequenced taxa were assembled with NOVOPlasty (v3.7) 77 using the clean Illumina short reads (Supplementary Table  13). Another 25 Avena chloroplast genomes previously released 22,23 (Supplementary  Table 14) were also downloaded. Multiple sequence alignments were performed using MUSCLE, and the identified informative sites were used for phylogenetic tree construction using RAxML with 100 bootstrap replicates under the GTR+I+Γ evolutionary model, where the chloroplast genome sequence from Triticum aestivum was used as the outgroup. Divergence times were estimated based on independent rates and the Jukes-Cantor 1969 (JC69) model using the MCMCTree program in the PAML (v4.7) package.
Synteny and comparative genomics. The subgenome synteny between Sanfensan and A. insularis was analyzed by plotting the positions of homologous pairs in the subgenome pairs within the context of 21 and 14 chromosomes using Circos (v0.69-9) 78 software (Extended Data Fig. 5a,b). The interchromosomal exchanges between A. insularis and Sanfensan after polyploidization were analyzed by individually mapping reads from A. longiglumis and A. eriantha to the A. insularis reference genome and reads from A. longiglumis, A. eriantha and A. insularis to the Sanfensan reference genome. The single-base depth coverage of the properly paired reads obtained from the A. longiglumis, A. eriantha and A. insularis mapping was calculated using the Mosdepth (v0.3.0) 79 program. The median depth within a sliding window (window size: 1 Mb, step size: 0.5 Mb) was calculated and plotted along the chromosomes of the reference genome ( Fig. 4b and Extended Data Fig. 6a-d).
FISH analysis. Major interchromosomal exchanges between the C and D subgenomes of A. insularis and Sanfensan were detected using FISH with the A and C genome-specific repeats As120a and Am1 as the probes. The large inversion observed on hexaploid chromosome 3 C was validated by FISH using two additional oligonucleotide probes, Oligo-5SrDNA and Oligo-6C343, as previously reported 80 , that gave stable fluorescent signals in regions flanking one of the breakpoints of this inversion on hexaploid chromosome 3C. The nucleotide sequences of the FISH probes were given in Supplementary Table 17. Metaphase chromosome preparation and FISH analysis were performed as described in the previous study 81,82 ; (Fig. 4c, Extended Data Fig. 6e, f).
Ka/Ks analysis. On the basis of the 7,353 one-to-one orthologous gene sets identified among the genome assemblies for Hordeum vulgare, we calculated the nonsynonymous (Ka) and synonymous substitution (Ks) rates for the A-genome (A. atlantica and A. longiglumis) and C-genome (A. eriantha) diploid progenitors of the hexaploid oat, and the subgenomes of A. insularis and Sanfensan. For this purpose, the homoeologous gene pair list was used as the input and the protein sequences from each gene pair were aligned using MUSCLE. PAL2NAL(v14) 83 was used to convert the peptide alignment to a nucleotide alignment, and the Ka and Ks values were computed between gene pairs using Codeml from PAML (v4.7) in free-ratio mode. All estimates with Ks < 0.01 were excluded from the analysis. The significance of the differences in Ka/Ks values between genomes (subgenomes) was estimated using the Wilcoxon rank-sum test for nonnormal distributions in R.

Gene loss and retention.
To measure the effects of polyploidization on gene loss and retention, we performed PAV analyses of the single-copy genes between the polyploids and the diploids. In this case, 14,624 one-to-one orthologues for the three diploids (A. atlantica, A. longiglumis and A. eriantha) were identified. The presence/absence of these genes in each subgenome of the tetraploid and hexaploid species were counted. The distributions of these conserved genes that were absent in the polyploids were displayed with the A. longiglumis genome as the reference. Expanded and contracted gene families for eight Avena (sub)genomes with the barley genome as reference were identified using CAFÉ (v5.2.1) 84 . Significantly overrepresented GO terms of the contracted gene families in the C subgenome of the hexaploid were identified using the R package 'topGO' . We further compared the gene family size changes in the hexaploid subgenomes and the A-and C-genome diploids based on the gene families identified by the OrthoFinder program. The scatter dots and regression lines were plotted (Extended Data Fig. 8e,f).
Gene expression analysis. Gene expression levels in each sample were quantified using the HiSAT2 (v2.2.1) 85 and HTSeq (v0.9.1) 86 pipelines. Differentially expressed genes were identified with the edgeR (v3.38.1) software package 87 (false discovery rate < 0.05 and |log2-fold change > 0.5). To analyze differences in the expression patterns of homoeologous genes, we undertook an initial analysis of the variation in expression among 12,225 homoeologous triplets. Triplet expression vectors were created by concatenating the observed gene expression values for the A, C and D homologs. Triplets that expressed at least one homolog across the sampled tissues were summarized in a triplet expression matrix. The expression values of the triplet expression matrix were log transformed (log 10 (transcripts per million + 1)), and the matrix was subjected to two-dimensional hierarchical clustering using 'hclust' implemented in R with the 'average' correlation distance and clustering. Heatmap visualization was performed using the heatmap.2 command from the R package gplots.
Characterization of AS variants. Transcripts from the RNA-sequencing data were first assembled using StringTie (v2.2.0) 88 with the default parameters based on the bam files generated from HISAT2, which were then integrated with the IsoSeq transcript assemblies using SQANTI3 (v5.0) 89 . The number of AS events in each gene locus was counted using SQANTI3 with the default settings.
Identification of R-genes. R-genes in the five Avena genomes were identified using InterProScan (v5.22) and DeepCoil (v2.0.1) based on domains from the Pfam (v14.2) database. The domains included coiled-coil (CC), kinase (KIN), Toll/ interleukin-1 receptor/resistance protein (TIR), nucleotide-binding site (NBS) and leucine-rich repeat (LRR). Non-canonical R-genes (which lack most conserved motifs but are nevertheless potential R-genes) were determined by BLAST searchers using manually curated R-genes from the PRGdb (v4.0) 90 database as reference sequences. Results with e-values of less than 1E-5 and top-hits with at least 30% query and subject coverage were considered to be the putative R-gene candidates.
To understand whether the identified R-genes were correlated with the map positions of the known quantitative trait loci for crown rust, one of the most serious diseases of oats, DNA markers co-segregating with or flanking the known crown rust genes (Supplementary Table 22) were mapped to the hexaploid Sanfensan reference genome by BLASTn analyses. The distributions of the R-genes and known quantitative trait loci are shown in Fig. 5b. Candidate gene for the hulless trait. A diverse collection of 659 oat accessions that included 510 hulled and 149 hulless oats were used for genotyping by sequencing (GBS) analysis (Supplementary Table 23). Variants were identified using the HaplotypeCaller module of GATK (v4.1.9.0). The resulting variants were further filtered to retain only bi-allelic sites with minor allele frequency > 0.05, missing data rate <20% and heterozygous calls <10% for genome-wide association scans.
Principal-component analysis and neighbor-joining trees were used to infer population structure of the oat collection using TASSEL 5.0 (ref. 91 ). Pairwise r 2 was calculated using an LD sliding window size of 50. The pattern of LD decay was visualized by plotting pairwise r 2 values against the physical distance (Mb). A smoothened LD curve was fit to the data using the R script developed by M. Ali (https://github.com/mohsinali1990/My_scripts/blob/main/LD; decay Plot from TASSEL LDoutput.R).
Genome-wide association scans for the hulless trait were performed with TASSEL 5.0 using a mixed-linear model incorporating the kinship matrix and population structure as covariates. A genome-wide threshold of −log(P) = 6.70, calculated from the formula −log 10 (0.01/effective number of SNPs) was used to identify markers associated with the hulless trait.
Ten hulled and 12 hulless oat accession (Supplementary Table 1) were selected for RNA-sequencing analysis. Genomic sequences of the candidate gene from five hulless and five hulled oats were aligned using the ClustalW 92 program. A total of 25 single-nucleotide changes were identified in the gene coding regions between hulled and hulless oats, with one SNP in exon 1 predicted to cause amino acid changes. This SNP was converted to a KASP marker (Supplementary Note) and validated in 286 oat lines randomly selected from the diverse oat collection.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The genome assemblies and sequence data for A. sativa ssp. nuda cv. Sanfensan, A. insularis (CN 108634) and A. longiglumis (CN 58139) were deposited at NCBI under BioProject codes PRJNA727473, PRJNA731599 and PRJNA716144, respectively. Sanfensan genome assembly (SAMN19770945), ONT data (SAMN19021785), Hi-C data (SAMN19340419), next-generation sequencing (NGS) data (SAMN19582572), IsoSeq data (SAMN19581880) and RNA-seq data (SAMN19582573, SAMN19582574); A. insularis genome assembly (SAMN19771048), ONT data (SAMN19291344), Hi-C data (SAMN19312172), NGS data (SAMN19579880) and IsoSeq data (SAMN19581879); A. longiglumis genome assembly (SAMN19771099), ONT data (SAMN18395928), . The NG contig length is calculated at integer thresholds (1% to 100%) and the contig size (in bp) for that particular threshold is shown on the y-axis. b, Completeness of the three assembled oat genomes and the related cereal crop species (as indicated in panel a), assessed using BUSCO. Bar charts show the percentages of 1,614 highly conserved plant BUSCO genes that are completely present (light and dark blue), fragmented (yellow), or missing (red) in each assemblies. c, The long terminal repeat (LTR) assembly index (LAI) scores of the three assembled Avena genomes. The x-axes show the seven chromosomes in each genome. The LAI scores, represented by dots, were calculated using 3 Mb-sliding windows with 1.5 Mb steps. The dark-blue lines indicate the average LAI score across each whole genome. d, The distribution of the average depth per million base pairs covered by the ONT reads. e, GC content distributions of the three assembled oat genomes. The GC content was determined using 2 kb nonoverlapping sliding windows. Fig. 10 | Comparison of alternative spicing (aS) events and Te density in the 12,225 strict 1:1:1 triplets in each subgenome of hexaploid oat. a, The AS number (≥2) of 12,225 homoeologous triads, as well as the AS number (≥2) of genes that showed balanced expression or preferential expression in the A, C, and D subgenomes (left to right) in each subgenome of hexaploid oat. The central line for each box plot indicates the median value. The top and bottom edges of the box indicate the first and third quartiles and the whiskers extend 1.5 times the interquartile range beyond the edges of the box. b, Comparison of TE densities near genes in the three subgenomes of hexaploid oat. TE densities near genes in the C subgenome are the highest relative to homoeologs in the A and D subgenomes. TE density was calculated in 5 kb windows upstream and downstream of the gene. c, Gene expression is negatively correlated with TE density in hexaploid oat. The 12,225 genes in each subgenome were equally divided into four groups based on their TE density, and their relative expression was then plotted in boxplots. The x-axis represents the TE density in each group. The data are presented as mean±s.d. (n = 9,168, 9,168, 9,168, and 9,171 indepent samples for groups A, B, C, and D, respectively). The central line for each box plot indicates the median value. The top and bottom edges of the box indicate the first and third quartiles and the whiskers extend 1.5 times the interquartile range beyond the edges of the box. Numbers of samples used in each assay in a and c are indicated as n. Two-tailed Student's t-test was used to generate the P values in a (*, P < 0.05, **, P < 0.01) and a pairwise t-test (two-tailed) was used to determine significance in c (means with the same letter are not significantly different at P < 0.05).