Common oat (A. sativa L., 2n = 6x = 42, AACCDD genomes) is an important cereal crop cultivated worldwide and has long been prized by consumers primarily because it is one of the richest sources of protein, fat and vitamin B1 among all crops1. In addition, oats are a widely grown cool-season annual forage species, and represent a major source of high-quality forage for livestock globally2. Common oat belongs to the genus Avena in the grass family Poaceae, and the genus comprises a polyploid series of wild, weedy and cultivated species distributed across six continents3. The diploids have either the AA or CC genomes; the tetraploids mainly contain the AABB or CCDD4 (previously AACC) genomes; all hexaploid species, including common oat, share the same AACCDD genomic constitutions5. Due to the lack of genome sequences from hexaploid oat and its close diploid and tetraploid relatives, the phylogenetic history and divergence time among the A, C and D genomic lineages remain unclear. Polyploid plants often have notable advantages in biomass production, vigor and adaptability to environmental changes, contributing to the emergence of important agronomic traits in food crops6,7,8. Oat has good adaptability to a wide range of climatic conditions, enabling oat to reliably produce grains in marginal regions with harsh conditions. Therefore, crop polyploidization plays an important role in next-generation crop improvement to overcome food security challenges8. Many genomes of commercially important crops have been sequenced and assembled, which has improved the understanding of crop evolutionary history and the development of efficient approaches for selecting important traits9. Oat has lagged behind in this regard, primarily due to the large genome10 that contains highly repetitive DNA sequences, and the fact that both the A- and D- subgenomes are similar to the A-genome diploid and were difficult to distinguish from one another in previous studies11. Currently, relatively little is known regarding the position and distribution of genes on each of the oat chromosomes and their evolution during the polyploidization events that gave rise to the hexaploid species, which limits the full and effective utilization of oat germplasm. With the recent advances made by Oxford Nanopore Technologies (ONT), the ONT system now offers ultralong sequence reads, delivering high contiguity with low assembly errors caused by long repetitive regions12. This technology has facilitated complete telomere-to-telomere (T2T) genome assembly in various species, including Homo sapiens, by resolving long, complex repetitive regions13,14. It is very suitable for assembling the large, complex polyploid oat genome, with a high content of repetitive sequences and high subgenomic homology. Here, we used the ultralong ONT system to sequence the genome of the oat variety ‘Sanfensan’ (A. sativa ssp. nuda), a hexaploid oat landrace that originated from the diversity center of hulless oat, together with assembling and resequencing the genomes of additional diploid and tetraploid Avena species to gain insights into the evolutionary processes that established the dominant subgenome in allopolyploid oat species and demonstrate the utility of the oat reference genome in identifying genes that underlie agronomic traits.


Assembly and annotation of the oat genome

We assembled the Sanfensan genome into 326 contigs based on 1,028 Gb of ultralong reads (N50 length: 52 kb; ~100× genome coverage). The draft genome assemblies were then corrected by using ~650 Gb of cleaned Illumina paired-end reads, resulting in a total assembly size of 10.76 Gb, 99.06% of which was arranged into 21 chromosomes (based on 1,296 Gb of Hi-C data) after dissociating 72 misjoined contigs into 182 contigs using three-dimensional proximity information (Extended Data Fig. 1a). The final assembly contained 436 corrected contigs with N50 of 75.27 Mb and a maximum length of 313.87 Mb. Of these, 323 contigs were anchored onto 21 pseudochromosomes (Table 1, Supplementary Tables 13 and Extended Data Fig. 1a). The quality of the Sanfensan genome assembly was supported by assessments including NG plots, long terminal repeat (LTR) assembly index (LAI; 18.34), BUSCO (99.44%) and the coverage of full-length transcripts by the assembled genome (92.00%) (Extended Data Fig. 2), together with the markers from hexaploid oat consensus map15 and high level of syntenic relationship with the other hexaploid OT3098 v2 reference genome ( (Extended Data Fig. 3). In addition, we found that 99.87% and 96.30% of the assemblies were covered at more than 20× sequencing depth with ultralong reads and short reads, respectively, indicating high accuracy at the nucleotide level (Extended Data Fig. 2d,e). Finally, 99.75% of the 4,336,693,678 Illumina paired-end reads could be mapped onto the assembly, with 98.22% properly paired alignments. From this mapping result, 98,885 homozygous single-nucleotide polymorphisms (SNPs) and 19,444 insertions/deletions (InDels) were identified in Sanfensan, giving an estimated overall nucleotide accuracy rate of 99.999%. We also identified 1,093,198 heterozygous SNPs and 66,019 heterozygous InDels, giving an overall heterozygosity rate of 0.011%, indicating that the Sanfensan genome is largely homozygous. We annotated a total of 120,769 protein-coding genes, 88.41% of which were assigned to a predicted function and 33.69% of which have alternative splicing (AS) variants (Table 1, Supplementary Tables 4 and 5). Moreover, 86.95% (9.35 Gb) of the assembly was annotated as repetitive elements (Supplementary Table 6), which is higher than previously reported genomes of barley (80.80%)16 and bread wheat (84.70%)9.

Table 1 Genome assembly statistics for the diploid, tetraploid and hexaploid Avena species

To distinguish the subgenomes accurately and clarify the polyploidization history of the hexaploid oat, we sequenced and assembled its most likely ancestral species A. longiglumis (2n = 2x = 14, AlAl genome) and A. insularis (2n = 4x = 28, CCDD genome)5, resulting in >60× genome coverage for A. longiglumis (218.67 Gb) and A. insularis (374.77 Gb). The assembled A. longiglumis genome was 3.74 Gb in size, 99.20% of which was anchored onto seven chromosomes with the A. atlantica genome as the reference17. The tetraploid A. insularis genome assembly was 7.52 Gb in size, 95.08% of which was arranged into 14 chromosomes by Hi-C analysis (Extended Data Fig. 1b). We annotated 43,477 and 89,995 protein-coding genes in the A. longiglumis and A. insularis genomes respectively (Table 1). Similarly, most of the A. longiglumis (86.83%) and A. insularis (87.11%) genomes are composed of repetitive elements (Table 1 and Supplementary Table 6).

Based on the genomic similarity among these three genomes, we successfully assigned the 21 pseudochromosomes of hexaploid oat to the A, C and D subgenomes (Extended Data Fig. 4a–d). The subgenome assignment was supported by the over-representation of the A- and C-genome-specific satellite repeats, As120a and Am1, respectively18,19, in the corresponding subgenomes (Fig. 1), and by mapping the whole-genome sequencing reads from a range of Avena species (Extended Data Fig. 4e,f). The nomenclature system of OT3098 v2 was adopted for naming the chromosomes of Sanfensan, which was consistent with that approved by the International Oat Nomenclature Committee.

Fig. 1: Circos display of the genomic features of the assembled diploid A. longiglumis, tetraploid A. insularis and hexaploid A. sativa ssp. nuda genomes.
figure 1

a, The inferred centromere positions of the A and D genome chromosomes. b, The distribution of the C genome-specific repeat Am1 along each chromosome. The Am1-rich regions on chromosomes 1A, 2D, 3D, 4D and 5D of Sanfensan are C genome introgressions. c, The distribution of the A genome-specific repeat As120a along each chromosome. d, k-mer frequencies. e, Tandem repeat (TR) density (MaxPeriod ≤500 bp). f, LTR retrotransposon density. g, Gene density. h, Chromosome names and sizes. The innermost layer shows synteny of hexaploid and its ancestor species, with colored upper-layer links representing syntenic blocks of each hexaploid chromosome and its ancestor chromosome, and the gray low-layer shows chromosome rearrangements during hexaploidization.

Source data

Evolutionary position of oat among cereal crops

The Poaceae family consists of many agronomically important species, commonly known as cereals, that are classified into three subfamilies: Oryzoideae (rice), Panicoideae (maize, sorghum) and Pooideae (Triticeae: wheat, barley and rye; Aveneae: oat). Among them, Oryzoideae and Pooideae belong to the BOP clad (Bambusoideae, Oryzoideae and Pooideae), which is one of the two primary groups in the Poaceae family. To further clarify the evolutionary position of oat among cereal crops, we used our oat reference genome assemblies to perform phylogenomic analysis to include oat. Gene family analysis across 23 subgenomes from 16 species belonging to the BOP clade identified 2,237 single-copy orthologs (Supplementary Tables 79). Phylogenomic analyses (Fig. 2a) revealed that the divergence between Aveneae and Triticeae took place after the speciation of Oryzoideae, with the approximate times for the two events being 28.2 and 47.9 million years ago (mya), respectively. The tribes Aveneae and Lolieae were indicated to be more closely related than Triticeae. The diversification of oat species occurred ~8.7 mya, which is earlier than wheat species (~6.6 mya) and falls within the previously estimated speciation time of Avena diploids (5.4–12.9 mya)17. We compared the gene families among Aveneae, Triticeae and Lolieae and found 6,425 common gene families in these three tribes, and there are 1,608 gene families specific to Aveneae (Fig. 2b). The top three most enriched Gene Ontology (GO) terms are ‘response to auxin’, ‘enzyme inhibitor activity’ and ‘transcription factor activity’ (Supplementary Table 10).

Fig. 2: Phylogenomic relationships of cereal crops.
figure 2

a, Phylogeny and time scale of 23 subgenomes from 16 cereal crops and related grass species, including panicle images of the 11 sampled cereal crops. The number in the upper left corner of each panicle image corresponds to the number following the species name in the phylogenetic tree. Scale bars, 1 cm. b, Venn diagram showing the numbers of shared and unique gene families in the tribes of Aveneae, Lolieae and Triticeae. c, Probable chromosome evolutionary scenario of oat and wheat species. The subgenome chromosomes (1–7) are presented with a color code to show different segments from the 12 chromosomes of rice (Os1–Os12), which can be used as the representative of the ancestral grass chromosomes (AGK1–AGK12). The abbreviations in the top of the subgenomes are accordance with the abbreviations indicated in the brackets of panel d. d, Chromosome evolution based on barely genomes and chromosomal synteny between the three subgenomes of oat and wheat.

Source data

The ancestral grass karyotype (AGK) was previously reconstructed as a post-ρ AGK with 12 protochromosomes and a pre-ρ AGK with seven protochromosomes by comparing extant species20. Rice was identified as the most slowly evolving species and has 12 chromosomes, most of which closely resemble the post-ρ AGK. Based on the phylogenomic analysis, Triticeae (wheat, barley and rye) and Aveneae (oat) clustered together with Oryzoideae (rice) as an outgroup. We therefore used the rice genome as the ancestral reference and the barley genome as a closely related reference to investigate the chromosomal evolution of oat and to compare it with another allohexaploid cereal species, wheat. We identified 733 and 872 syntenic blocks that included 33,065 and 37,042 orthologous gene pairs from the subgenomes of hexaploid oat based on rice and barely, respectively. (Fig. 2c and Supplementary Table 11). Essentially, the third homologous group of oat and wheat was derived from a single ancient chromosome, AGK1 or Os1, except for the AGK3 or Os3 segment that was translocated to chromosome 3C in A. eriantha. The second group was mainly derived from the insertion of AGK7 or Os7 into AGK4 or Os4, and the remaining five groups were each derived from at least three ancestral chromosomes via complex translocations (Fig. 2c). When the oat assembly was compared with the three subgenomes of common wheat using barley genomes as a reference, a large number of chromosomal rearrangements were identified. For example, although the second homoeologous group of oat and wheat are mainly derived from AGK4 or Os4 and AGK7 or Os7, the arrangement patterns of these two ancestral chromosomes are different.

Polyploidization history and reticulate evolution

Previous phylogenetic studies could not be well distinguish between the A and D subgenomes of Avena, which led to confusion and gave inconsistent results about the origins of hexaploid oat3,4,5,11,21. We have now definitely assigned chromosomes to the A and D subgenomes with the resolution afforded by the complete genome assemblies. Based on this foundation, and to identify the subgenome donors accurately, we resequenced the genomes of 14 Avena species representing different genomic subtypes and ploidy levels (As, Al, Ac, Ad, Cv, Cp, AB and CD). A total of 5,014.96 Gb of genome resequencing data (Supplementary Table 1) were mapped against the A, C and D subgenomes (Methods and Supplementary Table 12). The identity distribution clearly showed that the Al/As diploid species, and the C and D subgenomes of A. insularis have the highest similarities with the A, C and D subgenomes of hexaploid oat, respectively (Fig. 3a). The SNP-based phylogenetic tree of the three subgenomes consistently implies that the species cluster according to the genome subtypes, in which Al and As are the most closely related to the A subgenome of hexaploid oat, whereas none of the extant diploids shows a particularly close relationship with the C and D subgenomes (Fig. 3b). These results are further supported by the phylogenetic analyses of transcriptome data from 11 diploid species (Fig. 3c), suggesting that the A-genome progenitor of hexaploid oat was more likely the ancestor of the Al and As diploids than an extant diploid species, which may be the reason that previous studies22 have considered different A diploids to be the A-genome donors of hexaploid oat.

Fig. 3: Phylogenetics and polyploidization of Avena species.
figure 3

a, Sequence similarities of reads from different Avena species that were uniquely mapped to the A, C and D subgenomes of Sanfensan. b, Phylogenetics of the A-, C- and D-genome lineages generated from the SNPs. c, Phylogenetic analysis based on 7,353 nuclear genes using barley as the outgroup. d, Unrooted phylogenetic tree constructed using the complete sequences of the Avena chloroplast genomes. e, Model showing the evolutionary history of hexaploid oat (A. sativa ssp. nuda; AACCDD).

Source data

We also assembled the chloroplast genomes of different Avena species (Supplementary Table 13), as the maternal inheritance characteristic can be used to determine the maternal genome donor to hexaploid oat. By including 25 previously reported chloroplast genomes (Supplementary Table 14)22,23, our phylogenetic analysis (Fig. 3d) showed that the C genome was undoubtedly the male parent in the polyploidization and that the D genome, rather than the A genome, was the maternal donor in hexaploid oat. The evolutionary order of the different A-genome subtypes was Ac-Ad-Al-As (Fig. 3d). The relatively low collinearity between the C genomes of the diploid and polyploid species is consistent with the nuclear–cytoplasmic interaction hypothesis, which suggests that the paternally inherited genome of an allopolyploid is usually more prone to genetic changes than the maternally derived genome24.

In addition, our results indicate that the D-genome progenitor of hexaploid oat is more closely related to the A-genome than to the C-genome and may be extinct. The C and A/D lineages diverged approximately 8 mya, followed by the A-genome subtypes (Ac/Ad) and the D genome around 3.5 mya. Cultivated ACD-genome hexaploid oat originated around 0.5 mya from the hybridization between a paternal Al/As-genome diploid ancestor and a maternal CD-genome tetraploid that is closely related to A. insularis and originated from an allotetraploidy event between a paternal C-genome and a maternal D-genome diploid (Fig. 3e). These findings clarified the evolutionary history of oats based on various pieces of evidence at the genomic level and provided the most possible clues to the subgenomic origin of hexaploid oat, which will be of great value for introgression breeding and the transfer of traits from the closest extant wild relatives (As/Al genome diploids and CD genome tetraploids) to cultivated oat.

Subgenome structure and dominance

Consistent with the inference that the A and D subgenomes of oats are closely related, the synteny between them (54.40% genes in collinear blocks) is more extensive (P < 2.2 × 10−16, Fisher’s exact test) than that between the C and A/D subgenomes in hexaploid (43.59% in C and A and 41.43% in C and D) and tetraploid (38.81% in C and D) (Extended Data Fig. 5). To investigate the chromosomal rearrangement events that occurred during the evolution of polyploid oats, we conducted a comprehensive synteny analysis among the diploid, tetraploid and hexaploid species (Fig. 4a,b, Extended Data Fig. 6a–d and Supplementary Tables 15 and 16). The results show that two large D-to-C and four C-to-D chromosomal translocations (>40 Mb) occurred during the formation of tetraploid A. insularis. A translocation from chromosome 1C to 1A (>100 Mb) and an inversion in 3C (>130 Mb) subsequently occurred in the hexaploid Sanfensan genome; six of these structural rearrangements were further confirmed by FISH (fluorescence in situ hybridization) assays (Fig. 4a–d, Extended Data Fig. 6 and Supplementary Table 17). Chromosomal rearrangements have been implicated as one of the driving forces for shaping adaption and speciation by affecting gene expression through position effects25,26,27. In oat, the 1C/1A translocation (previously designated as 7C/17A) is well known to be associated with the division of cultivated oat into A. sativa L and A. byzantina K. Koch (sub)species28 and variations in crown freezing tolerance and winter field survival29,30. Functional enrichment analyses showed that ‘multicellular organism development’ (GO:0007275, 37/112, BH-Adjusted P = 9.69 × 10−12), ‘ubiquitin-dependent protein catabolic process’ (GO:0006511, 51/218, BH-Adjusted P = 4.84 × 10−9), and ‘oxidation-reduction process’ (GO:0055114, 373/3,273, BH-Adjusted P = 3.00 × 10−6) are the top three most enriched biological processes terms for genes positioned within the six large intergenomic translocations that occurred during tetraploidization (Supplementary Table 18). Genes positioned within the 1A/1C translocation were significantly enriched for two biological process terms ‘photosynthesis, light harvesting’ (GO:0009765, 11/139, BH-Adjusted P = 2.92 × 10−2) and ‘regulation of systemic acquired resistance’ (GO:0010112, 5/23, BH-Adjusted P = 2.92 × 10−2) (Supplementary Table 19). These processes are essential for responses to abiotic and biotic stress and thus might be involved in local environmental adaption and (or) speciation in oat.

Fig. 4: Subgenome evolution in Avena species.
figure 4

a, Synteny between the subgenomes of hexaploid oat and the putative tetraploid and diploid ancestors. The yellow and blue arrows and lines connecting the chromosomes represent the observed large chromosomal translocations (>40 Mb) from the A and C genomes, respectively. The dark gray arrow and line indicate the inversion in chromosome 3C between A. insularis and Sanfensan. b, Large C-to-A or C-to-D translocations are supported by mapping reads from the C genome diploid to the hexaploid reference genome; blue arrows indicate C-to-D and C-to-A intergenomic translocations. c, FISH using the C genome-specific repeat as a probe confirms the C-to-A and C-to-D translocations. Fluorescence signals from the A genome-specific repeat (As120a) are shown in green, and signals from the C genome-specific repeat (Am1) are in red. The white arrows indicate C-to-D and C-to-A intergenomic translocations. d, FISH confirms the inversion on chromosome 3C observed in the hexaploid oat genome. Oligo-5SrDNA (red) and Oligo-6C343 (green) gave clear hybridization signals on the short and long arms of the tetraploid 3C, respectively, whereas both signals are observed on the long arm of chromosome 3C in the hexaploid. For karyotyping in panels c and d, at least three slides for each accession and ten chromosomes per slide were examined. Scale bars, 2 μm. e, Comparison of Ka/Ks value distributions between the three subgenomes of hexaploid oat. The central line for each box plot indicates the median. The top and bottom edges of the box indicate the first and third quartiles and the whiskers extend 1.5 times the interquartile range beyond the edges of the box. Numbers of samples used in each assay are indicated as n. The asterisks represent significant differences (two-tailed Wilcoxon rank-sum test, **P < 0.01). f, Two-dimensional hierarchical cluster analysis of gene expression among single-copy homoeologous oat genes compared with organ-specific gene expression. TPM, transcripts per million. g, Analysis of log2-fold changes in pairwise gene expression between homoeologous genes showed biased expressions. DEGs, differentially expressed genes. Dot plots show the fold changes for each triplet ordered as shown in the y axis (f). The numbers of significantly differentially expressed triplets (one-tailed Fisher’s exact test, P < 0.05) across all organs are shown at the top of the box, and the significance of the differences in the values between subgenomes was estimated using the one-tailed Wilcoxon rank-sum test (*P < 0.05, **P < 0.01).

Source data

Unlike the chromosomal translocations in the tetraploid A. insularis (3.91%, 99.64/2,549.33 Mb), we found that 49.69% (1,054.29/2,121.61 Mb) of the translocated sequences in the hexaploid occurred between homoeologous chromosomes (Supplementary Tables 15 and 16 and Extended Data Fig. 7). These results suggest that the homoeologous rearrangements after hexaploidization played an important role in forming the genome structure of cultivated oats. Moreover, the homoeologous rearrangements in hexaploid oat appeared to be biased among the three subgenomes in that 88.4% (931.94/1054.30 Mb) occurred between the A and D subgenomes, which is much higher than that occurred in A and C (11.2%, 117.71/1054.30 Mb) or D and C (0.04%, 4.66/1054.30 Mb). This finding may support the hypothesis that the presence of two well-conserved homologous genomes in the same nucleus would facilitate inter-subgenome recombination and rearrangement after polyploidization31.

Assembly results showed that the C subgenome is ~20% larger than the A and D subgenomes, which can be explained almost entirely by the relative abundances of repeat (Supplementary Table 6). Intriguingly, we observed biased gene fractionation among the subgenomes of hexaploid oat. First, the identification of presence/absence variations (PAVs) (lost genes in tetraploid C vs D, 1,618 vs 1,315, degrees of freedom (df) = 1, P = 4.13 × 10−9; hexaploid C vs A, 2,367 vs 790, df = 1, P < 2.2 × 10−16; and C vs D, 2,367 vs 1,079, df = 1, P < 2.2 × 10−16; chi-squared test) and pseudogenes (C vs A, 28,056 vs 23,695, df = 1, P < 2.2 × 10−16; C vs D, 28,056 vs 24,009, df = 1, P < 2.2 × 10−16; chi-squared test) consistently showed a higher rate of gene loss in the C subgenome of hexaploid oat (Extended Data Fig. 8a–c). More contracted gene families in this subgenome (C vs A: 1,100 vs 569; C vs D: 1,100 vs 535) also support this result (Extended Data Fig. 8d–f). The contracted gene families are mainly related to ‘fructose 6-phosphate metabolism’, ‘the glycine metabolism’ and ‘translational termination’ (Supplementary Table 20). Second, using barley as the outgroup, we estimated the nonsynonymous to synonymous substitution rate ratios (Ka/Ks) of the 7,353 single-copy orthologs identified between the oat (sub)genomes. The results showed that the polyploids exhibited an accelerated rate of evolution relative to their A- and C-genome diploid progenitors and that the C subgenome was subject to less purifying selection than the other two subgenomes of hexaploid oat (Fig. 4e and Extended Data Fig. 9). Third, examination of the orthologs expression patterns showed that the number of preferentially expressed genes in the C subgenome was significantly lower than that in the A (up-regulated genes in A vs C, 9,375 vs 7,541, P = 0.009548, Wilcoxon rank-sum test) and D (D vs C, 9,359 vs 8,139, P = 0.04693, Wilcoxon rank-sum test) subgenomes (Fig. 4f,g and Supplementary Table 21). Interestingly, these expression-biased genes have more AS variants than those with balanced expression (Extended Data Fig. 10a). For instance, 725 genes were preferentially expressed in the A subgenomes. Of them, 307 have AS variants with an average of 3.10, which was significant more than that of their homeologs in the C (235 genes have AS variants, mean AS variants of 2.77, P = 0.018, Student’s t test, df = 540) and D (246 genes have AS variants, mean AS variants of 3.70, P = 0.0036, Student’s t test, df = 551) subgenomes. These results indicated that biased expression patterns may be closely related to the proteome diversity. Fourth, we found that the C subgenome contained more transposable elements (TEs) (A vs C vs D, 2.88 Gb vs 3.61 Gb vs 2.76 Gb) (Supplementary Table 6) and showed a higher overall TE density near genes than did the A and D subgenomes (C vs A, 0.454 vs 0.438, P = 4.595 × 10−9; C vs D, 0.454 vs 0.437, P = 4.595 × 10−9, Student’s t test, df = 24,448) (Extended Data Fig. 10b). We further found that genes with relatively higher TE densities near genes tend to have lower expression levels (Extended Data Fig. 10c). This observation is consistent with the hypothesis that subgenome gene expression dominance is influenced by TE-density differences between subgenomes as observed in other allopolyploids7,32. Taken together, these results demonstrate subgenome dominance in hexaploid oat.

Genes related to important agronomic traits

Oat production is threatened by several agricultural diseases, one of which is crown rust caused by the basidiomycete fungus Puccinia coronata var. avenae. Nucleotide-binding site leucine-rich repeat (NBS-LRR) proteins, which are encoded by a class of resistance genes (R-genes), play important roles in plant immune signaling33. We identified 1,269 R-genes across the three subgenomes of Sanfensan, showing a contraction compared with the numbers identified in the tetraploid and diploid (sub)genomes (Fig. 5a). Most of these R-genes occur in clusters at the distal ends of all chromosome arms. Mapping DNA markers cosegregating with or flanking the known crown rust genes (Supplementary Table 22) revealed these genes colocalize with R-genes (Fig. 5b). These discoveries could serve as promising candidates for future mapping and gene cloning studies.

Fig. 5: R-genes identified in Avena genomes, and distribution on each chromosome of Sanfensan.
figure 5

a, Comparison of the number of predicted R-genes identified in the genomes of hexaploid oat and its putative progenitors. b, Genomic distribution of the R-genes and the location of known quantitative trait loci for crown rust resistance.

Source data

Cultivated oats are generally classified into two production-related morphological types, hulled and hulless, which is one of the domestication traits (Fig. 6a). The caryopsis of hulled oat is tightly surrounded by a thick, tough and hard-to-remove hull, whereas hulless oat has a papery hull that is easily threshed when mature34. To identify genomic loci contributing to the hulless trait in oat, we performed a genome-wide association analysis based on 49,702 SNPs from 659 diverse oat accessions (Supplementary Table 23). We first infer the population structure and linkage disequilibrium (LD) pattern of the oat collection. Both principal-component analysis (Fig. 6b) and neighbor-joining tree analysis (Fig. 6c) revealed weak population structure, which is consistent with previous studies35,36,37. Most of the hulless landraces were tightly clustered reflecting domestication bottleneck for hulless oat34. The LD decay rate was measured as the physical distance at which the average pairwise r2 dropped to 0.2. Genome-wide LD decay rate of this population was estimated at 2.29 Mb (Fig. 6d), The long-range LD observed in oat is similar to that in other self-fertilizing species such as wheat38,39 and barley40,41. An association scan detected a strong peak on the end of chromosome 4D, which collocated with the previously reported N1 locus34,42 (Fig. 6e,f). We searched the vicinity of the peak for plausible candidate genes with particular attention to genes within the genomic region covered by markers flanking the N1 locus42 (Fig. 6f), and came across a gene (A.satnudSFS4D01G000045) annotated as a receptor-like kinase, which is about 30 kb distant from the most highly associated marker. A homologous gene AtVRLK1 in Arabidopsis was found to be involved in secondary cell wall thickening43, and the mutant of the homologous gene mis2 in rice displayed an open hulled spikelet44. Comparison the coding sequences (CDSs) of A.satnudSFS4D01G000045 identified a SNP in first exon predicted to cause amino acid changes (Fig. 6g). The association of this SNP with the hulless trait was validated by the development of KASP (Kompetitive allele-specific polymerase chain reaction) markers (Fig. 6h). By comparing transcriptome data between 10 hulled and 12 hulless oats, we found A.satnudSFS4D01G000045 is differentially expressed with hulless oats having higher expression levels (P < 0.01, Student’s t test) (Fig. 6i). Further examination of the expression patterns in the panicles at different developmental stages revealed this gene is highly expressed during panicle development in Sanfensan (hulless) but is expressed at very low levels in panicles of Ogle (hulled) (Fig. 6j, Student’s t test). These results indicated that A.satnudSFS4D01G000045 may be a promising plausible candidate gene that controls the hulled/hulless trait in oat.

Fig. 6: Genome-wide scan for the hulless grain trait.
figure 6

a, The spikelets and kernels of hulled and hulless oats. Scale bar, 1 cm. b, Population structure of 659 oat lines. Scatter plots of the first principle component (PC1) versus PC2 and PC1 versus PC3. c, Neighbor-joining tree constructed based on the Euclidean distance metrics between taxa. d, LD decay across the whole genome. The r2 values of LD were plotted against the physical distance (Mb). LD dropped to 0.2 at 2.29 Mb across the whole genome. e, Quantile–quantile plot of the MLM model for hulless grain. Linear regression was used to generate the P values. The dotted lines show the 95% confidence interval for the QQ-plot under the null hypothesis of no association between the SNP and the trait. f, Manhattan plot for candidate region association mapping of N1. The red horizontal line indicates the Bonferroni-adjusted significance threshold (−log10 P = 6.70). The zoomed-in region displays the physical positions of the markers GMI_ES22_c7478_431 and GMI_ES14_c19259_657 flanking the N1 locus42, and the mostly associated markers. g, Gene structure of the candidate gene A.satnudSFS4D01G000045. Green boxes are exons, and lines between the boxes are introns. The location of a SNP that resulted in an amino acid change is shown in the diamond shaped box. h, The association between A.satnudSFS4D01G000045 and the hulless trait was further confirmed by a KASP marker derived from the SNP. HEX, hexachloro fluorescein; FAM, fluorescein amidite. i, Expression level of A.satnudSFS4D01G000045 in RNA samples equally mixed from seven tissue/conditions of 10 hulled and 12 hulless oat lines. The central line for each box plot indicates the median. The top and bottom edges of the box indicate the first and third quartiles and the whiskers extend 1.5 times the interquartile range beyond the edges of the box. j, Comparison of expression levels of A.satnudSFS4D01G000045 in panicles at different developmental stages between Sanfensan (hulless) and Ogle (hulled). The data are presented as mean ± s.d. (n = 3 biological replicates). S1, S2, S3 and S4 represent panicles at the booting (Zadok’s 45), heading (Zadok’s 50 and 58) and grain dough (Zadok’s 83) stages, respectively. Two-tailed Student’s t test was used to generate the P values in panels i and j (*P < 0.05, **P < 0.01).

Source data


The Sanfensan oat, together with its diploid and tetraploid ancestors reference genomes presented here, constitutes an important community resource for cereal genomics and provides comprehensive and specific insights into the evolutionary history of oat. This study is the most comprehensive phylogenomic analysis of Avena to date, as it included samples representing all extant Avena genomes and developed the largest number of molecular markers evaluated thus far. Whole-genome sequencing, the de novo assembly of transcriptomes and complete plastid genomes show agreement on conclusion. The proposed model for the chronological formation of polyploid oat has been clarified, and the investigation into the evolution of the oat subgenomes during polyploidization events offers a window into the study of polyploid genome evolution. The high-quality reference genomes will facilitate the identification of genes that are related to important agronomic traits, including disease resistance and end-use functionality. These genomic resources are an important step toward the genetic improvement of oat.


Plant materials

The hexaploid oat cultivar Sanfensan (2n = 6x = 42, AACCDD), the diploid species A. longiglumis (accession CN 58139, 2n = 2x = 14, AlAl) and the tetraploid species A. insularis (accession CN 108634, 2n = 4x = 28, CCDD) were selected for genome sequencing and assembly. Sanfensan is a traditional hulless oat landrace that has a long cultivation history in Shanxi, China, which has been widely used as a parental line in hulless oat breeding programs, whereas A. insularis and A. longiglumis have been considered as the most likely tetraploid and diploid ancestors of hexaploid oat4.

Genome sequencing and assembly

The Illumina HiSeq X-Ten or MGISEQ-2000 platforms were used to generate short paired-end reads from genomic DNA isolated from Sanfensan, A. insularis, and A. longiglumis. A total of 649.68 Gb, 404.97 Gb, and 204.68 Gb of raw sequencing reads were generated for Sanfensan, A. insularis and A. longiglumis, respectively (Supplementary Table 1). ONT sequencing was used to generate long nanopore reads for these three taxa. Considering the large and complex genome of hexaploid oat, we used ONT ultralong sequencing to sequence the genome of Sanfensan, whereas ONT regular long reads were generated for the A. longiglumis and A. insularis genomes. A total of 1,260.30 Gb, 481.39 Gb, and 268.74 Gb of raw ONT (ultra-)long reads were produced for Sanfensan, A. insularis and A. longiglumis from 71, 7 and 8 libraries, covering their genomes at approximately 100×, 60× and 60× depth, respectively (Supplementary Table 2).

The raw ONT long reads were subjected to self-correction using the NextCorrect module implemented in NextDenovo (v2.0-beta.1) ( The corrected reads were then assembled into contigs using NextDenovo (Supplementary Table 3). To improve the accuracy of the assembled contigs, a two-step polishing strategy was applied: (1) the corrected Nanopore reads were first used for initial polishing with Racon (v1.4.21)45 (three rounds), and (2) the highly accurate short reads were then used to further correct the assemblies with NextPolish (v1.0.5) (four rounds).

We used Hi-C technology to obtain chromosome-level genome assemblies of Sanfensan and A. insularis. A total of 1,312.83 Gb and 816.93 Gb of raw Hi-C data were generated for Sanfensan and A. insularis, respectively. The polished contigs of Sanfensan and A. insularis were further clustered, ordered and anchored to pseudochromosomes using Hi-C data with the LACHESIS program46. We used RaGOO (v1.1)47 with the default parameters to anchor the contigs of A. longiglumis to seven pseudochromosomes with the previously published As genome of the diploid species A. atlantica17 as the reference.

Genome quality assessment

Multiple approaches were used to evaluate the quality of the assembled genomes. First, the NG values for all integer thresholds (1–100%) were calculated and used to generate an NG graph for each of the three assembled genomes to estimate the continuity of the assemblies. Second, the short paired-end reads from each species were mapped to their corresponding genome assemblies using BWA (v0.7.10-r789)48 with default settings. Variants were called using the HaplotypeCaller module of GATK (v4.1.9.0)49. The obtained SNPs and InDels were used to estimate the base-level accuracy and relative low heterozygosity of the three assemblies. Third, the BUSCO (v5.2.2)50 pipeline was used to evaluate the completeness of the gene space. Finally, the LAI51 was used to evaluate the completeness in the more repetitive genomic regions.

Subgenome assignment

A reference-guided strategy based on subgenome sequence similarity was used to distinguish the subgenomes of A. insularis and Sanfensan. The A. longiglumis reference genome was split into 100-bp nonoverlapping fragments and mapped onto the hexaploid genome assembly using BWA with the default settings. The uniquely mapped fragments were retained (Extended Data Fig. 4b–d). A syntenic block was defined based on the presence of at least five syntenic fragments (Extended Data Fig. 4a,b). The chromosomes with the highest similarity to A. longiglumis were assigned to the A subgenome, and the chromosomes with the lowest similarity were assigned to the C subgenome, because previous studies have shown that the C genome is highly diverged from the A and D genomes4,5; the remaining chromosomes with median similarity were accordingly assigned to the D subgenome (Extended Data Fig. 4a). Using the same method, the A. insularis genomic sequences were aligned to the Sanfensan reference genome, and they showed a high level of similarity with the hexaploid chromosomes that were assigned to the C and D subgenomes (Extended Data Fig. 4b,d). The subgenome assignments were further validated by the quantified depth of coverage of the paired-ended reads from A. longiglumis and A. insularis t (Extended Data Fig. 4e,f) and the abundance of two types of DNA repeats in the A. insularis and Sanfensan genomes, which have been reported to be overrepresented in the Avena C18 and A19 genomes, respectively (Fig. 1).

Repeated sequences and protein-coding gene annotation

Each of the three whole-genome assemblies was searched for repetitive sequences including tandem repeats and TEs. Tandem repeats were identified by using GMATA (v2.2)52 and Tandem Repeats Finder (v4.07b)53. A species-specific de novo repeat library was constructed using MITE-Hunter54, LTR_FINDER (v1.0.5)55, and RepeatModeler (v2.0.1) ( RepeatMasker (v1.331)56 was then adapted to search for TEs in the reference genome against Repbase (v19.06)57 and the species-specific de novo repeat library.

Protein-coding genes were predicted using an evidence-based annotation workflow by integrating evidence from transcriptomic data, homologue searches and ab initio prediction. Transcriptomic data were generated by performing PacBio full-length transcriptome sequencing using total RNA isolated from mixed plant organs. Raw reads were processed with IsoSeq3 pipeline ( /IsoSeq) to identify full-length, nonchimeric circular consensus sequences (CCSs). The resulting high-quality CCSs were mapped onto the reference genome for de-redundancy. The nonredundant isoforms were then used to determine the locations of potential intron-exon boundaries using GeneMarkST58. Protein sequences from A. atlantica, A. eriantha, Brachypodium distachyon, Hordeum vulgare, Oryza sativa and Triticum aestivum were used as protein evidence. Ab initio gene prediction was performed using GeneMark-ET (v4.0)59 and AUGUSTUS (v2.4)60 with two rounds of iterative training. All gene predictions were integrated using the recommended settings of EVidenceModeler (v1.1.1)61 after removing TE-related genes, pseudogenes and noncoding genes using TransposonPSI (v1.0.0)62 with the default settings.

Noncoding RNAs, including microRNAs, small nuclear RNAs, ribosomal RNAs and regulatory elements, were identified using the Infernal (v1.1.2)63 program to search against the Rfam database64. The ribosomal RNAs, transfer RNAs and microRNAs were identified using RNAmmer (v1.2)65, tRNAscan-SE (v2.0)66 and miRanda (v3.0), respectively. The putative domains and GO terms of the predicted proteins were identified using the InterProScan (v5.22)67 program with the default settings.

Gene family analysis

Sixteen grass species belong to BOP clade (Supplementary Table 7) for which high-quality reference genomes were available, mainly representing the Pooideae subfamilies Aveneae, Lolieae and Triticeae tribes that contain the main cereal crops were used for gene family clustering analyses. Gene families of the 23 subgenomes from 16 species were identified using the OrthoFinder (v2.2.7)68 program with default settings. The genes that were exclusively found in each tribe (Aveneae, Lolieae and Triticeae) were identified. Significantly overrepresented GO terms in each group were identified using the R package ‘topGO’ (

Identification of orthologous and homoeologous gene sets

Using the gene families identified by the OrthoFinder program, 2,237 one-to-one orthologous gene sets were identified for the 23 subgenomes of 16 grass species. We identified 7,353 one-to-one orthologous gene sets for the eight Avena (sub)genomes and H. vulgare cv. ‘Morex’. We also identified 12,225 one-to-one homoeologous gene sets (triads) for hexaploid oat (Supplementary Table 9).

Phylogenomic analyses of cereal crops

To investigate the evolutionary position of oat among cereal crops, a phylogenetic tree was constructed using the 2,237 conserved single-copy genes among the 23 subgenomes of 16 grass species. For this purpose, conserved CDS alignments of these single-copy orthologues were extracted by using Gblocks (v0.9b)69 and were concatenated to generate a supermatrix. The transversion rates at fourfold degenerate (4DTv) sites were extracted from this supermatrix and subject to analysis with RAxML (v.8.2.7)70 to generate the maximum likelihood tree using the GTR + I + Γ model. Divergence times were estimated using the MCMCTree program in the PAML(v4.7) package71.

To investigate the chromosomal evolution of the Avena genomes, we selected oat diploid, tetraploid and hexaploid species together with wheat to analyze the synteny between these genomes and the rice or barely genome using MCScanX (git-97e74f40)72 with the default settings, and the identified syntenic blocks were then used to deduce the homologous relationships between rice or barely genes and the protein sequences of Avena and wheat. The collinearity between species was identified and plotted using MCScanX (python version) (Fig. 2c).

The evolution and allopolyploidization history of oat

For whole-genome resequencing, 14 other Avena accessions were chosen, representing all genome types found among extant Avena diploid, tetraploid and hexaploid species (Supplementary Table 1). All genome sequencing was performed on an Illumina HiSeq X-Ten instrument generating 400-base paired-end libraries.

The raw reads from the 14 newly sequenced accessions as well as A. longiglumis and A. insularis were trimmed using Trimmomatic (v.0.40)73. Potential polymerase chain reaction duplicates were removed using SAMtools (v1.9)74. The genomic variants were identified using the HaplotypeCaller module in GATK (v4.1.9.0), and SNPs that met any of the following criteria were further discarded: (1) SNPs with quality score <50, QD score <2, FS score >60 or MQ score <40; (2) SNPs with more than two alleles; (3) SNPs at or within 5 bp from any InDels; (4) Genotypes with extremely high (greater than three-fold average depth) or extremely low (less than one-third average depth) coverage were assigned as missing sites; (5) SNPs with a minor allele frequency < 0.05; and (6) SNPs with missing sample rate >0.3. SNPs in syntenic regions across the A, C, and D subgenomes for each accession were isolated and used for maximum-likelihood (ML) tree construction using the RAxML software with 200 bootstrap replicates. The phylogenetic relationships of the species related to the A, C, and D lineages were constructed based on A-, C-, and D-type SNPs, respectively. The level of identity between the individual Sanfensan subgenomes and each sequenced accession was calculated by mapping ~1× clean reads of each accession onto the Sanfensan subgenomes.

All sequenced diploid accessions were further subjected to transcriptome sequencing. The raw reads were cleaned with Trimmomatic. De novo assembly was performed using Trinity (v2.0.3)75 with the default settings. CDSs were predicted using TransDecoder (v5.5.0) ( Proteomes from these diploid accessions was subjected to gene family analysis against Hordeum vulgare protein sequences by using OrthoFinder, and single-copy gene families that were conserved in all species were retained for further study. The protein sequences from each conserved gene family were aligned using MUSCLE (v3.8.31)76 with the default parameters, and the alignments were CDS back-translated into CDS from the corresponding protein alignments. The same methods described above were used for phylogenetic tree construction and divergence time estimation.

The chloroplast genomes of all sequenced taxa were assembled with NOVOPlasty (v3.7)77 using the clean Illumina short reads (Supplementary Table 13). Another 25 Avena chloroplast genomes previously released22,23 (Supplementary Table 14) were also downloaded. Multiple sequence alignments were performed using MUSCLE, and the identified informative sites were used for phylogenetic tree construction using RAxML with 100 bootstrap replicates under the GTR+I+Γ evolutionary model, where the chloroplast genome sequence from Triticum aestivum was used as the outgroup. Divergence times were estimated based on independent rates and the Jukes-Cantor 1969 (JC69) model using the MCMCTree program in the PAML (v4.7) package.

Synteny and comparative genomics

The subgenome synteny between Sanfensan and A. insularis was analyzed by plotting the positions of homologous pairs in the subgenome pairs within the context of 21 and 14 chromosomes using Circos (v0.69-9)78 software (Extended Data Fig. 5a,b). The interchromosomal exchanges between A. insularis and Sanfensan after polyploidization were analyzed by individually mapping reads from A. longiglumis and A. eriantha to the A. insularis reference genome and reads from A. longiglumis, A. eriantha and A. insularis to the Sanfensan reference genome. The single-base depth coverage of the properly paired reads obtained from the A. longiglumis, A. eriantha and A. insularis mapping was calculated using the Mosdepth (v0.3.0)79 program. The median depth within a sliding window (window size: 1 Mb, step size: 0.5 Mb) was calculated and plotted along the chromosomes of the reference genome (Fig. 4b and Extended Data Fig. 6a–d).

FISH analysis

Major interchromosomal exchanges between the C and D subgenomes of A. insularis and Sanfensan were detected using FISH with the A and C genome-specific repeats As120a and Am1 as the probes. The large inversion observed on hexaploid chromosome 3 C was validated by FISH using two additional oligonucleotide probes, Oligo-5SrDNA and Oligo-6C343, as previously reported80, that gave stable fluorescent signals in regions flanking one of the breakpoints of this inversion on hexaploid chromosome 3C. The nucleotide sequences of the FISH probes were given in Supplementary Table 17. Metaphase chromosome preparation and FISH analysis were performed as described in the previous study81,82; (Fig. 4c, Extended Data Fig. 6e, f).

Ka/Ks analysis

On the basis of the 7,353 one-to-one orthologous gene sets identified among the genome assemblies for Hordeum vulgare, we calculated the nonsynonymous (Ka) and synonymous substitution (Ks) rates for the A-genome (A. atlantica and A. longiglumis) and C-genome (A. eriantha) diploid progenitors of the hexaploid oat, and the subgenomes of A. insularis and Sanfensan. For this purpose, the homoeologous gene pair list was used as the input and the protein sequences from each gene pair were aligned using MUSCLE. PAL2NAL(v14)83 was used to convert the peptide alignment to a nucleotide alignment, and the Ka and Ks values were computed between gene pairs using Codeml from PAML (v4.7) in free-ratio mode. All estimates with Ks < 0.01 were excluded from the analysis. The significance of the differences in Ka/Ks values between genomes (subgenomes) was estimated using the Wilcoxon rank-sum test for nonnormal distributions in R.

Gene loss and retention

To measure the effects of polyploidization on gene loss and retention, we performed PAV analyses of the single-copy genes between the polyploids and the diploids. In this case, 14,624 one-to-one orthologues for the three diploids (A. atlantica, A. longiglumis and A. eriantha) were identified. The presence/absence of these genes in each subgenome of the tetraploid and hexaploid species were counted. The distributions of these conserved genes that were absent in the polyploids were displayed with the A. longiglumis genome as the reference. Expanded and contracted gene families for eight Avena (sub)genomes with the barley genome as reference were identified using CAFÉ (v5.2.1)84. Significantly overrepresented GO terms of the contracted gene families in the C subgenome of the hexaploid were identified using the R package ‘topGO’. We further compared the gene family size changes in the hexaploid subgenomes and the A- and C-genome diploids based on the gene families identified by the OrthoFinder program. The scatter dots and regression lines were plotted (Extended Data Fig. 8e,f).

Gene expression analysis

Gene expression levels in each sample were quantified using the HiSAT2 (v2.2.1)85 and HTSeq (v0.9.1)86 pipelines. Differentially expressed genes were identified with the edgeR (v3.38.1) software package87 (false discovery rate < 0.05 and |log2-fold change > 0.5). To analyze differences in the expression patterns of homoeologous genes, we undertook an initial analysis of the variation in expression among 12,225 homoeologous triplets. Triplet expression vectors were created by concatenating the observed gene expression values for the A, C and D homologs. Triplets that expressed at least one homolog across the sampled tissues were summarized in a triplet expression matrix. The expression values of the triplet expression matrix were log transformed (log10(transcripts per million + 1)), and the matrix was subjected to two-dimensional hierarchical clustering using ‘hclust’ implemented in R with the ‘average’ correlation distance and clustering. Heatmap visualization was performed using the heatmap.2 command from the R package gplots.

Characterization of AS variants

Transcripts from the RNA-sequencing data were first assembled using StringTie (v2.2.0)88 with the default parameters based on the bam files generated from HISAT2, which were then integrated with the IsoSeq transcript assemblies using SQANTI3 (v5.0)89. The number of AS events in each gene locus was counted using SQANTI3 with the default settings.

Identification of R-genes

R-genes in the five Avena genomes were identified using InterProScan (v5.22) and DeepCoil (v2.0.1) based on domains from the Pfam (v14.2) database. The domains included coiled-coil (CC), kinase (KIN), Toll/interleukin-1 receptor/resistance protein (TIR), nucleotide-binding site (NBS) and leucine-rich repeat (LRR). Non-canonical R-genes (which lack most conserved motifs but are nevertheless potential R-genes) were determined by BLAST searchers using manually curated R-genes from the PRGdb (v4.0)90 database as reference sequences. Results with e-values of less than 1E-5 and top-hits with at least 30% query and subject coverage were considered to be the putative R-gene candidates.

To understand whether the identified R-genes were correlated with the map positions of the known quantitative trait loci for crown rust, one of the most serious diseases of oats, DNA markers co-segregating with or flanking the known crown rust genes (Supplementary Table 22) were mapped to the hexaploid Sanfensan reference genome by BLASTn analyses. The distributions of the R-genes and known quantitative trait loci are shown in Fig. 5b.

Candidate gene for the hulless trait

A diverse collection of 659 oat accessions that included 510 hulled and 149 hulless oats were used for genotyping by sequencing (GBS) analysis (Supplementary Table 23). Variants were identified using the HaplotypeCaller module of GATK (v4.1.9.0). The resulting variants were further filtered to retain only bi-allelic sites with minor allele frequency > 0.05, missing data rate <20% and heterozygous calls <10% for genome-wide association scans.

Principal-component analysis and neighbor-joining trees were used to infer population structure of the oat collection using TASSEL 5.0 (ref. 91). Pairwise r2 was calculated using an LD sliding window size of 50. The pattern of LD decay was visualized by plotting pairwise r2 values against the physical distance (Mb). A smoothened LD curve was fit to the data using the R script developed by M. Ali (; decay Plot from TASSEL LDoutput.R).

Genome-wide association scans for the hulless trait were performed with TASSEL 5.0 using a mixed-linear model incorporating the kinship matrix and population structure as covariates. A genome-wide threshold of −log(P) = 6.70, calculated from the formula −log10(0.01/effective number of SNPs) was used to identify markers associated with the hulless trait.

Ten hulled and 12 hulless oat accession (Supplementary Table 1) were selected for RNA-sequencing analysis. Genomic sequences of the candidate gene from five hulless and five hulled oats were aligned using the ClustalW92 program. A total of 25 single-nucleotide changes were identified in the gene coding regions between hulled and hulless oats, with one SNP in exon 1 predicted to cause amino acid changes. This SNP was converted to a KASP marker (Supplementary Note) and validated in 286 oat lines randomly selected from the diverse oat collection.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.