Subtelomeric assembly of a multi-gene pathway for antimicrobial defense compounds in cereals

Non-random gene organization in eukaryotes plays a significant role in genome evolution. Here, we investigate the origin of a biosynthetic gene cluster for production of defence compounds in oat—the avenacin cluster. We elucidate the structure and organisation of this 12-gene cluster, characterise the last two missing pathway steps, and reconstitute the entire pathway in tobacco by transient expression. We show that the cluster has formed de novo since the divergence of oats in a subtelomeric region of the genome that lacks homology with other grasses, and that gene order is approximately colinear with the biosynthetic pathway. We speculate that the positioning of the late pathway genes furthest away from the telomere may mitigate against a ‘self-poisoning’ scenario in which toxic intermediates accumulate as a result of telomeric gene deletions. Our investigations reveal a striking example of adaptive evolution underpinned by remarkable genome plasticity.

. Flow cytometric estimation of the nuclear genome size of Avena strigosa. Nuclei were isolated from A. strigosa accession S75 and maize (Zea mays cv. CE.77; used as an internal reference standard), stained and analyzed simultaneously. The 2C DNA content was calculated using the ratio of the G1 peak means of A. strigosa S75 and the maize standard, giving a 2C value for A. strigosa S75 of 8.486 ± 0.074 pg DNA (mean ± SD). The 1C genome size in base pairs was calculated using the formula 1pg DNA = 0.978 x10 9 bp 1 , giving a 1C A. strigosa S75 genome size of 4149 ± 0.036 Mb (mean ± SD). A total of 5515 64-base tag-level haplotypes out of 13,873 examined had good matches (either perfect or one base mismatch) matches to single sites on the seven chromosomes. Supplementary Fig. 6. High confidence gene models identified in the A. strigosa S75 genome assembly. (A) Proportions of gene models supported by evidence from protein-based homology searches and expression data. (B) RNA-seq and homolog-based evidence for the high confidence gene models. RNAseq reads were mapped to gene coding sequences. The coverage is the ratio of total aligned reads length to the coding sequence length of the gene. The gene models of A. strigosa were aligned to those of barley and Triticum urartu (red wild einkorn wheat; AA genome) by Blastp. The coverage of each A. strigosa gene is the ratio of aligned protein sequence length to total protein length of the corresponding gene. Source data are provided as a Source Data file. Supplementary Fig. 7. Comparison of assembled scaffolds with A. strigosa S75 BACs sequenced using the Sanger method. The depth of the nanopore long reads (blue) was calculated by mapping paired-end reads onto the BAC sequences. The grey blocks show aligned regions between the BAC sequences and the scaffolds. RepeatMasker-annotated transposable elements (Repeats) within the BAC sequences are shown in red. The white blocks indicate the unfilled gaps on the scaffolds. GC content is shown in green. Supplementary Fig. 8. Occurrence of different types of repetitive elements within the avenacin cluster region. The avenacin gene cluster is indicated by the red line. Supplementary Fig. 9. CYP94D65 is a C-23 hydroxylase. Left: GC-MS total ion chromatograms of extracts from N. benthamiana leaves expressing tHMGR and bAs/Sad1 without or with CYP94D65. Right: EI mass spectrum and inferred structure for the oxidized form of β-amyrin generated by CYP94D65. Co-expression of bAS1/SAD1, CYP51H10/SAD2, CYP72A475/SAD6, CYP94D65 and AAT1/UGT99D1 with the glucosyltransferase UGT91G16 results in the disaccharide (2, retention time = 6.6 min, m/z = 783, top right mass spectrum). The addition of CYP72A476 results in the clear consumption of the disaccharide peak (2) and the appearance of a new more polar peak with the mass of (2) with the addition of an aldehyde group (3, retention time = 5.2 min, m/z = 797, middle left mass spectrum). Co-expression of bAS1/SAD1, CYP51H10/SAD2, CYP72A475/SAD6, CYP94D65, AAT1/UGT99D1 and UGT91G16 with TG1 results in a reduction of the disaccharide peak (2) and the appearance of the trisaccharide (4, retention time = 5.1 min, m/z = 945, middle right mass spectrum). Co-expression with CYP72A476 results in the conversion of the disaccharide and trisaccharide peaks to new products, (3, retention time = 5.2 min, m/z = 797, bottom left mass spectrum) and (5, retention time = 4.0 min, m/z = 959, bottom right mass spectrum) respectively, that have masses consistent with that of the addition of an aldehyde group.

A. atlantica (A s A s )
A. eriantha (C p C p )  Supplementary Fig. 22). The A subgenome of A. sativa has a conserved region of avenacin cluster on chromosome 7, but the homologs of the last two genes (UGT91 and TG1) are located on chromosome 1. In the C subgenome of A. sativa there is also a roughly similar region, but with several rearranged genes. In the D subgenome of A. sativa, only the homologs of the last two genes (UGT91 and TG1) of the avenacin cluster were found on chromosome 7. The homologs of the other ten genes were found on Chr00, which is not assembled into the chromosomes. Two conserved regions of the genes in between the avenacin cluster and the telomere were also found on Chr00 (further detail for hexaploid oat shown in Supplementary  Fig. 23).
Supplementary Fig. 23. Comparison of the A. strigosa avenacin cluster region with related regions in A. atlantica and A. eriantha. The A. strigosa avenacin cluster is indicated by the red line. A. atlantica has a very similar region in the syntenic position on chromosome 1, which is also root-expressed. There are two additional A. atlantica genes in between the homologs of UGT74H7 and UGT91 that are absent from the A. strigosa cluster. However, homologs of these two genes are located ~20 genes to the right of the A. strigosa cluster. The genes in between the avenacin cluster and the telomere are also conserved in A. atlantica. However, A. atlantica has an additional six genes (indicated by grey arrows with red outlines) beyond these at the end of the chromosome. No corresponding syntenic block of six genes was found anywhere in the A. strigosa genome. In A. eriantha there is a loosely similar region but with numerous additional genes. Although most of these A. eriantha genes are expressed at moderate level in the roots, they are primarily expressed in whole seedlings and in the crowns. Source data are provided as a Source Data file.

Supplementary Fig. 24. Pairwise dN/dS analysis for the avenacin gene cluster region in the sequenced diploid oat species.
The genes of the avenacin gene cluster are indicated by the red line, and the flanking genes by asterisks. Notably, some of the genes in the avenacin cluster in A. eriantha are duplicated; the best-matched homologous genes are marked with a black circle. For each dataset, the likelihood ratio test (LRT) was conducted to assess the statistical significance. The p value was determined by comparison to χ2 with one degree of freedom. Only dN/dS ratios with p < 0.05 were considered as significant and plotted. Source data are provided as a Source Data file.

Supplementary Fig. 25. Comparison of the A. strigosa avenacin cluster region with related regions in the hexaploid
A. sativa genome. The A subgenome of A. sativa has a conserved region of avenacin cluster on chromosome 7, but the homologs of the last two genes (UGT91 and TG1) are located on chromosome 1. In the C subgenome of A. sativa there is also a roughly similar region, but with several rearranged genes. In the D subgenome of A. sativa, only the homologs of the last two genes (UGT91 and TG1) of the avenacin cluster were found on chromosome 7. The homologs of the other ten genes were found on Chr00, which is not assembled into the chromosomes. Two conserved regions of the genes in between the avenacin cluster and the telomere were also found on Chr00.