Main

The genomes of Arabidopsis3, rice4, poplar, grape and Sorghum5 were first sequenced using high-quality and reiterative Sanger-based approaches producing a series of ‘gold standard’ reference genomes. The advent of next-generation sequencing (NGS) technologies reduced costs of sequencing substantially, which has enabled sequencing of over 100 plant genomes1. The quality of plant genome assemblies depends on genome size, ploidy, heterozygosity and sequence coverage, but most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds. The short read lengths of NGS, inherent biases and non-random sequencing errors have resulted in highly fragmented draft genome assemblies that are not complete, which means they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements, centromeres, telomeres and haplotype-specific structural variations. It is becoming clear from ENCODE projects that complete genomes are needed to better understand the importance of the non-coding regions of genomes2.

More than 40% of calories consumed by humans are derived from grasses, and the grass family (Poaceae) is arguably the most important plant family with regard to global food security6. The size and complexity of most grass genomes has challenged progress in gene discovery and comparative genomics, although draft genomes are now available for most agriculturally important grasses1. The largest genome assemblies, such as maize (2,300 megabases (Mb))7, barley (5,100 Mb)8 and wheat (hexaploid, 17,000 Mb)9 are highly fragmented as a result of the inability of current sequencing technologies to span complex repeat regions. Near-finished reference genomes are available for rice4, Sorghum5 and Brachypodium10, but more high-quality grass genomes are needed for comparative genomics and gene discovery. Here we present the ‘near-complete’ draft genome of the grass Oropetium thomaeum, the first high-quality reference genome from the Chloridoideae subfamily. The draft genome is near complete because we were able to sequence through complex repeat regions that are unassembled in most draft genomes. Oropetium has the smallest known grass genome at 245 Mb and is also a resurrection plant that can survive the extreme water stress such as loss of >95% of cellular water (Fig. 1)11.

Figure 1: Desiccation tolerance in the resurrection grass Oropetium thomaeum.
figure 1

a, Well watered. b, Desiccated (relative water content <5%) after 9 days of drought stress. c, Condition 24 h post-hydration (relative water content >70%).

PowerPoint slide

Single-molecule real-time (SMRT) sequencing (Pacific Biosciences) produces long and unbiased sequences, which enables assembly of complex repeat structures and GC- and AT-rich regions that are often unassembled or highly fragmented in NGS-based draft genomes. We generated ~72× sequencing coverage of the Oropetium genome using 32 SMRT cells on the PacBio RS II platform (which is equivalent to <1 week of sequencing time and <US$10,000 in reagents). The resulting sequence had a read N50 length of over 16 kilobases (kb), and there was 10× coverage of reads over 20 kb in length (Extended Data Fig. 1a). The raw reads were error-corrected using the hierarchical genome assembly process (HGAP), and the longest reads (>16 kb) were assembled using Celera assembler followed by two rounds of genome polishing using Quiver12. The assembly contains 650 contigs spanning 99% (244 Mb) of the estimated 245 Mb genome size (Extended Data Fig. 1b) with a contig N50 length of 2.4 Mb (Extended Data Fig. 1c). The final assembly consists of 625 contigs after removal of the complete chloroplast genome, mitochondria-derived contigs and contaminants. The 35 largest contigs span half the genome, and the largest 107 contigs contain 90% of the sequence. The 135,324 base-pair (bp) chloroplast genome assembled into a single contig that includes both ~25 kb of inverted repeat regions which typically collapse into a single copy during assembly. The mitochondria genome was assembled into 20 partially overlapping circular chromosomes, which are the product of intramolecular recombination events that collectively span 1,100 kb.

The Oropetium genome has high contiguity for an uncurated draft plant genome. The average contig N50 length for all published plant genomes is 50 kb compared to 2.4 Mb for Oropetium (Extended Data Fig. 1d, e). After manual curation and data augmentation, only the Arabidopsis (TAIR10)13, rice (V7) and Brachypodium (V 2.1)10 genomes have longer contig N50 lengths. The accuracy rate is very high at 99.99995%, which is similar to Sanger-based approaches and higher than most NGS-based assemblies (Extended Data Fig. 1h). We plotted repeat density and GC content along the length of the contigs to identify factors causing contig breaks (Extended Data Fig. 1f, g). There is no correlation between repeat density and GC content at contig break points. This suggests that contig break points occur at the start of repeats or that most assembly breaks are caused by other factors, such as within-genome heterozygosity or haplotype-specific structural variation. To test this, we also tried ‘diploid-aware’ assemblers Falcon (https://github.com/PacificBiosciences/falcon) and MinHash Alignment Process (MHAP)14. These assemblies had similar metrics but were less contiguous overall (Extended Data Fig. 1i).

The completeness of the Oropetium genome allowed us to accurately survey its highly repetitive features that are often unassembled in most plant genomes. The Oropetium assembly captures all 18 telomeric arrays (Extended Data Table 1) with repeat number ranging from 40 to 900, suggesting that at least some are full length. Three of the nine centromeric satellites are completely assembled into large inverted repeats spanning 400 kb with a base monomer length of 155 bp, and higher order structures of dimers (310 bp), trimers (465 bp) and tetramers (620 bp; Fig. 2, Extended Data Fig. 2 and Supplementary Table 1). The remaining 40 centromeric sequences are incomplete centromere repeat fragments broken during assembly or solo repeats not associated with a larger centromere satellite. Nucleolus organizer regions contain tandem arrays of the 18S, 5.8S and 25S ribosomal RNA (rRNA) genes and typically span several megabase pairs with hundreds of nearly identical 10-kb arrays. Twenty-two full-length rRNA tandem arrays in six contigs are found in the Oropetium assembly (Extended Data Table 2). The largest tandem array contains five identical and one partial 9-kb repeats collectively spanning 51 kb; this is approaching the theoretical limit given the read-length distributions of our data. The remaining rRNA tandem repeats probably collapsed during read correction or genome assembly given their high sequence conservation.

Figure 2: SMRT sequencing enables contiguous sequencing over complex regions.
figure 2

The distributions of centromere-specific satellite DNA (CenOt), long terminal repeat retrotransposons (LTRs), DNA transposable elements (DNA-TE) and coding DNA sequences (CDS) are plotted. a, The gap-free assembly of a full-length centromeric array and the flanking highly repetitive pericentromeric region. b, The largest contig (7.8 Mb), which has a more typical distribution of elements.

PowerPoint slide

Most repeats are incomplete, unassembled or highly collapsed in Illumina/454 NGS-based genomes, which has led to an underestimation and misclassification of repeat content in most plant genomes. Repetitive elements account for a surprisingly high proportion of the Oropetium genome (43%) compared to 21% in Brachypodium10, 35% in rice4, 54% in Sorghum5 and over 90% in wheat9 (Extended Data Table 3). Similar to these other genomes, the long terminal repeat (LTR) retrotransposons are the most abundant class and account for 35.6% of the Oropetium genome. We identified 3,247 intact LTRs in 358 families, which is similar to rice (3,663) and Brachypodium (2,162), but far less than Sorghum (17,022)15. Only ~2% of the repeats are unclassified, which reflects the completeness of individual repeat elements due to the long reads.

Genome size in the grasses varies by several orders of magnitude as a consequence of polyploidy and genome bloating due to repetitive DNA accumulation16. Oropetium has the smallest known genome among the grasses17 at 90%, 60%, 50%, 30% and 10% the size of Brachypodium10, rice4, Setaria18, Sorghum5 and maize7, respectively. We found that Oropetium has a solo:intact LTR ratio >1, which is similar to small grass genomes like rice and Brachypodium, where proliferating LTRs are removed by illegitimate recombination, whereas large grass genomes like Sorghum and maize have solo:intact LTR ratios <1 (ref. 15). Despite its compact size, the Oropetium genome has a typical number of predicted protein coding genes at 28,446. A pan-cereal whole-genome duplication (WGD) event, called rho, occurred before the diversification of grasses5,19. There appear to have been no further WGDs in the selected grass genomes, including Oropetium, since the shared rho event4,5.

Genome alignments between Oropetium and selected grass genomes are mostly one-to-one after exclusion of the alignments derived from the shared genome duplication events (Extended Data Fig. 3a–e). Overall, 75% of the Oropetium genome, or 89% of its gene space, is contained in conserved syntenic blocks when compared to other grasses. Genomic colinearity across grass genomes is extensive, with a high density of orthologous genes spanning much of the euchromatin (Fig. 3). Insertions of retrotransposons and non-collinear genes that originated elsewhere in the genome contribute greatly to the differences in the intergenic sequences in grasses20.

Figure 3: Compact genome structure of Oropetium.
figure 3

Oropetium, part of the PACMAD clade, provides the first high-quality reference genome from the Chloridoideae subfamily—a large and diverse group of ~1,600 species that contains the orphan crops tef (Eragrostis tef) and finger millet (Eleusine coracana). Typical micro-colinearity patterns among genomic regions from Oropetium, Setaria, Sorghum, Oryza and Brachypodium are shown. Rectangles show predicted gene models, and colours indicate relative orientations. Matching gene pairs are displayed as grey connections. chr, chromosome.

PowerPoint slide

The relative sizes of syntenic blocks in the grass genomes track closely with the overall genome size difference (Extended Data Fig. 3f). In contrast, the genomic span of coding sequences is similar across genes that are retained in orthologous locations, although coding features are slightly smaller in Oropetium (Extended Data Table 4). The relatively constant sizes of coding sequences among grass genomes confirm that genome size differences are indeed due to variations in the intergenic contents. It was thought that plants have a ‘one-way ticket to genome obesity’ due to the retention of proliferating transposable elements21. However, analysis of carnivorous plants Utricularia gibba (bladderwort, 82 Mb)22 and Genlisea aurea (corkscrew, 63.6 Mb)23 provided evidence that almost all intergenic space can be purged. Small genomes also arise from a reduction in gene number as seen in the aquatic monocotyledon Spirodela polyrhiza, which has the fewest predicted protein coding genes at 19,623 (ref. 24). Oropetium seems to have reduced both its intergenic and intragenic sequence.

As the intergenic sequence in Oropetium is specifically reduced compared with other grasses (Extended Data Fig. 3f), we determined which sequence accounted for its smaller genome size by comparing highly syntenic regions of the larger 730 Mb Sorghum genome. To identify highly orthologous regions we looked for Sorghum genes (promoter, 5′UTR, exons, introns and 3′UTR) with an increased number of conserved noncoding sequences25. We then analysed the top 48 Sorghum genes against their orthologous sequences in Oropetium and found that they were 38% (±0.27, 1 s.d.) larger in Sorghum (Extended Data Fig. 4a). The primary driver of gene-space expansion was highly unique ~1-kb intragenic sequences evenly spaced within the Sorghum genes. One explanation is that these evenly spaced highly unique sequences are degenerate remnants of transposons that have been partly purged from the Sorghum genome. Oropetium has a >1 solo:intact LTR ratio, consistent with active purging of transposons and complete loss of these regions. These results lend support to an emerging theory about the C-value paradox called the Genome Balance Hypothesis26, which suggests that selection on gene networks and pericentromeric growth (centromere movement) is balanced by transposon proliferation and retention. Therefore, these evenly spaced highly unique sequences balance the 6:1 expansion of pericentromeric sequence in Sorghum as compared to Oropetium (Extended Data Fig. 4b).

Desiccation tolerance was a key adaptation that permitted the most recent common ancestor of terrestrial plants to survive on land. Desiccation tolerance is widespread in bryophytes and lichens but rare in flowering plants, although similar mechanisms have evolved in vascular plants for seed and pollen desiccation. Desiccation tolerance to survive prolonged drought evolved independently in diverse monocotyledon and eudicotyledon lineages, and is found in at least 300 species. Gene duplications have provided the raw material for evolutionary innovation across plants. Tandem duplicated genes are often involved in stress responses and are probably important for adaptive evolution in dynamically changing environments. Oropetium has 6,668 tandem duplicated genes in 2,326 clusters, which is a slightly higher number than in other grasses, but a similar proportion (24% of genes). Tandem duplicated genes are enriched for gene ontology terms involved in response to abiotic stresses, gene regulation and cellular metabolism (Supplementary Table 2). In addition, Oropetium has 4,209 homeologous gene pairs retained from the rho WGD event, which are enriched for gene ontology terms related to gene regulation and stress responses such as transcription factor activity, nitrogen metabolism, response to abiotic stimulus, to salt stress and to oxygen-containing compounds (Supplementary Tables 3 and 4). Understanding the genomic mechanisms of extreme desiccation tolerance in resurrection plants such as Oropetium may provide targets for engineering drought and stress tolerance in crop plants.

Pacific Biosciences (PacBio) SMRT sequencing has been used to close gaps in the human genome27, assemble complete bacterial genomes12 and identify novel gene isoforms28. Here we present a several hundred megabase plant genome, sequenced and assembled entirely by SMRT sequencing. The long SMRT reads produced a near-complete draft genome that captured three of nine complete centromeres, all of the telomeres and biologically relevant features of the Oropetium genome. The total time from extracted DNA to a complete assembly was less than one month, and costs for PacBio were comparable to an Illumina-based genome assembly. Our study demonstrates that SMRT sequencing enables a new level of genome assembly required for full ENCODE-type analysis of intergenic sequence, which is not currently possible with other NGS-based methods. The compactness of the Oropetium genome results from purging of both inter- and intragenic sequences, probably through small deletions during illegitimate recombination, as has been shown in other grasses. One hypothesis is that genome size is a function of cell size29, and consistent with this, all small plant genomes sequenced to date including Arabidopsis (125 Mb), Brachypodium (272 Mb), Selaginella (100 Mb) Spirodela (158 Mb) and Utricularia (82 Mb) are plants of very small stature (Fig. 1). However, we provide evidence for the Genome Balance Hypothesis, which suggests that there is selective pressure on Oropetium to purge proliferating transposons in order to maintain expression balance of networked genes and spacing in centromeres. The complete assembly of complex and highly similar repeat sequences demonstrated here suggests that SMRT sequencing can be used to assemble large and polyploid plant and other eukaryotic genomes, assuming ample sequence coverage and computational resources. SMRT-sequencing-based assemblies provide an opportunity to determine how these regions play a role in genome architecture and dynamics.

Methods

No statistical methods were used to predetermine sample size.

Plant material

Oropetium thomaeum is a compact resurrection plant that has the smallest known genome among the grasses, at 245 Mb and 9 chromosomes (2n = 2x = 18; 1C = 0.25 pg)17. We estimated the genome size to be 250 Mb by flow cytometry and 245 Mb by k-mer analysis (Extended Data Fig. 1b). Oropetium thomaeum plants were originally collected in Jodhpur, Rajasthan, India and propagated as previously described11. Oropetium is a member of the Chloridoideae subfamily, a large and diverse group of roughly 1,600 species that contains the orphan crops tef (Eragrostis tef) and finger millet (Eleusine coracana) as well as some turf grasses (such as Bermuda grass, Cynodon dactylon and Zoysia japonica).

SMRT PacBio sequencing

Fifty micrograms of high-molecular-weight Oropetium gDNA was extracted using a modified nuclei preparation method30 followed by an additional high-salt phenol–chloroform purification to minimize contamination. A 20-kb insert SMRTbell library was generated using a 15 kb lower-end size selection protocol on the BluePippin (Sage Science). Initial titration runs were performed to optimize loading on the SMRT Cell for maximum performance. The Oropetium genome was sequenced using 32 SMRT Cells with 4-h collections and P6-C4 chemistry on the PacBio RS II platform (Pacific Biosciences).

HGAP genome assembly

The Oropetium genome was assembled using the RS_HGAP_Assembly.3 protocol for assembly and Quiver for genome polishing in SMRT Analysis v2.3.012. This consisted of a three-step process involving (1) generation of preassembled reads with improved consensus accuracy; (2) assembly of the genome through overlap consensus accuracy using Celera; and (3) one round of genome polishing with Quiver. For HGAP, the following parameters were used: PreAssembler Filter v1 (minimum sub-read length = 3,000 bp, minimum polymerase read quality = 0.80, minimum polymerase read length = 3,000 bp); PreAssembler v2 (minimum seed length = 16,000 bp, number of seed read chunks = 6, alignment candidates per chunk = 10, total alignment candidates = 24, min coverage for correction = 6); AssembleUnitig v1 (target genome coverage = 30, overlap error rate = 0.06, minimum overlap = 40 bp and overlap k-mer = 14); and BLASR v1 mapping of reads for genome polishing with Quiver (max divergence percentage = 30, minimum anchor size = 12). A second round of genome polishing was performed using Quiver (SMRT Analysis v2.3.0) to further improve the site-specific consensus accuracy of the assembly. The following Quiver parameters were used for genome polishing: filtering (minimum sub-read length = 3,000 bp, minimum polymerase read quality = 0.80, minimum polymerase read length = 3,000 bp); mapping (maximum divergence percentage = 30, minimum anchor size = 12). Default parameters were otherwise employed for both HGAP assembly and Quiver protocols.

Falcon and MHAP assemblies

We also tested other assemblers to compare the PacBio HGAP assembly results (Extended Data Fig. 1i). Raw PacBio reads were error-corrected and assembled using Falcon and MHAP under default parameters. The Falcon and MHAP assemblies have lower contiguity than the HGAP assembly and have fewer assembled centromere and telomere sequences with a lower average length.

Construction of a genome map using the Irys system for contig anchoring and scaffolding

Genome mapping from BioNano Genomics31 was used to improve the assembly quality of the Oropetium genome with the eventual goal of producing a chromosome-scale assembly. High molecular weight genomic DNA was isolated from fresh Oropetium tissue using the following protocol outline. Three grams of leaves were collected from live Oropetium thomaeum plants and fixed with formaldehyde. After blending with a tissue homogenizer in isolation buffer, a filtration step and Triton-X washing treatment were performed. The nuclei were purified on percoll cushions. The nuclei were washed extensively and embedded in low melting agarose at different dilutions. Finally, the DNA plugs were treated with a lysis buffer containing detergent, proteinase K and β-mercaptoethanol (BME). In total, 53 Gb of data (>100 kb) were collected representing ~200× genome coverage with a molecule N50 length of 169 kb (Extended Data Fig. 5a). The size distribution was lower than expected and is probably a result of impurities during high-molecular-weight gDNA isolation that would cause shearing and inhibition of enzymes. Molecules were de novo assembled as previously described32. Two genome maps were assembled at different stringencies, map set 1 has 402 maps with an N50 length of 725 kb and spans 216 Mb (Extended Data Fig. 5b); the second genome map has 214 maps and an N50 of 1.674 Mb. Combining the genome maps with the PacBio assembly to produce a hybrid scaffold was performed sequentially with the two genome maps. The scaffolding merged 90 contigs producing an assembly of 46 primary scaffolds covering 94% of the sequence assembly with an N50 of 7.8 Mb; in total there are 535 scaffolds with an N50 of 7.1 Mb and total assembled size of 244 Mb.

Variant calling using Illumina data

WGS Illumina sequences from Oropetium gDNA were used to assess the error rate of the PacBio assembly and residual within-genome heterozygosity (Supplementary Table 5). Raw Illumina HiSeq data from three different libraries of 570-bp insert, 1-kb insert and 3-kb insert sizes were trimmed for quality using Trimmomatic (v.0.32; ref. 33). Illumina sequence adaptors were removed, leading low quality (below quality 3) and N base pairs were trimmed, and reads were scanned using a 4-bp sliding window and trimmed when the average quality per base dropped below 30. Read pairs where both reads were ultimately of at least 36 bp in length following this quality control process were retained and used for subsequent analyses.

Quality trimmed data were aligned to our assembly using BWA mem (v. 0.7.12-r1039)34. Duplicate alignments were marked using Picard tools v.1.104 MarkDuplicates (http://broadinstitute.github.io/picard/). Genome Analysis Toolkit (v.3.3.0)35 IndelRealigner was used to perform local realignment around indels, followed by application of GATK HaplotypeCaller to call variants. Identified single nucleotide polymorphisms were filtered by depth, strand bias, mapping quality and read position. Identified indels were filtered by depth, strand bias and read position.

The native error rate of raw PacBio reads is in the range of 15–20%, raising the possibility that residual sequencing errors may be introduced into the final assembly of the Oropetium genome. Homozygous mismatches are classified as sequencing errors, and heterozygous mismatches indicate sites of heterozygosity. The accuracy rate is very high at 99.99995%, and a relatively high proportion of the errors (two-thirds) are small insertions or deletions (indels). The accuracy rate is similar to those obtained with WGS Sanger approaches5,36 and is higher than those reported for most NGS-based assemblies. The estimated residual within-genome heterozygosity for the Oropetium genome is very low at 0.087%, which probably contributed to the high contiguity of the assembly. This suggests that provided sufficient coverage, a PacBio SMRT-only approach can produce a high-quality complete plant genome.

Repeat annotation

To structurally annotate repeat sequences in the Oropetium genome, we began by discovering repetitive elements through application of the REPET v.2.2 packages TEdenovo and TEannot37. The TEdenovo pipeline compares the genome with itself to identify and classify repeated genomic elements. All-by-all alignments were conducted with NCBI-BLAST+ using default TEdenovo parameters. LTRharvest38 was used for structural detection. During clustering, Grouper, Recon and Plier steps were invoked both with and without structural detection. Consensus building was performed using default parameters. During consensus detect features, repeat scout39 was invoked, and Pfam26.0 HMM profiles40 and Repbase (v18.08) nucleotide and amino acid databanks were used. Finally, consensus classification, filtering and clustering were performed using default parameters.

Output from the TEdenovo pipeline was used as input to the TEannot pipeline. This pipeline mines the genome sequence using repeated sequences identified in the previous TEdenovo pipeline to produce classified non-redundant consensus repeat sequences along with short simple repeats, which are exported to GFF3 format. First, a set of perfectly matching sequences from the TEdenovo-output transposable elements (TE) library was selected by running a subset of the TEannot pipeline, producing a working reference TE library. This TE library was used in a full run of the TEannot pipeline. For alignment of the reference TE library, NCBI-BLAST+ was used, and blaster, repeat masker and censor steps were run both on the reference TE library and on randomized chunks. Filtering was applied using default parameters. Short simple repeats were identified using the crossmatch engine. Merging was performed using default parameters. For comparisons, Repbase (v18.08) nucleotide and amino acids databanks were used. Finally, filtering was applied using default parameters, and annotations were exported to GFF3 format.

To classify identified repeats, non-redundant consensus repeat sequences as output by TEanno were annotated via PASTEClassifier v1.0 https://urgi.versailles.inra.fr/Tools/PASTEClassifier/README). To classify these sequences, Repbase (v18.08)41 nucleotide and amino acid sequences were used, as were Pfam v26.0 (http://pfam.xfam.org/) HMM repeat profiles. Finally, identified LTRs were classified as Gypsy if homology or motif evidence existed for Gypsy and not for Copia, classified as Copia if the opposite were true, and otherwise classified as unknown.

Centromere and telomere identification

Centromeric repeats were identified using an approach outlined in ref. 42. Tandem repeat finder (TRF, Version 4.07b)43 was used to find tandem repeats using the parameters ‘1 1 2 80 5 200 2000-d –h’ in order to find high order repeats. The resulting ‘.dat file’ was transformed into a GFF3 file, which was used to identify telomeric and centromeric repeats. To identify the centromeric repeats, the largest repeat arrays (period length X copy number) were identified and clustered. Clustered centromeric repeat regions were transformed into FASTA files and aligned using clustalX to identify array sequence composition and orientation. The base centromere repeat was 155 bp dimers (310 bp), trimers (465 bp) and tetramers (620 bp) (Extended Data Fig. 2 and Supplementary Table 1). The three largest centromeric arrays (contigs 003, 028 and 064) were >400 kb and resolved into large inverted repeats, consistent with them being full length. The telomeric repeats were identified by searching the ends of contigs for short (~7 bp) high copy number repeats; 18 telomeric repeat sequences with the monomer ‘AAACCCT’ were identified (Extended Data Table 1).

Transcriptome assembly

Total RNA was extracted from fresh, desiccated and 24-h post rehydration Oropetium leaf tissues with 2 biological replicates collected for each tissue. RNA-seq libraries were prepared from the total RNA and bar-coded using TruSeq RNA Sample Prep Kits (Illumina) according to the manufacturer’s protocol. Raw Illumina RNA-seq data from the six libraries were trimmed for quality using Trimmomatic (v.0.32; ref. 33). Illumina sequence adaptors were removed, then leading low-quality (below quality 3) and N base pairs were trimmed and, finally, resulting trimmed reads were scanned using a 4-bp sliding window and cut when the average quality per base dropped below 30. Read pairs where both reads were ultimately of at least 36 base pairs in length following this quality control process were retained and used for subsequent analyses. Trinity (v.r20140717)44 was used to assemble quality filtered data. Assembled transcripts were aligned to our genome sequence using NCBI blastn v.2.2.30+ with an e-value cut-off of 1 × 10−5. Successfully aligned transcripts were clustered at 90% identity using CD-HIT (v. 4.5.4)45, with representative sequences from each cluster retained and used to help parameterize gene calling. Eighty-seven per cent of the trimmed RNA-seq reads aligned to the Oropetium genome, suggesting that the genome is largely complete (Supplementary Table 5). Reads that failed to align may have been contaminants from other organisms.

Gene annotation

Maker v2.31.846 (http://www.yandell-lab.org/software/maker.html) was used to identify putative genes. Aligned and representative sequences from our transcriptome assembly were input to Maker as expressed sequence tag evidence. Rice and Brachypodium proteome sequences clustered at 90% identity using CD-HIT (v. 4.5.4)45 with representative sequences from each cluster retained and input to Maker as multi-organismal protein homology evidence. The Oropetium repeat database was input to Maker as a custom repeat library. SNAPhmm, Augustus, and GeneMarkHMM were invoked by Maker and were initially trained using rice and maize. Only genes for which the encoded protein was predicted to contain a complete open reading frame were retained.

On the basis of the gene annotations provided by Maker, cufflinks (v2.2.1)47 was used to identify predicted genes without empirical expression evidence. Quality-trimmed data from all six RNA-seq libraries were input simultaneously to cufflinks, with results used to identify genes with and without expression.

Protein sequences from genes predicted by Maker were functionally annotated using NCBI blastp v.2.2.30+ versus the NCBI non-redundant refseq protein database (http://www.ncbi.nlm.nih.gov/refseq/), versus the UniProt database48, and using InterProScan (v. 5.6-48.0)49.

Finally, Maker-predicted genes were pruned based on a Maker-defined annotation edit distanced (AED) score that measures distance between the predicted gene and the evidence input to Maker, non-redundant (NR) annotation, Uniprot annotation, InterProScan annotation and expression level as output by cufflinks. Genes were removed that had no alignment evidence (AED = 1), no sequence match to either the NR or Uniprot databases, no InterProScan predicted domains and no expression evidence in our RNA-seq data.

Synteny and comparative genomics

Genome data sets from Setaria, Sorghum, rice and Brachypodium were downloaded from Phytozome (version 9.1) and subject to pairwise genome alignments against the Oropetium genome. For each pairwise alignment, the coding sequences of predicted gene models are compared to each other using adaptive seeds50. Our synteny search pipeline defines syntenic blocks by chaining the large-scale alignment tool (LAST) hits with a distance cut-off of 20 genes apart, also requiring at least four gene pairs per syntenic block. The syntenic blocks were further screened using QUOTA-ALIGN51 to retain one-to-one blocks and to exclude weak blocks derived from shared ancient duplications. The resulting dot plots were visually inspected to confirm the structural similarity of the Oropetium genome in relation to other genomes (Extended Data Fig. 3a–e).

Pairwise genomic alignments, described above, combined with OrthoMCL52 analyses filtered to one-to-one hits were used to identify orthologous gene clusters between Oropetium and Sorghum, rice, Vitis and Arabidopsis. The complete OropetiumArabidopsis orthologue list was then filtered to focus on genes with functional data in the STRING v9.1 global Arabidopsis protein interaction network53. Gene expression patterns and duplicated genes (tandem and whole-genome duplicates) were mapped onto this network using Cytoscape v3.1.154 to identify clusters of co-expressed and interacting duplicate genes, respectively (Extended Data Fig. 6). Various network statistics were calculated using NetworkAnalyzer55, including average number of neighbours (that is, protein interactions) and total number of isolated nodes (that is, without known interactors).

Constructing a gene interaction network

We constructed a gene interaction network for Oropetium on the basis of orthologous relationships with Arabidopsis genes with validated interactions and expression data yielding a network with 4,421 nodes (gene products) with 36,918 edges (interactions). This network encompasses most metabolic pathways including photosynthesis, core anabolic and catabolic processes and stress response pathways (Extended Data Fig. 6).