Bread wheat (Triticum aestivum, AABBDD) is one of the most widely cultivated and consumed food crops in the world. However, the complex polyploid nature of its genome makes genetic and functional analyses extremely challenging. The A genome, as a basic genome of bread wheat and other polyploid wheats, for example, T. turgidum (AABB), T. timopheevii (AAGG) and T. zhukovskyi (AAGGAmAm), is central to wheat evolution, domestication and genetic improvement1. The progenitor species of the A genome is the diploid wild einkorn wheat T. urartu2, which resembles cultivated wheat more extensively than do Aegilops speltoides (the ancestor of the B genome3) and Ae. tauschii (the donor of the D genome4), especially in the morphology and development of spike and seed. Here we present the generation, assembly and analysis of a whole-genome shotgun draft sequence of the T. urartu genome. We identified protein-coding gene models, performed genome structure analyses and assessed its utility for analysing agronomically important genes and for developing molecular markers. Our T. urartu genome assembly provides a diploid reference for analysis of polyploid wheat genomes and is a valuable resource for the genetic improvement of wheat.
Bread wheat is one of the most important food crops worldwide, and provides about 20% of the calories consumed by humans5. To accelerate wheat improvement, a substantial amount of research has been conducted on the genome. The International Wheat Genome Sequencing Consortium aims at flow-sorting and sequencing the individual chromosomes of bread wheat, and significant progress has been made with several chromosomes, for example 3B (ref. 6) and 4A (ref. 7). More recently, a whole-genome shotgun sequence analysis of bread wheat and its diploid relatives8 has allocated more than 60% of the genes to the A, B and D genomes with more than 70% confidence. The sequence of diploid progenitor genomes will allow the complete and unambiguous assignment of their homeologous relationships.
We sequenced T. urartu accession G1812 (PI428198) using a whole-genome shotgun strategy on the Illumina HiSequation (2000) platform, and assembled the genome using SOAPdenovo (v. 1.05)9 with 448.49 gigabases (Gb) of filtered high-quality sequence data (Supplementary Information). We estimated the genome size of T. urartu to be 4.94 Gb (Supplementary Information), which is consistent with previous reports of 4.8–5.7 Gb (refs 10, 11). The genome assembly reached 3.92 Gb with a contig N50 size (at which 50% of assembly was covered) of 3.42 kilobases (kb). After gap closure, the draft assembly was 4.66 Gb with a scaffold N50 length of 63.69 kb (Table 1 and Supplementary Information). The length of the contigs that contained intact or partial genes ranged from 200 base pairs (bp) to 65.8 kb, with an average length of 9.91 kb. The assembly was evaluated by comparisons with published bacterial artificial chromosome and expressed sequence tag (EST) sequences and by validation with PCR (Supplementary Information), and both indicated that the draft sequence had extensive genome coverage with high accuracy. The distribution of GC content in the T. urartu genome was comparable with those in the genomes of rice12, maize13, sorghum14 and Brachypodium distachyon15 (Supplementary Information).
Genome annotation of the assembly was performed as described in Supplementary Information. About 66.88% of the T. urartu assembly was identified as repetitive elements, including long terminal repeat retrotransposons (49.07%), DNA transposons (9.77%) and unclassified elements (8.04%) (Supplementary Information). The proportion of repetitive DNA was lower than the roughly 80% previously reported16, which is probably due to a decreased incorporation of repeat sequence reads into the assemblies.
To facilitate gene prediction, we generated a 116.65-megabase (Mb) transcriptome of T. urartu with 67.14 Gb of RNA-Seq data from eight different tissues and treatments using the HiSequation (2000) platform, and 49,935 assembled transcripts from six tissues using the Roche 454 sequencing platform (Supplementary Information). These data, together with publicly available ESTs from hexaploid wheat, and homologues from sequenced grass genomes12,13,14,15,17, were used as evidence in gene prediction (Supplementary Information). In total, we predicted 34,879 protein-coding gene models. The average gene size was 3,207 bp, with a mean of 4.7 exons per gene, which was similar to that found for B. distachyon (5.2)15 but slightly higher than that of rice (3.8)12, maize (4.1)13 and sorghum (4.3)14. In comparison with the 28,000 genes estimated for the A genome of hexaploid wheat7, our gene set for T. urartu contained 6,800 more members, indicating a more complete representation of genes in our analysis. However, the different approaches used in this work and in a previous study7, and the extensive loss of genes in the hexaploid A genome compared with its diploid progenitor8, may also have contributed to this difference.
We also obtained 14,222,170 small RNA (sRNA) reads (18–30 bp) representing 4,369,970 unique sRNA tags. In total, 412 conserved and 24 new microRNAs (miRNAs) distributed into 116 families were identified. Comparison with the miRNAs of five monocots and five dicots showed that 73 miRNA families were specific to monocots, of which 23 were uniquely present in T. urartu. We predicted 244 target genes for these miRNAs and found that the target gene (TRIUR3_06170) of miRNA MIR5050 responded to cold treatment, which provides a new resource for investigating the regulation of cold adaptation through miRNA (Supplementary Information).
The gene families of T. urartu were compared with those of rice12, maize13, sorghum14 and B. distachyon15 using OrthoMCL18 (Supplementary Information). We identified 24,339 families in the five grasses. Of these, 9,836 families, which contained 68,464 genes, were common to all five species. Another 1,103 families, containing 3,425 genes, were specific to T. urartu (Fig. 1a). GO analysis of the T. urartu-specific families revealed that 556, 230 and 841 genes were involved in biological processes, cellular compounds and molecular functions, respectively. In total, 2,067 Pfam domains were shared among the five species. Of these, 14 Pfam domains had differences in member numbers in T. urartu compared with the other four grasses (Fig. 1b). These included NB-ARC and serine–threonine/tyrosine-protein kinase domains that were markedly increased in T. urartu, and C3HC4 RING-type and pathogenesis-related transcriptional factor/ERF domains that were significantly decreased. However, determination of the significance and accuracy of these differences will require more detailed analysis.
Given that NB-ARC domain proteins function mainly in disease resistance19, we analysed the genes encoding R proteins in the T. urartu genome and identified 593 such genes, which were more abundant than in B. distachyon (197), rice (460), maize (106) and sorghum (211) (Supplementary Information). In contrast with barley genome data17, the ratio of NBS-LRR type genes in T. urartu (1.21%) was also substantially higher than that in barley (0.73%). These analyses indicate that there was a specific expansion of R genes in the T. urartu genome.
The scaffolds and gene models of T. urartu were assigned to chromosomes by using genetically mapped bread wheat ESTs20 as queries to search for homologous sequences in the T. urartu assembly (Supplementary Information). A total of 8,715 scaffolds, harbouring 14,578 genes (41.8% of the total predicted genes) were mapped to 45 chromosomal regions of the wheat A genome. Syntenic alignments between the T. urartu and B. distachyon genomes were constructed by using a set of 14,578 orthologous genes (Fig. 2a). These gene-based alignments conform, and supply more details, to the broad framework of genome synteny between wheat and B. distachyon proposed previously15.
The 4.94-Gb T. urartu genome is more than 18 times larger than the 272-Mb genome of B. distachyon. Given that the average gene size of T. urartu is similar to that of B. distachyon, and the predicted gene number (34,879) is only about 1.37-fold that of B. distachyon (25,532), the larger genome size of T. urartu might be due to increased intergenic spaces. We therefore compared the intergenic space of the syntenic blocks between T. urartu and B. distachyon (Supplementary Information). About 21% of T. urartu genes had similarly sized intergenic spaces to those in B. distachyon, but most T. urartu genes were separated by a greatly increased intergenic space enriched in Gypsy and Copia retrotransposons, and were present in separate scaffolds (Fig. 2b). This provides the genome sequence-scale evidence for the role of repeat expansion in genome size enlargement during the evolution of the tribe Triticeae.
We next demonstrated the utility of the T. urartu draft genome sequence for finding agronomically important genes through identifying the T. urartu homologue of OsGASR7, a gibberellin-regulated gene that controls grain length in rice21. We found two haplotypes (H1 and H2) for TuGASR7 in 92 diverse T. urartu accessions collected from different regions. H1 was significantly associated with greater values of grain length and grain weight (Fig. 3 and Supplementary Information). We also found natural variation of TaGASR7 in bread wheat, with the elite variant associated with improved yield-related traits, suggesting that TaGASR7 is of use for the improvement of wheat yield (Supplementary Information).
The T. urartu assembly also served as a rich resource for the development of genetic markers for molecular breeding through genomic selection. We identified 739,534 insertion-site-based polymorphism (ISBP) markers and 166,309 simple sequence repeats (SSRs) (Supplementary Information). PCR validation showed that 94.5% of the SSRs and 87% of the ISBP markers gave the expected products, and that 33.61% of the SSRs and 10.19% of the ISBP markers were specific to the A genome. Moreover, 28.7% of the SSR loci were polymorphic in bread wheat (Supplementary Information). To enable the identification of single nucleotide polymorphisms (SNPs), we re-sequenced another T. urartu accession (DV2138) and obtained 78.6 Gb of high-quality data. Comparison of the genome data between the two T. urartu accessions (G1812 versus DV2138) allowed the discovery of 2,989,540 SNPs, which will be useful for the future development of SNP markers (Supplementary Information).
Previous studies have revealed that more than half of the 60 meta-quantitative trait loci (meta-QTLs)22,23 related to wheat yield traits are present in the A genome of bread wheat, and three meta-QTLs (MQTL_5, 6 and 7) are located on chromosome 5A (ref. 22). We therefore searched the T. urartu scaffolds using available markers around the three meta-QTLs (Supplementary Information). We found ten scaffolds with a total length of 772,014 bp that were distributed in the 14-centimorgan (cM) region of MQTL_5; nine scaffolds with a combined length of 783,140 bp were located in the 15-cM region containing MQTL_6; and six scaffolds with an overall length of 529,604 bp were assigned to the 9-cM region harbouring MQTL_7. The sequence information of these scaffolds will expedite the development of more polymorphic markers within the three meta-QTL regions and facilitate the identification of their corresponding genes.
Our T. urartu draft genome sequence provides new insights into the A genome that is shared by many polyploid wheat species. The large set of gene models (34,879) and abundant genetic markers anchored in sequence scaffolds, together with the emerging genomic resources from bread wheat8, promise to accelerate deeper and more systematic genomic and breeding studies of bread wheat that are required to meet the future challenges of food security and sustainable agriculture.
The genome of T. urartu accession G1812 was sequenced on the Illumina HiSequation (2000) platform. These data were used to assemble the draft genome sequence with the use of the SOAPdenovo9 software. RNA-Seq data were generated on the same platform and Roche 454 for genome annotation and transcriptome analysis. Repeat sequences were identified through sequence similarity at the nucleotide and protein levels24. Protein-coding genes were predicted by using an ab initio approach, sequence similarity search and RNA-Seq data to build reliable gene models25. Detailed methodology descriptions are given in Supplementary Information.
Sequence Read Archive
This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under accession number AOTI00000000. Sequence assemblies and all short-read data are under project accession numbers SRA030525 (genomic short reads), SRA066084 (resequencing reads), PRJNA182347 (assembly and annotation) and SRA064213 (RNA-Seq). The version described in this paper is the first version, AOTI01000000. Genomic data are also available at the Comprehensive Library for Modern Biotechnology (CLiMB) repository26.
We thank L. Goodman for assistance in editing the manuscript, and M. Bevan and Y.-B. Xue for critical reading of the manuscript. This work was supported by grants from the Ministry of Science and Technology of China (the special fund for the State Key Laboratory of Plant Cell and Chromosome Engineering, 2010DFB33540, 2011CB100304, 2011AA100104 and 2009CB118300).