Rice, one of the world's most important food plants, has important syntenic relationships with the other cereal species and is a model plant for the grasses. Here we present a map-based, finished quality sequence that covers 95% of the 389 Mb genome, including virtually all of the euchromatin and two complete centromeres. A total of 37,544 non-transposable-element-related protein-coding genes were identified, of which 71% had a putative homologue in Arabidopsis. In a reciprocal analysis, 90% of the Arabidopsis proteins had a putative homologue in the predicted rice proteome. Twenty-nine per cent of the 37,544 predicted genes appear in clustered gene families. The number and classes of transposable elements found in the rice genome are consistent with the expansion of syntenic regions in the maize and sorghum genomes. We find evidence for widespread and recurrent gene transfer from the organelles to the nuclear chromosomes. The map-based sequence has proven useful for the identification of genes underlying agronomic traits. The additional single-nucleotide polymorphisms and simple sequence repeats identified in our study should accelerate improvements in rice production.
Rice (Oryza sativa L.) is the most important food crop in the world and feeds over half of the global population. As the first step in a systematic and complete functional characterization of the rice genome, the International Rice Genome Sequencing Project (IRGSP) has generated and analysed a highly accurate finished sequence of the rice genome that is anchored to the genetic map. Our analysis has revealed several salient features of the rice genome:
We provide evidence for a genome size of 389 Mb. This size estimation is ∼260 Mb larger than the fully sequenced dicot plant model Arabidopsis thaliana. We generated 370 Mb of finished sequence, representing 95% coverage of the genome and virtually all of the euchromatic regions.
A total of 37,544 non-transposable-element-related protein-coding sequences were detected, compared with ∼28,000–29,000 in Arabidopsis, with a lower gene density of one gene per 9.9 kb in rice. A total of 2,859 genes seem to be unique to rice and the other cereals, some of which might differentiate monocot and dicot lineages.
Gene knockouts are useful tools for determining gene function and relating genes to phenotypes. We identified 11,487 Tos17 retrotransposon insertion sites, of which 3,243 are in genes.
Between 0.38 and 0.43% of the nuclear genome contains organellar DNA fragments, representing repeated and ongoing transfer of organellar DNA to the nuclear genome.
The transposon content of rice is at least 35% and is populated by representatives from all known transposon superfamilies.
We have identified 80,127 polymorphic sites that distinguish between two cultivated rice subspecies, japonica and indica, resulting in a high-resolution genetic map for rice. Single-nucleotide polymorphism (SNP) frequency varies from 0.53 to 0.78%, which is 20 times the frequency observed between the Columbia and Landsberg erecta ecotypes of Arabidopsis.
A comparison between the IRGSP genome sequence and the 6.3 × indica and 6 × japonica whole-genome shotgun sequence assemblies revealed that the draft sequences provided coverage of 69% by indica and 78% by japonica relative to the map-based sequence.
Rice has played a central role in human nutrition and culture for the past 10,000 years. It has been estimated that world rice production must increase by 30% over the next 20 years to meet projected demands from population increase and economic development1. Rice grown on the most productive irrigated land has achieved nearly maximum production with current strains1. Environmental degradation, including pollution, increase in night time temperature due to global warming2, reductions in suitable arable land, water, labour and energy-dependent fertilizer provide additional constraints. These factors make steps to maximize rice productivity particularly important. Increasing yield potential and yield stability will come from a combination of biotechnology and improved conventional breeding. Both will be dependent on a high-quality rice genome sequence.
Rice benefits from having the smallest genome of the major cereals, dense genetic maps and relative ease of genetic transformation3. The discovery of extensive genome colinearity among the Poaceae4 has established rice as the model organism for the cereal grasses. These properties, along with the finished sequence and other tools under development, set the stage for a complete functional characterization of the rice genome.
The International Rice Genome Sequencing Project
The IRGSP, formally established in 1998, pooled the resources of sequencing groups in ten nations to obtain a complete finished quality sequence of the rice genome (Oryza sativa L. ssp. japonica cv. Nipponbare). Finished quality sequence is defined as containing less than one error in 10,000 nucleotides, having resolved ambiguities, and having made all state-of-the-art attempts to close gaps. The IRGSP released a high-quality map-based draft sequence in December 2002. Three completely sequenced chromosomes have been published5,6,7, as well as two completely sequenced centromeres8,9,10. As the IRGSP subscribed to an immediate-release policy, high-quality map-based sequence has been public for some time. This has permitted rice geneticists to identify several genes underlying traits, and revealed very large and previously unknown segmental duplications that comprise 60% of the genome11,12,13. The public sequence has also revealed new details about the syntenic relationships and gene mobility between rice, maize and sorghum13,14,15.
Physical maps, sequencing and coverage
The IRGSP sequenced the genome of a single inbred cultivar, Oryza sativa ssp. japonica cv. Nipponbare, and adopted a hierarchical clone-by-clone method using bacterial and P1 artificial chromosome clones (BACs and PACs, respectively). This strategy used a high-density genetic map16, expressed-sequence tags (ESTs)17, yeast artificial chromosome (YAC)- and BAC-based physical maps18,19,20, BAC-end sequences21 and two draft sequences22,23. A total of 3,401 BAC/PAC clones (Table 1) were sequenced to approximately tenfold sequence coverage, assembled, ordered and finished to a sequence quality of less than one error per 10,000 bases. A majority of physical gaps in the BAC/PAC tiling path were bridged using a variety of substrates, including PCR fragments, 10-kb plasmids and 40-kb fosmid clones. A total of 62 unsequenced physical gaps, including nine centromere and 17 telomere gaps, remain on the 12 chromosomes (Table 2). Chromosome arm and telomere gaps were measured, and the nine centromere gaps were estimated on the basis of CentO satellite DNA content. The remaining gaps are estimated to total 18.1 Mb.
Ninety-seven percent of the BAC/PACs and gap sequences (3,360) have been submitted as finished quality in the PLN division of GenBank/DDBJ/EMBL. These and the remaining draft-sequenced clones were used to construct pseudomolecules representing the 12 chromosomes of rice (Fig. 1). The total nucleotide sequence of the 12 pseudomolecules is 370,733,456 bp, with an N-average continuous sequence length of 6.9 Mb (see Table 1 for a definition of N-average length). Sequence quality was assessed by comparing 1.2 Mb of overlapping sequence produced by different laboratories. The overall accuracy was calculated as 99.99% (Supplementary Table 2). The statistics of sequenced PAC/BAC clones and pseudomolecules for each chromosome are shown in Table 1.
The genome size of rice (O. sativa ssp. japonica cv. Nipponbare) was reported to have a haploid nuclear DNA content of 394 Mb on the basis of flow cytometry24, and 403 Mb on the basis of lengths of anchored BAC contigs and estimates of gap sizes20. Table 2 shows the calculated size for each chromosome and the estimated coverage. Adding the estimated length of the gaps to the sum of the non-overlapping sequence, the total length of the rice nuclear genome was calculated to be 388.8 Mb. Therefore, the pseudomolecules are expected to cover 95.3% of the entire genome and an estimated 98.9% of the euchromatin. An independent measure of genome coverage represented by the pseudomolecules was obtained by searching for unique EST markers19; of 8,440 ESTs, 8,391 (99.4%) were identified in the pseudomolecules.
Typical eukaryotic centromeres contain repetitive sequences, including satellite DNA at the centre and retrotransposons and transposons in the flanking regions. All rice centromeres contain the highly repetitive 155–165 bp CentO satellite DNA, together with centromere-specific retrotransposons25,26. The CentO satellites are located within the functional domain of the rice centromere10,26. Complete sequencing of the centromeres of rice chromosomes 4 and 8 revealed that they consist of 59 kb and 69 kb of clustered CentO repeats (respectively)8,9,10, tandemly arrayed head-to-tail within the clusters. Numerous retrotransposons, including the centromere-specific RIRE7, are found between and around the CentO repeats. CentO clusters show differences in length and orientation for the two centromeres.
BLASTN analysis of the pseudomolecules indicated that about 0.9 Mb of CentO repeats (corresponding to more than 5,800 copies of the satellite) were sequenced and found to be associated with centromere-specific retroelements. Locations of all CentO sequences correspond to genetically identified centromere regions (Supplementary Table 3). Our pseudomolecules cover the centromere regions on chromosomes 4, 5 and 8, and portions of the centromeres on the remaining chromosomes (Fig. 1).
Gene content, expression and distribution
We masked the pseudomolecules for repetitive sequences and used the ab initio gene finder FGENESH to identify only non-transposable-element-related genes. A total of 37,544 non-transposable-element protein-coding sequences were predicted, resulting in a density of one gene per 9.9 kb (Supplementary Tables 4 and 5). As the ability to identify unannotated and transposable-element-related genes improves, the true protein-coding gene number in rice will doubtless be revised.
Full-length complementary DNA sequences are available for rice27, and provide a powerful resource for improving gene model structure derived from ab initio gene finders28. Of the 37,544 non-transposable-element-related FGENESH models, 17,016 could be supported by a total of 25,636 full-length cDNAs (Supplementary Table 6).
A total of 22,840 (61%) genes had a high identity match with a rice EST or full-length cDNA. On average, about 10.7 EST sequences were present for each expressed rice gene. A total of 2,927 genes aligned well with ESTs from other cereal species, and 330 of these genes matched only with a non-rice cereal EST (Supplementary Fig. 1). Except for the short arms of chromosomes 4, 9 and 10, which are known to be highly heterochromatic, the density of expressed genes is greater on the distal portions of the chromosome arms compared with the regions around the centromeres (Supplementary Fig. 2).
A total of 19,675 proteins had matches with entries in the Swiss-Prot database; of these, 4,500 had no expression support. Domain searches revealed a minimum of one motif or domain present in 63% of the predicted proteins, with a total of 3,328 different domains present in the predicted rice proteome. The five most abundant domains were associated with protein kinases (Supplementary Table 7). Fifty-one per cent of the predicted proteins could be associated with a biological process (Supplementary Fig. 3a), with metabolism (29.1%) and cellular physiological processes (11.9%) representing the two most abundant classes.
Approximately 71% (26,837) of the predicted rice proteins have a homologue in the Arabidopsis proteome (Supplementary Fig. 4). In a reciprocal search, 89.8% (26,004) of the proteins from the Arabidopsis genome have a homologue in the rice proteome. Of the 23,170 rice genes with rice EST, cereal EST, or full-length cDNA support, 20,311 (88%) have a homologue in Arabidopsis. Fewer putative homologues were found in other model species: 38.1% in Drosophila, 40.8% in human, 36.5% in Caenorhabditis elegans, 30.2% in yeast, 17.6% in Synechocystis and 10.2% in Escherichia coli.
There are profound differences in plant architecture and biochemistry between monocotyledonous and dicotyledonous angiosperms. Only 2,859 rice genes with evidence of transcription lack homologues in the Arabidopsis genome. We investigated these to learn what functions they encoded. The vast majority had no matches, or most closely matched unknown or hypothetical proteins. The grasses have a class of seed storage proteins called prolamins that is not found in dicots. There are also families of hormone response proteins and defence proteins, such as proteinase inhibitors, chitinases, pathogenesis-related proteins and seed allergens, many of which are tandemly repeated (Supplementary Table 8). Nevertheless, with a large number of proteins of unknown function, the most interesting differences between the genome content of these two groups of angiosperms remain to be discovered.
Tos17 is an endogenous copia-like retrotransposon in rice that is inactive under normal growth conditions. In tissue culture, it becomes activated, transposes and is stably inherited when the plant is regenerated29. There are only two copies of Tos17 in the rice cultivar Nipponbare. These features, together with its preferential insertion into gene-rich regions, make Tos17 uniquely suitable for the functional analysis of rice genes by gene disruption. About 50,000 Tos17-insertion lines carrying 500,000 insertions have been produced30. A total of 11,487 target loci were mapped on the 12 pseudomolecules (Supplementary Fig. 5), with at least one insertion detected in 3,243 genes. The density of Tos17 insertions is higher in euchromatic regions of the genome30, in contrast to the distribution of high-copy retrotransposons, which are more frequently found in pericentromeric regions. A similar target site preference has been reported for T-DNA insertions in Arabidopsis31.
Tandem gene families
One surprising outcome of the Arabidopsis genome analysis was the large percentage (17%) of genes arranged in tandem repeats32. When performing a similar analysis with rice, the percentage was comparable (14%). However, manual curation on rice chromosome 10 showed one gene family encoding a glycine-rich protein with 27 copies and one encoding a TRAF/BTB domain protein with 48 copies33. These tandemly repeated families are interrupted with other genes and are not included in strictly defined tandem repeats. We therefore screened for all tandemly arranged genes in 5-Mb intervals. Using these criteria, 29% of the genes (10,837) are amplified at least once in tandem, and 153 rice gene arrays contained 10–134 members (Supplementary Fig. 6). Sixty five per cent of the tandem arrays with over 27 members, and 33% of all the arrays with over 10 members, contain protein kinase domains (Supplementary Table 9).
Non-coding RNA genes
The nucleolar organizer, consisting of 17S–5.8S–25S ribosomal DNA coding units, is found at the telomeric end of the short arm of chromosome 9 (ref. 34) in O. sativa ssp. japonica, and is estimated to comprise 7 Mb (ref. 35). A second 17S–5.8S–25S rDNA locus is found at the end of the short arm of chromosome 10 in O. sativa ssp. indica34. A single 5S cluster is present on the short arm of chromosome 11 in the vicinity of the centromere36, and encompasses 0.25 Mb.
A total of 763 transfer RNA genes, including 14 tRNA pseudogenes were detected in the 12 pseudomolecules. In comparison, a total of 611 tRNA genes were detected in Arabidopsis32. Supplementary Fig. 7 shows the distribution of these tRNA genes in each chromosome. Chromosome 4 has a single tRNA cluster6, and chromosome 10 has two large clusters derived from inserted chloroplast DNA7. Except for regions of intermediate density on chromosomes 1, 2, 8 and 12, there seem to be no other large clusters.
MicroRNAs (miRNAs), a class of eukaryotic non-coding RNAs, are believed to regulate gene expression by interacting with the target messenger RNA37. miRNAs have been predicted from Arabidopsis38 and rice39, and we mapped 158 miRNAs onto the rice pseudomolecules (Supplementary Table 10). Among other non-coding RNAs, we identified 215 small nucleolar RNA (snoRNA) and 93 spliceosomal RNA genes, both showing biased chromosomal distributions, in the rice genome (Supplementary Table 11).
Organellar insertions in the nuclear genome
Mitochondria and chloroplasts originated from alpha-proteobacteria and cyanobacteria endosymbionts. A continuous transfer of organellar DNA to the nucleus has resulted in the presence of chloroplast and mitochondrial DNA inserted in the nuclear chromosomes. Although the endosymbionts probably contained genomes of several Mb at the time they were internalized, the organellar genomes diminished so that the present size of the mitochondrial genome is less than 600 kb, and that of the chloroplast is only 150 kb. Homology searches detected 421–453 chloroplast insertions and 909–1,191 mitochondrial insertions, depending upon the stringency adopted (Supplementary Fig. 8 and Supplementary Table 12). Thus, chloroplast and mitochondrial insertions contribute 0.20–0.24% and 0.18–0.19% of the nuclear genome of rice, respectively, and correspond to 5.3 chloroplast and 1.3 mitochondrial genome equivalents. The distribution of chloroplast and mitochondrial insertions over the 12 chromosomes indicates that mitochondrial and chloroplast transfers occurred independently. Two chromosomes harbour more insertions than the others (Supplementary Fig. 8 and Supplementary Table 12), with chromosome 12 containing nearly 1% mitochondrial DNA and chromosome 10 containing approximately 0.8% chloroplast DNA. It is clear that several successive transfer events have occurred, as insertions of less than 10 kb have heterogeneous identities. The longest insertions, however, systematically show >98.5% identity to organellar DNA (Supplementary Table 13), indicating recent insertions for both chloroplast and mitochondrial genomes.
The rice genome is populated by representatives from all known transposon superfamilies, including elements that cannot be easily classified into either class I or II (ref. 40). Previous estimates of the transposon content in the rice genome range from 10 to 25% (refs 21, 40). However, the increased availability of transposon query sequences and the use of profile hidden Markov models allow the identification of more divergent elements41 and indicate that the transposon content of the O. sativa ssp. japonica genome is at least 35% (Table 3). Chromosomes 8 and 12 have the highest transposon content (38.0% and 38.3%, respectively), and chromosomes 1 (31.0%), 2 (29.8%) and 3 (29.0%) have the lowest proportion of transposons. Conversely, elements belonging to the IS5/Tourist and IS630/Tc1/mariner superfamilies, which are generally correlated with gene density, are prevalent on the first three chromosomes and least frequent on chromosomes 4 and 12.
Class II elements, characterized by terminal inverted-repeats and including the hAT, CACTA, IS256/Mutator, IS5/Tourist, and IS630/Tc1/mariner superfamilies, outnumber class I elements, which include long terminal-repeat (LTR) retrotransposons (Ty1/copia, Ty3/gypsy and TRIM) and non-LTR retrotransposons (LINEs and SINEs, or long- and short-interspersed nucleotide elements, respectively), by more than twofold (Table 3). However, the nucleotide contribution of class I is greater than that of class II, due mostly to the large size of LTR retrotransposons and the small size of IS5/Tourist and IS630/Tc1/mariner elements. The inverse is the case for maize, for which class I elements outnumber class II elements42. Given their larger sizes, differential amplification of LTR elements in maize compared with rice is consistent with the genomic expansion found between orthologous regions of rice and maize15,33.
Most class I elements are concentrated in gene-poor, heterochromatic regions such as the centromeric and pericentromeric regions (Supplementary Table 14). In contrast, members of some transposon superfamilies, including IS5/Tourist, IS630/Tc1/mariner and LINEs, have a significant positive correlation with both recombination rate and gene density. There is an effect of average element length associated with these patterns: short elements generally show a positive correlation with recombination rate and gene density, and are under-represented in the centromere regions, whereas larger elements have higher centromeric and pericentromeric abundance.
Intraspecific sequence polymorphism
Map-based cloning to identify genes that are associated with agronomic traits is dependent on having a high frequency of polymorphic markers to order recombination events. In rice, most of the segregating populations are generated from crosses between the two major subspecies of cultivated rice, Oryza sativa ssp. japonica and O. sativa ssp. indica. Although several studies on the polymorphisms detected between japonica and indica subspecies have been reported6,43,44, the analysis reported here uses an approach that ensures comparison of orthologous sequences. O. sativa ssp. indica cv. Kasalath and O. sativa ssp. japonica cv. Nipponbare are the parents of the most densely mapped rice population16. BAC-end sequences were obtained from a Kasalath BAC library of 47,194 clones. Only high quality, single-copy sequences were mapped to the Nipponbare pseudomolecules, and only paired inverted sequences that mapped within 200 kb were considered. A total of 26,632 paired Kasalath BAC-end sequences were mapped to the 12 rice pseudomolecules (Supplementary Table 15). Kasalath BAC clones spanned 308 Mb or 79% of the Nipponbare genome. Sequence alignments with a PHRED quality value of 30 covered 12,319,100 bp (3%) of the total rice genome. A total of 80,127 sites differed in the corresponding regions in Nipponbare and Kasalath. The frequency of SNPs varied between chromosomes (0.53–0.78%). Insertions and deletions were also detected. The ratio of small insertion/deletion site nucleotides (1–14 bases) against the alignment length (0.20–0.27%) was similar among the different chromosomes, and there was no preference for the direction of insertions or deletions. The main patterns of base substitutions observed between Nipponbare and Kasalath are shown in Supplementary Table 16. Transitions (70%) were the most prominent substitutions; this is a substantially higher fraction than found between Arabidopsis ecotypes Columbia and Landsberg erecta32.
Class 1 simple sequence repeats in the rice genome
Class 1 simple sequence repeats (SSRs) are perfect repeats >20 nucleotides in length45 that behave as hypervariable loci, providing a rich source of markers for use in genetics and breeding. A total of 18,828 Class 1 di, tri and tetra-nucleotide SSRs, representing 47 distinctive motif families, were identified and annotated on the rice genome (Supplementary Fig. 9). Supplementary Table 17 provides information about the physical positions of all Class 1 SSRs in relation to widely used restriction-fragment length polymorphisms (RFLPs)16,46 and previously published SSRs45. There was an average of 51 hypervariable SSRs per Mb, with the highest density of markers occurring on chromosome 3 (55.8 SSR Mb-1) and the lowest occurring on chromosome 4 (41.0 SSR Mb-1). A summary of information about the Class 1 SSRs identified in the rice pseudomolecules appears in Supplementary Table 18. Several thousand of these SSRs have already been shown to amplify well and be polymorphic in a panel of diverse cultivars45, and thus are of immediate use for genetic analysis.
Genome-wide comparison of draft versus finished sequences
Two whole-genome shotgun assemblies of draft-quality rice sequence have been published23,47, and reassemblies of both have just appeared48. One of these is an assembly of 6.28 × coverage of O. sativa ssp. indica cv. 93-11. The second sequence is a ∼6 × coverage of O. sativa ssp. japonica cv. Nipponbare23,48. These assemblies predict genome sizes of 433 Mb for japonica and 466 Mb for. indica, which differ from our estimation of a 389 Mb japonica genome. Contigs from the whole-genome shotgun assembly of 93-11 and Nipponbare48 were aligned with the IRGSP pseudomolecules. Non-redundant coverage of the pseudomolecules by the indica assembly varied from 78% for chromosome 3 to 59% for chromosome 12, with an overall coverage of 69% (Supplementary Table 19). When genes supported by full-length cDNA coverage were aligned to the covered regions, we found that 68.3% were completely covered by the indica sequences. The average size of the indica contigs is 8.2 kb, so it is not surprising that many did not completely cover the gene models defined here. The coverage of the Nipponbare whole-genome shotgun assembly varied from 68–82%, with an overall coverage of 78% of the genome, and 75.3% of the full-length cDNAs supported gene models.
We undertook a detailed comparison of the first Mb of these assemblies on 1S (the short arm of chromosome 1) with the IRGSP chromosome 1 (Supplementary Fig. 10 and Supplementary Table 20). The numbers from this comparison agree with the whole-genome comparison described above. In addition, we observed that a substantial portion of the contigs from each assembly were non-homologous, misaligned or provided duplicate coverage. Indeed, the whole-genome shotgun assembly differed by 0.05% base-pair mismatches for the two aligned regions from the same Nipponbare cultivar. The two assemblies were further examined for the presence of the CentO sequence (Supplementary Table 21). Sixty-eight per cent of the copies observed in the 93-11 assembly and 32% of the CentO-containing contigs in the whole-genome shotgun Nipponbare assembly were found outside the centromeric regions. In contrast, the CentO repeats were restricted to the centromeric regions in the IRGSP pseudomolecules. It is unlikely that there are dispersed centromeres in indica rice; misassembly of the whole-genome shotgun sequences is a more likely explanation for dispersed CentO repeats. These observations indicate that the draft sequences, although providing a useful preliminary survey of the genome, might not be adequate for gene annotation, functional genomics or the identification of genes underlying agronomic traits.
The attainment of a complete and accurate map-based sequence for rice is compelling. We now have a blueprint for all of the rice chromosomes. We know, with a high level of confidence, the distribution and location of all the main components—the genes, repetitive sequences and centromeres. Substantial portions of the map-based sequence have been in public databases for some time, and the availability of provisional rice pseudomolecules based on this sequence has provided the scientific community with numerous opportunities to evaluate the genome, as indicated by the number of publications in rice biology and genetics over the past few years. Furthermore, the wealth of SNP and SSR information provided here and elsewhere will accelerate marker-assisted breeding and positional cloning, facilitating advances in rice improvement.
The syntenic relationships between rice and the cereal grasses have long been recognized4. Comparing genome organization, genes and intergenic regions between cereal species will permit identification of regions that are highly conserved or rapidly evolving. Such regions are expected to yield crucial insights into genome evolution, speciation and domestication.
Physical map and sequencing
Nine genomic libraries from Oryza sativa ssp. japonica cultivar Nipponbare were used to establish the physical map of rice chromosomes by polymerase chain reaction (PCR) screening19, fingerprinting20 and end-sequencing21. The PAC, BAC and fosmid clones on the physical map were subjected to random shearing and shotgun sequencing to tenfold redundancy, using both universal primers and the dye-terminator or dye-primer methods. The sequences were assembled using PHRED (http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm) and PHRAP (http://www.genome.washington.edu/UWGC/analysistools/Phrap.cfm) software packages or using the TIGR Assembler (http://www.tigr.org/software/assembler/).
Sequence gaps were resolved by full sequencing of gap-bridge clones, PCR fragments or direct sequencing of BACs. Sequence ambiguities (indicated by PHRAP scores less than 30) were resolved by confirming the sequence data using alternative chemistries or different polymerases. We empirically determined that a PHRAP score of 30 or above exceeds the standard of less than one error in 10,000 bp. BAC and PAC assemblies were tested for accuracy by comparing computationally derived fingerprint patterns with experimentally determined patterns of restriction enzyme digests. Sequence quality was also evaluated by comparing independently obtained overlapping sequences.
Small physical gaps were filled by long-range PCR. Remaining physical gaps were measured using fluorescence in situ hybridization analysis. We used the length of CentO arrays26 to estimate the size of each of the remaining centromere gaps.
Annotation and bioinformatics
Gene models were predicted using FGENESH (http://www.softberry.com/berry.phtml?topic=fgenesh) using the monocot trained matrix on the native and repeat-masked pseudomolecules. Gene models with incomplete open reading frames, those encoding proteins of less than 50 amino acids, or those corresponding to organellar DNA were omitted from the final set. The coordinates of transposable elements, excluding MITEs (miniature inverted-repeat transposable elements), were used to mask the pseudomolecules.
Conserved domain/motif searches and association with gene ontologies were performed using InterproScan (http://www.ebi.ac.uk/InterProScan/) in combination with the Interpro2Go program. For biological processes, the number of detected domains was re-calculated as number of non-redundant proteins.
The predicted rice proteome was searched using BLASTP against the proteomes of several model species for which a complete genome sequence and deduced protein set was available. Each rice chromosome was searched against the TIGR rice gene index (http://www.tigr.org/tdb/tgi/ogi/) and against gene index entries that aligned to gene models corresponding to expressed genes. In addition, five cereal gene indices (http://www.tigr.org/tdb/tgi/) were searched against the rice chromosomes, and gene index matches were recorded. We searched the Oryza sativa ssp. japonica cv. Nipponbare collection of full-length cDNAs (ftp://cdna01.dna.affrc.go.jp/pub/data/), after first removing the transposable-element-related sequences, against the FGENESH models.
Gene models with rice full-length cDNA, EST or cereal EST matches but without identifiable homologues in the Arabidopsis genome were searched for conserved domains/motifs using InterproScan, and for homologues in the Swiss-Prot database (http://us.expasy.org/sprot/) using BLASTP. All proteins with positive blast matches were further compared with the nr database (http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases), using BLASTP to eliminate truncated proteins and those with matches to other dicots.
Tandem gene families
The rice genome was subjected to a BLASTP search as previously described32. The search was also performed by permitting more than one unrelated gene within the arrays, and the limit of the search was set to 5-Mb intervals to exclude large chromosomal duplications.
Transfer-RNA genes were detected by the program tRNA-scan SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/). The miRNA registry in the Rfam database (http://www.sanger.ac.uk/Software/Rfam/) was used as a reference database for miRNAs. In addition, experimentally validated miRNAs of other species, excluding Arabidopsis miRNAs, were used for BLASTN queries against the pseudomolecules. Spliceosomal and snoRNAs were retrieved from the Rfam database and used for queries. BLASTN was used to find the location of snoRNAs and spliceosomal RNAs in the pseudomolecules.
Oryza sativa ssp. japonica Nipponbare chloroplast (GenBank NC_001320) and mitochondrial (GenBank BA000029) sequences were aligned with the pseudomolecules using BLASTN and MUMmer49.
The TIGR Oryza Repeat Database, together with other published and unpublished rice transposable element sequences, was used to create RTEdb (a rice transposable element database)50 and determine transposable element coordinates on the rice pseudomolecules. In the case of hAT, IS256/Mutator, IS5/Tourist and IS630/Tc1/mariner elements, family-specific profile hidden Markov models were applied using HMMER41 (http://hmmer.wustl.edu/). The remaining superfamilies were annotated using RepeatMasker (http://www.repeatmasker.org/).
Flanking sequences of transposed copies of 6,278 Tos17 insertion lines were isolated by modified thermal asymmetric interlaced (TAIL)-PCR and suppression PCR, and screened against the pseudomolecule sequences.
BAC clones from an O. sativa ssp. indica var. Kasalath BAC library were end-sequenced. Sequence reads were omitted if they contained more than 50% nucleotides of low quality or high similarity to known repeats. The remaining sequences were subjected to BLASTN analysis against the pseudomolecules. Gaps within the alignments were classified as small insertions/deletions.
The Simple Sequence Repeat Identification Tool (http://www.gramene.org/) was used to identify simple sequence repeat motifs, and the physical position of all Class 1 SSRs was recorded. The copy number of SSR markers was estimated using electronic (e)-PCR to determine the number of independent hits of primer pairs on the pseudomolecules.
Whole-genome shotgun assembly analysis
Contigs from the BGI 6.28 × whole genome assembly of O. sativa ssp. indica 93-11 (GenBank/DDBJ/EMBL accession number AAAA02000001–AAAA02050231) and the Syngenta 6 × whole genome assembly of O. sativa ssp. japonica cv. Nipponbare (AACV01000001–AACV01035047; ref. 48) were aligned with the pseudomolecules using MUMmer49. The number of IRGSP Nipponbare full-length cDNA-supported gene models completely covered by the aligned contigs was tabulated. The 155-bp CentO consensus sequence was used for BLAST analysis against the 93-11 and Nipponbare whole-genome shotgun contigs, and the coordinates of the positive hits recorded. Locations of centromeres for each indica chromosome were obtained with the CentO sequence positions on the IRGSP pseudomolecule of the corresponding chromosome. A detailed comparison of the BGI-assembled and -mapped Syngenta contigs (AACV01000001–AACV01000070) and the 93-11 contigs (AAAA02000001–AAAA02000093) was obtained by BLAST analysis against the IRGSP chromosome 1 pseudomolecule.
Detailed procedures for the analyses described above can be found in the Supplementary Information.
International Rice Genome Sequencing Project
Participants are arranged by area of contribution and then by institution.
Physical Maps and Sequencing: Rice Genome Research Program (RGP) Takashi Matsumoto1, Jianzhong Wu1, Hiroyuki Kanamori1, Yuichi Katayose1, Masaki Fujisawa1, Nobukazu Namiki1, Hiroshi Mizuno1, Kimiko Yamamoto1, Baltazar A. Antonio1, Tomoya Baba1, Katsumi Sakata1, Yoshiaki Nagamura1, Hiroyoshi Aoki1, Koji Arikawa1, Kohei Arita1, Takahito Bito1, Yoshino Chiden1, Nahoko Fujitsuka1, Rie Fukunaka1, Masao Hamada1, Chizuko Harada1, Akiko Hayashi1, Saori Hijishita1, Mikiko Honda1, Satomi Hosokawa1, Yoko Ichikawa1, Atsuko Idonuma1, Masumi Iijima1, Michiko Ikeda1, Maiko Ikeno1, Kazue Ito1, Sachie Ito1, Tomoko Ito1, Yuichi Ito1, Yukiyo Ito1, Aki Iwabuchi1, Kozue Kamiya1, Wataru Karasawa1, Kanako Kurita1, Satoshi Katagiri1, Ari Kikuta1, Harumi Kobayashi1, Noriko Kobayashi1, Kayo Machita1, Tomoko Maehara1, Masatoshi Masukawa1, Tatsumi Mizubayashi1, Yoshiyuki Mukai1, Hideki Nagasaki1, Yuko Nagata1, Shinji Naito1, Marina Nakashima1, Yuko Nakama1, Yumi Nakamichi1, Mari Nakamura1, Ayano Meguro1, Manami Negishi1, Isamu Ohta1, Tomoya Ohta1, Masako Okamoto1, Nozomi Ono1, Shoko Saji1, Miyuki Sakaguchi1, Kumiko Sakai1, Michie Shibata1, Takanori Shimokawa1, Jianyu Song1, Yuka Takazaki1, Kimihiro Terasawa1, Mika Tsugane1, Kumiko Tsuji1, Shigenori Ueda1, Kazunori Waki1, Harumi Yamagata1, Mayu Yamamoto1, Shinichi Yamamoto1, Hiroko Yamane1, Shoji Yoshiki1, Rie Yoshihara1, Kazuko Yukawa1, Huisun Zhong1, Masahiro Yano1, Takuji Sasaki, (Principal Investigator)1; The Institute for Genomic Research (TIGR) Qiaoping Yuan2, Shu Ouyang2, Jia Liu2, Kristine M. Jones2, Kristen Gansberger2, Kelly Moffat2, Jessica Hill2, Jayati Bera2, Douglas Fadrosh2, Shaohua Jin2, Shivani Johri2, Mary Kim2, Larry Overton2, Matthew Reardon2, Tamara Tsitrin2, Hue Vuong2, Bruce Weaver2, Anne Ciecko2, Luke Tallon2, Jacqueline Jackson2, Grace Pai2, Susan Van Aken2, Terry Utterback2, Steve Reidmuller2, Tamara Feldblyum2, Joseph Hsiao2, Victoria Zismann2, Stacey Iobst2, Aymeric R. de Vazeille2, C. Robin Buell, (Principal Investigator)2; National Center for Gene Research Chinese Academy of Sciences (NCGR) Kai Ying3, Ying Li3, Tingting Lu3, Yuchen Huang3, Qiang Zhao3, Qi Feng3, Lei Zhang3, Jingjie Zhu3, Qijun Weng3, Jie Mu3, Yiqi Lu3, Danlin Fan3, Yilei Liu3, Jianping Guan3, Yujun Zhang3, Shuliang Yu3, Xiaohui Liu3, Yu Zhang3, Guofan Hong3, Bin Han, (Principal Investigator)3, Genoscope, Nathalie Choisne4, Nadia Demange4, Gisela Orjeda4, Sylvie Samain4, Laurence Cattolico4, Eric Pelletier4, Arnaud Couloux4, Beatrice Segurens4, Patrick Wincker4, Angelique D'Hont5, Claude Scarpelli4, Jean Weissenbach4, Marcel Salanoubat4, Francis Quetier, (Principal Investigator)4; Arizona Genomics Institute (AGI) and Arizona Genomics Computational Laboratory (AGCol) Yeisoo Yu6, Hye Ran Kim6, Teri Rambo6, Jennifer Currie6, Kristi Collura6, Meizhong Luo6, Tae-Jin Yang6, Jetty S. S. Ammiraju6, Friedrich Engler6, Carol Soderlund6, Rod A. Wing, (Principal Investigator)6; Cold Spring Harbor Laboratory (CSHL) Lance E. Palmer7, Melissa de la Bastide7, Lori Spiegel7, Lidia Nascimento7, Theresa Zutavern7, Andrew O'Shaughnessy7, Sujit Dike7, Neilay Dedhia7, Raymond Preston7, Vivekanand Balija7, W. Richard McCombie, (Principal Investigator)7; Academia Sinica Plant Genome Center (ASPGC) Teh-Yuan Chow8, Hong-Hwa Chen9, Mei-Chu Chung8, Ching-San Chen8, Jei-Fu Shaw8, Hong-Pang Wu8, Kwang-Jen Hsiao10, Ya-Ting Chao8, Mu-kuei Chu8, Chia-Hsiung Cheng8, Ai-Ling Hour8, Pei-Fang Lee8, Shu-Jen Lin8, Yao-Cheng Lin8, John-Yu Liou8, Shu-Mei Liu8, Yue-Ie Hsing, (Principal Investigator)8; Indian Initiative for Rice Genome Sequencing (IIRGS), University of Delhi South Campus (UDSC) S. Raghuvanshi11, A. Mohanty11, A. K. Bharti11,13, A. Gaur11, V. Gupta11, D. Kumar11, V. Ravi11, S. Vij11, A. Kapur11, Parul Khurana11, Paramjit Khurana11, J. P. Khurana11, A. K. Tyagi, (Principal Investigator)11; Indian Initiative for Rice Genome Sequencing (IIRGS), Indian Agricultural Research Institute (IARI) K. Gaikwad12, A. Singh12, V. Dalal12, S. Srivastava12, A. Dixit12, A. K. Pal12, I. A. Ghazi12, M. Yadav12, A. Pandit12, A. Bhargava12, K. Sureshbabu12, K. Batra12, T. R. Sharma12, T. Mohapatra12, N. K. Singh, (Principal Investigator)12; Plant Genome Initiative at Rutgers (PGIR) Joachim Messing, (Principal Investigator)13, Amy Bronzino Nelson13, Galina Fuks13, Steve Kavchok13, Gladys Keizer13, Eric Linton Victor Llaca13, Rentao Song13, Bahattin Tanyolac13, Steve Young13; Korea Rice Genome Research Program (KRGRP) Kim Ho-Il14, Jang Ho Hahn, (Principal Investigator)14; National Center for Genetic Engineering and Biotechnology (BIOTEC) G. Sangsakoo15, A. Vanavichit, (Principal Investigator)15; Brazilian Rice Genome Initiative (BRIGI) Luiz Anderson Teixeira de Mattos16, Paulo Dejalma Zimmer16, Gaspar Malone16, Odir Dellagostin16, Antonio Costa de Oliveira, (Principal Investigator)16; John Innes Centre (JIC) Michael Bevan17, Ian Bancroft17; Washington University School of Medicine Genome Sequencing Center Pat Minx18, Holly Cordum18, Richard Wilson18; University of Wisconsin-Madison Zhukuan Cheng19, Weiwei Jin19, Jiming Jiang19, Sally Ann Leong20
Annotation and Analysis: Hisakazu Iwama21, Takashi Gojobori21,22, Takeshi Itoh22,23, Yoshihito Niimura24, Yasuyuki Fujii25, Takuya Habara25, Hiroaki Sakai23,25, Yoshiharu Sato22, Greg Wilson26, Kiran Kumar27, Susan McCouch26, Nikoleta Juretic28, Douglas Hoen28, Stephen Wright29, Richard Bruskiewich30, Thomas Bureau28, Akio Miyao23, Hirohiko Hirochika23, Tomotaro Nishikawa23, Koh-ichi Kadowaki23, Masahiro Sugiura31
Coordination: Benjamin Burr32
Affiliations for participants: 1National Institute of Agrobiological Sciences/Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, 2-1-2 Kannondai, Tsukuba, Ibaraki 305-8602, Japan. 2The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA. 3Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (CAS), 500 Caobao Road, Shanghai 200233, China. 4Centre National de Séquençage, INRA-URGV, and CNRS UMR-8030, 2, rue Gaston Crémieux, CP 5706, 91057 EVRY Cedex, France. 5UMR PIA, Cirad-Amis, TA40-03 avenue Agropolis, 34398 Montpellier Cedex 05, France. 6Department of Plant Sciences, BIO5 Institute, The University of Arizona, Tucson, Arizona 85721, USA. 7Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11723, USA. 8Institute of Botany, Academia Sinica, 128, Sec. 2, Yen-Chiu-Yuan Rd, Nankang, Taipei 11529, Taiwan. 9National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan. 10National Yang-Ming University, 155, Sec. 2, Li-Nong St, Peitou, Taipei 112, Taiwan. 11Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110021, India. 12National Research Centre on Plant Biotechnology, Indian Agricultural Research Institute, New Delhi 110012, India. 13Waksman Institute, Rutgers University, Piscataway, New Jersey 08854, USA. 14National Institute of Agricultural Science and Technology, RDA, Suwon, 441-707 Republic of Korea. 15Rice Gene Discovery Unit, Kasetsart University, Nakron Pathom 73140, Thailand. 16Centro de Genomica e Fitomelhoramento, UFPel, Pelotas, RS, l 96001-970, Brazil. 17John Innes Centre, Norwich Research Park, Colney, Norwich NR4 7UH, UK. 18Washington University Genome Sequencing Center, 3333 Forest Park Boulevard, St. Louis, Missouri 63108, USA. 19University of Wisconsin, Department of Horticulture, Madison, Wisconsin 53706, USA. 20University of Wisconsin, Department of Plant Pathology, Madison, Wisconsin 53706, USA. 21Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan. 22Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan. 23National Institute of Agrobiological Sciences, Tsukuba, Ibaraki 305-8602, Japan. 24Medical Research Institute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-8510, Japan. 25Japan Biological Information Research Center, Japan Biological Informatics Consortium, Koto-ku, Tokyo 135-0064, Japan. 26Plant Breeding Dept, Cornell University, Ithaca, New York 14850-1901, USA. 27Cold Spring Harbor Laboratory, PO Box 100, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA. 28Department of Biology, McGill University, 1205 Dr Penfield Avenue, Montreal, Quebec H3A 1B1, Canada. 29Department of Biology, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada. 30Biometrics and Bioinformatics Unit, International Rice Research Institute, DAPO Box 7777, Metro Manila, Philippines. 31Graduate School of Natural Sciences, Nagoya City University, Nagoya 467-8501, Japan. 32Biology Department, Brookhaven National Laboratory, Upton, New York 11973, USA.
Work at the RGP was supported by the Ministry of Agriculture, Forestry and Fisheries of Japan. Work at TIGR was supported by grants to C.R.B. from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative, the National Science Foundation and the US Department of Energy. Work at the NCGR was supported by the Chinese Ministry of Science and Technology, the Chinese Academy of Sciences, the Shanghai Municipal Commission of Science and Technology, and the National Natural Science Foundation of China. Work at Genoscope was supported by le Ministère de la Recherche, France. Funding for the work at the AGI and AGCoL was provided by grants to R.A.W. and C.S. from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative, the National Science Foundation, the US Department of Energy and the Rockefeller Foundation. Work at CSHL was supported by grants from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative and from the National Science Foundation. Work at the ASPGC was supported by Academia Sinica, National Science Council, Council of Agriculture, and Institute of Botany, Academia Sinica. The IIRGS acknowledges the Department of Biotechnology, Government of India, for financial assistance and the Indian Council of Agricultural Research, New Delhi, for support. Work at Rice Gene Discovery was supported by BIOTECH and the Princess Sirindhorn's Plant Germplasm Conservation Initiative Program. Work at PGIR was supported by Rutgers University. The BRIGI was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Financiadora de Estudos e Projetos - Ministério de Ciência e Tecnologia (FINEP-MCT), Fundação de Amparo a Pesquisa do Rio Grande do Sul (FAPERGS) and Universidade Federal de Pelotas (UFPel). Work at McGill and York Universities was supported by the National Science and Engineering Research Council of Canada and the Canadian International Development Agency. Funding for H.H. at the National Institute of Agrobiological Sciences was from the Ministry of Agriculture, Forestry, and Fisheries of Japan, and the Program for Promotion of Basic Research Activities for Innovative Biosciences. Funding at Brookhaven National Laboratory was from The Rockefeller Foundation and the Office of Basic Energy Science of the United States Department of Energy. We would like to thank G. Barry and S. Goff for their help in negotiating agreements that permitted the sharing of materials and sequence with the IRGSP. We also acknowledge the work of G. Barry, S. Goff and their colleagues in facilitating the transfer of sequence information and supporting data.
The genomic sequence is available under accession numbers AP008207–AP008218 in international databases (DDBJ, GenBank and EMBL).
Rich text format
About this article
BMC Bioinformatics (2018)