Main

Rice has been studied extensively by molecular genetics and constitutes one of the best characterized crop plants with a fine genetic map of 3,267 markers (http://rgp.dna.affrc.go.jp/publicdata/geneticmap2000/index.html)1, a yeast artificial chromosome (YAC) physical map with 80.8% coverage2, sequences for about 10,000 unique expressed sequence tags (ESTs)3, and a transcriptional map indicating the placement of 6,591 unique ESTs2. The Rice Genome Research Program (RGP) in Japan launched its rice genome sequencing project in 1998. It is a partner of the International Rice Genome Sequencing Project (IRGSP), which involves ten countries in Asia, North America, South America and Europe that are working towards the immediate release of high-quality sequence data to the public domain4. The draft sequences of the two main subspecies of rice, japonica and indica, have been reported5,6. Both studies were based on whole-genome shotgun sequencing rather than on the clone-by-clone approach of the IRGSP. Although the release of the draft sequence is of immense scientific value, many challenges in rice genomics demand the availability of a complete, accurate, map-based rice genome sequence.

We determined the sequence of chromosome 1 from 390 overlapping phage (P1)-derived artificial chromosome (PAC) and bacterial artificial chromosome (BAC) clones and assembled it into nine contigs (Fig. 1). The longest contig is 14.4 Mb and spans positions 106.2 centimorgans (cM) to 157.1 cM on the molecular genetic map. Among the eight remaining gaps, gap 4, located at 73.4 cM, corresponds to a portion of the centromeric region and is estimated to be about 1,400 kilobases (kb) by the pachytene fluorescence in situ hybridization (FISH) method7. PAC/BAC clones adjacent to this gap contain copies of the rice centromere-specific sequence RCS2 (ref. 8). Two PAC clones, P0402A09 and P0020E09, are localized to the most distal ends of the short arm and the long arm, and their map positions have been verified by pachytene FISH using pAtT4 (ref. 9), a telomeric clone of Arabidopsis (Supplementary Fig. 1). This indicates that our physical map extends to within less than 50 kb of the telomeres. Integration of the PAC/BAC physical mapping with the results from fibre FISH gives a total length of 45.7 Mb for chromosome 1, corresponding to 181.8 cM on the genetic map, excluding the telomeres.

Figure 1: Physical map of rice chromosome 1.
figure 1

Positions of the PAC/BAC contigs are indicated by black bars. Purple numbers indicate the physical distances that were calculated on the basis of the nucleotide sequence length of each contig. A representation of the genetic map of chromosome 1 is shown on the left with the positions of the genetic markers found nearest the end of each contig. The centromeric region is shown as a red circle. The green numbers show the gap sizes as measured by fibre FISH and pachytene FISH.

Statistics for the nucleotide sequence of rice chromosome 1 are summarized in Table 1. The non-overlapping sequence covers 43,276,883 nucleotides. In this sequence, 6,756 genes were either identified or predicted. Thus, the average gene density of chromosome 1 is about one gene per 6.4 kb. If this distribution is assumed to be similar throughout the whole genome, then the total number of genes in the rice genome (400 Mb) is roughly 62,500. This number is 2.5 times larger than the gene total of Arabidopsis10. But this difference might easily be the result of an overestimate of rice genes, because it assumes that there is a uniform distribution of genes along the chromosomes.

Table 1 Compositional analysis of the sequence of rice chromosome 1

Cytogenetic analysis has indicated clear differences in the content of heterochromatin in each of the 12 rice chromosomes, and chromosome 1 shows the least amount of heterochromatic material11. The average exon size is comparable to that of Arabidopsis, but the average intron size is about 3.6 times larger. This means that, although the longer introns engender larger gene sizes in rice, the average transcriptome size is similar in both species. The G + C content of coding and noncoding regions in rice is higher than in Arabidopsis—the rice coding regions are especially (G + C)-rich. This characteristic is reflected by the biased usage of G/C at the third position of codons within predicted genes (Supplementary Table 1). Buoyant density experiments have shown that rice genes are localized in (G + C)-rich islands that occupy 24% of the genome12. When we plotted the average G + C values against chromosomal position in chromosome 1, however, we did not detect any CpG islands, indicating a neutral nucleotide distribution. The ratios of physical to genetic distance on the short and the long arms are 214 kb cM-1 (r2 = 0.983) and 288 kb cM-1 (r2 = 0.976), respectively, suggesting that the rate of recombination differs along the two arms of the chromosome.

We compared our finished sequence (493,729 bp from the distal end of the short chromosome arm) with 127,550 indica sequence contigs assembled from the whole-genome shotgun sequences of the Beijing Genomics Institute (BGI, http://btn.genomics.org.cn/rice/) using the japonica sequence as a query for basic BLASTN (basic local alignment search tool) analysis (Fig. 2). We could detect the corresponding indica sequence in about 78% of the whole region. But there were 65 gaps in the aligned contigs, and a total of 110,389 bases (22%) of japonica sequence could not be identified in the indica assembly. This may partly reflect the sequence difference between the two subspecies, although some artefacts in the whole-genome shotgun assembly cannot be ruled out. Among the 96 predicted genes in this region of the completed japonica sequence, 55 genes are intact, 33 genes are partially predicted and 8 genes are not predicted in the corresponding indica draft sequence. Relative identities near the repeat (retrotransposon-like) regions are lower than in the other regions, indicating a misassembly in the sequence.

Figure 2: Comparison between the Nipponbare finished sequence and the indica draft sequence.
figure 2

Our finished continuous sequence (493,729 bp from near the distal end of the short arm) was used as query. All of the indica 93-11 sequence contigs assembled from the whole-genome shotgun sequences from the BGI were downloaded from their website and searched by BLASTN. The highest rank of the hit contigs was aligned to our Nipponbare sequence. Coloured bars represent the following: blue, RGP sequence data; grey, PAC/BAC clones; red, BGI sequence data with >95% identity; green, BGI sequence data with 90–95% identity; and yellow, regions containing repetitive sequences. Numbers in parentheses correspond to the length of PAC/BAC clones and to BGI sequence contigs in bp.

Direct comparison with the japonica draft sequence could not be made because the sequence data are not in the public domain. But previously, 4,467 genes were predicted from a set of 99 BAC contigs assigned to chromosome 1 (ref. 6). It is likely that an estimated 2,835–4,211 gaps (either 63 gaps per megabase or 10% of 42,109 total gaps) for this chromosome prevented an accurate prediction of the number of genes. Not surprisingly, only half of the genes predicted contain complete coding regions. In addition, no basis was provided for the assignment of genes to chromosome6.

We used an automated annotation system, RiceGAAS13, to characterize the gene composition of chromosome 1 (Supplementary Fig. 2, http://RiceGAAS.dna.affrc.go.jp/chromosome1/). The distribution of genes along both arms of the chromosome indicates higher density (18–19 genes per 100 kb) in distal as compared with proximal regions (10–12 genes per 100 kb). This was verified by experimental results obtained by mapping 977 expressed sequence tags on to chromosome 1 (ref. 2). Among the 6,756 predicted genes, 2,073 (31%) were functionally characterized by homology to known proteins using BLASTP, whereas 69% of the predicted genes corresponded to proteins with no known function (Table 2). The protein signature search program InterPro detected protein domains in 3,660 (54%) of the total predicted genes (see http://RiceGAAS.dna.affrc.go.jp/chromosome1/). In particular, 1,170 (33%) of 3,600 hypothetical proteins showed domain homology, suggesting that these proteins may correspond to newly identified proteins in rice. BLASTN analysis was done using the cereal EST entries from the EST database at the National Institute for Biotechnology Information (NCBI). Exon regions from all predicted genes were used as queries, and 546,723 unclustered ESTs from wheat, maize, barley and sorghum were searched using a threshold probability value of 10-5. A total of 2,985 predicted genes, including 756 hypothetical proteins, have cereal homologues. Thus, among the 6,756 predicted genes, 4,803 (71%) show some evidence of homology to a domain, a functional site, a cereal EST or a protein.

Table 2 Functional classification of the proteins encoded on rice chromosome 1

The predicted proteins found on chromosome 1 were categorized into gene families by BLASTP, using a threshold probability score of 10-20 over more than 50% of the length of the gene. The most abundant gene family was the serine/threonine receptor kinase family with 132 members distributed along the chromosome (Fig. 3a). A cluster of this gene family was observed at the distal end of the short arm, although some members of the cluster seemed to be pseudogenes. The highest number of tandem repeats detected at a single site was a cluster of ten copies of the hypothetical gene family located on the short arm of chromosome 1. These results are summarized in Fig. 3b, which shows a dot matrix plot of chromosome 1, indicating the predicted genes with significant homology to a given gene. On this plot, which disregards self-homology, a clear diagonal line was obtained, indicating that a significant number of genes are duplicated and arrayed in tandem.

Figure 3: Analysis of gene families and gene clusters.
figure 3

a, Distribution of gene families on chromosome 1. For each category, BLASTP was carried out against all of the chromosome 1 gene products. Proteins that had a threshold probability (E) value of less than 10-20 and that also showed strong homology for more than half of their total length were grouped as a gene family. The six largest gene families are shown. Vertical lines on each bar indicate the positions of gene family members. The bars are oriented with the distal end of the short arm of the chromosome to the left. The scale at the bottom shows the physical length of chromosome 1 in Mb. b, Dot matrix plot showing the positions of homologous genes on chromosome 1. A BLASTP search was carried out using all of the predicted genes. A dot was plotted when the E value between two genes was less than 10-20 and the length of the match was over 50%. The colour of each dot represents the E value of the match. The colour spectrum bar at the bottom left shows the E value associated with each colour; for example, green corresponds to E = 10-40. The matches of E < 10-100 are shown as red dots. ‘Self-against-self’ matches are omitted.

To determine whether any of the proteins on rice chromosome 1 are not present in Arabidopsis, the 6,756 predicted proteins were queried in BLASTP searches against all the Arabidopsis proteins in the Munich Information Center for Protein Sequence (MIPS) database using a threshold probability score of 10-5. Among 3,161 positive queries, 824 showed strong similarities (probability value less than 10-100) to proteins found in Arabidopsis, whereas 3,595 sequences (53%) did not have positive BLASTP hits with predicted Arabidopsis proteins at a probability threshold of 10-5. Only 27 of these sequences had homology to known proteins and among them, only Bowman–Birk trypsin inhibitor and cytochrome f (chloroplast) were clearly found in rice chromosome 1. This suggests that almost all of the known proteins found in rice chromosome 1 are also found in Arabidopsis. Among the hypothetical proteins, 3,051 genes have no counterpart in Arabidopsis and 442 (15%) genes have grass orthologues. Analysis of the draft sequence also showed that half of the predicted genes have no homologues in Arabidopsis5,6. Although many of these hypothetical genes could be artefacts resulting from prediction errors, functional characterization of these genes in the future may identify grass-specific or even rice-specific genes.

We also observed rice chloroplast genes in sequential order on the chromosomal DNA. For example, at 149.1 cM we identified 3,564 bp of sequence that matched the rice chloroplast sequence with only a 3-bp difference. This sequence contains three genes14, PSII cytochrome b559, cytochrome f and the chloroplast envelope membrane protein ORF230. We also detected 85 putative transfer RNA genes using tRNAscan SE15. Analysis of the retrotransposable elements and DNA intermediate transposons, including miniature inverted-repeats transposable elements (MITEs)16, using RepeatMasker is given in Table 1 and summarized in Supplementary Fig. 3. MITEs have a tendency to be dispersed along the chromosome, whereas the retrotransposons and other autonomous type DNA-mediated transposable elements are clustered in the pericentromeric region. Among retroelements, Ty3/Gypsy-type elements are the most frequent (2,157), followed by Ty1/Copia-type elements (384). The sum of the lengths of these three repetitive elements is 6.0 Mb, corresponding to 13% of chromosome 1.

There are at least three compelling reasons for obtaining finished high-quality sequence for the complete rice genome: first, the ability to determine gene function is highly dependent on having accurate sequences; second, as a model plant for the cereal grasses, the complete rice sequence will directly affect what can be accomplished with the other cereal grasses; and last, the identification of genes responsible for agronomic traits of economic importance requires precise map-based genomic sequence. Chromosome 1 contains many biologically important genes. More than 20 gene loci have been identified by genetic analysis, including genes controlling dwarfing and fertility. One of these genes, sd1 has been cloned and shown to encode one of the enzymes in gibberellic acid synthesis17.

The complete genomic sequence of chromosome 1 has yielded several findings that would be observed only using a clone-by-clone sequencing strategy. Gene families comprising active and inactive members and sets of tandemly repeated genes seem to be common features of chromosome 1. This redundancy may account for the unexpectedly large number of predicted genes on this chromosome. The intergenic repetitive fraction of the genome is not well understood and is frequently described as ‘junk’. Repetitive sequences are usually removed or separated from other sequences before whole-genome shotgun assembly because they can cause global misassembly. But we know that functional genes are found in repetitive sequences and that transposable elements embedded in the repetitive sequences can restructure genomes, can control gene action and are likely to be involved in generating some of the allelic variation that has been selected in plants.

In addition, high-quality finished sequence provides the only real opportunity to study gene regulation, because most of the essential regulatory sequences fall outside the transcribed regions and our analysis of a restricted region of the genome showed that 43% of the genes predicted from whole-genome shotgun sequence methods were incomplete. Our results and those from the sequencing of rice chromosome 4 (ref. 18) show clearly the importance of the finished sequence. The IRGSP has an immediate goal of sequencing the rice genome to a minimum standard of the high-throughput genomic sequence (HTG) phase 2 level by the end of 2002 and is committed to a long-term goal of obtaining finished high-quality sequence for the whole genome.

Methods

Chromosome sequencing

We sequenced the whole chromosome 1 of Oryza sativa ssp. japonica, variety Nipponbare, from 390 overlapping PAC/BAC clones. Initially, we constructed a sequence-ready physical map using the RGP Sau3AI PAC and MboI BAC libraries19. We also used HindIII or EcoRI BAC libraries constructed by Clemson University Genomics Institute (CUGI), and BAC clones with draft sequence data provided by Monsanto for gap filling in particular. We carried out shotgun sequencing of RGP and CUGI PAC/BAC clones to obtain sequence data with tenfold overlap. For Monsanto BAC clones20, we complemented the available draft sequence (fivefold redundancy) with an additional fivefold overlap sequence (http://rgp.dna.affrc.go.jp/genomicdata/seqstrategy/newstrategy.html).

After the initial assembly of sequence data, stretches of poor or ambiguous quality and apparent gap regions were identified for further sequencing to obtain greater than 99.99% sequence accuracy. But despite extensive efforts to improve the sequence quality and to fill the gaps, 4 of the 390 PAC/BAC clones sequenced are still at phase 1 (GenBank, http://www.ncbi.nlm.nih.gov/HTGS/) because the consensus sequence could not be ordered correctly owing to numerous repeats. The remainder comprises 16 phase 2 and 370 phase 3 clones. The nine contigs for chromosome 1 representing the non-overlapping segments of continuous sequence were conjoined by inserting into the gap regions nucleotides that were calculated on the basis of the results of FISH experiments. All of the sequence information of chromosome 1 has been submitted to the DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp/) with the accession number BA000010 (Con Division).

Gene prediction and functional classification

We carried out gene prediction using our in-house automated gene prediction system RiceGAAS13. The algorithm for gene domain prediction in RiceGAAS was designed by combining several prediction programs including GENSCAN21 for maize, GENSCAN21 for Arabidopsis, RiceHMM (http://rgp.dna.affrc.go.jp/RiceHMM/index.html) and the exon-finding program MZEF (http://argon.cshl.org/genefinder/), with homology search results from BLASTN and BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/). These results were merged and integrated for gene prediction. Domain search was done using InterPro (http://www.ebi.ac.uk/interpro/scan.html), and repeats were identified using RepeatMasker (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker). The predicted proteins were used to query the nonredundant protein database using BLASTP and categorized according to functional categories defined for Arabidopsis by MIPS (http://mips.gsf.de/cgi-bin/proj/thal/filter_funcat.pl?all) with a threshold probability value of 10-20.