Main

We report here the completion of the fully annotated genome sequence of the simple eukaryote Schizosaccharomyces pombe, a fission yeast. It becomes the sixth eukaryotic genome to be sequenced, following Saccharomyces cerevisiae1, Caenorhabditis elegans2, Drosophila melanogaster3, Arabidopsis thaliana4 and Homo sapiens5,6. The entire sequence of the unique regions of the three chromosomes is complete, with gaps in the centromeric regions of about 40 kb, and about 260 kb in the telomeric regions. The completion of this sequence, the availability of sophisticated research methodologies, and the expanding community working on S. pombe, will accelerate the use of S. pombe for functional and comparative studies of eukaryotic cell processes.

Schizosaccharomyces pombe is a single-celled free living archiascomycete fungus sharing many features with cells of more complicated eukaryotes. From gene sequence comparisons and phylogenetic analyses, it has been suggested that fission yeast diverged from budding yeast around 330–420 million years (Myr) ago, and from Metazoa and plants around 1,000–1,200 Myr ago7, although a more recent estimate has put these times at 1,144 and 1,600 Myr, respectively8. Some gene sequences are as equally diverged between the two yeasts as they are from their human homologues, probably reflecting a more rapid evolution within fungal lineages than in the Metazoa. S. pombe was first described in the 1890s and has been extensively studied since the 1950s9,10, resulting in the characterization of around 1,200 genes (http://www.genedb.org/pombe). The ease with which it can be genetically manipulated is second only to S. cerevisiae among eukaryotes and it has served as an excellent model organism for the study of cell-cycle control, mitosis and meiosis11, DNA repair and recombination12, and the checkpoint controls important for genome stability13.

The 13.8-Mb genome of S. pombe is distributed between chromosomes I (5.7 Mb), II (4.6 Mb) and III (3.5 Mb)14, together with a 20-kb mitochondrial genome15. Tandem arrays of 100–120 repeats of a 10.4-kb fragment containing the 5.8S, 18S and 25S ribosomal RNA genes account for around 1.1 Mb16. The three centromeres are 35, 65 and 110 kb long for chromosomes I, II and III, respectively, totalling 0.2 Mb. This leaves about 12.5 Mb of unique sequence, similar in size to that of S. cerevisiae, and substantially smaller than those of the three other sequenced model eukaryotes, C. elegans (97 Mb), Arabidopsis (125 Mb) and Drosophila (137 Mb). All of the unique sequence and most of the three centromeres of the Urs Leupold 972h- strain9 have been sequenced by the Wellcome Trust Sanger Institute and the 13 other laboratories that make up the S. pombe European Sequencing Consortium (EUPOM), together with 100 kb of sequence generated by the Cold Spring Harbor Laboratory (GenBank accession numbers AL355920, AL355921, AL391034 and AL391016). Here, we present and discuss the genome sequence and composition, and carry out an initial overview of gene function, making comparisons with other eukaryotic organisms, particularly S. cerevisiae.

Mapping, sequencing and sequence analysis

A clone map was generated by the integration of the two pre-existing maps17,18. End sequencing and restriction digestion of cosmids were used to construct a minimal tile path for sequencing. Problems with the earlier maps included the existence of chimaeric clones, mismapped cosmids, bacterial insertion elements and unfilled gaps. Small gaps were covered using a long-range polymerase chain reaction (PCR) strategy, plasmid libraries, and a bacterial artificial chromosome (BAC) library provided clones for gap closure across regions not represented in the cosmid libraries. The final 12.5-Mb sequence of the S. pombe genome is a composite of 452 cosmids, 22 plasmids, 15 BAC clones and 13 PCR products.

Most sequencing was performed using random sequencing of sub-cloned DNA followed by directed sequencing19. DNA from clones was shattered (usually by sonication) and fragments of 1.4–2 kb were cloned, typically, into M13 or pUC18. Random sub-clones were sequenced with dye-terminator chemistry and analysed on automated sequencers. Most laboratories used Phred software for sequence base calling and Phrap or Gap4 for contig assembly20. Gaps and low-quality regions of the sequence were resolved using primer walking, PCR and re-sequencing clones, under conditions that gave increased read lengths. Some laboratories also used direct blotting procedures, classical radioactive sequencing and nested deletions. All sequences were finished to a high degree of accuracy, with at least two high-quality reads on each strand, or, if this could not be accomplished, an additional read on the same strand using an alternative chemistry. The depth of coverage was on average eightfold. Sequences were collected centrally at the Wellcome Trust Sanger Institute, where the quality was examined by comparison of overlapping regions and by checking for frameshifts in coding regions. The sequencing error rate was less than 1 in 180,000 base pairs (bp), calculated from the number of single-base differences observed in overlapping sequences from different sources. All identified sequencing errors have been resolved with the exception of four single-base differences found in homopolymeric tracts located outside coding regions, possibly generated by slippage during DNA replication.

Gene prediction was carried out with GENEFINDER (P. Green and L. Hillier, unpublished software) trained on experimentally confirmed S. pombe genes to recognize intronic and coding regions. Additional information was provided using a Hidden Markov Model trained on intron sequences using HMMER (http://hmmer.wustl.edu/hmmer-html/). Searches were performed against public databases (SWISS-PROT and TrEMBL21, EMBL22 and Pfam23), using BLAST24, MSPcrunch25, FASTA26 and Genewise27. The predictions were refined manually within the Artemis analysis and annotation tool28 using protein homology and expressed sequence tag (EST) data29. Because most S. pombe genes have a prospective homologue in other organisms, putative functions were assigned on the basis of similarities to known genes, using the SWISS-PROT21, Pfam23, Proteome30, SGD31 and MIPS databases32. Identification of transfer RNA was carried out using the tRNA scan-SE software33.

Prediction of genes in fission yeast is a problem of intermediate complexity. It is more difficult than the analysis of tightly packed genomes that have little or no splicing, as found in prokaryotes and budding yeast, but less difficult than gene prediction in multicellular eukaryotes, which have lower gene density, high levels of splicing, and long introns. There are 4,730 confirmed and predicted introns in S. pombe, many more than the 272 now predicted for S. cerevisiae. S. pombe introns average only 81 nucleotides in length and so are shorter and easier to predict than those found in Metazoa and plants. Of the 4,730 introns in S. pombe, 638 have been confirmed experimentally by messenger RNA and EST data29, and many more by homology.

Genome content

We predicted a maximum of 4,940 protein coding genes (including 11 mitochondrial genes) and 33 pseudogenes. The three gene maps showing these predictions can be viewed at ftp://ftp.sanger.ac.uk/pub/yeast/pombe/GeneMaps/. All open reading frames (ORFs) over 100 amino acids with an initiator methionine and not overlapping with other known genes are included in this set. Also included are 147 confirmed or predicted protein-coding sequences of 25–99 amino acids. Any remaining undiscovered genes are likely to have either a highly spliced structure with small exons, or to be smaller than 100 amino acids. There are a further 116 questionable proteins considered less likely to be coding because they are small, have no detectable homologies, and display low coding potential. Removal of these questionable genes reduces the predicted gene complement from 4,940 to 4,824.

Even our upper estimate of 4,940 genes for S. pombe is substantially less than the 5,570–5,651 genes predicted for S. cerevisiae34,35, the 6,752 genes predicted for Mesorhizobium loti, the largest published prokaryote genome sequence to date36, and the 7,825 genes estimated in the 8.67-Mb genome of the prokaryote Streptomyces coelicolor (J. Parkhill and S. Bentley, personal communication). We conclude that a free-living eukaryotic cell can be constructed with fewer than 5,000 genes, and that the distinction between eukaryotic and prokaryotic cell organization is not determined simply by total number of genes but depends on the types of genes present and how they interact with each other and the environment. Comparing the genome content of species at different levels of organization, it seems that fewer than 500 genes are sufficient to generate a parasitic prokaryotic cell such as Mycoplasma genitalium37, about 1,500 genes for a free-living prokaryotic cell such as Aquifex aeolicus38, 5,000 genes for a free-living eukaryotic cell (S. cerevisiae and S. pombe; ref. 39 and this paper), and around 15,000 genes for multicellular eukaryotic organisms such as Drosophila and C. elegans2,3, whereas 30,000–40,000 genes gives rise to human consciousness5,6.

Gene density is similar for chromosomes I and II, with one gene every 2,483 and 2,457 bp respectively, but is less dense for chromosome III, at one gene every 2,790 bp. This is not due to differences in the average length of the genes, which are similar (1,407–1,446 bp) for all three chromosomes (Table 1). Protein-coding genes are absent from the centromeres, although tRNA genes are found in these regions. Gene density is also lower at the telomeres. The gene density for the complete genome is one gene every 2,528 bp, compared with one gene every 2,088 bp for S. cerevisiae. The protein-coding sequence is predicted to occupy 60.2% (57% excluding introns) of the sequenced portion of the S. pombe genome, compared with 71% in S. cerevisiae (70.5% excluding introns). The overall guanine and cytosine (GC) content is 36.0%, compared with 38.3% in S. cerevisiae, and for the protein-coding portion is identical in the two yeasts at 39.6%.

Table 1 Genome content for the three chromosomes

We have identified a total of 174 tRNAs, 45 of which have introns; all the tRNA families needed to decode all codons are present. The spliceosomal RNAs (U1–U6) are found together with 16 small nuclear RNA genes (snRNAs) and 33 small nucleolar RNAs (snoRNAs). These are dispersed mostly as singletons throughout the genome. The 5.8S, 18S and 26S ribosomal RNA genes are grouped together as 100–120 tandem repeats in two arrays on chromosome III40, but the thirty 5S ribosomal RNA genes are distributed throughout the genome41, providing opportunities for unequal crossing over when they are in tandem orientation and close proximity. This can lead to local duplications and deletions of genes located between the 5S RNA genes42. There are 11 intact transposable elements (Tf2 type) (Table 1), accounting for 0.35% of the genome. This is significantly less than the 2.4% (59 elements) found in S. cerevisiae43 and the 10% found in Arabidopsis4, and is also likely to be much less than the numbers in Drosophila and humans44,45. There are 25 wtf elements (‘with tf1- or tf2-type’ long terminal repeats, LTRs), which appear to be spliced membrane proteins of S. pombe. These elements are often flanked by LTRs, and so may have been duplicated by retrotransposition. There are also 180 solo LTRs, marking former transposition events, compared with 268 found in S. cerevisiae. The density of transposable element remnants on chromosome III of S. pombe is twice that of chromosomes I and II (Table 1).

We examined 73 genetically and physically mapped genes from the three gene maps; comparison of these maps shows that they are essentially co-linear and that the level of recombination is similar throughout the three chromosomes. More detailed comparisons of the genetic and physical maps may reveal subtle variations in recombination around centromeres, telomeres, the mating-type locus, and sites of meiotic DNA double-strand breaks. Several inconsistencies in the genetic maps were identified, including the reversal of a chromosome II fragment near the telomere between trp1 and spo4 (ref. 46), the relocation of cut1 and wee1 from the telomere region to the centromere region of chromosome III, and changes in position of lys1 and top1.

Centromere structures

The outline structure of the centromeres has previously been deduced by Southern blotting and by sequencing about 14% of the centromere repeat regions47,48,49. Here, we sequenced most (81%) of the three centromeres; this has allowed schematic maps of the centromeres to be verified (Fig. 1). The nomenclature used follows that of the Yanagida group50,51; however, other designations of the centromere elements have been used52. The most complete sequence is for centromere 1, which is the shortest at 35 kb and is missing only one 2.5-kb fragment. This centromere consists of a central core (cnt1) of 4.1 kb and 28% GC content, flanked by two 5.6-kb imperfect imr1 repeats (imr1L, imr1R) with 29% GC content, and two pairs of 4.4-kb dg and 4.8-kb dh repeats (dg1, dh1) of 33–34% GC content. A repeat of around 0.3 kb, known as cen 253 (EMBL X13757), is found adjacent to the dh repeats. The maps of the other two centromeres have the same basic structure with central cnt regions flanked by imr repeats and by variable numbers of dg and dh repeats separated by cen 253. Cnt1, -2 and -3 share 48% identity over a 1,405-bp region, and dh1, -2 and -3 share 48% identity over a 1,811-bp region. However, the most striking conservation is observed in the dg regions, which share 97% identity over a 1,780-bp region. This highly conserved segment represents an element that is essential for centromere function; deletion of this region from the dg repeat, termed the K/K″ repeat by the Clarke group, results in a complete loss of centromere activity in both mitosis and meiosis53. There must be a special mechanism to maintain such a high level of sequence conservation between the different centromeres. The total calculated lengths of centromeres 1, 2 and 3 are respectively 35, 65 and 110 kb, inversely proportional to the lengths of the chromosomes at 5.7, 4.6 and 3.5 Mb. Possibly more extended centromeric regions are required for proper mitotic and meiotic behaviour when the chromosome arms are shorter. As noted above there are no protein-coding genes in the centromeric region but there are many tRNA genes (Fig. 1). tRNA clusters flank centromeres 2 and 3 and are also found within the imr regions of all three centromeres50. These tRNA genes might contribute to centromere function by defining domain boundaries important for centromere activity54.

Figure 1: Schematic maps of the three S. pombe centromeres showing the repeated elements.
figure 1

The key is given at the bottom of the figure and the relevant clones are indicated under each centromere map. The maps are not drawn to scale.

The S. pombe centromeres are considerably longer than their S. cerevisiae equivalents, which contain a core region sufficient for centromere activity of only 120 bp55,56 and a nuclease-protected region of 150–160 bp including the 120-bp conserved core57. It is not clear why S. pombe centromeres are 300–1,000 times larger than their S. cerevisiae equivalents, but one possibility is that their kinetochore structures are different.

Intergene regions

The total intergene length distributions for S. pombe and S. cerevisiae are shown in Fig. 2. The length is calculated from the stop codon to the next start codon for tandemly oriented genes, from the start codon to the start codon for divergently oriented genes, and from the stop codon to the stop codon for convergently oriented genes. Intergenic regions in S. pombe have a mode of 423 bp and a mean of 952 bp, both longer than the equivalent values for S. cerevisiae (200 and 515 bp respectively). Analysis of the divergent intergene regions reveals that pairs of upstream regions range in length from 200 to 2,100 bp, with a peak between 200 and 1,200 bp (Fig. 2). This is longer than the equivalent distributions in S. cerevisiae, which range from 200 to 900 bp, with a peak from 200 to 700 bp (Fig. 2). Analysis of convergent intergene regions shows a peak in length for pairs of downstream regions of 200–800 bp for S. pombe and 100–500 bp for S. cerevisiae (Fig. 2). Therefore there is a smaller difference between the two yeasts for the intergenic regions between convergent genes (downstream regions) than for those between the divergent genes (upstream regions).

Figure 2: Intergene regions.
figure 2

Distribution of intergene regions given for all genes and for divergent and convergent pairs of genes, for both S. pombe and S. cerevisiae. A total of 4,890 intergene regions from S. pombe were analysed from a database prepared just before completion of the whole genome, and 5,788 intergene regions from S. cerevisiae were analysed. Histograms show the number of regions in 200-bp bins.

Several explanations can account for these results. The 5′ mRNA regions may be systematically longer in S. pombe than in S. cerevisiae, although there is no evidence for this. For example, the spacing between the TATA-box region and the transcriptional start in S. pombe is shorter than that in S. cerevisiae58,59. Alternatively, the promoter regions may be of greater complexity in S. pombe and therefore longer. Again there is no direct evidence to support this view, but there are other examples of more-extended organization of chromatin elements in S. pombe, including larger centromeres and regions of DNA replication origin60. The existence of truly intergenic spacer regions in S. pombe is supported by the identification of several 4–8-kb extended gene-free regions, which fall outside the broad distribution of lengths associated with average intergenic regions. These are low complexity sequences with a (G - C)/(G + C) strand switch61. There are about ten gene-free regions per chromosome, which are usually flanked by tandemly oriented genes. One of these gene-free regions, between SPAC4G8.03c and SPAC4G8.04, corresponds to a prominent meiotic DNA break site or cluster of sites (J. A. Young, R. W. Schreckhise and G. R. Smith, manuscript in preparation).

Introns

A total of 4,730 introns is distributed among 43% of S. pombe genes, with 15 being the largest number of introns found within a single gene (Table 2). Introns varied from 29 to 819 nucleotides long, with a mean length of 81 and a mode of 48 nucleotides. In S. cerevisiae, introns are much rarer, with only 5% of genes having introns. Most introns in S. pombe follow the rule of GT donor and AG acceptor, but there are three examples that have GC donors62. The average positions of introns within genes were assessed by mapping them with respect to the start and stop codons. This analysis does not take into account any introns in 5′ and 3′ untranslated regions. For the genes with 1–6 introns there is a 5′ bias from the values expected if introns were evenly distributed throughout the genes (Table 2). A 5′ bias is also seen in S. cerevisiae, where it has been hypothesized to be due to in vivo reverse transcription generating complementary DNAs primed from the 3′ ends of the mRNAs, followed by replacement of the original chromosomal gene with the cDNA by homologous recombination63. Because cDNAs are extended from their 3′ ends, there will be a tendency for introns at 5′ ends not to be removed from the chromosomal genes. Of genes that have two or more introns, 614 have two introns, 324 have three, 148 have four, 70 have five and 40 have six (Table 2). Thus the number of genes having an extra intron decreases by about half as intron number increases from two to six per gene. These observations may be of relevance to speculations concerning the mechanisms by which introns are generated and removed64. The relatively large number of introns in S. pombe provides opportunities for alternative splicing to generate protein variants, which could have regulatory roles as well as increasing the range of protein types present in the cell65.

Table 2 Introns per gene and average positions of introns within genes

Genome duplications and comparisons

Comparisons of chromosomal sequences and searches for tracts of conserved gene order did not reveal evidence for large-scale genome duplications in S. pombe. This differs from reports for S. cerevisiae and Arabidopsis, which have suggested that both of these organisms have undergone some large-scale genome duplication4,66. However, blocks of duplicated sequence totalling about 50 kb retaining a conserved gene order can be found at the sub-telomeric regions of chromosomes I and II. Twenty-four genes (in groups of two or four) are 100% identical at the DNA level, and twenty of these are localized in sub-telomeric regions, suggesting frequent exchange of genetic information at these positions. Most of these genes code for proteins belonging to families specific to fission yeast and are predicted to be cell-surface proteins. Interestingly, in S. cerevisiae 7 of the 16 genes (in groups of two, three or four) that are 100% identical at the DNA level are also located in sub-telomeric regions. These gene products include members of the budding-yeast-specific PAU and COS families, which are also predicted to be cell-surface proteins39. In the highly plastic telomeric and sub-telomeric regions of malaria and several other protozoan parasites, genes coding for species-specific cell-surface proteins are also found, for example, the Var, Rifin and Stevor families of Plasmodium falciparum67. These data suggest that recombination events between telomeric regions may be a major mechanism involved in the generation of organism-specific cell-surface molecules. These molecules may also be of importance for cell identity and for processes that generate hypervariable cell-surface molecules relevant for self and non-self recognition.

We next compared the proteins of S. pombe with those of the unicellular eukaryote S. cerevisiae and the metazoan C. elegans (Fig. 3), using BlastP24 with a cutoff E-value of 0.001 and no low-complexity filtering. Excluding genes coded by the mitochondria and transposons, we used a data set of 4,876 proteins from S. pombe, 5,777 proteins from S. cerevisiae (Cerpep 14 May 2001; ftp://ftp.sanger.ac.uk/pub/yeast/SCreannotation/cerpep) and 19,622 proteins from C. elegans (ftp://ftp.sanger.ac.uk/pub/databases/wormpep). About two-thirds of the S. pombe proteins (3,281) have homologues in common with both S. cerevisiae and C. elegans (Fig. 3). A smaller number, 769 (16%), have homologues in S. cerevisiae but not in C. elegans and many fewer, 145 (3%), have homologues in C. elegans but not in S. cerevisiae. A total of 681 proteins (14%) seems to be unique to S. pombe. A comparison between S. cerevisiae and the other two organisms gave similar results, with 3,605 (62%) of the proteins in common, 918 (16%) found only in S. pombe and 150 (3%) only in C. elegans, leaving 1,104 proteins (19%) unique to S. cerevisiae. Thus, S. cerevisiae proteins with homologues only in S. pombe total 918 whereas the reverse comparison totals 769 (Fig. 3), indicating that there might be more gene duplications in S. cerevisiae, accounting for the extra proteins found in this organism.

Figure 3: Comparison of proteins in S. pombe (S.p.), S. cerevisiae (S.c.) and C. elegans (C.e.).
figure 3

a, Pie chart comparing the homology of proteins of S. pombe with those of S. cerevisiae and C. elegans. b, Pie chart comparing the homology of proteins of S. cerevisiae with those of S. pombe and C. elegans. For example, S.p. proteins in S.c. and C.e. means S. pombe proteins with homologues found in S. cerevisiae and C. elegans. The absolute numbers of proteins are given for both yeasts.

To investigate gene duplication further, we carried out an ‘all against all’ comparison using the same protein data sets and NCBI BlastClust68 (ftp://ncbi.nlm.nih.gov/blast/documents/README.bcl) to distinguish protein clusters from proteins represented uniquely. Of the 4,876 protein-coding genes of S. pombe, 4,515 have no other sequence relatives within the organism and can be considered unique. The remaining 361 are distributed among protein cluster groups with two or more members (Table 3). Using the same parameters in S. cerevisiae, 5,061 genes are unique and 716 fall into groups with two or more members (Table 3). This supports the idea that there is less gene redundancy than in S. cerevisiae, which may help functional analyses of those genes that are not duplicated in S. pombe.

Table 3 Gene duplication in S. pombe and S. cerevisiae

Human disease genes

To assess the usefulness of S. pombe for investigating the functions of genes related to human disease, we used the same method and dataset of human disease genes as that employed for analysis of the Drosophila genome69. Protein-coding genes of S. pombe were identified that generate products with similarities to proteins coded by 289 genes that are mutated, amplified or deleted in human disease. A total of 172 S. pombe proteins have similarity with members of this data set of human disease proteins, and 122 of these have E-values greater than 1 × 10-40. These values indicate that either they are not significant or they have only limited similarities with the equivalent human proteins, reflecting, for example, shared domains such as related protein-interacting regions or catalytic sites. However, despite this limitation, they may still be useful for investigating the biochemical activities and interactions of human disease proteins in S. pombe. The other 50 S. pombe proteins (Tables 4 and 5) have E-values lower than 1 × 10-40. The more significant similarities seen with this class mean that genes coding for these proteins are more likely to be useful for investigating not only the biochemical but also the biological functions of the human genes, and some could provide good models for studying the associated human disease pathways. The largest group of human disease-related genes are those implicated in cancer. There are 23 such genes (Table 4), and they are involved in DNA damage and repair, checkpoint controls, and the cell cycle, all processes involved in maintaining genomic stability. The cell cycle and checkpoint background of S. pombe make it a good model organism for studying these particular cancer disease pathways. Other categories that are also represented in S. pombe are those involved in metabolic (12 genes), neurological (13 genes), cardiac (1 gene) and renal (1 gene) disease (Table 5).

Table 4 Schizosaccharomyces pombe genes related to human cancer genes
Table 5 Schizosaccharomyces pombe genes related to human disease genes

A similar analysis in S. cerevisiae identified 182 proteins with similarities to the human disease set, with most of the genes coding for these proteins being shared by the two yeasts. Only two of the genes (SPAC630.13c and SPBC530.12c), found in S. pombe but not S. cerevisiae, code for proteins with any significant similarity to human disease proteins. These are tuberous sclerosis 2 (TSC2), involved in cancer, and ceroid lipofuscinosis PPT1, involved in metabolism. Both yeasts seem to be similarly useful as model organisms for the study of human disease gene function, although their differing biologies may favour one organism for certain genes and the other organism for other genes.

Protein domains

Listed in Table 6 are the ten most frequent protein domains found in S. pombe, with 11 more domains of interest in the top 40 most frequent, as determined by InterPro matches70, together with the frequency of these domains for the other fully sequenced eukaryotic genomes. These domains are divided into three categories (1–3).

Table 6 Protein domain analysis and comparison with other eukaryotes

The first category (1) consists of five domains found in the top ten most frequent domains in S. pombe that are also found in the top ten of at least four of the other eukaryotes. They are the ATP/GTP binding site, the WD40 repeat, the eukaryotic protein kinase catalytic core, the RNA binding region RNP-1, and the zinc finger C2H2-type transcriptional activator. These universal and commonly exploited domains also feature highly in other eukaryotes. Because total gene number increases with the complexity of an organism, the proportion of these domains is approximately similar in each of the sequenced eukaryotic genomes. Energy utilization exploiting the ATP/GTP binding site, protein phosphorylation dependent on the catalytic protein kinase domain, and transcriptional activation using the zinc finger C2H2 domain must define biochemical mechanisms that are readily exploited to generate new biological pathways.

In the second category (2), the domains are present in a similar absolute number in the eukaryotic genomes analysed. Amongst those more frequently found in this category are the BRCT, replication factor C, minichromosome maintenance proteins (MCMs), Fizzy, DNA-directed DNA polymerase β family and helicase C-terminal domains. Some of these are involved in core cell activities like DNA replication, DNA repair and cell-cycle progression, perhaps explaining why they are present in similar absolute number regardless of genome size71. Systematic searches for other domains present in similar absolute numbers in genomes of all eukaryotes might identify other, at present unrecognized, functions involved in similar core cell activities.

The third category (3) includes domains whose occurrence rises dramatically with increasing genome size within the Metazoa. This category includes the SH3, PH and tyrosine/dual-specificity phosphatase domains. These are involved in intra- and intercellular signalling pathways, which might be expected to become increasingly elaborate as multicellular complexity increases69,71.

Two other domains in the top ten for both the yeasts are the sugar and ABC transporters (Table 6). S. cerevisiae has significantly more of these domains and the amino-acid permease domain than does S. pombe72, which may explain why it is a more versatile organism, growing on a greater range of media. The Zn(ll)Cys(6) transcription-factor domain is found only in the two yeasts, supporting the idea that it is specific to fungi. The chromodomain is found more frequently in S. pombe—seven examples compared with two in S. cerevisiae—possibly reflecting differences in higher-order chromatin structure.

Defining the eukaryotic cell

The genome sequence of S. pombe increases the range of available complete eukaryotic genome sequences to two unicellular free-living organisms (S. cerevisiae and S. pombe), one plant (Arabidopsis), and three metazoans (C. elegans, Drosophila and humans). This range of organisms allows a comparison between eukaryotic and prokaryotic genomes (represented by 37 bacteria and 8 archaea), with the intention of identifying those genes important for eukaryotic cell organization. We have made an initial analysis to identify the more conserved genes falling in this category by comparing the predicted protein sequences coded by the above genomes. The percentage similarity was derived from the hit bit score divided by the self bit score for each protein (see Table 7 legend). We selected those proteins with a high percentage similarity score in all of the eukaryotes, and a low one in all of the prokaryotes. Three thresholds (50%, 45% and 40%) were used to identify proteins that are highly conserved in the fully sequenced eukaryotes and three corresponding thresholds (20%, 15% and 12% respectively) to identify proteins not found in the fully sequenced prokaryotes (Table 7a). For an initial discussion of these proteins, thresholds of 50% and 20% were selected. This analysis identifies genes coding for proteins that are highly conserved in yeasts, plants and metazoans (by using a threshold of 50% similarity) and yet are not well conserved in prokaryotes (by using a threshold of 20% similarity). The proteins identified using these criteria are likely to be important for maintaining eukaryotic cell organization, although the high threshold of 50% means that other proteins required for this may well be excluded.

Table 7 Identifying conserved genes important for defining the eukaryotic cell and multicellularity

Using these thresholds, 62 genes were identified and grouped according to function (Table 8). More information about these genes can be found on the GeneDB website (http://www.genedb.org/pombe) and the PombePD website (http://proteome.com/databases). Two of these groups code for proteins associated with characteristics considered to distinguish eukaryotic cells from prokaryotic cells: the organization of DNA in chromosomes within a nucleus, and the formation of 40S and 60S ribosomal subunits, which are larger than the prokaryotic 30S and 50S subunits. The first group includes the H3 and H4 core histone proteins required for packaging DNA into nucleosomes, the Hda1 histone deacetylase, which suggests histone acetylation is critical for eukaryotic chromatin, and the Ran GTPase Spi1, a key element for nuclear membrane transport. One putative protein in this category (SPAC890.07c) is possibly involved in export of mRNA binding proteins and another may be localized in the nucleus (SPCP1E11.08). The second group includes two Rps and six Rpl proteins, components of the 40S and 60S ribosomal subunits respectively; these eight proteins may contribute to differences in protein translation between prokaryotes and eukaryotes.

Table 8 Classification of conserved genes important for defining the eukaryotic cell

Two further groups in Table 8 are relevant for the more elaborate organization and compartmentation of eukaryotic cells. One consists of cytoskeletal proteins, the actins Act1 and Act2, the tubulins Nda2, Nda3 and Tub1, and the cytoskeleton-associated proteins Arp2 and Cdc42. The actin and tubulin polymers provide not only internal structure but also the means for transport of components and information from one region of the cell to another, important matters given the increased size of eukaryotic cells. The bacterial FtsA, Hsp70 and FtsZ proteins have structures with similarities respectively to actin and tubulin but only very limited primary sequence similarities73,74,75. Arp2 is an actin-related protein required for actin organization, and the Cdc42 GTPase is a signalling molecule important for cell shape and for communicating signals from the cytoskeleton. One protein (SPAC926.07c) is predicted to be a dynein light chain. The second group consists of GTP binding proteins and their regulators Ypt1, -2, -3 and -7, Arf1, Aps1, Gdi1 and Sar1, which are required for membrane transport. Membrane-bound organelles and structures are characteristic features of eukaryotic cells, and membrane fusion and fragmentation are important in organelle formation and function. Cam1 (calmodulin) is a protein that exploits compartmentalization of Ca2+ to regulate cellular processes. One protein (SPBC1539.08) is a putative ADP ribosylation factor and may be involved in transport.

A small group (Table 8) includes cell-cycle and checkpoint control proteins. The Cdc2 protein kinase (Cdc28 in S. cerevisiae) is a cyclin-dependent kinase (CDK) controlling the onset of S-phase and mitosis in the two yeasts, with closely related CDKs controlling these cell-cycle transitions in other eukaryotes. The CDK system for cell-cycle control evolved with the appearance of eukaryotic cells, whose cell cycle differs from prokaryotes in two ways: DNA synthesis, which uses multiple origins of replication, and mitosis, which brings about chromosome segregation. It has been argued that, in the primeval eukaryote, there was a single CDK that underwent a monotonic change during the cell cycle, initiating S phase early in the cycle at a low activity and mitosis late in the cycle at a high activity76. Two checkpoint proteins, Rad24 and Rad25, are 14-3-3 proteins thought to regulate the Cdc25 phosphatase controlling the Cdc2 CDK77. If DNA becomes damaged then these checkpoint proteins prevent the onset of mitosis until the damage is repaired. This pathway is essential for maintaining genomic stability and seems to be characteristic of eukaryotic cells.

Three further groups reflect biochemical processes that are important in eukaryotic cell regulation. The first group consists of Lsm2 and Smd2, which are required for RNA splicing. The second group consists of the Ubc, Ubi and Ubl proteins together with Uip1 and Pad1 (Table 8), all required to bring about controlled proteolysis of proteins. A further protein putatively involved in proteolysis is a prohibitin complex subunit (SPAC1782.06c). The third group consists of protein kinases and phosphatases, and includes Cka1, Dis2, Hhpt, Ppa1, Ppa2, Ppe1 and Sds21 and putative serine/threonine protein phosphatases (SPAC22H10.04 and SPBC26H8.05c). The presence of these three regulatory processes unique to eukaryotic cells allows protein levels and activities to be specifically and rapidly changed without relying on changes in transcription rate. In prokaryotic cells, gene regulation often operates through changes in transcription rate, followed by dilution of remaining proteins as a consequence of rapid cellular growth. The slower growth rates of eukaryotic cells means that mechanisms in addition to dilution by growth are required to modulate protein activity; these mechanisms may be provided by RNA splicing, proteolysis and phosphorylation.

Two genes code for a putative zinc-finger protein (SPBC24C6.11) with a possible role in cell polarity and a putative autophagy protein (SPBP8B7.24c) that may mediate attachment of autophagosomes to microtubules. Extension of this analysis at different thresholds of similarity should identify further proteins of unknown function that are important for eukaryotic cell organization.

We performed a similar analysis to identify highly conserved genes that may be important for maintaining multicellular eukaryotic organization (Table 7b). We compared the proteins in prokaryotes and in S. cerevisiae and S. pombe, which are all unicellular, with those of C. elegans, Drosophila, Arabidopsis and humans, which are all multicellular. The same thresholds were used to identify those proteins that are highly conserved in the four multicellular eukaryotes (50%, 45% and 40%) and to identify which of these proteins were not found to be highly conserved in the unicellular organisms (20%, 15% and 12%). The number of genes coding for proteins that fall into these categories was very small: one to three depending on the thresholds used. These genes code a putative transcription factor, an RNA-binding protein and a selenium-binding protein.

As more sequences become available, the groups of genes we have identified as being important for eukaryotic and multicellular organization will inevitably be modified. However, our results allow us to speculate on the evolutionary transitions from prokaryotes to eukaryotes and to multicellularity. The transition to multicellularity may not have required the evolution of many new genes, absent from unicellular organisms. The pathways necessary for multicellular organization could already have been in existence in unicellular eukaryotes. For example, intercellular signalling may have been solved by the sexual needs of primeval, single-celled eukaryotes to seek out and identify an appropriate mating partner. Once signalling between cells had evolved, it could be readily exploited to generate the signalling pathways required for multicellular organization. The highly conserved genes specific to eukaryotes may be necessary for eukaryotic cell organization to be generated. In contrast, the transition from unicellularity to multicellularity may not have required many new genes. Instead it may have used genes already present in unicellular eukaryotes, perhaps by the shuffling of functional domains, to give rise to new combinations, which allowed the development of pathways required for the evolution of multicellularity2,69,71,78. If these speculations are correct, they imply that the evolutionary transition from unicellular prokaryotic to unicellular eukaryotic life may have been more complex than the transition to multicellular life. This might provide some explanation as to why it took around 2,300 million years (Myr) to evolve from the first prokaryote to the first eukaryote (thought to have arisen about 3,800 Myr and 1,500 Myr ago, respectively) but only 500 Myr for the evolution of the first multicellular organisms, which arose about 1,000 Myr ago. Further analyses and comparisons should continue to be illuminating about this interesting question of which genes define eukaryotic cells and which define multicellular organisms.