Article | Published:

Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters


Genome sequencing has led to the discovery of tens of thousands of potential new genes. Six years after the sequencing of the well-studied yeast Saccharomyces cerevisiae and the discovery that its genome encodes 6,000 predicted proteins, more than 2,000 have not yet been characterized experimentally, and determining their functions seems far from a trivial task. One crucial constraint is the generation of useful hypotheses about protein function. Using a new approach to interpret microarray data, we assign likely cellular functions with confidence values to these new yeast proteins. We perform extensive genome-wide validations of our predictions and offer visualization methods for exploration of the large numbers of functional predictions. We identify potential new members of many existing functional categories including 285 candidate proteins involved in transcription, processing and transport of non-coding RNA molecules. We present experimental validation confirming the involvement of several of these proteins in ribosomal RNA processing. Our methodology can be applied to a variety of genomics data types and organisms.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Change history

  • 19 June 2002

    added supplementary figure callouts


  1. 1

    Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  2. 2

    Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  3. 3

    Genome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium. Science 282, 2012–2018 (1998).

  4. 4

    Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

  5. 5

    Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563–547 (1996).

  6. 6

    Consortium, T.C.e.S. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

  7. 7

    Hodges, P.E., McKee, A.H., Davis, B.P., Payne, W.E. & Garrels, J.I. The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res. 27, 69–73 (1999).

  8. 8

    Mewes, H.W., Albermann, K., Heumann, K., Liebl, S. & Pfeiffer, F. MIPS: a database for protein sequences, homology data and yeast genome information. Nucleic Acids Res. 25, 28–30 (1997).

  9. 9

    Ball, C.A. et al. Intergrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res. 28, 77–80 (2000).

  10. 10

    Bork, P. et al. Predicting function: from genes to genomes and back. J. Mol. Biol. 283, 707–725 (1998).

  11. 11

    Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

  12. 12

    Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–86 (1999).

  13. 13

    Niehrs, C. & Pollet, N. Synexpression groups in eukaryotes. Nature 402, 483–487 (1999).

  14. 14

    Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).

  15. 15

    King, R.D., Karwath, A., Clare, A. & Dehaspe, L. Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast 17, 283–293 (2000).

  16. 16

    Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. & Takagi, T. Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18, 523–531 (2001).

  17. 17

    Shatkay, H., Edwards, S., Wilbur, W.J. & Boguski, M. Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–328 (2000).

  18. 18

    Jenssen, T.K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet. 28, 21–28 (2001).

  19. 19

    Hartigan, J. Clustering Algorithms (John Wiley & Sons, 1975).

  20. 20

    Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998).

  21. 21

    DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

  22. 22

    Miki, R. et al. Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays. Proc. Natl Acad. Sci. USA 98, 2199–2204 (2001).

  23. 23

    Hughes, T.R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).

  24. 24

    Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. & Church, G.M. Systematic determination of genetic network architecture. Nat. Genet. 22, 281–285 (1999).

  25. 25

    Goldstein, D.R., Ghosh, D. & Conlon, E.M. Statistical issues in the clustering of gene expression data. Stat. Sinica 12, 219–240 (2002).

  26. 26

    Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999).

  27. 27

    Prosperi, E. Multiple roles of the proliferating cell nuclear antigen: DNA replication, repair and cell cycle control. Prog. Cell Cycle Res. 3, 193–210 (1997).

  28. 28

    Chen, C., Merrill, B.J., Lau, P.J., Holm, C. & Kolodner, R.D. Saccharomyces cerevisiae pol30 (proliferating cell nuclear antigen) mutations impair replication fidelity and mismatch repair. Mol. Cell Biol. 19, 7801–7815 (1999).

  29. 29

    Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 (1998).

  30. 30

    Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).

  31. 31

    Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, 873–880 (2000).

  32. 32

    Kressler, D., Linder, P. & de La Cruz, J. Protein trans-acting factors involved in ribosome biogenesis in Saccharomyces cerevisiae. Mol. Cell Biol. 19, 7897–7912 (1999).

  33. 33

    Paule, M.R. & White, R.J. Survey and summary: transcription by RNA polymerases I and III. Nucleic Acids Res. 28, 1283–1298 (2000).

  34. 34

    Spingola, M., Grate, L., Haussler, D. & Ares, M., Jr. Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae. RNA 5, 221–234 (1999).

  35. 35

    Cheng, Y., Dahlberg, J.E. & Lund, E. Diverse effects of the guanine nucleotide exchange factor RCC1 on RNA transport. Science 267, 1807–1810 (1995).

  36. 36

    Winzeler, E.A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901–906 (1999).

  37. 37

    Gari, E., Piedrafita, L., Aldea, M. & Herrero, E. A set of vectors with a tetracycline-regulatable promoter system for modulated gene expression in Saccharomyces cerevisiae. Yeast 13, 837–848 (1997).

  38. 38

    Gelperin, D., Horton, L., Beckman, J., Hensold, J. & Lemmon, S.K. Bms1p, a novel GTP-binding protein, and the related Tsr1p are required for distinct steps of 40S ribosome biogenesis in yeast. RNA 7, 1268–1283 (2001).

  39. 39

    Bassler, J. et al. Identification of a 60S preribosomal particle that is closely linked to nuclear export. Mol. Cell 8, 517–529 (2001).

  40. 40

    Kim, S.K. et al. A gene expression map for Caenorhabditis elegans. Science 293, 2087–2092 (2001).

  41. 41

    Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

  42. 42

    Kohonen, T. Self-Organizing Maps (2001).

Download references


We thank M. Boguski, S. Friend, L. Hartwell and A. W. Murray for support, advice and encouragement; J. Burchard, J. Castle, Y. He, M. Margarint and E. Tan for help with BLAST and clustering; L. Garwin, M. Groudine, J. Johnson, P. Linsley, P. Lum, D. Marks, C. Roberts, M. Roth, C. Sander, E. Schadt and S. Tapscott for comments and useful discussions on this work; and B. Blencowe and S. McCracken for lab space, reagents and assistance with experiments in Fig. 6. This work was supported by Rosetta Inpharmatics, a CIHR Operating Grant to T.R.H. and the Ontario Premier's Research Excellence Award to T.R.H.

Author information

Competing interests

The authors declare no competing financial interests.

Correspondence to Steven J. Altschuler.

Supplementary information

  1. Web Figure A

  2. Web Figure B

  3. Web Figure C

  4. Web Figure D

  5. Web Figure E

Rights and permissions

Reprints and Permissions

About this article

Further reading

Figure 1: Overview of prediction and validation approach.
Figure 2: Exploratory visualization of annotated clusters.
Figure 3: Functional prediction for POL30.
Figure 4: Validation of predictions using known annotations, for a variety of parameters and choices shown in Fig 1.
Figure 5: Cellular Role category predictions for 1,644 genes (2,368 annotations) previously unclassified by Cellular Role.
Figure 6: Northern-blot analysis of total RNA extracted from strains with TET promoters integrated upstream of genes predicted with high confidence to function in rRNA processing and modification.