The backlog of gene products without an assigned biochemical or genetic function has become an urgent problem for biologists. As the cost of sequencing continues to fall, and computational methods rapidly identify new genes, there is a great need to integrate gene sequences with the rest of biology. Bioinformaticians have made some headway in assigning predicted functions to new genes, primarily by using clustering algorithms, which assume that co-regulated genes have similar functions. These algorithms have limitations, however, that reflect the underlying parameters chosen by the investigator, and also the trade-off between an algorithm's accuracy and power. Now, Wu et al. describe a new approach to interpreting microarray data that eliminates some of the weaknesses of traditional clustering algorithms, facilitating large-scale predictions of Saccharomyces cerevisiae gene function.

Wu et al. tested their approach on a previously published microarray data set of 300 experiments, involving a variety of mutants and drug treatments of S. cerevisiae. Using several sets of parameters and algorithms, they generated partially overlapping transcriptional clusters. As each clustering algorithm defines similarity in a different way, combining several algorithms identified clusters that might have been missed by a single approach. An extra algorithm was added to search for co-regulated genes in a subset of experimental conditions, which again might have been missed by standard algorithms that build clusters based on global similarity in gene-expression profiles. Using statistical methods, the authors assigned a function to each cluster using the annotations in the Yeast Proteome Database (YPD) and the Munich Information Center for Protein Sequences (MIPS). Out of more than 13,000 derived clusters, 44% were assigned a function with a high level of confidence, arguing for the physiological relevance of these groupings.

To improve the predictive power of this approach, the authors removed particular algorithms that tended to generate large clusters, noting that small clusters were more apt to generate accurate predictions. They also added data from other experiments in the public domain, concentrating on areas of function in which the initial data set perfomed poorly — the cell cycle, signal transduction and differentiation. When applied to the larger, 424-experiment data set, this refined group of clusterings yielded predictions in 23 functional categories that were consistent with current annotation in at least 30% of the cases (a conservative estimate, with many giving a much higher rate of success).

All told, the authors assigned probable functions to 1,650 poorly characterized yeast proteins, 285 of which are proposed to be involved in non-coding RNA metabolism. As always, biochemistry is the ultimate validation. Mutations in five genes that are predicted to be involved in ribosomal RNA processing confirmed their involvement in the accumulation or processing of different species of ribosomal RNA. As the authors point out, the flexibility of the approach indicates that there might be room for improvement, as well as for its application to the forthcoming data sets from higher eukaryotes.