An untruth sometimes said about scientists is that they lack artistic creativity. As any reader of Nature Genetics can attest, scientific papers are rife with creative language describing experimental data, natural phenomena and interpretation. So it's likely that a not a few biologists grumbled when the Gene Ontology (GO) consortium announced its efforts to design a common language for describing the functions of genes across organisms; a tool, it was claimed, that would serve to unify biology1. Some two years later, GO is beginning to realize its lofty goal and, moreover, it is being applied in unanticipated ways. The tool described by Carolina Perez-Iratxeta and colleagues2 on page 316 is one such example.

An ontology defines a controlled, consistent vocabulary to describe concepts and relationships, thereby enabling knowledge-sharing3. The GO consortium was inspired by the recognition of a bottleneck in the transfer of information between those studying different model organisms, owing to the absence of a shared vocabulary4. To circumvent this problem, they commenced development of three ontologies applicable to all eukaryotes: the biological process in which the gene product participates, the molecular function of the gene product and the cellular component within which the gene product acts. Although GO terms are consistent, they are not complete, thereby allowing a dynamic vocabulary that evolves within the constraints of the ontology. The consortium initially included FlyBase, Mouse Genome Informatics and Saccaromyces Genome Database, but has subsequently grown to include the Arabidopsis Information Resource, WormBase, PomBase, the Rat Genome Database and DictyBase among others, which are now united by the use of a single shared vocabulary4.

Perez-Iratxeta and colleagues2 report an approach in which they produce a score that links the functional annotation of proteins described using GO terms with the description of an inherited disease using medical subject heading (MeSH; the National Library of Science's controlled vocabulary). Linking this score to information from RefSeq yielded a list of most likely candidate genes for 455 mapped diseases of unknown genetic defect. As a blind test, the authors looked at 100 genes for which disease-causing mutations had already been identified. In 55 of the cases, the disease-related gene was identified. As the authors point out, the data-mining system is highly dependent on the information that is being mined. Thus, continued improvements in GO and its increased integration in databases will enhance the authors' data-mining system (available on the Genes2Disease website).

Credit: Katie Ris

The efforts of the GO consortium are helping to lead us beyond a Babel-like period and unify the field. And the tool reported by Perez-Iratxeta and colleagues2 suggests that speaking the same language will have unexpected benefits.