Sequencing a genome is one thing; annotating all protein-coding genes is a bigger challenge yet. Efforts of automated and manual annotation by many research groups are directed toward finding locations of genes in the genome and determining which of these are actually translated into proteins.

Tim Hubbard, head of the Vertebrate Genome Analysis Project at the Wellcome Trust Sanger Institute, was looking for ways to improve current annotation efforts. Together with Jyoti Choudhary, head of protein mass spectrometry at the Institute, and Jennifer Harrow, who leads the Institute's manual annotation team, he devised a strategy that would incorporate shotgun proteomics data into the annotation effort of the mouse genome.

“We were looking to take mass spec data from any instrument,” says Choudhary, “and have it go through one pipeline in a statistically grounded framework.” To lay the groundwork, Markus Bosch, a joint PhD student between the groups of Hubbard and Choudhary, wrote an algorithm that dealt with the key problem in mass spectrum analysis, that of a high false positive rate in the matching of spectra to peptides. Bosch combined the database search engine Mascot with the machine-learning algorithm Percolator to weed out incorrect peptide spectrum matches. This combination provided high sensitivity and allowed a significance measure of every single matched peptide.

Choudhary stresses the importance of being conservative in peptide assignments. “What people have tried to do in the past,” she notes, “is to maximize spectrum matching and say they identified a large number of peptides. But this is not representative of the quality of the call.” Adding probability scores to a call allows one to set a higher threshold for bona fide matches.

The core of the team's pipeline is the GenoMS database, which contains peptides derived from in silico digests of well-annotated proteins in public databases as well as computational ab initio predictions of proteins. The team then analyzed entries from existing proteomic datasets, stored in the Peptide Atlas, and new spectra generated from mouse embryonic stem cells and brain cells by running them against the GenoMS database. They validated over 30% of known protein-coding genes with high probability scores.

For Harrow, the most exciting findings did not lie in the robust validation of existing annotations but in the discovery that a small number of pseudogenes are actually translated into proteins. “We have always seen pseudogenes expressed,” Harrow says, “but we have never seen translated ones.” Of the 10,000 pseudogenes present in the mouse genome, they saw evidence of translation for 19 of them. Validation of these expressed pseudogenes is ongoing, but Harrow sees it as reassuring that they only found a few, which indicates that the results are specific.

Of course, finding evidence of pseudogene translation does not say anything about the function of the protein; by common definition a pseudogene is a nonfunctional defective DNA segment that only resembles a gene. Choudhary plans to delve into analyzing the functional role of these newly discovered 'pseudoproteins' by looking at their localization and identifying their binding partners.

Having established the reliability of the pipeline, Hubbard plans to expand it to the human genome. “What we focused on up until now [in genome analysis] was entirely based on annotation by transcription,” he says; “it is worthwhile trying to integrate the proteins.”

Because some pseudogenes may not be 'pseudo' after all, it is possible that the same holds true for noncoding RNAs.