To the editor

As we move into the postsequencing phase of many genome projects, a major focus for the next few years will be to accurately annotate the functions of genes on a genomic scale. A powerful approach for predicting the exact biochemical function of a new protein is to use bioinformatics to find its characterized orthologs that typically retain the same function in the course of evolution. Currently, computer-based programs (e.g., BLAST, PSI-BLAST, COG) have been developed that enable searches for orthologs in databases. However, these similarity-based analyses are problematic and can generate large amounts of grossly misannotated sequences.

Using isocitrate and isopropylmalate dehydrogenases family as a model, with insight into how distinct functions of orthologs and paralogs are conferred and evolved, we believe it is feasible to confidently identify orthologs. Our strategy is based on a recent finding from protein engineering studies that indicates substitutions of only a few amino acid residues in these enzymes are sufficient to exchange substrate and coenzyme specificities. Hence, a few major specificity determinants can serve as reliable markers for determining orthologous or paralogous relationships. The approach has effectively corrected similarity-based functional misassignments of some 30 or so family members. To illustrate our approach, we present two case studies below:

The sequence (Gene number: Aq1512) from the Aquifex aeolicus genome sequencing project is predicted to encode a NADP-isocitrate dehydrogenase (NADP-IDH, EC 1.1.1.42) based on its high sequence identity with bacterial NADP-IDH1. Structure-based sequence alignment shows that all of the substrate binding and catalytic residues identified in Escherichia coli NADP-IDH are conserved in this protein, including Ser113 and Asn115 which are the major determinants of specificity towards isocitrate2. In contrast, the key residues Lys344 and Tyr345 interacting with NADP in the E.coli NADP-IDH are replaced by Asp and Ile, as seen in NAD-dependent isopropylmalate dehydrogenase (EC 1.1.1.85)3,4. This observation allows us to correct the function of the protein as NAD-isocitrate dehydrogenase (NAD-IDH, 1.1.1.41). We subcloned the coding region of the genomic DNA and expressed the enzyme in an idh strain of E. coli. As expected, the enzyme is NAD-dependent. Calculated as the ratio of kcat/KM, this thermophilic enzyme favors NAD over NADP by a factor of 86.

As a second example, two Arabidopsis cDNA clones, NAD-IDH1 and NAD-IDH2 (U81993 and U81994)5, have been identified by homology searches from EST database. It has been suggested that a single-subunit form of Arabidopsis enzyme may exist and that these two clones may represent isozymes. With a careful examination of the active site residues in these two sequences, it is obvious that the Mg2+ binding residues equivalent to Arg129, Asp129 and Asp311 found in E. coli NADP-IDH are all missing3,4. Hence, these gene products should correspond to different regulatory subunits of NAD-IDH, neither of which can form an active enzyme. This is consistent with the observation that the cDNA fails to complement yeast NAD-IDH mutants5. It is also noteworthy that the physiologically active form of tobacco NAD-IDH is composed of two regulatory subunits and one catalytic subunit6.

Our studies demonstrate that correct biochemical function of new genes can be assigned with certainty—an important first step in characterizing their roles in various cellular processes. With the progress of genome-wide efforts to determine representative three-dimensional structures for all protein families, we believe our approach could become more powerful and broadly applicable. Extension of similar studies for other protein families would be much needed in order to take full advantages of an enormous wealth of biological information coming out of the EST and genome projects.