Main

Two strategies currently prevail for predicting the structure of a protein. Comparative modeling (CM) begins by identifying template proteins with strong sequence homology; by determining conserved regions and comparing against the known structure of the template, one can assemble a structure for the query protein. The second strategy, known as threading, extends the idea of CM to potentially evolutionary distant proteins; the query sequence is subjected to a range of possible structural configurations, and once a strong match is found, the tentative fold can be further refined. Thus arises the 'protein structure prediction problem': because both strategies rely to varying extents on the existence of an optimally comprehensive database of structures, the extent to which this database is incomplete limits the potential accuracy of any prediction.

But Jeffrey Skolnick, director of the University of Buffalo Center of Excellence in Bioinformatics, recently came to a rather startling conclusion: the PDB—which currently contains more than 29,000 protein structures—may now be sufficiently complete to comfortably solve this problem for any given single-domain protein. “As the threading algorithms kept getting better,” says Skolnick, “I kept finding more and more structures I could fit. If the [PDB] was really incomplete, that left just two possibilities. Either the results are wrong, and we don't understand what's going on; or else it really was that as your algorithm got better, you could recognize the more distantly related structures...and you could build folds.”

Skolnick and colleague Yang Zhang addressed this possibility in a new study, assembling a query set of 1,489 single-domain proteins from the PDB, for which they attempted to derive folds from a template library of 3,575 nonhomologous proteins. Using two methods—TASSER, a threading and refinement algorithm, and MODELLER, a CM algorithm—they generated continuous structures for all their queries that were reasonably close to native. After examining the extent to which TASSER could improve alignments of unrelated proteins, they found that for nearly 97% of their single-domain proteins (which had an average sequence identity of 13%), they could actually bring the predicted structure closer to native, within root mean square deviations of 4 Å or less. These findings imply that the PDB could be considerably closer to complete, in terms of fold representation, than expected.

Such findings are bound to rouse controversy. “Most people, when I tell them this, initially are very skeptical,” says Skolnick. However, he continues, if these data do reflect the actual state of available information in the PDB, they raise a strong argument for creating databases that can do a better job of matching targets to distantly related templates. “I want to strongly emphasize that the models based on the templates provided by the PDB are imperfect and gappy but can have continuous models built from them—and I would argue that this should suggest a revisiting of the structural genomics fold selection strategy. If it's true, I think it's a very important implication.”