Computational techniques that can 'dock' small molecules into the structures of protein targets and 'score' their potential complementarity with putative binding sites have become popular in lead identification and optimization, and many different programs are now available that can perform these tasks. But just how good are they? To survey the current state of the art in this field, Warren and colleagues set out to compare as many docking programs and scoring functions as possible, and the intriguing results of their study have recently been described in the Journal of Medicinal Chemistry.

For their analysis, the authors evaluated the performance of 10 docking programs and 37 scoring functions against eight proteins of seven protein types, including a kinase, two proteases, a nuclear hormone receptor and a polymerase. Three tasks were assessed: binding-mode prediction, virtual screening for lead identification and rank-ordering by affinity for lead optimization.

In the first task, the multiple docking protocols were used to predict bound conformations for 136 compounds for which protein–ligand crystal structures were available. Overall success rates were quite good across all protein targets, and all of the docking programs were able to generate ligand conformations similar to the crystallographically determined structure for at least one of the targets.

For the second task, a test of virtual screening capability, the authors used a challenging test compound set similar to a typical corporate collection: it contained a large number of diverse chemical classes, each of which contained a number of active and inactive close chemical analogues. For all but one target, at least one docking program–scoring function pair was very successful at identifying active molecules from the pool of decoy molecules, although no single program performed well for all of the targets. The ability to identify chemically diverse leads across diverse targets is also important and, except for one target, at least one algorithm identified at least one member of all the active chemotypes within the top 10% of the docking-score-ordered list.

However, in the final task, rank-ordering by affinity, there was no statistically significant relationship between docking scores and ligand affinity for any of the eight protein targets. Furthermore, in most cases reproduction of the correct binding mode did not improve rank-order or potency-prediction performance. These findings, which represent the first extensive evaluation of this aspect of docking and scoring, demonstrate that considerable improvements are needed in compound scoring by docking algorithms before such approaches will be consistently valuable in lead optimization.

The authors also make a number of other interesting observations related to each task. Overall, the results of this systematic and extensive study highlight the strengths and weaknesses of the current docking and scoring approaches, and should provide a useful benchmark against which future progress in this field can be measured.