Chemical database mining, in which the structures of molecules known to have a particular activity are used to formulate 'queries' to search for other molecules in the database that are likely to show similar activity, can be a valuable approach for lead discovery, especially when little or no information on the macromolecular target is available. Writing in the Journal of Medicinal Chemistry, Tropsha, Kohn and colleagues have recently described a novel general approach to database mining, which integrates rigorously validated quantitative structure–activity relationship (QSAR) models. This approach gave an exceptionally high hit rate in a test of its ability to identify anticonvulsant compounds from a set of 250,000 molecules.

A key determinant of the success of database mining is the way in which the query molecules and those within the database are described computationally. Many types of chemical 'descriptors' can be calculated for a given molecule, and so an important question is how to select descriptors that will maximize the chances of identifying interesting molecules when a database is searched.

Addressing this question was a major focus of the authors' study, which was based on a series of compounds termed functionalized amino acids (FAAs) that were previously shown by Kohn and co-workers to have anticonvulsant activity in animal models of epilepsy (indeed, an FAA was recently designated for Phase III trials). However, the macromolecular targets of FAAs are not known, meaning that ligand-based approaches, such as database mining, represent the most efficient way to rationally discover new chemical entities with anticonvulsant activities that might be structurally dissimilar from FAAs.

Tropsha et al. had previously developed several QSAR models for FAAs, and rigorously validated them, which is also a key determinant of the success of the authors' approach in the present study. Each model was built by first using many different types of descriptors to characterize known FAAs and then applying statistical methods to find those descriptors that best correlated the structure of the compounds with their anticonvulsant activity in animals. Several of these models showed strong predictive ability, and the authors reasoned that the descriptors used in the best of the models would be a good choice for use in database mining.

So, the authors computed the descriptors from each of the top ten models for 250,000 compounds in two publicly available databases. The FAAs, as described by the descriptors in each of the ten models, were then used as queries to search the two databases, and 4,000 compounds that met a similarity threshold set by the authors (with respect to the query structures) were identified. However, by only selecting molecules that were found in all ten of the individual 'hit lists' (each corresponding to a particular model/descriptor set), this list could be further reduced to just 50 compounds.

Finally, the authors used the 10 QSAR models to predict the anticonvulsant activity of the 50 compounds, and 22 compounds were selected on the basis of these predictions. Four of these compounds, and five compounds derived by minor modifications, were chosen for synthesis, and seven of the nine compounds were confirmed to have good anticonvulsant activity in animals. This extremely high hit rate suggests that the authors' strategy of integrating rigorously validated QSAR models with database mining could be a valuable general tool in the design and discovery of novel bioactive chemical entities.