Nature Publishing Group, publisher of Nature, and other science journals and reference works NATURE.COM NATURE NEWS NATUREJOBS NATUREEVENTS ABOUT NPG
Help Nature.com site index  
The Pharmacogenomics Journal
SEARCH     advanced search my account e-alerts subscribe register
Journal home
Advance online publication
Current issue
Archive
Press releases
For authors
For referees
Contact editorial office
About the journal
For librarians
Subscribe
Advertising
naturereprints
Contact NPG
Customer services
Site features
NPG Subject areas
Access material from all our publications in your subject area:
Biotechnology Biotechnology
Cancer Cancer
Chemistry Chemistry
Dentistry Dentistry
Development Development
Drug Discovery Drug Discovery
Earth Sciences Earth Sciences
Evolution & Ecology Evolution & Ecology
Genetics Genetics
Immunology Immunology
Materials Materials Science
Medical Research Medical Research
Microbiology Microbiology
Molecular Cell Biology Molecular Cell Biology
Neuroscience Neuroscience
Pharmacology Pharmacology
Physics Physics
Browse all publications
 
2002, Volume 2, Number 4, Pages 259-271
Table of contents    Previous  Article  Next   [PDF]
Original Article
Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data
P E Blower1, C Yang1, M A Fligner2, J S Verducci2, L Yu1, S Richman3 and J N Weinstein3

1Leadscope Inc, Columbus, OH, USA

2Department of Statistics, The Ohio State University, Columbus, OH, USA

3Laboratory of Molecular Pharmacology, Division of Basic Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA

Correspondence to: P E Blower, Leadscope Inc, 1275 Kinnear Road, Columbus, OH 43212, USA Tel: 001 641 675 376 E-mail: Pblower@leadscope.com http://www.leadscope.com or J N Weinstein, Building 37, Room 5068, NIH, 9000 Rockville Pike, Bethesda, MD 20892, USA Tel: 001 301 496 9571 E-mail: jw4i@nih.gov http://discover.nci.nih.gov

Abstract

Genomic studies are producing large databases of molecular information on cancers and other cell and tissue types. Hence, we have the opportunity to link these accumulating data to the drug discovery processes. Our previous efforts at 'information-intensive' molecular pharmacology have focused on the relationship between patterns of gene expression and patterns of drug activity. In the present study, we take the process a step further¾relating gene expression patterns, not just to the drugs as entities, but to ~27 000 substructures and other chemical features within the drugs. This coupling of genomic information with structure-based data mining can be used to identify classes of compounds for which detailed experimental structure-activity studies may be fruitful. Using a systematic substructure analysis coupled with statistical correlations of compound activity with differential gene expression, we have identified two subclasses of quinones whose patterns of activity in the National Cancer Institute's 60-cell line screening panel (NCI-60) correlate strongly with the expression patterns of particular genes: (i) The growth inhibitory patterns of an electron-withdrawing subclass of benzodithiophenedione-containing compounds over the NCI-60 are highly correlated with the expression patterns of Rab7 and other melanoma-specific genes; (ii) the inhibitory patterns of indolonaphthoquinone-containing compounds are highly correlated with the expression patterns of the hematopoietic lineage-specific gene HS1 and other leukemia genes. As illustrated by these proof-of-principle examples, we introduce here a set of conceptual tools and fluent computational methods for projecting directly from gene expression patterns to drug substructures and vice versa. The analysis is presented in terms of the NCI-60 cell lines and microarray-based gene expression patterns, but the concept and methods are broadly applicable to other large-scale pharmacogenomic database sets as well. The approach (SAT for Structure-Activity-Target) provides a systematic way to mine databases for the design of further structure-activity studies, particularly to aid in target and lead identification.

The Pharmacogenomics Journal (2002) 2, 259-271. doi:10.1038/sj.tpj.6500116

Keywords

molecular class; NCI cell lines; statistical correlation

INTRODUCTION

Genomic and proteomic technologies will revolutionize drug discovery and development. That much is universally agreed. But, to date, pharmacological and biological databases have not been linked in a way that permits fluent exploration of the relationships that a medicinal chemist or drug designer would most like to see¾relationships between the molecular structural features of compounds and the genes or gene products in cells that predict activity of the compound against the cell. The principal aim of this paper is to present a method for doing so. We will use, as an example, the set of compounds tested in the National Cancer Institute's 60-cell line anti-cancer drug screen and a microarray-generated database of gene expression profiles obtained for the 60 cell lines.1,2 However, the formalism and methods can be applied more broadly in the context of pharmacogenomic studies.

Since 1990, the National Cancer Institute (NCI) has screened more than 70 000 chemical compounds for their growth inhibitory activity against a panel of 60 human cancer cell lines (the NCI60) in microtiter plate format.3,4,5 For each compound and cell line, growth inhibition after 48 h of drug treatment is assessed from changes in total cellular protein using a sulforhodamine B assay. A vector of 60 growth inhibition values, one for each cell line, represents the activity pattern of a compound. These patterns of drug activity across the NCI60 have been found to encode incisive information about mechanisms of drug activity.3,4,5,6,7,8,9 The utility of that information has been enhanced by correlating the activity patterns with molecular structure descriptors of the tested compounds10,11 and with molecular characteristics of the test cell lines or tissue types.1,2,6,12,13,14,15,16,17,18,19,20,21 The NCI60 panel includes melanomas (8 cell lines), leukemias (6), and cancers of breast (8), prostate (2), lung (9), colon (7), ovary (6), kidney (8) and central nervous system (6) origin.

In a recent study,1,2 one of the present authors and collaborators used cDNA micro-arrays to generate gene expression profiles for the NCI60 cells. The database generated is being applied to problems in cancer diagnosis, prognosis, prevention and therapy. Most pertinent to the present study, these biological data can be mapped into pharmacological information, as described in Scherf et al.1 However, the correlation itself does not provide the type of insight most useful for drug design and selection of therapy. From the medicinal chemist's or drug designer's perspective, in particular, a method for projecting the genomic information through the pharmacology to molecular descriptors and structural features would be highly desirable. This paper describes such a method. It uses data mining software22 to identify structural features that are found in compounds whose activity patterns are highly correlated with expression patterns of selected genes. Representative compounds are then used to probe relevant genes and thus gain insight into possible molecular mechanisms of drug action.

The conceptual background of this work6 involves the databases shown in Figure 1. Database [A] contains the activity patterns of tested compounds, [S] contains molecular structural features of the compounds, and [T] contains gene expression patterns, including those for possible targets or modulators of activity in the cells. The databases [A] and [T] to be analyzed here are publicly available at http://www.leadscope.com and http://discover.nci.nih.gov. For [S] in this study we used a set containing 27 000 2D structural features.22

In practice, the overall [T]-database can include cell properties at the DNA, RNA, protein, functional and pharmacological levels, but in the present analysis we will consider only mRNA transcript expression patterns. More specifically, we will analyze transcript patterns obtained for the 60 cell lines using pin-spotted cDNA microarrays1,2 prepared by robotically spotting 9706 human cDNAs on glass microscopic slides. The cDNAs represented approximately 8000 different genes.1 Approximately 3700 of them were previously characterized human proteins, an additional 1900 had homologues in other organisms, and the remaining 4100 were identified only as ESTs. cDNA prepared from a pool of 12 cell lines selected for diversity from the set of 60 was used as an internal standard.2

The chemical structure database [S] can, in principle, be encoded in terms of any set of 1-, 2-, or 3-dimensional molecular structural descriptors or physico-chemical properties that are experimentally measured or theoretically calculated. The set of 27 000 2D structural features22 used here includes the familiar chemical building blocks of a compound: functional groups, aromatics, heterocycles, pharmacophores, etc organized hierarchically from general to specific. Each row in the database [S] corresponds to a structural feature, F. For compound C and feature F, the entry SFC = 0 if F does not occur in C; otherwise, SFC = 1/NF, where NF is the number of compounds containing F.

To relate the drug activity profiles to the gene expression patterns, activity and expression values were standardized, and the matrix [ATT] of Pearson product-moment correlation coefficients was then obtained by matrix multiplication. Finally, the matrix [SATT] (see Figure 1) was generated to associate a structural feature F with a gene. Each element in the [SATT] matrix reflects the tendency of a particular substructure to occur in compounds that are active in cell lines that express large amounts of the given gene product. Whereas previous studies of these databases have focused on identification of compounds, the present development (SAT for Structure-Activity-Target) takes this process a step further, identifying structural features that are associated with the observed correlations between gene expression and growth inhibition.

Other techniques have been used to model the anti-cancer activity of NCI compounds using various molecular property sets. For example, Shi et al10,11 calculated chemical structural descriptors of ellipticine analogs using molecular modeling software and derived correlations with growth inhibition using genetic function approximation. Fan et al, used similar methods to analyze topoisomerase I inhibitors.23 Cho et al24 analyzed structure-activity relationships for the NCI-H23 cell line using a recursive partitioning technique with several types of atom and physico-chemical property class descriptors. The MCASE program, developed by Klopman et al25,26 was used to dissect the compounds of an NCI60 training set into all possible structural fragments with 2-10 non-hydrogen atoms and statistically correlate each fragment with drug activity. In that study, MCASE identified multi-drug resistance (MDR) reversal agents. Roberts et al22 described general purpose data mining techniques that combine structural analysis and dynamic property filtering, and illustrated how these techniques could be used to identify compound classes in the NCI cancer screening data with unexpectedly high growth inhibitory activity. Each of these analyses can be thought of as implicitly or explicitly involving an [SA] matrix but not an [SATT] matrix. In other words, they did not carry the analysis through to the gene expression level as is done here. The present study describes a way to explore the large-scale correlation of molecular structural features with molecular characteristics of cells. It provides a systematic and fluent approach to the mining of genomic, proteomic and other 'omic',27,28,29,30 databases to aid in the identification of new targets and lead compounds.

RESULTS

Cell Groupings and Genes Selected for Analysis

When the NCI60 cell lines were clustered on the basis of gene expression pattern, they tended to group by tissue of origin,1,2 but there were exceptions. Table 1 lists seven relatively robust cell clusters based on the hierarchical cluster tree in Figure 2a of Reference 2. Grouping by tissue of origin is most apparent for the CNS, colon, leukemia, melanoma and renal panels. The melanoma cluster included two lines, MDA-MB-435 and MDA-N, that were derived from a pleural effusion in a patient with breast cancer, but showed the distinctive gene expression patterns of melanotic melanomas. MDA-N is an erb/B2 transfectant of MDA-MB-435. MDA-MB-435 may represent a second primary in the patient, a breast cancer with strong neuroendocrine features, or a contamination prior to derivation of MDA-N. The LOX IMVI cell line, which was not included in the melanoma cluster, is amelanotic and undifferentiated; it lacks the set of genes characteristic of melanin production and melanoma.1,2 The breast and non-small cell lung cancers did not form large coherent groups, and there were only two prostate cancer lines. The CNS cell cluster contained one breast cancer line, BT-549. The ovarian cancers were intermediate, with four cell lines in a coherent group and the other two elsewhere in the tree. There was a mixed cluster of 11 cell lines containing three non-small cell lung cancers, two each of breast, ovarian and prostate, and one each of melanoma and renal. Nine additional cell lines were not included in the cell clusters because they belonged to no coherent group with more than three cell lines. Some of these groupings are, to a degree, arbitrary, but the melanotic melanoma and leukemia categories are well defined.

Using the Studentized range test based on the seven cell panels, we selected 476 genes to distinguish 'extreme' panels, ie panels whose average expression level for the gene was significantly higher, or lower than those seen in other panels. Partial results are shown in Table 2, which reports genes with high Studentized range scores. The first column gives the clone identifier, and the last column gives a brief description of the corresponding gene identity. The second column contains the value of the Studentized range statistic, and the columns labeled Min and Max are the cell clusters that yield the minimum and maximum, respectively, of the means in equation (2). The column labeled upper 10% mean gives the average correlation of the highest 10% of the 4463 gene-compound correlations for the gene.

Of the 476 genes in the full version of Table 2, 391 (82%) had melanoma or leukemia as the Min or Max cell cluster. Guided by values for the upper 10% mean for genes in Table 2, we identified several genes that are selectively over-expressed in melanoma cell lines and well correlated with the activity patterns of specific compounds. These included Rab7 human small GTP binding protein31,32,33 (ID 486233), ASAH lysosomal ceramidase34 (ID 363919), human mRNA for KIAA011035 (ID 323730), MMP14 membrane-type matrix metalloproteinase36,37 (ID 270505), hMYH human mutY38,39 (ID 268727), RET Ret proto-oncogene (ID 485268), and human fetal brain mRNA for vacuolar ATPase (ID 488599). Separately, we found several other genes showing analogous behavior for leukemias. Included were LCP1 lymphocyte cytosolic protein 140,41 (L-plastin, ID 486676), HS142,43 hematopoietic lineage cell-specific protein (ID 260052), SF2 pre-mRNA splicing factor (ID 357011), CENPC centromere auto-antigen C (ID 488194), and CARS-cyp human Clk-associated RS cyclophilin44 (ID 179994).

For each of these genes, the corresponding column of the [SATT] matrix contains the average correlation coefficient between activity and expression level for each of the structural features. In other words, this column identifies structural features of compounds for which the compound activities are well correlated with the expression patterns of the specific genes. For each gene, these correlations were standardized over all features to create a feature z-score. Figure 2 shows two compound classes, benzothiophenedione45,46 and indolonaphthoquinone,47,48 that are well correlated with several melanoma genes and several leukemia genes, respectively. That is, they have the highest feature z-scores for these genes.

Anticancer Quinones

Since both compound classes in Figure 2 are heterocyclic quinones, we surveyed other classes of cytotoxic quinone anti-cancer agents.49,50,51 There have been a number of recent studies of indolequinone anti-tumor agents52,53,54,55 based on Mitomycin C and the aziridinyl analog EO9 (see Figure 3). These compounds are substrates for NAD(P)H:quinone oxidoreductases (NQO1 and DT diaphorase), and some show selective activity against DT diaphorase-rich55 cell lines. A number of topoisomerase II inhibitors56 such as doxorubicin (adriamycin), daunorubicin, and mitoxanthrone (see Figure 3) possess an anthraquinone substructure. Actinomycin D is a DNA inter-calating agent that contains a quinoneimine in a phenoxazine ring. Lastly, some naphthimidazole-4,9-diones show good activity and selectivity in the NCI60 panel. Based on these anti-cancer agents, we constructed a substructure query for each class. The results shown in Table 3 give information on the average correlations of common quinone classes with several of the genes listed above.

The two substructures in Figure 2 resulted in a class of 23 dihydrobenzodithiophene-4,8-diones45,46 and a class of 20 indolo-1,4-naphthoquinones,47,48 respectively. These two classes had the highest and second highest feature z-scores with Rab7 and LCP1, respectively, of any structural feature we studied for these genes. Furthermore, among the 407 quinones in the full set of 4463 compounds, nine of the 22 compounds best correlated with gene Rab7 were benzodithiophenediones, and 13 of the 54 best correlated with LCP1 were indolonaphthoquinones.

Table 3 shows the feature z-scores for several classes of compounds using selected genes from Table 2. For example, using compound-gene correlation coefficients for the Rab7 gene31,32,33 from Table 2, the z-score for the benzothiophenedione subset is 10.5. Thus, the benzothiophenedione class is highly enriched with compounds for which the 60-cell activity patterns are well correlated with the expression pattern of Rab7. Similarly, using compound-gene correlation coefficients for the LCP1 gene40,41 from Table 2, the z-score for the indolonaphthoquinone subset is 5.64.

For comparison, the right four columns in Table 3 give average GI50 values and feature z-scores for the compound classes with the leukemia and melanoma cell line panels. For each compound, we calculated its average GI50 values over the melanoma and leukemia cell line panels, giving AveMEL and AveLEU values for the compound. Then, using these two values, we calculated the mean GI50 values and feature z-scores for each compound class listed in Table 3. In contrast, note that actinomycins and anthraquinones, two potent and widely used classes of cancer chemotherapeutic agents, have average or negative correlations with the expression patterns of genes listed in Table 3.

The benzodithiophenediones and indolonaphthoquinones display opposite behavior with respect to the genes in Table 3 in the following sense: The benzodithiophenediones have well above average correlation with Rab7, KIAA0110 and MMP14, and below average correlation with LCP1, HS1 and CARS-cyp, whereas the indolonaphthoquinone class shows the opposite behavior. This difference is particularly pronounced for the leukemia gene CARS-cyp44 (ID 179994). For this gene, the 23 benzodithiophenediones have a feature z-score of -7.25, whereas the 20 indolonaphthoquinones have a feature z-score of +5.75. Similar differences are seen with the leukemia gene HS1 (ID 260052) and the melanoma gene MMP14 (ID 270505). Furthermore, even though the benzodithiophenediones, as a class are strongly correlated with over-expression of melanoma genes, members of this class with strong electron withdrawing substituents correlate better with LCP1. In a COMPARE analysis5,6,7,8,9,46 none of the benzodithiophenediones had a Pearson correlation coefficient >0.6 against any compound in the NCI 'Standard Agent' database, perhaps indicating that the benzodithiophenediones act by a mechanism different from that of any of the standard agents.

Gene Expression Correlations of Representative Compounds

In the last section, we began by identifying genes that differentiated particular cell subsets and projected them through the cells and compounds to correlated substructures. In this section, we do the reverse, starting with substructures of interest and projecting through compounds and then through cells to correlated genes. To understand more fully how compound activity might be related to molecular biology in the NCI60 panel, we used one representative from each of the two quinone classes (NSC 656238 from the benzodithiophenedione class and NSC 661223 from the indolonaphthoquinone class; see Figure 4) as probes to look for genes whose expression patterns are highly correlated with the compounds' activity profiles. Partial results are shown in Table 4a and b. The full version of Table 4, which gives correlations between 3748 genes and the three compounds of Figure 4 is available at http://www.leadscope.com and http://discover.nci.nih.gov. In the tables, genes are in rows, and the three columns labeled NSC 656238, NSC 661223 and NSC 682991 contain compound-gene correlation coefficients. Table 4a shows selected genes with high positive correlations with NSC 656238, and Table 4b shows selected genes with high positive correlations with NSC 661223. Note that the correlations are quite high and that for all of the listed genes, the correlation coefficients with these two compounds have opposite signs, in agreement with the trends seen in Table 3. In the full version of Table 4, there are 335 distinct genes that have high correlations ( +0.5 or -0.5), with at least one of the two probe compounds. To put these values in context, Figure 5 shows the distribution of Pearson correlation coefficients between the 4463 compound activity patterns and the 3748 gene expression patterns over the NCI60 cell lines. Of the 17 million gene- compound correlations, only 1% are in either tail region, above 0.45 or below -0.45.

Of the 335 genes mentioned above, 325 or 97% of them have correlation coefficients with opposite signs for compounds NSC 656238 and NSC 661223. This pattern of opposite signs is also present, but to a lesser extent for the full set of 3748 genes. In comparing the correlations of the activities of NSC 656238 and NSC 661223 with the expression levels of the full set of genes, approximately 70% of the correlation pairs are of opposite sign (r = -0.62), and those with the same sign tend to be closer to zero. Finally, although NSC 682991 is in the benzodithiophenedione class, it behaves in terms of its correlations with the full set of 3748 genes more like the indolonaphthoquinone NSC 661223, r = 0.77. Its correlations are much less related to those of NSC 656238 (r = -0.27), even though both are in the benzodithiophenedione class. A possible explanation of this apparent paradox will be offered in the Discussion section.

DISCUSSION

Genomic studies are producing large databases of molecular information on cancers and other cell and tissue types. As is universally recognized, these databases represent an unparalleled opportunity for pharmaceutical advance. The challenge is to link the data to the drug discovery and development processes. An 'information-intensive' approach6 formulated several years ago (by one of the present authors and colleagues) provided a blueprint for one productive way to meet that challenge. It provided a way to organize and inter-relate potential therapeutic targets, molecular mechanisms of action of compounds tested and modulators of activity within cancer cell lines. It also suggested a way to project genomic information on the cells used for testing through the activity patterns of compounds to molecular structural characteristics of those compounds.6 However, that suggestion was not pursued, and it was not converted into a fluent methodology for exploration or into a software package for doing so. Required was a way to couple the genomic (or proteomic) information with structure-based data mining to provide insights fruitful for follow-up in experimental structure-activity studies. Here we have presented such a method, based on the relational database system schematized in Figure 1. Included are gene expression levels for 3748 genes in 60 cell lines (T-matrix), activity values for 4463 compounds in 60 cell lines (A-matrix), and binary indices of occurrence of 27 000 structural features in 4463 compounds (S-matrix). As a proof-of-principle example of the approach, we have used it to identify subclasses of quinones well correlated with genes that are selectively expressed either in melanomas or in leukemias. A brief discussion of these agents and their genomic associations follows.

Of the 4463 compounds in the NCI set used in this analysis, 462 (10.4%) are quinones, quinoneimines or quinone methides. The mechanisms of quinone cytotoxicity49,50,51 are complex and varied. However, two principal pathways are well established. First, quinones act as redox-active molecules that can undergo either 1- or 2-electron reductions, depending on the cellular environment. The mechanism for 1-electron reduction involves redox cycling between quinone and semiquinone radical states, leading to consumption of NADH and formation of hydroperoxy radicals. Depending on the cellular environment, other reactive oxygen species, including superoxides, hydrogen peroxides and hydroxyl radicals, can be formed. These reactive species can, in turn, cause peroxidation of lipids, oxidation and strand breaks in DNA, consumption of reducing equivalents (eg, NAD(P)H or glutathione), and oxidation of other macro-molecules. In the second pathway, unhindered quinones act as Michael acceptors, causing cellular damage through alkylation of thiol or amino groups of glutathione, proteins and DNA. Mitomycin C and E09, for example, undergo reductive alkylation53 by mechanisms that involve opening the aziridinyl ring.

In the present study, we found that several genes selectively over-expressed in melanomas have expression patterns that are well correlated with the activity patterns of a subclass of benzodithiophenedione compounds. This class shows a distinctive substituent effect: Benzodithiophenediones with strong electron-withdrawing substituents (eg, NSC 682991; see Figure 5) show low or negative correlation with many of the genes that are over-expressed in melanomas (see Table 4a), whereas members with electron-donating substituents (eg, NSC 656238) show high positive correlations with those genes. For example, NSC 656238 is 10 times more potent against the melanoma cell lines than is NSC 682991. Electron-withdrawing substituents such as nitro groups raise the reduction potential of the quinone moiety, making it a better oxidant than it is in compounds with electron-donating groups. A plausible hypothesis for the cytotoxicity of a benzodithiophenedione is that it may disrupt an essential cellular redox process. This hypothesis is consistent with the roles of genes over-expressed in melanomas. In particular, Rab731,32,33 is the gene most strongly correlated with the electron-donating benzodithiophenediones. For example, the correlation coefficient with NSC 656238 is 0.67. Genes in the Rab family are small GTP binding proteins that ensure specificity of the docking of transport vesicles. In particular, Rab7 has recently been identified as a key regulatory protein for aggregation and fusion of late endocytic lysosomes. Cells expressing a dominant-negative Rab7 mutant have been reported not to form lysosomal aggregates.31 The dispersed lysosomes exhibit sharply higher pH, presumably due to disruption of the vacuolar proton pump. Interestingly, in this context, another gene highly correlated with NSC 656238 is ACP5 (r = 0.51). ACP5 (Clone ID 127821) is a unique lysosomal membrane ATPase responsible for maintaining the pH. Several other lysosomal proteins are also well correlated with the electron-donating benzodithiophenediones. Two other ATPases, ATP6B2 (Clone ID 380399) and ATP6E (Clone ID 417475), have correlation coefficients of 0.40 and 0.46, respectively, with NSC 656238. Both of these ATPases are reported to be lysosomal H+ transporters. Other lysosomal genes, ASAH (Clone ID 417819) and LAMP2 (357407), also show high correlations (0.50 and 0.40, respectively) with this quinone. Thus, genes well-correlated with this particular quinone class seem to be enriched in lysosomal proteins that are involved in vacuolar proton pump activity.

This substituent effect suggests a possible link between the oxidation potential of quinones, the proton pump, and the electron transport chain. A plausible hypothesis is that NSC656238 may act as a surrogate oxidizing agent in the electron transport chain. Ubiquinone-10 is the electron acceptor for mitochondrial oxidative phosphorylation. Menadione (2-methylnaphthoquinone), a compound known to compete with ubiquinone in the oxidative phosphorylation chain, also shows a reasonable correlation with Rab7 (r = 0.40). The reduction potentials of menadione and ubiquinone are known,57,58 but the reduction potential of NSC656238 has not been reported. We speculate that its oxidizing potential allows it to compete successfully with ubiquinone in the electron transport chain, as does menadione. The oxidizing potential of the quinone moiety would be a key factor in such a mechanism. Although compound NSC 682991 is a better oxidant, it may be reduced by cellular protective agents such as glutathione. Thus, at low concentrations, it may not be available to compete with ubiquinone and therefore may be effective only at higher concentrations.

We have illustrated a way to couple information on differential gene expression with structure-based data mining. The approach provides insights that may allow selective targeting of cellular mechanisms preferentially operating in specific tissues. The benzodithiophenedione series that emerged from this study is a clear example. This is a well-defined and structurally homogeneous series of quinones, which are well correlated with the expression patterns of Rab7 and other melanoma genes. The substituent effect seen in this series suggests a relationship between the oxidation potential of a compound and its correlation with the expression patterns of specific genes. This relationship prompts new questions that can be pursued experimentally: Is there a quantitative relationship between the oxidation potential of the benzodithiophenedione series and melanoma cytotoxicity? If so, is there a direct relationship between the selective cytotoxicity of NSC 656238 and Rab7 or the ATPases that are over-expressed in melanomas? The data currently available do not permit answers to these questions, but the analyses described here do provide indirect evidence of connections that can be tested in experimental structure-activity studies.

In this article, we have described a general analytical method, designated SAT, for discovering relationships between compound classes and potential molecular targets. The method uses statistical techniques to select genes with characteristic expression patterns, then applies structure-based data mining software to identify compound substructural classes that are well-correlated with the expression patterns of those genes. Selected members of the class identified can then be used as molecular probes to identify additional compound-gene associations and thereby refine hypotheses or focus further experiments. This semi-empirical method projects genomic information from cells through compound activity patterns to molecular structural features of drugs or potential drugs. It can also do the reverse, identifying genes whose expression levels (or other characteristics) correlate strongly with structural features of a particular drug, or drug candidate. The SAT approach to pharmacogenomic analysis can shed light on molecular mechanisms and has the potential to accelerate the process of drug discovery in several ways: (i) it can be used to prioritize genes for follow-up studies as potential therapeutic targets; (ii) because the analysis projects genomic information to molecular substructure through the [S] matrix, it allows extraction of a preliminary structure-activity relationship (SAR) directly from the SAT correlations; (iii) the preliminary SAR can, in turn, be used for early pharmacophore development or to select new, untested drug candidates from an actual or virtual library of compounds; and (iv) it can be used to prioritize candidate compounds for detailed gene expression analysis or other biological studies.

METHODS

Databases

For the target matrix [T] in this study, we used a 3748-gene subset of the 9704-cDNA database. The subset was selected to include only genes whose identities had been sequence-verified1 and which had <10% missing data values over the 60 cell lines. For the activity matrix [A], we selected a set of 4463 compounds that had been tested in the NCI Developmental Therapeutics Programs sulforhodamine B assay two or more times and for which we had structure records. The compound activity values used were based on GI50, the concentration needed for 50% growth inhibition. More specifically, activity was parameterized as -log(GI50). These databases are available at http://www.leadscope.com and htt://discover.nci.nih.gov.

Correlation Matrix

The activity matrix [A] (4463 ´ 60) contains compound activity data for the 60 cell lines, and the target matrix [T] (3748 ´ 60) contains gene expression patterns over these same cell lines.6 For each compound, the activity level can be considered as a variable defined over the 60 cell lines; likewise, for each target gene the expression level can be considered as a variable over these same 60 cell lines. If there were no missing data, the Pearson correlations between the compound activities and gene expressions could be computed by first standardizing the rows of each matrix into Z-scores,

6500116e1.gif

then forming the matrix product [ATT], and finally dividing each entry by n - 1 = 59 to obtain the correlation matrix.6 However, approximately 7% of the compound activity data and about 2% of the gene expression data were missing, so the algorithm for obtaining all pair-wise correlations had to be modified. The correlation coefficient between the activity of the ith compound and the expression level of the jth target was actually computed as

6500116e2.gif

where sAi and sTj are the standard deviations of the activity of the ith compound and the expression of the jth target, respectively, and sAiTj is the covariance between these variables. In formula (1) all Ni activity values available for the ith compound and all Nj expression values available for the jth target are used to compute the activity and expression means and standard deviations. To account for the pattern of missing data, the denominator used in computing the covariance sAiTj is

6500116e3.gif

where Nij is the number of cell lines for which both activity and expression were measured. This divisor N*ij has been shown59 to give an unbiased estimate of the correlation if data points are missing at random. Missing data in [A] and [T] are not entirely random in distribution, but the effect of the non-randomness is expected to be second-order.

Selection of Genes

We examined several methods for identifying genes that are strongly correlated with compound activities. Included were: (1) selection of genes differentially expressed in particular tissues of origin; and (2) selection of genes highly correlated with the activity levels of large numbers of compounds. Here, we focus on the first method. First, the cluster analyses in Ross et al1 and Scherf et al2 were used to group the cell lines into seven subsets, which corresponded roughly to tissue of origin. To find genes whose average level of expression distinguished among the seven panels, we used a version of the Studentized range procedure. Because some cell subsets (eg the leukemia panel) had expression levels that were more tightly clustered (less variable) than did the other panels, we used an unpooled estimate of variance. The unpooled Studentized range statistic is given by the expression:

6500116e4.gif

where X1 is the mean expression level for the panel with the highest average expression for the given gene, and s21 is the corresponding variance; X0 and s20 are analogous values for the cell panel with the lowest mean expression for the gene. We calculated the statistic for all 3748 genes and selected those for which the Studentized range value was greater than 5.08. Based on a Bonferroni adjustment for all 21 possible pair-wise comparisons between subsets, we would expect the values of fewer than 1% of the genes to exceed 5.08.

The rows of [T] corresponding to genes with high values of the Studentized range statistic (see Table 2) formed a submatrix [TS]. For each of the selected genes in [TS] we examined the distribution of the 4463 gene-compound Pearson correlation coefficients (columns of the correlation matrix [ATTS]) to identify genes well-correlated with compound activity. This distribution of correlation coefficients was summarized for each gene by computing the average correlation of the highest 10% of the 4463 gene-compound correlations. These values are reported in Table 2. Large values indicate a gene that is well-correlated with compound activity.

The examples of SAT analysis presented here focus on one particular basis for selection of genes¾differential expression between tissues of origin. However, many other bases for gene selection could equally be the starting point. For example, one could choose to focus on genes with the highest variance in expression level over the NCI60 cell lines, or on genes that simply happen to be the subject of one's research. The following steps are independent of the basis on which the gene is selected. Analogously, if one were starting from a structural feature, or features, and finding related genes, any basis for selection of the feature could serve as a starting point.

Mining of Structural Features

Once a set of genes was selected, the final step was to identify structural features well correlated with the expression levels of those genes. Any structural feature F (eg, 1,4-benzoquinone) is either present or absent in each compound. Let Nk denote the number of compounds with feature Fk, where k is an index over the structural features. The structural feature matrix [S] has potentially 27 000 rows corresponding to the full set of 2D structures considered and 4463 columns corresponding to compounds in the NCI database subset analyzed. For feature Fk, the corresponding row in [S] has entry 1/Nk for compounds in which the feature is present or 0 for compounds for which the feature is absent. The structural information was incorporated by forming the matrix product6 [SATT] in the LeadScope software (LeadScope, Inc, Columbus, OH, USA). Each row corresponds to a structural feature Fk, each column corresponds to a gene, and each element is the mean of the correlation coefficient between the gene and all compounds containing feature Fk. Although the size of matrix [SATT] would be 27 000 ´ 3748 if all structural features were represented in at least one compound and all genes were used, in practice we always selected smaller subsets of features and genes for analysis.

The jth column of the matrix [SATT] can be analyzed to identify structural features most highly associated with expression of the jth gene. To identify structural features that are enriched with compounds for which the 60-cell activity patterns are highly associated with expression patterns of a gene, for each structural feature we calculated a feature z-score;

6500116e5.gif

In this equation, the feature mean is the average correlation with the jth gene of all compounds containing the feature, the overall mean is the average correlation with the jth gene of all compounds, and the standard error is the standard deviation of the correlation with the jth gene of all compounds divided by the square root of the number of compounds with the feature. Further details are given in Reference 22.

After selection of a gene column in [SATT], the structural classes with the highest feature z-scores (ie those features that tend to have the highest average correlation) were identified. For example, using the melanoma gene Rab7 (ID 486233 in Table 2),31,32,33 we found that the 2-arylcarbonylthiophene class had the highest feature z-score. For the leukemia gene LCP1 (ID 486676 in Table 2),40,41 the 7-carbonylindole class had the highest feature z-score. By sorting the structural classes in order of decreasing feature z-score and examining the compounds in the high-scoring structural classes, we could usually postulate a structural class that defined the membership more precisely than did the highest scoring feature in the hierarchy. We then formulated a substructure query to define the postulated class. In the two cases described here, we defined the substructure queries labeled benzothiophenedione and indolonaphthoquinone shown in Figure 2. Note that these substructures are more precise extensions of the highest scoring features in the hierarchy; viz, 2-arylcarbonylthiophene and 7-carbonylindole.

DATA

The [A] and [T] databases analyzed here are publicly available at http://www.leadscope.com and http://discover.nci. nih.gov. The [S] matrix is available from LeadScope, Inc on request. These sites also provide the full version of Table 4, which gives correlations between 3748 genes and the three compounds of Figure 4.

DUALITY OF INTEREST

Authors PE Blower, C Yang and L Yu are employees of LeadScope, Inc, Columbus, OH, USA, which produces the software used in this study.

References

1 Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 2000; 24: 227-235. Article MEDLINE

2 Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L et al. A gene expression database for the molecular pharmacology of cancer. Nat Genet 2000; 24: 236-244. Article MEDLINE

3 Boyd MR, Paull KD. Some practical consideration and applications of the National Cancer Institute in vitro anti-cancer drug discovery screen. Drug Dev Des 1995; 34: 91-109.

4 Monks AP, Scudiero DA, Johnson GS, Paull KD, Sausville EA. The NCI anti-cancer drug screen: a smart screen to identify effectors of novel targets. Anti-Cancer Drug Des 1997; 12: 533-541.

5 Paull KD, Shoemaker RH, Hodes L, Monks A, Scudiero DA, Rubinstein L et al. Display and analysis of patterns of differential activity of drugs against human tumor cell lines: development of mean graph and COMPARE algorithm. J Natl Cancer Inst 1989; 81: 1088-1092. MEDLINE

6 Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace AJ, Kohn KW et al. An information intensive approach to the molecular pharmacology of cancer. Science 1997; 275: 343-349. Article MEDLINE

7 Weinstein JN, Kohn KW, Grever MR, Viswanadhan VN, Rubinstein LV, Monks AP et al. Neural computing in cancer drug development: predicting mechanism of action. Science 1992; 258: 447-451. MEDLINE

8 Paull KD, Hamel E, Malspeis L. Prediction of biochemical mechanism of action from the in vitro antitumor screen of the National Cancer Institute. In: Foye WE (ed) Cancer Chemotherapeutic Agents American Chemical Soc Books, 1993, pp 1574-1581.

9 Weinstein JN, Myers TG, Buolamwini JK, Raghavan K, van Osdol W, Licht J et al. Predictive statistics and artificial intelligence in the US National Cancer Institutes drug discovery program for cancer and AIDS. Stem Cells 1994; 12: 13-22. MEDLINE

10 Shi LM, Myers TG, Fan Y, O'Connor PM, Paull KD, Friend SH et al. Mining the National Cancer Institute anticancer drug discovery database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of activity. Mol Pharmacol 1998; 53: 241-251. MEDLINE

11 Shi LM, Fan Y, Myers TG, O'Connor PM, Paull KD, Friend SH et al. Mining the NCI anticancer drug discovery databases: genetic function approximation for the QSAR study of anticancer ellipticine analogues. J Chem Inf Comput Sci 1998; 38: 189-199. Article MEDLINE

12 Wu L et al. Multidrug-resistant phenotype of disease-oriented panels of human tumor cell lines used for anticancer drug screening. Cancer Res 1992; 52: 3029-3034. MEDLINE

13 Lee J-S et al. Rhodamine efflux patterns predict P-glycoprotein substrates in the National Cancer Institute drug screen. Mol Pharmacol 1994; 46: 627-638. MEDLINE

14 Alvarez M et al. Generation of a drug resistance profile by quantitation of MDR-1/P-glycoprotein expression in the cell lines of the NCI anticancer drug screen. J Clin Invest 1995; 95: 2205-2214. MEDLINE

15 Bates SE et al. Molecular targets in the National Cancer Institute drug screen. J Cancer Res Clin Oncol 1995; 121: 495-500. MEDLINE

16 Izquierdo MA et al. Overlapping phenotypes of multidrug resistance among panels of human cancer-cell lines. Int J Cancer 1996; 65: 230-237. Article MEDLINE

17 Koo H-M et al. Enhanced sensitivity to 1-beta-D-arabinofuranosylcytosine and topoisomerase II inhibitors in tumor cell lines harboring activated ras oncogenes. J Natl Cancer Inst 1996; 56: 5211-5216.

18 O'Connor PM et al. Characterization of the p53-tumor suppressor pathway in cells of the National Cancer Institute anticancer drug screen and correlations with the growth-inhibitory potency of 123 anticancer agents. Cancer Res 1997; 57: 4285-4300. MEDLINE

19 Freije JM et al. Identification of compounds with preferential inhibitory activity against low-Nm23-expressing human breast carcinoma and melanoma cell lines. Nat Med 1997; 3: 395-401. MEDLINE

20 Wosikowski K et al. Identification of epidermal growth factor receptor and c-erbB2 pathway inhibitors by correlation with gene expression patterns. J Natl Cancer Inst 1997; 89: 1505-1513. Article MEDLINE

21 Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J et al. Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci USA 2001; 98: 10787-10792. MEDLINE

22 Roberts G, Myatt GJ, Johnson WP, Cross KP, Blower PE. LeadScope: software for exploring large sets of screening data. J Chem Inf Comput Sci 2000; 40: 1302-1314. MEDLINE

23 Fan Y, Weinstein JN, Kohn KW, Shi LM, Pommier Y. Molecular modeling studies of the DNA-topoisomerase I ternary cleavable complex with camptothecin. J Med Chem 1998; 41: 2216-2226. MEDLINE

24 Cho SJ, Shen CF, Hermsmeier MA. Binary formal inference-based recursive modeling using multiple atom and physicochemical property class pair and torsion descriptors as decision criteria. J Chem Inf Comput Sci 2000; 40: 668-680. MEDLINE

25 Klopman G, Shi LM, Ramu A. Quantitative structure-activity relationship of multi-drug resistance reversal agents. Mol Pharmacol 1997; 52: 323-334. MEDLINE

26 Klopman G, Tu M. Diversity analysis of 14 156 molecules tested by the National Cancer Institute for anti-HIV activity using the quantitative structure-activity relational expert system MCASE. J Med Chem 1999; 42: 992-998. MEDLINE

27 Weinstein JN. Fishing Expeditions. Science 1998; 282: 627-628.

28 Weinstein JN. Pharmacogenomics: teaching old drugs new tricks. N Eng J Med 2000; 343: 1408-1409.

29 Weinstein JN, Buolamwini JK. Molecular targets in cancer drug discovery: cell-based profiling. Curr Pharm Des 2000; 6: 473-483. MEDLINE

30 Weinstein JN. Searching for pharmacogenomic markers: the synergy between omic and hypothesis-driven research. Disease Markers 2001; 17: 77-88. MEDLINE

31 Bucci C, Thomsen P, Nicoziani P, McCarthy J, van Deurs B. Rab7: a key to lysosome biogenesis. Mol Biol Cell 2000; 11: 467-480. MEDLINE

32 Meresse S, Steele-Mortimer O, Finlay BB, Gorvel JP. The rab7 GTPase controls the maturation of Salmonella typhimurium-containing vacuoles in HeLa cells. EMBO J 1999; 18: 4394-4403. Article MEDLINE

33 Press B, Feng Y, Hoflack B, Wandinger-Ness A. Mutant. Rab7 causes the accumulation of cathepsin D and cation-independent mannose 6-phosphate receptor in an early endocytic compartment. J Cell Biol 1998; 140: 1075-1089. MEDLINE

34 Hong SB, Li CM, Rhee HJ, Park JH, He X, Levy B et al. Molecular cloning and characterization of a human cDNA and gene encoding a novel acid ceramidase-like protein. Genomics 1999; 62: 232-241. MEDLINE

35 Nagase T, Miyajima N, Tanaka A, Sazuka T, Seki N, Sato S et al. Prediction of the coding sequences of unidentified human genes III. The coding sequences of 40 new genes (KIAA0081-KIAA0120) deduced by analysis of cDNA clones from human cell line KG-1. DNA Res 1995; 2: 37-43. MEDLINE

36 Holmbeck K, Bianco P, Caterina J, Yamada S, Kromer M, Kuznetsov SA et al. MT1-MMP-deficient mice develop dwarfism, osteopenia, arthritis and connective tissue disease due to inadequate collagen turnover. Cell 1999; 99: 81-92. MEDLINE

37 Apte SS, Fukai N, Beier DR, Olsen BR. The matrix metalloproteinase-14 (MMP-14) gene is structurally distinct from other MMP genes and is co-expressed with the TIMP-2 gene during mouse embryogenesis. J Biol Chem 1997; 272: 25511-25517. Article MEDLINE

38 Shinmura K, Yamaguchi S, Saitoh T, Takeuchi-Sasaki M, Kim SR, Nohmi T et al. Adenine excisional repair function of MYH protein on the adenine:8-hydroxyguanine base pair in double-stranded DNA. Nucleic Acids Res 2000; 28: 4912-4918. MEDLINE

39 Ohtsubo T, Nishioka K, Imaiso Y, Iwai S, Shimokawa H, Oda H et al. Identification of human MutY homolog (hMYH) as a repair enzyme for 2-hydroxyadenine in DNA and detection of multiple forms of hMYH located in nuclei and mitochondria. Nucl Acids Res 2000; 28: 1355-1364.

40 Wang J, Brown EJ. Immune complex-induced integrin activation and L-plastin phosphorylation require protein kinase A. J Biol Chem 1999; 274: 24349-24356. MEDLINE

41 Jones SL, Wang J, Turck CW, Brown EJ. A role for the actin-bundling protein L-plastin in the regulation of leukocyte integrin function. Proc Natl Acad Sci 1998; 95: 9331-9336. MEDLINE

42 Ingley E, Sarna MK, Beaumont JG, Tilbrook PA, Tsai S, Takemoto Y et al. HS1 interacts with Lyn and is critical for erythropoietin-induced differentiation of erythroid cells. J Biol Chem 2000; 275: 7887-7893. MEDLINE

43 Brunati AM, Donella-Deana A, James P, Quadroni M, Contri A, Marin O et al. Molecular features underlying the sequential phosphorylation of HS1 protein and its association with c-Fgr protein-tyrosine kinase. J Biol Chem 1999; 274: 7557-7564. MEDLINE

44 Nestel FP, Colwill K, Harper S, Pawson T, Anderson SK. RS cyclophilins: identification of an NK-TR1-related cyclophilin. Gene 1996; 180: 151-155. MEDLINE

45 Chao YH, Kuo SC, Ku K, Chiu I, Wu CH, Mauger A et al. Synthesis and cytotoxicity of Methyl-4,8-dihydrobenzo[1,2-b:5,4-b']dithiophene-4,8-dione derivatives. Bioorg Med Chem 1999; 7: 1025-1031. MEDLINE

46 Chao YH, Kuo SC, Wu CH, Lee CY, Mauger A, Sun IC et al. Synthesis and cytotoxicity of 2-acetyl-4,8-dihydrobenzodithiophene-4,8-dione derivatives. J Med Chem 1998; 41: 4658-4661. MEDLINE

47 Kundel MW, Kirkpatrick DL, Johnson JI, Powis G. Cell line-directed screening assay for inhibitors of thioredoxin reductase signaling as potential anti-cancer drugs. Anti Canc Drug Des 1997; 12: 659-670.

48 Rogge M, Fischer G, Pindur U, Schollmeyer D. alpha-Anellated carbazoles with anti-tumor activity: synthesis and cytotoxicity. Monatsh Chem 1996; 127: 97-102.

49 Monks TJ, Hanzlik RP, Cohen GM, Ross D, Graham DG. Quinone chemistry and toxicity. Toxicol Appl Pharmacol 1992; 112: 2-16. MEDLINE

50 O'Brien PJ. Molecular mechanisms of quinone cytotoxicity. Chem Biol Interact 1991; 80: 1-41. MEDLINE

51 Bolton JL, Trush MA, Penning TM, Dryhurst G, Monks TJ. Role of quinones in toxicology. Chem Res Toxicol 2000; 13: 135-160. MEDLINE

52 Phillips RM, Naylor MA, Jaffar M, Doughty SW, Everett SA, Breen AG et al. Bioreductive activation of a series of indolequinones by human DT-diaphorase: structure-activity relationships. J Med Chem 1999; 42: 4071-4080. MEDLINE

53 Xing C, Wu P, Skibo EB, Dorr RT. Design of cancer-specific antitumor agents based on aziridinylcyclopent[b]indoloquinones. J Med Chem 2000; 43: 457-466. MEDLINE

54 Beall HD, Hudnott AR, Winski S, Siegel D, Swann E, Ross D et al. Indolequinone antitumor agents: relationship between quinone structure and rate of metabolism by recombinant human NQO1. Bioorg Med Chem Lett 1998; 8: 545-548. MEDLINE

55 Fitzsimmons SA, Workman P, Grever M, Paull K, Camalier R, Lewis AD. Reductase enzyme expression across the National Cancer Institute tumor cell line panel: correlation with sensitivity to mitomycin C and EO9. J Natl Canc Inst 1996; 88: 259-269.

56 Sengupta SK. Inhibitors of DNA-transcribing enyzmes. In Foye WE (ed) Cancer Chemotherapeutic Agents American Chemical Society: Washington, DC, 1993, pp 205-260.

57 Mitchell J, Marrian DH. Radiosensitization of cells by aderivative of2-methyl-1, 4-naphthoquinone. In Morton RA (ed) Biochemistry of Quinones Academic Press: New York, 1965, pp 503-541.

58 Nesta P. Radiation chemistry of quinonoid compounds. In Patai S, Rappoport S (eds) The Chemistry of the Quinonoid Compounds John Wiley & Sons: New York, 1988, pp 879-898.

59 Schaffer J. The Analysis of Incomplete Multivariate Data New York: Chapman and Hall, 1996.

Figures

Figure 1 Conceptual framework for statistical analysis relating structural features of compounds to patterns of gene expression in the NCI60 human cancer cell lines. Database [A] contains compound activity patterns, [S] contains molecular structural features of the tested compounds and [T] contains differential gene expression for potential molecular targets in the cells. Modified from Figure 1 of Reference 6.

Figure 2 Substructural queries defining the benzothiophenedione and indolonaphthoquinone classes.

Figure 3 Cytotoxic quinone anti-cancer agents in clinical use.

Figure 4 Representative quinones from the benzothiophenedione and indolonaphthoquinone classes used as probes to identify genes for which the expression patterns are highly correlated with the compound's activity.

Figure 5 Histogram of Pearson correlation coefficients of 3748 genes and 4463 compounds. The correlation coefficients across the NCI60 cell lines are approximately normally distributed around zero.

Tables

Table 1 Cell line clusters

Table 2 Selected genes with high values of studentized range statistic over cell clusters

Table 3 Statistics for average compound-gene correlations for several classes of compounds and selected genes

Table 4 Selected compound-gene correlations for NSC 656238, NSC 661223 and NSC 682991 (see Figure 4)

Received 7 January 2002; accepted 8 April 2002
2002, Volume 2, Number 4, Pages 259-271
Table of contents    Previous  Article  Next    [PDF]
Privacy Policy © 2002 Nature Publishing Group