Selecting the best targets is a key challenge for drug discovery, and achieving this effectively, efficiently and systematically is particularly important for prioritizing candidates from the sizeable lists of potential therapeutic targets that are now emerging from large-scale multi-omics initiatives, such as those in oncology. Here, we describe an objective, systematic, multifaceted computational assessment of biological and chemical space that can be applied to any human gene set to prioritize targets for therapeutic exploration. We use this approach to evaluate an exemplar set of 479 cancer-associated genes, reveal the tension between biological relevance and chemical tractability, and describe major gaps in available knowledge that could be addressed to aid objective decision-making. We also propose drug repurposing opportunities and identify potentially druggable cancer-associated proteins that have been poorly explored with regard to the discovery of small-molecule modulators, despite their biological relevance.
Identifying and validating disease-causing genes that are viable as drug targets is a key challenge in drug discovery.
Large-scale multi-omics initiatives are deepening our understanding of cancer and providing an unbiased view of possible molecular mechanisms of the disease. Such studies usually result in sizeable lists — often hundreds — of potential cancer drug targets, most of which are not members of well-understood cancer pathways.
The selection a small number of genes for in-depth biological validation is thus often done in an ad hoc manner, thereby running the risk of bias or neglecting potentially druggable and therapeutically important novel targets.
We describe an objective, systematic, multifaceted computational approach of assessing biological and chemical space that draws on unprecedented volumes of multidisciplinary data, simultaneously, to assess large gene lists.
We utilize our new approach to evaluate 479 cancer genes from the Cancer Gene Census as an exemplar list and demonstrate the power of such an unbiased approach in rapidly unveiling potential therapeutic opportunities.
This analysis reveals the tension between biological relevance versus chemical tractability and highlights major gaps in available knowledge that can be addressed to aid objective decision-making.
We hypothesize drug repurposing opportunities and identify potentially druggable cancer proteins that are as yet poorly explored in the chemical space — despite their biological relevance — and we propose these proteins for in-depth chemical and biological studies.
We also illustrate how the mapping of biological and chemical data distillations onto cellular networks can provide deeper insights and potentially guide rational drug combination experiments.
We provide a live web-based portal to allow simultaneous annotation of up to 500 genes that can be applied to any human gene list. We propose that by using our approach alongside a researcher's own biological knowledge, stronger, more rational and unbiased decisions about target selection can be made that could lead to the discovery of a new generation of novel and chemically tractable therapeutic targets.
Cancer is the leading cause of death worldwide and the second commonest in developed countries1. Opportunities to help reduce this huge adverse impact of cancer through the discovery of new drugs are benefiting from an increasing biological understanding of the disease, driven by technological advances including next-generation sequencing2 and other genomic technologies, genome-wide association studies (GWAS)3, proteomics analyses4, RNA interference studies5 and chemical biology6,7. These are fuelling efforts to identify vulnerabilities in cancer cells and hence new drug targets8. Nevertheless, cancer incidence continues to rise (see the cancer fact sheet on the World Health Organization (WHO) website), drug resistance to both cytotoxic and molecularly targeted drugs continues to emerge9, and the genetic complexity and heterogeneity10 associated with many cancers is becoming increasingly apparent. Consequently, the discovery and application of innovative targeted therapeutics that can have a genuine clinical impact is becoming a more difficult task11,12.
Although several novel first-in-class molecularly targeted drugs for cancer were approved in 2011–2012, including abiraterone13, crizotinib14 and vemurafenib15, the rate of success in cancer drug development is among the lowest of all therapeutic areas16,17. Not only can the causative genetics of cancer be complex, but targets that are biologically compelling may fall outside the 'druggable' or 'ligandable'18 parameters that are readily accessible by conventional medicinal chemistry approaches. Advances in drugging target classes that have traditionally been considered to be intractable, such as protein–protein interactions — exemplified by the discovery of the B cell lymphoma 2 (BCL-2) and BCL-XL inhibitor navitoclax19,20 — may well expand the target space that is accessible to small-molecule drug discovery. Nonetheless, reliance on tried and tested targets persists in drug discovery pipelines (as highlighted in an article on the Forbes website).
Large-scale genomics initiatives such as The Cancer Genome Atlas (TCGA), the Cancer Genome Project and the International Cancer Genome Consortium21 are providing a growing list of genes that are causally involved in cancer. Other complementary large-scale approaches such as synthetic lethality screens22 and GWAS3 are also contributing to the generation of lists of candidate genes that have a biologically and pathologically compelling role in cancer11. But which of these genes will lead to the next cancer drug?
Although systematic approaches to discover cancer genes are now commonplace, assessing and prioritizing them for biological validation and therapeutic exploitation remains a largely ad hoc exercise that does not necessarily make use of the full breadth and depth of data available from the abundant large-scale initiatives. In the field of tropical disease research, the TDR Targets resource23 introduced the importance of estimating a protein's potential therapeutic attractiveness through the combination of high-level druggability, orthology and predicted lethality to the infectious organism to prioritize targets for further investigation. Since then, the data landscape has changed substantially with the availability of large-scale, open-access resources24,25, making it possible for the first time to address target assessment comprehensively, objectively and in depth.
In order to therapeutically exploit a gene, it is crucial to establish the biological role of the gene and/or protein in disease causation and the cognate biochemical pathway. This requires extensive biological validation, which can be a challenging, long and expensive process26,27 and is not scalable to the size of the gene lists generated by large-scale efforts. Therefore, there is a clear need for a systematic, objective, data-driven assessment of such gene lists, based on information integrated from different disciplines and sources, with the goal of prioritizing genes for further detailed biological validation studies. Such an approach must provide sufficient detail to allow quantitative and thorough examination of the data supporting a target's suitability for therapeutic development, but it must also be applicable at a scale that can meet the demands of the large gene lists emerging from multi-omics efforts. When combined with the researchers' own biological and disease-specific knowledge, this assessment can form a powerful data-driven guide to selecting targets for further experiments before making the major resource-intensive commitment to a drug discovery project. Here, we demonstrate such a systematic, unbiased and objective computational approach.
The workflow with the annotation scheme and assessment criteria is summarized in Fig. 1. Further information on the data sets and analysis is provided in Box 1, with full details in Supplementary information S1 (notes). We have applied our approach to assess the Cancer Gene Census28 list, which is a manually curated set of genes that have mutations or other genomic abnormalities associated with cancer and are likely to be causative, as identified from genetic studies. This data set exemplifies large gene lists that have been generated from initial experimental exploration and that have varying degrees of biological validation. It highlights the roles of the genes in specific cancers, although many of the genes may well have roles in other malignancies, especially in view of the number of potential drug targets suggested by ongoing large-scale efforts.
By carrying out unbiased computational analyses of disparate data, including chemoinformatics, gene expression, mutations and three-dimensional structure, we have been able to prioritize — for further experimental work — the most chemically tractable targets within the gene list. In addition, we suggest potential alternative repurposing indications for known drugs and chemical tools. Here, we have focused on the chemical tractability of targets with small-molecule drugs rather than biologics, which would require a different prioritization strategy.
Disease-relevant versus drugged classes
At the time of our analysis (July 2012), the Cancer Gene Census list contained a total of 488 genes or loci. After the removal of duplicates and genes with no curated protein sequences, a total of 479 genes from the Census were selected for the subsequent analysis. We classified the 479 genes into major functional classes (Fig. 2a), as detailed in section 3 in Supplementary information S1 (notes). For simplicity, we assigned every gene product to a single functional class, although occasionally more than one class was applicable (for example, receptor tyrosine kinases can fit into both 'enzyme' and 'receptor' classes, but here they are placed in the enzyme class). The main functional classes represented in the Census are as follows: transcription factors and regulators (29%), enzymes (25%) and enzyme regulators (7%). Transcription factors in this set comprise primarily low-structural-complexity proteins such as C2H2 zinc finger domain-, basic leucine zipper domain- and helix–loop–helix domain-containing proteins, but also include three nuclear hormone receptors (peroxisome proliferator-activated receptor-γ (PPARγ), retinoid receptor-α and nuclear receptor subfamily 4 group A3). In general, most of the Census proteins are localized in either the nucleus (49%) or the cytoplasm (25%), whereas 10% are cell membrane-associated or extracellular.
We compared the functional class distribution (Fig. 2a) of the Census proteins to that of the targets of small-molecule pharmaceuticals approved by the US Food and Drug Administration (FDA), across all therapeutic areas as well as for oncology alone29. Figure 2b highlights the functional class enrichments between the data sets. There is a greater than twofold enrichment of enzymes in the current target set29 compared to the Census proteins (Census proteins: 25%; drug targets (all therapeutic areas): 55%; drug targets (oncology): 64%). The success of enzymes as targets of launched drugs is probably because enzymes typically possess a well-defined catalytic site, and because of the relative ease of setting up chemical screening assays. This enrichment points to the need for a concerted effort, using systems-based approaches, to identify enzymes that affect oncoproteins regardless of their own oncogenic role.
Conversely, transmembrane proteins such as G protein-coupled receptors (GPCRs) and ion channels are less frequently targeted in oncology, although they have been the focus of many drug discovery projects in other therapeutic areas owing to their importance as molecular gateways to cells and their suitability for binding to small molecules. They are also substantially under-represented in the list of Census proteins (Census proteins: 5%; drug targets (all therapeutic areas): 48%; drug targets (oncology): 7%).
The enrichment of transcription factors in the Census list (Census proteins: 17%; current drug targets (all therapeutic areas): 5.6%; drug targets (oncology): 9%) supports their importance in cancer30,31, as exemplified by nuclear factor-κB (NF-κB)32, signal transducers and activators of transcription (STATs)33 and the transcription factor ETS34, but also reveals the tension between their importance in the molecular pathology of cancer versus their chemical tractability. In Fig. 2b we illustrate the enrichment of transcription factors and regulators as a group, and highlight the enrichment of nuclear hormone receptors (a subgroup of ligand-activated transcription factors) separately. Historically, transcription factors have been largely inaccessible to drug discovery efforts owing to the need to target more challenging protein–DNA or protein–protein interfaces and a general absence of an enclosed hydrophobic pocket, with the exception of nuclear hormone receptors, which have a small-molecule ligand-binding domain and thus can be more readily targeted with low-molecular-weight drugs, as exemplified by tamoxifen (an oestrogen receptor (ER) antagonist) and flutamide (an androgen receptor (AR) antagonist). Nonetheless, recent progress has been made with stabilized peptides that have the potential to target protein–protein interfaces in transcription factor complexes; for example, in the Notch pathway31,35. Alternatively, transcription factors could be indirectly targeted in cancer using systems biology approaches to analyse up- and/or downstream members of pathways involving transcription factors to identify proteins that are more chemically tractable36.
Opportunities for drug repurposing
One of the advantages of integrative and unbiased large-scale analyses is the ability to identify links in existing knowledge that may not always be readily apparent, such as potential opportunities to repurpose existing drugs or chemical tools. Examples of such studies have been reported for infectious37,38 and inflammatory39 diseases.
Using precedence or homology to targets of approved drugs29, we have identified proteins from the Cancer Gene Census that are themselves targets of approved drugs or that are members of the same protein family as an existing drug target and thus can be hypothesized to be druggable. These proteins, examples of drugs and associated therapeutic indications are detailed in section 4 in Supplementary information S1 (notes). We have found that a total of 28 Census proteins have >50% homology to a known drug target. Twenty-five of these proteins are themselves targets of launched drugs, and of these 25 proteins, three (PPARγ, DNA methyltransferase 3A and aldehyde dehydrogenase) do not currently have small-molecule drugs indicated for cancer therapy. A fourth protein, thyroid-stimulating hormone receptor (TSHR), is the target of a biologic that has been indicated as a diagnostic and adjuvant therapy to avoid hypothyroidism after thyroid ablation in patients with thyroid cancer. Antagonists of this receptor have not been indicated for the treatment of cancer.
PPARγ is a type II nuclear hormone receptor that has been reported in the Census because of the observation of a dominant translocation in patients with follicular thyroid cancer28; its 'insufficiency' through reduced protein levels or activity drives oncogenesis40. It is also the target for the thiazolidinedione class of antidiabetic drugs, which are currently used for the treatment of type 2 diabetes. Studies have demonstrated the antitumour activity of PPARγ agonists, which is mediated through the transactivation of genes that regulate cell proliferation, apoptosis and differentiation41. It has been suggested that treatment with a PPARγ agonist delays the progression of thyroid carcinogenesis40. There are various clinical trials currently underway involving PPARγ agonists in cancers including malignant liposarcoma42 and non-small-cell lung cancer (ClinicalTrials.gov identifiers: NCT01199068; NCT01199055). Based on our analysis, we hypothesize a potential new application for PPARγ agonists in follicular thyroid cancer.
TSHR is a class 1 GPCR and is the target of the biologic recombinant human TSHα. The Census reports a dominant missense mutation in TSHR that is associated with thyroid adenoma, and studies have shown that constitutive activation of TSHR by somatic mutations can lead to toxic thyroid adenoma43,44,45. TSHR mRNA has been studied as a potential biomarker of circulating thyroid carcinoma cells46. In several studies, small-molecule antagonists and inverse agonists of TSHR have been shown to block stimulating antibodies for the potential therapy of hyperthyroidism47,48. These combined findings suggest that TSHR antagonists may have therapeutic potential in thyroid cancers.
Smoothened (SMO) is the target of vismodegib, which has recently been approved for the treatment of basal cell carcinoma49. SMO is listed in the Census because a dominant missense mutation in the SMO gene has been associated with basal cell carcinoma. Interestingly, by exploring the TCGA database of copy number variation and gene expression data from 599 patients with glioblastoma multiforme50, we have identified that the SMO gene exhibits copy number variation ranging from three to seven copies in 35% of patients. Additionally, according to the same database, SMO is overexpressed in 93% of the patients (at least twofold overexpression in comparison with matched normal tissue). If these findings are validated and found to be clinically significant, they would suggest another application for vismodegib in the treatment of SMO-amplified glioblastoma multiforme.
Additional applications can be envisaged for drugs that are in development or even earlier chemical tool compounds. For example, fostamatinib — an inhibitor of spleen tyrosine kinase (SYK) that is currently in development for the treatment of rheumatoid arthritis — has shown clinical activity in non-Hodgkin's lymphoma and chronic lymphocytic leukaemia51, and was found to induce tumour cell death in retinoblastoma models52. SYK is causally implicated in all of these cancer types and was found to have a role in head and neck squamous cell carcinoma77. Based on the oncogenic mutations reported in the Census, we propose that SYK inhibitors such as fostamatinib may also be useful in the treatment of myelodysplastic syndromes and peripheral T cell lymphoma.
Scale and utility of structural data
The availability of three-dimensional structures for a protein greatly empowers small-molecule drug discovery, as structural information can be used in hit generation and lead optimization as well as in understanding mechanisms of drug binding and resistance. Where available, multiple structural snapshots of the same protein provide an enhanced level of information to support drug discovery, as they allow the exploration of alternative functionally relevant conformations. In the absence of appropriate experimental structures, homology models have proved to be useful tools for this purpose. However, the ability to generate informative models decreases with lower homology and with fewer available relevant structural templates (see the notes on homology in Supplementary information S1 (notes)).
Our analysis shows that out of the 479 Census proteins, 257 (54%) are structurally characterized (Fig. 3a). By contrast, only approximately one-quarter of the human proteome has been structurally characterized. The higher degree of structural characterization of Census proteins is probably because many of these proteins have a high biological importance and hence have been more extensively studied. However, a detailed inspection reveals a grossly unequal representation of structural data. The distribution of the number of structures per protein (see section 5 in Supplementary information S1 (notes)) shows that out of the 257 structurally characterized proteins, one-third have only a single structure determined, whereas 14 better-known cancer proteins (such as RAS, heat shock protein 90 (HSP90) and tumour suppressor p53) each have more than 30 structural snapshots determined. It is important to note that many proteins have only been structurally characterized to a limited degree, and for some (for example, fibroblast growth factor receptor 3) the structure of the catalytic domain — which is important for drug-binding — has not yet been determined. Indeed, only 20% of the amino acid content of Census proteins has been structurally determined, highlighting the degree of partial structural characterization.
For one third of the Census proteins (79 in total), structures have been determined in complex with small-molecule ligands; the majority of these proteins are enzymes. For 11 proteins, structural complexes with 10 or more different ligands have been independently determined. However, the majority of the structurally characterized proteins (178 out of 257) have not been solved in complex with any ligand (see section 5 in Supplementary information S1 (notes)). Some of these proteins may not naturally bind to small-molecule ligands; however, 36 of these proteins are enzymes, representing a set of cancer-associated proteins for which ligand-bound structures would be greatly beneficial.
Structural annotation can be expanded by identifying structurally characterized homologues, thus allowing the indirect analysis of otherwise uncharacterized proteins. A further 119 Census proteins can be structurally annotated in this way (Fig. 3a); of these 119 proteins, 69 have 50–89% homology to the nearest determined structure, which makes them suitable candidates for potentially useful homology modelling. Conversely, 103 Census proteins cannot currently be structurally annotated (see Supplementary information S1 (notes) for structural characterization criteria). These Census proteins may be structurally uncharacterized owing to technical difficulties or simply because of a relative lack of scientific interest in them so far. Nonetheless, for the structurally uncharacterized fraction of the Census proteins, the future availability of three-dimensional structures would help considerably in understanding their functional roles in cancer and would probably aid drug discovery efforts. Furthermore, efforts to provide multiple structural snapshots of Census proteins — determined under different biologically relevant conditions — would be both informative and valuable for objectively assessing the suitability of these proteins to bind small-molecule drugs.
Chemical tractability using structure
Although the precedence-based druggability assessment described above provides a useful indication of druggability and also identifies drug repurposing opportunities, this approach has its limitations. First, it highlights protein families that have been drugged successfully in the past; therefore, it is historically biased as it ignores the possibility of a druggable but as yet undrugged protein family. Second, it assumes that all members of a given protein family are equally druggable. However, using the structural data available, the application of structure-based druggability predictions can extend the druggability boundaries and help to address some of these caveats.
Briefly, the structure-based druggability assessment identifies all cavities on a given three-dimensional structure and assesses their likely druggability based on physicochemical parameters that are independent of the homology of the protein to known drug targets. Two models of druggability predictions are used here: strict druggability (in which the cavity is compatible with binding to a 'rule of five'53 (RO5)-compliant drug-like molecule) and the more relaxed chemical tractability (see Box 1 and Supplementary information S1 (notes)). Interestingly, we have found that a total of 103 proteins from the Cancer Gene Census are predicted to be druggable using the strict model, which can be extended to 211 proteins using the relaxed chemical tractability model.
Examining the druggability results by functional class (Fig. 3b) reveals that enzymes are the most druggable among the Census proteins: 63% of enzymes with known structures are predicted to be druggable using the strict model. Conversely, 94% of transcription factors with a known structure are predicted to be undruggable using the strict model.
The strength of the structure-based algorithm is that it allows the identification of novel and potentially druggable proteins that lie outside its historical training set, such as histone-binding proteins (Supplementary information S1 (notes)). However, this approach is limited by incomplete structural characterization of the proteins. Out of the 27 enzymes with known structures that fail the druggability prediction assessment, we have found that structures of the possible druggable domains are not available for 15 enzymes (for example, the catalytic domain of fibroblast growth factor receptor 3) but these enzymes can be annotated as being druggable because of their homology to druggable structures. Five additional enzymes are predicted to be chemically tractable using the relaxed model. These include the oncoprotein MDM2, which has two druggable domains according to this model, both of which are targets of inhibitors that are currently in Phase I trials54.
A further potential limitation of de novo structure-based druggability assessment is that some proteins can be predicted to be undruggable because the available structures are in an undruggable conformation. For example, isocitrate dehydrogenase 1 (IDH1) has multiple structures in three major conformations, only one of which is druggable using the strict model. This transient site phenomenon, which has also been identified in other proteins55, can be partially addressed by using as many structural snapshots of the same protein as possible and by including homologues to aid the understanding of possible variations in the three-dimensional protein structure.
Chemical landscape for Census proteins
Small-molecule chemical tools56 can substantially enhance biological validation and mechanistic evaluation, and in particular aid the understanding of the specific functional role of a protein in cancer. Additionally, the presence of published chemical screens for a target or a very close homologue indicates the existence of viable binding or biochemical assays — a key component needed for expediting the therapeutic exploration of a potential target. To identify which proteins from the Cancer Gene Census have active compounds in the literature that can be potentially utilized as tools, we used canSAR25 — an integrated database that brings together biological, chemical and pharmacological data on all human proteins — to report the number of active compounds for each Census protein and its homologues. In addition to identifying possible tool compounds, this analysis helps in predicting the chemical tractability of the potential targets.
Submicromolar active compounds identified in binding or biochemical assays have been reported in the literature for 86 of the 479 Census proteins (see section 7 in Supplementary information S1 (notes)). Of these, submicromolar drug-like compounds have been reported for 73 proteins. Generally, we can hypothesize that chemical hits can be mapped based on target homology. An additional four Census proteins have at least 50% sequence homology to a target with active compounds. Assay protocols have been published for all of these putative targets; furthermore, 69 of these targets have compounds that are active in cellular assays and so they are likely to be cell-penetrant chemical tools. An analysis of compounds that are active against Census proteins is provided in Supplementary information S1 (notes).
Multifaceted assessment of Census proteins
As seen above, pathogenic importance in cancer does not always correlate with chemical tractability. We have ranked the members of the Cancer Gene Census list based on evidence of chemical tractability: namely, family precedence, availability of submicromolar compounds and structure-based druggability (see Supplementary information S2 (table)). Figure 4 shows the overlap among these three pieces of evidence. We have found that a total of 132 Census proteins have at least one piece of evidence for chemical tractability (Fig. 4) and that 173 can be designated as tractable based on ≥50% homology to a tractable protein.
Interestingly, we have identified 27 Census proteins for which there are active compounds in the literature that have submicromolar potency, despite the fact that these proteins were not predicted to be druggable by either structure- or precedence-based methods (Fig. 4). No three-dimensional structures have been determined for 11 of these proteins; the remainder (for example, the serine/threonine kinase BUB1B) are only partially structurally determined and the druggable region is missing, further highlighting the importance and limitations of utilizing three-dimensional structural data.
The combination of these orthogonal assessments — that is, structure-based, ligand-based and precedence-based assessments — can more comprehensively identify possible tractable targets for future drug discovery (Fig. 4). Importantly, using structure-based assessments, we have identified 46 Census proteins that are considered to be druggable, but for which few or no active compounds have yet been published. Table 1 details these 46 targets grouped into functional classes. Of these, 16 are enzymes — predominantly methyltransferases, helicases and ligases. Other targets encompass a diverse range of functions, and include histone-binding proteins and regulatory subunits of enzymes.
Some of the Census genes are oncogenes and others are tumour suppressors. Chemical intervention strategies for addressing these two classes will be different. It is easier, for example, to inhibit an abnormally activated enzyme but much more difficult to correct a loss-of-function phenotype. Out of the 46 targets, 26 are annotated by the Census as oncogenes in the reported cancer type, 19 are annotated as tumour suppressors28, and one — CBL, encoding an E3 ubiquitin protein ligase — is annotated as having both functions. Indeed, it is becoming increasingly common to identify genes that can act both as oncogenes and as tumour suppressors, depending on the specific genetic context57,58,59.
When there is no small-molecule chemical matter or crystal structure available for a protein, knowledge of close homologues provides a starting point in the initial search for chemical tools or potential druggable proteins. By mapping the annotation from homologous targets (with ≥50% homology), we have found that the number of potentially druggable proteins without chemical screening data increases from 46 to 83. Furthermore, the number of proteins with active compounds increases to 90, providing more chemical options for exploring cancer biology.
Potential novel druggable targets
Of particular interest are the 46 cancer-associated targets (Table 1) we have identified that are predicted to be druggable and are structurally characterized but appear to have undergone little or no active small-molecule chemical screening efforts as far as can be gleaned from the literature. These 46 proteins that are potentially druggable but as yet unexplored chemically are promising potential drug targets. The existence of three-dimensional structures for these proteins makes them excellent candidates for structure-based hit identification — for example, using fragment-based approaches. Some of these targets have only one druggable structure available, and further structural characterization may support or contradict the evidence for the existence of a chemically tractable pocket. Nonetheless, the list of 46 cancer-associated targets contains some interesting proteins that merit further investigation. It is worth noting that this list is based on the strict definition of druggability and can be expanded to 143 targets when considering a less stringent definition (see Supplementary information S2 (table)). Of considerable interest, 27 of the 46 targets are labelled in the Cancer Gene Census as oncogenes, having dominant genetics, and thus small molecules that could block their activity would be especially desirable as chemical tools. We describe some examples of these targets below (Fig. 5).
The guanine nucleotide-binding protein Gs α-subunit isoform GNAS (Fig. 5a) is identified in the Census as having a dominant activating mutation in pituitary adenoma28. It is involved in the hormonal regulation of adenylyl cyclase, thus stimulating the synthesis of cyclic AMP. Further activating mutations of the GNAS gene have been identified in kidney, thyroid, adenocortical, colorectal and Leydig cell tumours60. Small-molecule inhibitors of this enzyme would therefore serve as tools for validating the role of GNAS in the biology of these cancers and may have potential therapeutic applications.
The enzyme IDH1 (Fig. 5b) is a cytosolic dehydrogenase that is involved in the third step of the citric acid cycle. Mutations in the IDH1 gene are found in 80% of grade II–III gliomas and secondary glioblastomas in humans, and have more recently been linked to various types of leukaemia, including acute myeloid leukaemia61. IDH1 mutations result in loss of the enzyme's ability to catalyse the conversion of isocitrate to α-ketoglutarate; instead, mutant proteins gain a neomorphic enzymatic activity, catalysing the reduction of isocitrate to the 'oncometabolite' 2-hydroxyglutarate, which plays a part in the development and progression of malignant brain tumours62. Despite the growing interest in the role of this enzyme, particularly in gliomas63, at the time of our analysis no chemical tool compounds had been published to help understand its biology. However, after the completion of our analysis, the results of a high-throughput screen of more than 3,000 compounds against IDH1 have been deposited into the PubChem database64, heralding the beginnings of efforts towards the chemical modulation of this important target.
Calcium-activated nucleotidase 1 (CANT1) (Fig. 5c), a member of the apyrase family, is a soluble nucleotidase that preferentially hydrolyses di- and triphosphates in a calcium-dependent manner65. Various CANT1–ETV4 (ETS variant 4) fusion transcripts have been identified in prostate cancer28,66. In addition, CANT1 is overexpressed in prostate tumours, and a reduction in CANT1 expression in prostate cancer cell lines results in decreased proliferation and migration67. The catalytic site of CANT1 is predicted to be druggable based on our analysis, and target validation studies would benefit from the availability of a small-molecule chemical inhibitor of CANT1.
A family of related enzymes, namely DEAD box helicases (DEAD box helicase 5 (DDX5), DDX6, DDX10, DDX17 and eukaryotic translation initiation factor 4A (EIF4A)), are listed in the Census because activating mutations in these enzymes have been implicated in several solid cancers and leukaemias28. According to our analysis, the catalytic domain of this family is highly flexible and thus the formation of a druggable cavity is dependent on conformational rearrangement. The paucity of chemical probe data for members of this family may be partly due to this transient binding site or to difficulties in assay development. Nonetheless, the fact that this protein family is frequently associated with cancer, often with activating mutations, probably warrants investment into biological target validation, assay development and compound screening to identify inhibitors of these enzymes.
A total of 19 of the predicted druggable proteins in Table 1 are products of tumour suppressor genes. Therapeutic targeting of these proteins using small molecules will be more complex as it will require agonists or activators, or the modulation of an alternative protein that regulates their function. BRG1 (also known as SMARCA4) (Fig. 5d) is a transcriptional activator, and reported mutations have indicated its role as a tumour suppressor in various cancers68. It has two druggable domains: the AAA ATPase domain and a bromodomain. Developing small-molecule chemical tools to examine the specific roles of these domains would enhance our ability to understand the role of SMARCA4 in cancer. Moreover, it has been shown that rather than acting solely as a tumour suppressor, SMARCA4 is involved in senescence and in the activation of p53 (Ref. 69). SMARCA4 has also been shown to be overexpressed in various cancers70,71. This suggests that small-molecule modulators of SMARCA4, either activators or inhibitors, would be useful as chemical tools and could have therapeutic potential.
A network view of evidence-based assessment
Pathogenic driver genes in cancer function within complex protein interaction networks12. Mapping the information discussed in this present study onto a protein interaction network has the potential to allow the identification of key druggable intervention points and also to help pinpoint potential alternative targets that in turn regulate undruggable disease drivers. Figure 6 shows an interaction network of the proteins from the Cancer Gene Census; the interaction network is annotated according to biological roles (Fig. 6a), the precedence- and chemistry-based approaches (Fig. 6b), the structure-based assessment approaches (Fig. 6c) and, finally, a combination of all these annotations (Fig. 6d).
The resulting global interaction network reflects biological processes that are important to multiple cancers, from which some interesting patterns emerge with respect to therapeutic opportunities. With regard to biological role, oncoproteins and tumour suppressors tend to occupy geographically distinct regions of the network: some subnetworks are predominantly oncogenic, whereas others are predominantly tumour-suppressive (Fig. 6a). Targets of approved pharmaceuticals tend to be well connected, but interestingly are not major hubs of the network, and they primarily cluster within the same oncogenic subnetwork (Fig. 6b). This partially reflects a historical bias, as molecularly targeted oncology drugs are typically inhibitors or antagonists that are aimed at blocking the function of oncoproteins, which are clustered into a single major oncogenic subnetwork. However, Fig. 6b also shows that this historical activity appears to have neglected, to a large degree, other oncoproteins in smaller subnetworks. This finding poses the question of whether the previous focus of drug discovery efforts within limited cancer subnetworks may help to explain the high rates of emerging resistant clones that may exploit alternative routes through the global network9,10. Despite the fact that historical drug discovery activity has focused on limited subnetworks, we have found that targets that are predicted to be druggable fall within multiple subnetworks — both oncogenic and tumour-suppressive (Fig. 6c).
Finally, combining all of the above annotations reveals druggable oncogenic subnetworks outside the regions of historical bias (Fig. 6d). This observation could provide support for approaches that explicitly aim to target these alternative, neglected subnetworks in future drug discovery programmes. Furthermore, applying the above analysis to a network derived from data relating to a specific cancer type may similarly highlight key druggable subnetworks in that particular cancer, and this could prove to be useful in identifying targets for rational drug combinations12.
The global, systematic, objective and multidisciplinary computational analysis presented here allows for the effective, unbiased and data-driven assessment and prioritization of large, biologically compelling gene lists for the purpose of drug discovery. This approach, applied here to the Cancer Gene Census as an exemplar gene list, can be adapted to any gene set of interest that may emerge from large-scale omics studies, functional screening initiatives or GWAS39,72 from any therapeutic area. The full data set of 479 Census genes, with all the annotations from the analysis presented here, is available in Supplementary information S2 (table).
We emphasize that the large-scale approach described here is not a substitute for a detailed biological understanding of individual pathogenic targets and pathways; rather, it can be used alongside and indeed inform such classical in-depth studies. Currently, the selection of targets for validation generally tends to be an ad hoc process that does not take advantage of the sheer volume and multidisciplinary breadth and depth of data that are now becoming increasingly available from the abundant large-scale initiatives that are ongoing worldwide.
As detailed biological validation of each individual target is a challenging, long and expensive process26,27, objective prioritization to a more manageable gene list is essential. We have shown that, by integrating information from different large-scale research initiatives — comprising information on protein functional class, homology to targets of approved drugs, three-dimensional structure and the existence of published active small molecules — we are able very effectively to annotate a biologically and pathogenically important gene list containing potential drug targets, exemplified here by the Cancer Gene Census. We have also detected potential 'quick wins' through the identification of targets for repurposing known drugs or active chemical tool compounds that can be tested for activity in cancer models. Such small-molecule probe compounds have the potential to facilitate biological research and target validation and act as pathfinders for drug discovery7. The ability to identify published active compounds rapidly and systematically makes hypothesis testing using small-molecule chemical tools achievable without needing initial medicinal chemistry investment.
Of interest, our analysis has revealed the intriguing tension between targets that are chemically tractable with current medicinal chemistry capabilities versus those that are biologically important in cancer. Many oncoproteins are intractable with conventional medicinal chemistry approaches. These targets can be potentially addressed with small-molecule compounds through two strategies: by investing in medicinal chemistry approaches to expand the boundaries of druggability20; or via the identification of alternative druggable targets within the pathway or subnetwork in question12,36.
Although the Cancer Gene Census has the advantage of being a richly annotated and well-curated data set, the protocol and analysis demonstrated here for cancer can be applied to any human gene set from any disease area. In addition, the data in canSAR relate to all human proteins. Whereas certain biological data such as gene mutations and copy number variations focus on cancer, the three-dimensional structure data, druggability assessments and chemical bioactivity data are comprehensive and relate to the entire proteome and multiple model organisms. This makes the analysis and underlying tools applicable to most human diseases.
When combined with disease-specific biological information, together with an understanding of the target and cognate pathway involved in disease causation, the wealth of integrated multidisciplinary data can help provide a powerful, unbiased rationale for target prioritization and selection. This analysis utilizes large-scale multidisciplinary knowledge, including peer-reviewed and curated public chemoinformatic data from the medicinal chemistry literature24,25. In the future, this information could be further enriched by mining the patent literature. Well-curated patent resources do not currently exist in the public domain, and commercial databases are limited for large-scale analyses. Also, patents generally lack detailed structure–activity relationship data and precise individual compound potencies. In addition to the well-curated and peer-reviewed chemical data literature sources currently included in our approach, it should be possible to apply our methodology to other chemical databases in the public domain such as the PubChem database64, which will increase the breadth of the chemical annotation, albeit at the expense of including data that are not curated or peer-reviewed.
By mapping the multidisciplinary information discussed in the present study onto a protein interaction network, we reveal that targeted cancer drugs have historically focused on limited oncogenic subnetworks, neglecting other potentially druggable oncogenic and tumour-suppressive subnetworks that may be crucial for disease pathology and drug resistance. Mapping the chemogenomic and druggability data onto a disease-derived protein interaction network and combining them with biological data and existing knowledge of the disease will help identify key druggable nodes and suggest alternative approaches to modulate less druggable targets. In addition, the identification of chemically neglected subnetworks may reveal alternative therapeutic approaches and synergistic combinations that could inform target selection either for drug discovery or for drug repurposing, with the aim of identifying rational drug combinations that have the potential to overcome the major challenge of drug resistance9,10,12.
Perhaps surprisingly, our analysis shows that the horizon of cancer drug discovery may not be as dominated by intractable targets as is frequently feared73. Although efforts to extend the target space and drug the 'undruggable'73 continue to be necessary, our analysis shows that there are many protein classes with active sites or binding pockets that are compatible with conventional small-molecule drug intervention, but they have yet to be explored chemically. Some of these classes are beginning to be targeted (for example, bromodomains)74,78 but, on the basis of available evidence, others do not appear to be receiving extensive medicinal chemistry investments.
Of particular interest for new drug discovery, we have identified 46 proteins in the Cancer Gene Census (as highlighted in Table 1, with specific examples discussed above) that need to be examined carefully as they may provide the next wave of tractable targets for cancer. These proteins, which are predicted to be druggable but for which there is a lack of chemical compounds to modulate them, represent potential novel biological targets for chemical exploitation. This novelty carries with it an increased development risk compared to tried and tested targets, but this risk could be reduced by the discovery and use of small-molecule chemical tools7,56. Additionally, and despite the widely accepted importance of structural biology in modern drug discovery, we have noted that only 26% of human proteins are structurally characterized (either themselves or via their very close orthologues). Moreover, as we show, many of these proteins are only partially characterized. Indeed, only ∼12% of the amino acid content of the proteome is structurally characterized, highlighting the urgent need for further investment to expand the number and diversity of available protein structures82.
To empower the research community to carry out its own objective computational assessments, we provide a live web-based tool to allow the rapid annotation of human gene lists with the type of integrative information used in our own analysis (see the canSAR website). This tool has the advantage of allowing researchers to refresh their assessments regularly in response to the fast-changing information landscape, which is continually updated on the canSAR website25. We strongly recommend that the information outputs that researchers obtain from their use of our approach and computational tools are considered together with their own internal experimental data or in-house analyses — for example, of proprietary patent databases.
In conclusion, recent surveys26,27 have emphasized the importance of very careful and thorough target validation to reduce the risk of failure later in the drug development pipeline. However, in-depth 'wet biology' assessments to achieve this goal are time-consuming and expensive, which precludes conducting such studies on large numbers of targets to match the scale of omics initiatives. In addition to having biological and pathogenic importance as well as acceptable druggability or chemical tractability, targets need to have available biological assays for identifying small-molecule chemical probes and progressing them into drug discovery programmes. The application of systematic, multidisciplinary and unbiased computational assessments, such as the one we describe and exemplify here, can help to prioritize those targets and target classes that will benefit most from further investment in target validation, assay development, chemical biology or medicinal chemistry to discover new small-molecule probes and drugs. This will have a major impact on opening up new target space for drug discovery in cancer and, by extrapolation, other disease areas.
This work was supported by Cancer Research UK (grant numbers C309/A8274 and C309/A11566). P.W. is a Cancer Research UK Life Fellow. The authors acknowledge additional funding from Cancer Research UK to the Cancer Research UK Cancer Centre and from the UK National Health Service (NHS) to the National Institute for Health Research (NIHR) Biomedical Research Centre at The Institute of Cancer Research and Royal Marsden Hospital, UK. The authors thank K. Bulusu for technical help, and thank J. Blagg, M. Garnett and U. McDermott for valuable discussions and comments. Author contributions: B.A.L. conceived the project and designed the analysis; M.P., M.H.B. and B.A.L. performed the data analysis and informatics and wrote the paper; P.W. provided biological analysis and insights and wrote the paper; J.T. developed the target annotation tool.
Descriptions of Supplementry Table 2
About this article
Nature Reviews Cancer (2017)