Abstract
Small molecules are usually compared by their chemical structure, but there is no unified analytic framework for representing and comparing their biological activity. We present the Chemical Checker (CC), which provides processed, harmonized and integrated bioactivity data on ~800,000 small molecules. The CC divides data into five levels of increasing complexity, from the chemical properties of compounds to their clinical outcomes. In between, it includes targets, off-targets, networks and cell-level information, such as omics data, growth inhibition and morphology. Bioactivity data are expressed in a vector format, extending the concept of chemical similarity to similarity between bioactivity signatures. We show how CC signatures can aid drug discovery tasks, including target identification and library characterization. We also demonstrate the discovery of compounds that reverse and mimic biological signatures of disease models and genetic perturbations in cases that could not be addressed using chemical information alone. Overall, the CC signatures facilitate the conversion of bioactivity data to a format that is readily amenable to machine learning methods.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All gene expression signatures have been deposited in the GEO (GSE137202).
Code availability
To facilitate access to our data, we built a web-based resource (https://chemicalchecker.org), which includes all the bioactivity signatures in HDF5 format and the full code of the CC resource.
Change history
21 May 2020
A Correction to this paper has been published: https://doi.org/10.1038/s41587-020-0564-6
References
Sterling, T. & Irwin, J. J. ZINC 15—ligand discovery for everyone. J. Chem. Inform. Model. 55, 2324–2337 (2015).
Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45, D945–D954 (2017).
Wang, Y. et al. PubChem BioAssay: 2017 update. Nucleic Acids Res. 45, D955–D963 (2017).
Wishart, D. S. Chapter 3: small molecules and disease. PLOS Comput. Biol. 8, e1002805 (2012).
Duran-Frigola, M., Rossell, D. & Aloy, P. A chemo-centric view of human health and disease. Nature Commun. 5, 5676 (2014).
Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016, baw100–baw100 (2016).
Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs from 1981 to 2014. J. Nat. Prod. 79, 629–661 (2016).
Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nat. Chem. 8, 531–541 (2016).
Welsch, M. E., Snyder, S. A. & Stockwell, B. R. Privileged scaffolds for library design and drug discovery. Curr. Opin. Chem. Biol. 14, 347–361 (2010).
Bleicher, K. H., Böhm, H.-J., Müller, K. & Alanine, A. I. Hit and lead generation: beyond high-throughput screening. Nat. Rev. Drug Disc. 2, 369–378 (2003).
Holbeck, S. L., Collins, J. M. & Doroshow, J. H. Analysis of food and drug administration–approved anticancer agents in the NCI60 panel of human tumor cell lines. Mol. Cancer Therap. 9, 1451–1460 (2010).
Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov. 5, 1210–1223 (2015).
Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–366 (2008).
Petrone, P. M. et al. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem. Biol. 7, 1399–1409 (2012).
Papadatos, G., Gaulton, A., Hersey, A. & Overington, J. P. Activity, assay and target data curation and quality in the ChEMBL database. J. Comput. Aided Mol. Des. 29, 885–896 (2015).
Duran-Frigola, M., Mateo, L. & Aloy, P. Drug repositioning beyond the low-hanging fruits. Curr. Opin. Syst. Biol. 3, 95–102 (2017).
Nguyen, D. T. et al. Pharos: collating protein information to shed light on the druggable genome. Nucleic Acids Res. 45, D995–D1002 (2017).
Duran-Frigola, M., Fernandez-Torras, A., Bertoni, M. & Aloy, P. Formatting biological big data for modern machine learning in drug discovery. WIREs Comp. Mol. Sci. 9, e1408 (2018).
Corsello, S. M. et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Jokinen, E. & Koivunen, J. P. MEK and PI3K inhibition in solid tumors: rationale and evidence to date. Ther. Adv. Med. Oncol. 7, 170–180 (2015).
Lamb, J. et al. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Filzen, T. M., Kutchukian, P. S., Hermes, J. D., Li, J. & Tudor, M. Representing high throughput expression profiles via perturbation barcodes reveals compound targets. PLoS Comput. Biol. 13, e1005335 (2017).
Chen, B. et al. Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets. Nat. Commun. 8, 16022 (2017).
Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).
Encinas, M. et al. Sequential treatment of SH-SY5Y cells with retinoic acid and brain-derived neurotrophic factor gives rise to fully differentiated, neurotrophic factor-dependent, human neuron-like cells. J. Neurochem. 75, 991–1003 (2000).
Tanzi, R. E. The genetics of Alzheimer disease. Cold Spring Harb. Perspect. Med. 2, a006296 (2012).
Carvalho-Silva, D. et al. Open Targets Platform: new developments and updates two years on. Nucleic Acids Res. 47, D1056–D1065 (2019).
Perszyk, R. E. et al. GluN2D-containing N-methyl-d-aspartate receptors mediate synaptic transmission in hippocampal interneurons and regulate interneuron activityity. Mol. Pharmacol. 90, 689–702 (2016).
Harold, D. et al. Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer’s disease. Nat Genet 41, 1088–1093 (2009).
Anselmo, A. C., Gokarn, Y. & Mitragotri, S. Non-invasive delivery strategies for biologics. Nat. Rev. Drug Discov. 18, 19–40 (2018).
Depper, J. M., Leonard, W. J., Robb, R. J., Waldmann, T. A. & Greene, W. C. Blockade of the interleukin-2 receptor by anti-Tac antibody: inhibition of human lymphocyte activation. J. Immunol. 131, 690–696 (1983).
Benson, J. M. et al. Therapeutic targeting of the IL-12/23 pathways: generation and characterization of ustekinumab. Nat. Biotechnol. 29, 615–624 (2011).
Reddy, M. et al. Modulation of CLA, IL-12R, CD40L, and IL-2Ralpha expression and inhibition of IL-12- and IL-23-induced cytokine secretion by CNTO 1275. Cell Immunol. 247, 1–11 (2007).
Xu, M. J., Johnson, D. E. & Grandis, J. R. EGFR-targeted therapies in the post-genomic era. Cancer Metastasis Rev. 36, 463–473 (2017).
Masuelli, L. et al. Apigenin induces apoptosis and impairs head and neck carcinomas EGFR/ErbB2 signaling. Front. Biosci. 16, 1060–1068 (2011).
Hu, W. J., Liu, J., Zhong, L. K. & Wang, J. Apigenin enhances the antitumor effects of cetuximab in nasopharyngeal carcinoma by inhibiting EGFR signaling. Biomed. Pharmacother. 102, 681–688 (2018).
Sawai, A. et al. Inhibition of Hsp90 down-regulates mutant epidermal growth factor receptor (EGFR) expression and sensitizes EGFR mutant tumors to paclitaxel. Cancer Res. 68, 589–596 (2008).
Williams, A. J. et al. Open PHACTS: semantic interoperability for drug discovery. Drug Disc. Today 17, 1188–1198 (2012).
Rodgers, G. et al. Glimmers in illuminating the druggable genome. Nat. Rev. Drug Disc. 17, 301–302 (2018).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Lee, Y. S. et al. A computational framework for genome-wide characterization of the human disease landscape. Cell Syst. 8, 152–162 (2019).
Mendez-Lucio, O., Baillif, B., Clevert, D. A., Rouquie, D. & Wichard, J. De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat. Commun. 11, 10 (2020).
Reymond, J.-L. The Chemical Space Project. Acc. Chem. Res. 48, 722–730 (2015).
Irwin, J. J., Gaskins, G., Sterling, T., Mysinger, M. M. & Keiser, M. J. Predicted biological activity of purchasable chemical space. J. Chem. Info. Modeling 58, 148–164 (2018).
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11, 333–337 (2014).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Axen, S. D. et al. A Sisimple representation of three-dimensional molecular structure. J. Med. Chem. 60, 7393–7409 (2017).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Lipinski, C. A. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol. 1, 337–341 (2004).
Congreve, M., Carr, R., Murray, C. & Jhoti, H. A ‘rule of three’ for fragment-based lead discovery? Drug Discov. Today 8, 876–877 (2003).
Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).
Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol. 10, e1003926 (2014).
Gilson, M. K. et al. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 44, D1045–D1053 (2016).
Hastings, J. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219 (2016).
Thiele, I. et al. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 31, 419–425 (2013).
Cerami, E. G. et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685–D690 (2011).
Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).
Pryszcz, L. P., Huerta-Cepas, J. & Gabaldon, T. MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res. 39, e32 (2011).
Kruger, F. A. & Overington, J. P. Global analysis of small molecule binding to related protein targets. PLoS Comput. Biol. 8, e1002333 (2012).
Zwierzyna, M. & Overington, J. P. Classification and analysis of a large collection of in vivo bioassay descriptions. PLoS Comput. Biol. 13, e1005641 (2017).
Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017).
Li, T. et al. A scored human protein–protein interaction network to catalyze genomic interpretation. Nat. Methods 14, 61–64 (2017).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Kandasamy, K. et al. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 11, R3 (2010).
Mi, H. et al. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 45, D183–D189 (2017).
Kelder, T. et al. WikiPathways: building research communities on biological pathways. Nucleic Acids Res. 40, D1301–D1307 (2012).
Mosca, R., Ceol, A. & Aloy, P. Interactome3D: adding structural details to protein networks. Nat. Methods 10, 47–53 (2013).
Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Iorio, F. et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc. Natl Acad. Sci. USA 107, 14621–14626 (2010).
Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
Basu, A. et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell 154, 1151–1161 (2013).
Chabner, B. A. NCI-60 cell line screening: a radical departure in its time. J. Natl Cancer Inst. 108, djv388 (2016).
Azur, M. J., Stuart, E. A., Frangakis, C. & Leaf, P. J. Multiple imputation by chained equations: what is it and how does it work? Int. J. Meth. Psychiatr. Res. 20, 40–49 (2011).
Nelson, J. et al. MOSAIC: a chemical-genetic interaction data repository and web resource for exploring chemical modes of action. Bioinformatics 34, 1251–1252 (2017).
Wawer, M. J. et al. Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling. Proc. Natl Acad. Sci. USA 111, 10911–10916 (2014).
Brown, A. S. & Patel, C. J. A standard database for drug repositioning. Sci. Data 4, 170029 (2017).
Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).
Kuhn, M., Letunic, I., Jensen, L. J. & Bork, P. The SIDER database of drugs and side effects. Nucleic Acids Res. 44, D1075–1079 (2016).
Kuhn, M. et al. Systematic identification of proteins that elicit drug side effects. Mol. Syst. Biol. 9, 663 (2013).
Duran-Frigola, M. & Aloy, P. Analysis of chemical and biological features yields mechanistic insights into drug side effects. Chem. Biol. 20, 594–603 (2013).
Davis, A. P. et al. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Res. 45, D972–D978 (2017).
Ryu, J. Y., Kim, H. W. & Lee, S. Y. Deep learning improves prediction of drug–drug and drug–food interactions. Proc. Natl Acad. Sci. USA 115, 4304–4311 (2018).
Grover, A. & Leskovec, J. node2vec: scalable feature learning for networks. Preprint at https://arxiv.org/abs/1607.00653 (2016).
Matsui, Y. O., Yamasaki, K. & Aizawa, T. K PQk-means: billion-scale clustering for product-quantized codes. Preprint at https://arxiv.org/abs/1709.03708 (2017).
Maaten, L. v. d. Barnes–Hut-SNE. Preprint at https://arxiv.org/abs/1301.3342 (2013).
McInnes, L. & Healy, J. Accelerated hierarchical density based clustering. Proc. 2017 IEEE International Conference on Data Mining Workshops (IEEE, 2017).
Webber, W., Moffat, A. & Zobel, J. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 1–38 (2010).
Lo, Y. C. et al. Large-scale chemical similarity networks for target profiling of compounds identified in cell-based chemical screens. PLoS Comput. Biol. 11, e1004153 (2015).
Rennie, J. D. M., Shih, L., Teevan, J. & Karger, D. R. Tackling the poor assumptions of naive Bayes text classifiers. Proc. International Conference on International Conference on Machine Learning 616–623 (AAAI Press, 2003).
Irwin, J. J. & Shoichet, B. K. ZINC–a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model 45, 177–182 (2005).
Fernandez-Torras, A., Duran-Frigola, M. & Aloy, P. Encircling the regions of the pharmacogenomic landscape that determine drug response. Genome Med. 11, 17 (2019).
Badia, R. et al. SAMHD1 is active in cycling cells permissive to HIV-1 infection. Antiviral Res. 142, 123–135 (2017).
Saxena, V., Orgill, D. & Kohane, I. Absolute enrichment: gene set enrichment analysis for homeostatic systems. Nucleic Acids Res. 34, e151 (2006).
Acknowledgements
We thank the SB&NB laboratory members for their support and helpful discussions. We are grateful to the Broad Institute and National Center for Advancing Translational Sciences (NCATS-NIH) for providing compounds on request, and J. Duran-Frigola for the website design. We also thank the IRB Barcelona Biostatistics and Bioinformatics Unit and the IRB Functional Genomics Facility. P.A. acknowledges the support of the Spanish Ministerio de Economía y Competitividad (grant no. BIO2016-77038-R), the INB/ELIXIR-ES (grant no. PT17/0009/0007), the European Research Council (SysPharmAD, grant no. 614944) and ‘La Caixa’ BioMedTec (grant no. CTEC_15).
Author information
Authors and Affiliations
Contributions
M.D.-F., E.P. and P.A. designed the study, analyzed the results and wrote the manuscript. M.D.-F. did the computational analysis, together with M.B., T.J.-B. and D.A. O.G.-P. implemented the web server. E.P. and V.A. carried out the experimental validations. All authors have read and approved the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Data 1 and 2 legends, Supplementary Figs. 1–17 and Supplementary Tables 1–3.
Supplementary Data 1
Reversion of transcriptional signatures of fAD mutations.
Supplementary Data 2
Small-molecule analogs of biologics.
Rights and permissions
About this article
Cite this article
Duran-Frigola, M., Pauls, E., Guitart-Pla, O. et al. Extending the small-molecule similarity principle to all levels of biology with the Chemical Checker. Nat Biotechnol 38, 1087–1096 (2020). https://doi.org/10.1038/s41587-020-0502-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-020-0502-7
This article is cited by
-
Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data
Journal of Cheminformatics (2023)
-
First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa
Nature Communications (2023)
-
Nanoparticle stereochemistry-dependent endocytic processing improves in vivo mRNA delivery
Nature Chemistry (2023)
-
Expanding the search for small-molecule antibacterials by multidimensional profiling
Nature Chemical Biology (2022)
-
Universal multilayer network exploration by random walk with restart
Communications Physics (2022)