The genome of a eukaryotic cell is often vulnerable to both intrinsic and extrinsic threats owing to its constant exposure to a myriad of heterogeneous compounds. Despite the availability of innate DNA damage responses, some genomic lesions trigger malignant transformation of cells. Accurate prediction of carcinogens is an ever-challenging task owing to the limited information about bona fide (non-)carcinogens. We developed Metabokiller, an ensemble classifier that accurately recognizes carcinogens by quantitatively assessing their electrophilicity, their potential to induce proliferation, oxidative stress, genomic instability, epigenome alterations, and anti-apoptotic response. Concomitant with the carcinogenicity prediction, Metabokiller is fully interpretable and outperforms existing best-practice methods for carcinogenicity prediction. Metabokiller unraveled potential carcinogenic human metabolites. To cross-validate Metabokiller predictions, we performed multiple functional assays using Saccharomyces cerevisiae and human cells with two Metabokiller-flagged human metabolites, namely 4-nitrocatechol and 3,4-dihydroxyphenylacetic acid, and observed high synergy between Metabokiller predictions and experimental validations.
This is a preview of subscription content, access via your institution
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
The raw RNA sequencing files are available at ArrayExpress under accession E-MTAB-11179. The processed datasets detailing about the compound SMILES, compound names, PubChem IDs, InChIs, Bioactivity status and their source information are accessible from GitHub at https://github.com/the-ahuja-lab/Metabokiller/tree/main/datasets as well as Zenodo at https://doi.org/10.5281/zenodo.6683106 repositories. Source data are provided with this paper.
A Python package for Metabokiller is provided at https://pypi.org/project/Metabokiller/ or from the project GitHub page at https://github.com/the-ahuja-lab/Metabokiller and Zenodo at https://doi.org/10.5281/zenodo.6683106. Code used for building machine-learning models is provided on the project GitHub page.
Rappaport, S. M. Redefining environmental exposure for disease etiology. NPJ Syst. Biol. Appl. 4, 1–6 (2018).
Farland, W. H., Lynch, A., Erraguntla, N. K. & Pottenger, L. H. Improving risk assessment approaches for chemicals with both endogenous and exogenous exposures. Regul. Toxicol. Pharmacol. 103, 210–215 (2019).
Swenberg, J. A. et al. Endogenous versus exogenous DNA adducts: their role in carcinogenesis, epidemiology, and risk assessment. Toxicol. Sci. 120, S130–S145 (2011).
Luch, A. Nature and nurture—lessons from chemical carcinogenesis. Nat. Rev. Cancer 5, 113–125 (2005).
Yasaei, H. et al. Carcinogen-specific mutational and epigenetic alterations in INK4A, INK4B and p53 tumour-suppressor genes drive induced senescence bypass in normal diploid mammalian cells. Oncogene 32, 171–179 (2012).
Fuchs, R. P. P., Schwartz, N. & Daune, M. P. Hot spots of frameshift mutations induced by the ultimate carcinogen N-acetoxy-N-2-acetylaminofluorene. Nature 294, 657–659 (1981).
Lilly, L. J., Bahner, B. & Magee, P. N. Chromosome aberrations induced in rat lymphocytes by N-nitroso compounds as a possible basis for carcinogen screening. Nature 258, 611–612 (1975).
Madia, F., Worth, A., Whelan, M. & Corvi, R. Carcinogenicity assessment: addressing the challenges of cancer and chemicals in the environment. Environ. Int. 128, 417–429 (2019).
Anand, P. et al. Cancer is a preventable disease that requires major lifestyle changes. Pharm. Res. 25, 2097–2116 (2008).
Williams, G. M., Iatropoulos, M. J. & Weisburger, J. H. Chemical carcinogen mechanisms of action and implications for testing methodology. Exp. Toxicol. Pathol. 48, 101–111 (1996).
Barrett, J. C. Mechanisms of action of known human carcinogens. IARC Sci. Publ. 116, 115–134 (1992).
Meister, K. A. America’s War on ‘Carcinogens’: Reassessing the Use of Animal Tests to Predict Human Cancer Risk (American Council on Science, Health, 2005).
Banerjee, P., Eckert, A. O., Schrey, A. K. & Preissner, R. ProTox-II: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res. 46, W257–W263 (2018).
Zhang, L. et al. CarcinoPred-EL: novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods. Sci. Rep. 7, 2118 (2017).
Gupta, R. et al. OdoriFy: a conglomerate of artificial intelligence-driven prediction engines for olfactory decoding. J. Biol. Chem. 297, 100956.
Gupta, A. et al. Machine-OlF-Action: a unified framework for developing and interpreting machine-learning models for chemosensory research. Bioinformatics 37, 1769–1771 (2021).
Fjodorova, N. et al. Quantitative and qualitative models for carcinogenicity prediction for non-congeneric chemicals using CP ANN method for regulatory uses. Mol. Divers. 14, 581–594 (2010).
Morales, A. H., Pérez, M. A. C., Combes, R. D. & González, M. P. Quantitative structure activity relationship for the computational prediction of nitrocompounds carcinogenicity. Toxicology 220, 51–62 (2006).
Benigni, R., Giuliani, A., Franke, R. & Gruska, A. Quantitative structure-activity relationships of mutagenic and carcinogenic aromatic amines. Chem. Rev. 100, 3697–3714 (2000).
Singh, K. P., Gupta, S. & Rai, P. Predicting carcinogenicity of diverse chemicals using probabilistic neural network modeling approaches. Toxicol. Appl. Pharmacol. 272, 465–475 (2013).
Li, X. et al. In silico estimation of chemical carcinogenicity with binary and ternary classification methods. Mol. Inform. 34, 228–235 (2015).
Benigni, R., Bossa, C., Tcheremenskaia, O. & Giuliani, A. Alternatives to the carcinogenicity bioassay: in silico methods, and the in vitro and in vivo mutagenicity assays. Expert Opin. Drug Metab. Toxicol. 6, 809–819 (2010).
Butterworth, B. E., Aylward, L. L. & Hays, S. M. A mechanism-based cancer risk assessment for 1,4-dichlorobenzene. Regul. Toxicol. Pharmacol. 49, 138–148 (2007).
Liehr, J. G. Is estradiol a genotoxic mutagenic carcinogen? Endocr. Rev. 21, 40–54 (2000).
Knerr, S. & Schrenk, D. Carcinogenicity of 2,3,7,8-tetrachlorodibenzo-p-dioxin in experimental models. Mol. Nutr. Food Res. 50, 897–907 (2006).
Ryffel, B. The carcinogenicity of ciclosporin. Toxicology 73, 1–22 (1992).
Hernández, L. G., van Steeg, H., Luijten, M. & van Benthem, J. Mechanisms of non-genotoxic carcinogens and importance of a weight of evidence approach. Mutat. Res. 682, 94–109 (2009).
Miller, E. C. & Miller, J. A. Searches for ultimate chemical carcinogens and their reactions with cellular macromolecules. Cancer 47, 2327–2345 (1981).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Bertoni, M. et al. Bioactivity descriptors for uncharacterized chemical compounds. Nat. Commun. 12, 3932 (2021).
Moriwaki, H., Tian, Y.-S., Kawashita, N. & Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminform. 10, 4 (2018).
Ramsundar, B., Eastman, P., Walters, P. & Pande, V. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More (O’Reilly Media, 2019).
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should I trust you?’: Explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, New York, 2016).
Maunz, A. et al. lazar: a modular predictive toxicology framework. Front. Pharmacol. 4, 38 (2013).
Schyman, P., Liu, R., Desai, V. & Wallqvist, A. vNN web server for ADMET predictions. Front. Pharmacol. 8, 889 (2017).
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).
Reznik, E. et al. A landscape of metabolic variation across tumor types. Cell Syst. 6, 301–313.e3 (2018).
Dando, I. et al. Oncometabolites in cancer aggressiveness and tumour repopulation. Biol. Rev. Camb. Philos. Soc. 94, 1530–1546 (2019).
Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47, e47 (2019).
Lutz, W. K. & Fekete, T. Endogenous and exogenous factors in carcinogenesis: limits to cancer prevention. Int. Arch. Occup. Environ. Health 68, 120–125 (1996).
Rattray, N. J. W. et al. Beyond genomics: understanding exposotypes through metabolomics. Hum. Genomics 12, 4 (2018).
Hoeijmakers, J. H. J. DNA damage, aging, and cancer. N. Engl. J. Med. 361, 1475–1485 (2009).
&Ahuja, G. et al. Loss of genomic integrity induced by lysosphingolipid imbalance drives ageing in the heart. EMBO Rep. 20, e47407 (2019).
Siramshetty, V. B. et al. WITHDRAWN—a resource for withdrawn and discontinued drugs. Nucleic Acids Res. 44, D1080–D1086 (2016).
Zhou, Z., Dai, Q. & Gu, T. A QSAR model of PAHs carcinogenesis based on thermodynamic stabilities of biactive sites. J. Chem. Inf. Comput. Sci. 43, 615–621 (2003).
Ruiz, P. et al. Prediction of the health effects of polychlorinated biphenyls (PCBs) and their metabolites using quantitative structure–activity relationship (QSAR). Toxicol. Lett. 181, 53–65 (2008).
Ježek, P. 2-Hydroxyglutarate in cancer cells. Antioxid. Redox Signal. 33, 903–926 (2020).
Smith, M. T. et al. Key characteristics of carcinogens as a basis for organizing data on mechanisms of carcinogenesis. Environ. Health Perspect. 124, 713–721 (2016).
Schmidt, F. H. A new way to understand chemical carcinogenesis and cancer prevention. RRMC 4, 23–33 (2014).
Gusenleitner, D. et al. Genomic models of short-term exposure accurately predict long-term chemical carcinogenicity and identify putative mechanisms of action. PLoS ONE 9, e102579 (2014).
O’Boyle, N. M. et al. Open Babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011).
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Teng, X. & Hardwick, J. M. Reliable method for detection of programmed cell death in yeast. Methods Mol. Biol. 559, 335–342 (2009).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11, 1–9 (2016).
The authors would like to thank the IT-HelpDesk team of IIIT-Delhi for providing assistance with the computational resources. We thank all the members of the Ahuja lab for their intellectual contributions at various stages of this project. We also thank K. Datta for providing critical insights into this study and K. Chakraborty for sharing yeast strains. The Ahuja lab is supported by the Ramalingaswami Re-entry Fellowship (BT/HRD/35/02/2006), a re-entry scheme of the Department of Biotechnology, Ministry of Science & Technology, Government of India, Start-Up Research Grant (SRG/2020/000232) from the Science and Engineering Research Board and an intramural Start-up grant from Indraprastha Institute of Information Technology-Delhi. The Sengupta lab is funded by the INSPIRE faculty grant from the Department of Science & Technology, India.
A provisional patent has been filed (reference no. 202111052929, application no. TEMP/E-1/60118/2021-DEL) describing the computational architecture of the Metabokiller. Usage of the Metabokiller Python package is free for the academic institutions, or for any academic-related project, however, for commercial usage, users must contact the authors.
Peer review information
Nature Chemical Biology thanks Michael Fasullo, Hongsheng Liu and Stefano Monti for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Schematic representation depicting the step-by-step workflow used to build all the six individual biochemical models and the ensemble model (Metabokiller). Up/downsampling approach was used to counteract the class imbalance. Signaturizer library was used to generate bioactivity features. Hyperparameter tuning was performed to obtain the best-performing model parameters. The ensemble model (Metabokiller) was built using biochemical features of experimentally validated carcinogens/non-carcinogens generated using six models. The majority voting method was used to assign the final carcinogenicity status.
(a) Box plot depicting the AUCROC values of the bootstrapping (n = 20 repetitions) of the indicated models. (b-f) Box plots depicting the AUCROC, accuracy, F1 Score, precision, and recall of the indicated models as inferred from the 10-fold cross-validation. (g) Box plot depicting the model performance of the twenty Gradient Boosting Machine (GBM)-based models generated using bootstrapping technique (n = 20 repetitions). (h) Variables factor map (PCA) depicting the direction and contribution of all the six variables (individual models) representing the experimentally validated carcinogens (MKETn) in the Eigenspace. (i) Principal Component Analysis revealing the chemical heterogeneity between the carcinogens and non-carcinogens in the indicated datasets. The heatmap at the bottom depicts the relative enrichment of the indicated functional groups (RNH2: primary amine, R2NH: secondary amine, R3N: tertiary amine, ROPO3: monophosphate, ROH: alcohol, RCHO: aldehyde, RCOR: ketone, RCOOH: carboxylic acid, RCOOR: ester, ROR: ether, RCCH: terminal alkyne, RCN: nitrile) in both classes. (j) Bar graphs depicting the accuracy of Metabokiller on the indicated unseen datasets. In the box plots, center lines represent the medians; box limits indicate the 25th and 75th percentiles as determined by R software (ggplot2); whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles; outliers are represented by dots.
(a) Heatmap depicting the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions on the Independent Dataset (I.D.) for indicated methods/tools. (b) Venn diagram depicting predicted carcinogenic human metabolites, further segregated based on prediction probability cutoffs. (c) Variables factor map depicting the contribution of all the six individual models in predicting carcinogenic metabolites from HMDB (probability cutoff ≥ 0.5). (d) Projection of the predicted carcinogens (indicated as red dots; probability cutoff ≥ 0.7) on the human metabolic space, achieved using iPath Web Server. (e) Schematic representation of the steps involved in processing pan-cancer metabolomics dataset. Of note, Pearson correlation was computed between log2 fold change (tumor vs healthy) and biochemical/carcinogenicity probabilities. (f) Heatmap detailing the correlation values further segregated based on cancer type. (g) Volcano plots depicting the differentially enriched/de-enriched metabolites in the indicated cancer datasets. Gray dots highlight the metabolites that do not qualify for the enrichment cutoff (log2 fold change ≥ 1 or ≤ -1, and p-value (adjusted) < 0.05), and green and red dots represent the metabolites that qualify for the enrichment cutoff and are predicted as non-carcinogenic and carcinogenic by Metabokiller respectively. The p-value was computed using two-sided Mann–Whitney U test and corrected using Benjamini-Hochberg method. (h) Structural information of some of the well-characterized oncometabolites reported in the literature and predicted by Metabokiller.
(a) Schematic representation highlighting the predicted-carcinogenic metabolic intermediates of the tyrosine metabolism pathway and aminobenzoate degradation pathway. (b) Box plots depicting the fluorescence intensity of propidium iodide staining indicating cell viability in the indicated conditions (n = 8 biological replicates) after 9 hours (left) and 12 hours (right) of treatment. Of note, heat-killed (HK) yeast cells were used as a positive control. Two-sided Mann–Whitney U test was used to compute statistical significance between the test conditions and the negative control. For left panel, the p-values are 0.0009 (HK); for 4NC: 0.96 (0.1 µM), 0.87 (1 µM), and 0.02 (10 µM); for DP: 0.59 (0.1 µM), 0.64 (1 µM), and 0.83 (10 µM). For right panel, the p-values are 0.0009 (HK); for 4NC: 0.63 (0.1 µM), 0.75 (1 µM), and 0.2 (10 µM); for DP: 0.42 (0.1 µM), 0.26 (1 µM), and 0.17 (10 µM). (c) Growth curve profiles of the treated and untreated wild-type yeast during transient exposure with the indicated conditions (n = 8 biological replicates with technical duplicates). Data points represent mean ± SD. Two-sided Student’s t-test was used to compute statistical significance between the positive (H2O2 treated yeast cells) and negative control (untreated yeast cells). The p-values are 0.9 (0 hrs), 1.5 × 10−6 (1.5 hrs), 4.85 × 10−6 (3 hrs), 4.45 × 10−16 (4.5 hrs), 1.62 × 10−10 (6 hrs), 2.27 × 10−18 (7.5 hrs), 6.41 × 10−13 (9 hrs), 1.04 × 10−23 (10.5 hrs), 5.82 × 10−34 (12 hrs). (d) Box plot depicting the results of reactive oxygen species (ROS) levels inferred using DCFH-DA dye-based assay in the indicated conditions (n = 8 biological replicates). Of note, ROS levels were measured 12 hours post-incubation. Notably, hydrogen peroxide (H2O2) treated yeast cells were used as a positive control. Two-sided Mann–Whitney U test was used to compute statistical significance between the test conditions and the negative control. The p-values are 0.003 (H2O2); for 4NC: 0.069 (0.1 µM), 0.1 (1 µM), and 0.001 (10 µM); for DP: 0.016 (0.1 µM), 0.087 (1 µM), and 0.07 (10 µM). The p-value cutoff for all the plots is 0.05. *, **, ***, and **** refer to p-values <0.05, <0.01, <0.001, and <0.0001, respectively. In the box plots, center lines show the medians; box limits indicate the 25th and 75th percentiles; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles; outliers are represented by dots.
(a) Bar plots depicting the total read counts (in millions) of the indicated RNA sequencing samples. (b) Box plot representing the distribution of the transformed read count data in the indicated conditions (n = 3 biological replicates). (c) Correlation plot showing the relationship between the individual RNA sequencing samples. Of note, 75% of the normalized and transformed data was used for the correlation analysis. (d-e) Box plots depicting the relative log expression of the 3 biological replicates of the indicated conditions before and after upper quantile normalization. (f) Volcano plot indicating the differentially expressed genes between the treated (metabolite treatment) and untreated conditions. p-value was computed using Wald test and corrected using Benjamini-Hochberg method (g) Metascape-based Functional Gene Ontology analysis identified the involvement of differentially expressed genes in the indicated prominent biological processes. (h) Schematic representation depicting the genomic alterations in the CAN1 gene in the indicated replicates. In the box plots, center lines represent the medians; box limits indicate the 25th and 75th percentiles; whiskers extend 1.5 times the interquartile range from the 25th and 75th percentiles; outliers are represented by dots.
Statistical source data for Fig. 1.
Statistical source data for Fig. 2.
Statistical source data for Fig. 3.
Statistical source data for Figure 4.
Statistical source data for Figure 5.
Statistical source data for Extended Data Fig. 2.
Statistical source data for Extended Data Fig. 3.
Statistical source data for Extended Data Figure 4.
Statistical source data for Extended Data Figure 5.
About this article
Cite this article
Mittal, A., Mohanty, S.K., Gautam, V. et al. Artificial intelligence uncovers carcinogenic human metabolites. Nat Chem Biol (2022). https://doi.org/10.1038/s41589-022-01110-7
Nature India (2022)