Screening agrochemicals and pharmaceuticals for potential liver toxicity is required for regulatory approval and is an expensive and time-consuming process. The identification and utilization of early exposure gene signatures and robust predictive models in regulatory toxicity testing has the potential to reduce time and costs substantially. In this study, comparative supervised machine learning approaches were applied to the rat liver TG-GATEs dataset to develop feature selection and predictive testing. We identified ten gene biomarkers using three different feature selection methods that predicted liver necrosis with high specificity and selectivity in an independent validation dataset from the Microarray Quality Control (MAQC)-II study. Nine of the ten genes that were selected with the supervised methods are involved in metabolism and detoxification (Car3, Crat, Cyp39a1, Dcd, Lbp, Scly, Slc23a1, and Tkfc) and transcriptional regulation (Ablim3). Several of these genes are also implicated in liver carcinogenesis, including Crat, Car3 and Slc23a1. Our biomarker gene signature provides high statistical accuracy and a manageable number of genes to study as indicators to potentially accelerate toxicity testing based on their ability to induce liver necrosis and, eventually, liver cancer.
Pathological and biochemical data in non-human mammals are used extensively by the agrochemical and pharmaceutical sectors for assessing mammalian toxicity and effects on human health of molecular innovations. This effort is extensive; in addition to other cost and effort, required mammalian toxicity assessment packages can use ~ 6000 animals per molecule studied. Despite such careful screening, major setbacks to pharmaceutical product development pipelines still result where human toxicity is detected during late stages. When toxicity is not determined in this testing, a danger to public health arises if adverse effects on humans are only observed in the population after years of deployment. These risks can be greatly mitigated if early biomarkers of eventual toxicity can be found. Toxicogenomics, or the application of genomics methods to predict adverse effects of exogenous molecule exposure1, is gaining popularity with advances in computing and availability of curated data sets. Toxicogenomics databases have been designed and, through rigorous experiments on rat and human cell models, provide an avenue to understand the molecular basis of adverse conditions due to chemical toxicant exposures. Computational methods provide an opportunity to develop this much-desired capability2. These methods are relatively low cost to develop and test, can expedite data analysis, can reduce cost by reducing the scale of animal studies, and can reduce time to market for a safe product.
Toxicogenomics analyses are commonly categorized in the big data paradigm because of the large number of gene profiles that arise from the small number of samples, thus the need for data reduction tools. Classical statistical methods of identifying differentially expressed genes from microarray or RNA sequencing data results in lists comprising thousands of genes, which is not ideal for laboratory testing. Machine learning approaches such as feature selection and classification often use robust statistical modeling to reduce the number of features or variables used in the models3,4,5,6. Feature selection and classification can both be achieved by supervised methods for classification or unsupervised learning methods 5 that are primarily used for discovery.
Studies have shown that the use of supervised classification predictive models can help to find discriminative gene signatures across multiple platforms of microarray data3,4,5,6. Previously, several studies have used machine learning methods for prediction of biological end points7,8,9. Despite many attempts in the field, however, predictive ability remains relatively poor due to systematic noise associated with design of gene expression experiments10, high number of features in the signature, low predictive performance11,12,13,14, or poor performance of identified biomarkers at validation stage15. Innovations in data analysis pipeline design and modeling are still sorely needed.
The goal of this study was to construct a suitable modeling framework based on machine learning for feature selection, feature ranking, and predictive analysis applicable to liver toxicity. The developed framework was applied to the TG-GATES data set to select and rank the gene expression features that can serve as biomarkers for liver toxicity in rats16. After determining these features, a set of predictive models were optimized. Finally, the model was applied to untrained MAQC-II data to evaluate liver toxicity predictions17,18. The targeted conclusion of our study was to determine a small set of genes that successfully predicted liver necrosis and could be used for predictive testing in animals.
Gene expression data were obtained from TG-GATES database for male rat, in vivo experimental models utilizing Affymetrix Microarray Chip from the TG-GATES database https://dbarchive.biosciencedbc.jp/en/open-tggates/data-2.html. The in vivo models were categorized by whole organism outcomes of exposure related to cellular injury19,20. The treatments included 42 chemical compounds (Table 1, Supplementary Fig. 1A) at control, low, middle, and high dose levels and 8 time points, single dose: 3 h, 6 h, 9 h and 24 h; and repeat dose: 4 days, 8 days, 15 days and 29 days. In the single dose experiments, groups of 20 animals were administered a compound and then five animals per time point were sacrificed (3, 6, 9 or 24 h) after administration (Supplementary Fig. 1B 16). Livers were harvested after indicated time points. RNA was isolated, and gene expression patterns were analyzed using the common array platform, Affymetrix Rat 230 2.0 microarray that contained probes for 31,099 genes.
Data from the Microarray Quality Control Project (MAQC II) was used for validation and assessing classification performance of the top selected features17. From the six datasets, we focused on the National Institute of Environmental Health Sciences (NIEHS) data set for validation since it pertains to toxic effect of chemicals on liver. The study was similar to TG-GATES, which used microarray gene expression data acquired from the liver of rats exposed to various hepatotoxicants. Gene expression data, collected from 418 rats exposed to one of eight compounds (1, 2-dichlorobenzene, 1, 4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine, thioacetamide, galactosamine, and diquat dibromide), were used to build classifiers for prediction of liver necrosis. Each of the eight compounds were studied and analyzed using the common array platform (Affymetrix Rat 230 2.0 microarray), data retrieving and analysis processes. Similar to TG-GATES studies, four to six male, 12 week old F344 rats were treated with low-, mid-, and high-dose of the toxicant and sacrificed at 6, 24 and 48 h later. At necropsy, liver was harvested for RNA extraction, histopathology, and clinical chemistry assessments17.
Normalization and initial feature reduction by differential gene expression
To select best dose and earliest time point of liver toxicant exposure, EE data was used as described before21,22,23. Briefly, EE treatment data from the common array platform, Affymetrix Rat 230 2.0 microarray, which reported expression value of 31,099 genes were obtained from TG-GATES database. Data were normalized using the robust multi-array (RMA) average expression measure (Affy (v 1.57.0) package from Bioconductor)24,25. RMA was calculated on raw microarray gene expression values under standard normalization options (https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/affy/html/AffyBatch-class.html). After normalization, the data were centered and scaled for differentially expressed genes analysis. To identify differentially expressed genes upon EE treatment, statistical analyses were performed on normalized gene expressions from dose response and time course data using Limma (v 3.34.9) package from Bioconductor26,27. Design matrices, constructed in R, identified coefficients of interest specifically high dose treatments (denoted with a 1) and control dose treatments (denoted with − 1). Gene expression data were first fitted to a multiple linear model, based on the design matrix. The linear model was then fitted to an empirical Bayes model with the contrast matrix representing the differences between high and control doses for each molecule26,28,29. T statistics and F-statistics were computed from the model. Significant features were selected with p-value < 0.05 for further feature selection methods. Resulting differentially expressed gene list was used to perform hierarchal clustering using Cluster 3 software30. Clustered data was visualized using Treeview java (https://jtreeview.sourceforge.net/). Gene set enrichment analysis software was used to identify enriched functional gene groupings31,32. Principal component analysis was performed using StrandNGS (Version 3.1.1, Bangalore, India). Graphs for biochemical analysis (blood alkaline phosphatase levels, total biluribin, body weight, liver weight and triglyceride levels) and average gene expression values were plotted using Graphpad Prism8 software (GraphPad Software Inc., La Jolla, CA, www.graphpad.com).
To prepare data for feature selection and classification using machine learning, microarray data (Affymetrix Rat 230 2.0) for compounds that induce necrosis were obtained from TG-GATES database and MAQ CII project. To avoid batch effects, data were normalized using the robust multi-array (RMA) average expression measure (Affy (v 1.57.0) package from Bioconductor)24,25. RMA was calculated on raw microarray gene expression values under standard normalization options (https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/affy/html/AffyBatch-class.html). After normalization, the data were centered and scaled for gene expression analysis.
Feature selection and comparative supervised machine learning
To assess the hypothesis that an early exposure gene signature is associated with liver toxicity, we applied a methodology33,34 that combines traditional statistical modeling with machine learning methods to perform predictor selection and ranking. These selected biomarkers formed the inputs for an integrative modeling process to determine the performance of significant markers for classification.
First, to determine a gene feature’s measure of importance in predicting the necrosis response we used a set of feature selection approaches (marginal screening, embedded, and wrapper) on all predictors (i.e. genes and liver phenotypes)39 and an empirical ranking score based on the feature importance measure34,40. Methods for feature selection included Mann–Whitney, t-test, DCor as marginal screening methods; Boruta, RFE with both RF and SVM as wrapper methods; and RF, Elastic Net, Lasso, Ridge Regression Cross Validation (RidgeCV) and SVM as embedded methods. For each approach, the top N features were noted and utilized in the outer cross-validation loop of the integrative modeling process. Most algorithms are part of scikit-learn, scipy, and BorutaPy packages.
Cross-validation (out-of-sampling-testing) is utilized for obtaining the rankings by assessing every feature’s predictive power on unseen data41,42 with all compounds grouped together in the same fold and with a validation set43. Models were built for each feature selection approach and each predictive modeling approach. Predictive statistics were gathered as well as receiver operator characteristic (ROC) curves for each combination to visualize the classification performance (true positive rate vs. false positive rate) of the classifiers. Predictive modeling approaches include: logistic regression, RF, and support vector machine (SVM), Lasso and ElasticNet36,44,45,46. We built models incrementally from one feature to 100 features to understand and determine tradeoffs for identifying a cutoff for how many N features to select.
Parameter tuning and performance evaluation were performed using the MAQCII-NIEHS (GSE16716) as the validation set, utilizing the area under the cross-validated ROC curve (AUC) as a quantitative performance metric. For parameter tuning, we tested tree depth of Boruta at 4, 5, 6, and no limit. We chose to focus on the depth of 4 to avoid overfitting. We experimented with alpha values for Elastic Net and Lasso using the Scikit learn GridSearchCV, which selects the best performing parameters. In addition, we experimented with the C value for SVC. For the rest of the algorithm default parameters were used. All parameters are listed in Table 2. Cross-validation47 partitions the samples into training and testing sets and proceed by fitting the model on the training set and evaluating the AUC on the testing set. Repeatedly performing the procedure independently, we summarize AUCs of all iterations for comparison48. To compare the performances of the developed classification model using gene biomarkers and the traditional diagnostic model, we obtained the AUC measures from all models over all randomization runs, and perform a two-sample t-test to detect differences. For each feature selection and classification method combinations, we reported area under the curve (AUC), F-statistics and MCC49 (Table 3). Results are visualized using Tableau software (Seattle, WA, USA, https://www.tableau.com/).
Identification of dose and time-point to perform the feature selection
To select the dose and time-point towards our goal of deriving a gene signature, we utilized the ethinyl estradiol (EE) dataset (Fig. 1A) as prolonged EE exposure causes hepatocellular carcinoma in rats. Glucuronide metabolite of EE is known to cause cholestatic hepatotoxicity by changing expression of ABCB11 and ABCC2 and disrupting bile flow and bile salt excretion50. In the TG-GATES data set, high-dose EE treatment caused a statistically significant change in clinical pathology parameters such as alkaline phosphatase by day 4, and total bilirubin levels by week 2 (Fig. 1B)51. Statistically significant body weight, liver weight and triglyceride changes were not detected until day 4 of the high dose EE treatment (Fig. 1C). Pathology analysis of hematoxylin and eosin (HE) images of liver samples showed that EE exposure resulted in hepatocyte necrosis, centrolobular hypertrophy, sinusoid dilatation, Kupffer cell proliferation and eosinophilic infiltration in periportal region. Necrosis was the only apical change that was common to livers that were exposed to any of the three different doses at earlier time point (4 days) (Table 4). We decided to focus on necrosis as an end-point, since it is predictive of liver carcinogenesis52. Next, we analyzed the dose response of gene expression across different time points (24 h, 4,8 and 29 days), which showed that manifestation of clinical pathologic indicators of liver damage, metabolic changes, and liver necrosis by high-dose EE exposure at the earlier time point was consistent with gene expression. Many genes were up- or downregulated in the liver by the high-dose EE exposure at all-time points assayed (Fig. 1D). Based on these observations, we focused on the high-dose exposure data to identify time points that will give us an early gene expression signature.
To identify the earliest time point data to be used in feature selection, we utilized 3, 6, 9, and 24 h and the 4, 8, 15 and 29 days’ time-points. Hierarchical clustering of 1387 differentially expressed genes identified eight clusters with distinct gene expression kinetics and function (C1–8, Fig. 2A–C and Supplementary Fig. 2). C1–4 were characterized by genes that were upregulated at later time points compared to earlier time points. C5 contained genes that were down-regulated at later time points by high-dose EE treatment. C6 had genes that were specifically upregulated at 24 h. These genes were involved in chromatin-DNA binding, potentially pointing out the primary transcriptional changes related to ethinyl estradiol exposure that would drive later liver toxicity. C7 and C8 contained genes that were upregulated at earlier times (3, 6 and 9 h of EE treatment). Principal component analysis of the data utilizing 1387 genes showed that different time points had a unique gene expression profile. Since 24 h time point was quite distinct from earlier time points in the PCA analysis and C6 indicated a robust gene expression program specific to this time point, we chose this time point for the further analysis (Fig. 2D). This time point was chosen for ensuing feature selection and classification since it has a distinct gene expression profile, and ensures expression and sufficient accumulation of markers.
Gene expression feature reduction by differential expression analysis
Our data (Figs. 1 and 2) generated using classical approaches to identify differentially expressed genes showed that we need to utilize more advanced statistical and computational approaches to reduce number of gene features that can discriminate between control and toxicant treated individuals, and to generate models that can predict with high accuracy if the toxicant exposure would result in future liver carcinogenesis. To achieve our goal and avoid overfitting or underfitting our data, we utilized the 24 h exposure microarray data for 42 compounds that result in necrosis from TGGATES database, and we performed feature selection from the 31,099 genes to identify a small set of features predictive of necrosis. We chose methods from filtering, wrapper and embedded approaches. Methods for feature selection included Mann–Whitney, t-test, DCor as filter methods; Boruta, RFE with both RF and SVM as wrapper methods; and RF, Elastic Net, Lasso, Ridge Regression Cross Validation (RidgeCV) and SVM as embedded methods (Table 2). When we tested AUC up to 50 (Supplementary Fig. 3A) or 100 (Supplementary Fig. 3B) features, accuracy in majority of models dropped off after 20 or 25 features (Fig. 3A). Thus, we chose the fewest features, top 10 genes that provided a level of desired high accuracy for each method.
Given a set of 10 features from feature selection methods above, we conducted tenfold cross-validation (with all compounds grouped together in the same fold) utilizing the TG-GATEs dataset as training set, and MAQC-II dataset as an independent validation set. With this extensive testing and independent assessment, the gene signature that results is more likely to be a generalizable predictor. Based on ROC values, filter and wrapper feature selection methods in combination with Logistic Regression, RF and SVM performed with high accuracy (AUC > 0.75, F1 score > 0.75). To perform more detailed analysis, we focused on the four best performing feature selection methods (DCor, Boruta, RFE_RF, Mann–Whitney and Random Forest) and five classification methods (ElasticNet, Lasso, RF, SVM and Logistic Regression) (Fig. 3B) and unbiased performance error estimates of the models are obtained from the MAQC-II dataset (Table 5). The Mann–Whitney-RF combination had the highest F1 and MCC (F1 = 0.91, ROC = 0.91,sensitivity = 0.85, specificity = 0.97, MCC = 0.82), followed by Mann–Whitney-SVM (F1 = 0.89, ROC = 0.89,sensitivity = 0.88, specificity = 0.91, MCC = 0.79), Boruta and RF combination (F1 = 0.89, ROC = 0.89, sensitivity = 0.79, specificity = 0.1, MCC = 0.81), and DCor-RF (F1 = 0.89, ROC = 0,89,sensitivity = 0.82, specificity = 0.97, MCC = 0.80), (Fig. 4A, Tables 5 and 6). Overall, the top genes that contributed to the information were similar between Mann–Whitney, DCor and Boruta, five of the ten genes in the signature; Scly, Slc23a1, Dcd, Tkfc and RGD1309534, were the top contributors to the performance of the signature in all three methods used (Fig. 4B). Best performing feature selection method, Mann–Whitney, had Scly, Dcd, RGD1309534, Slc23a1, Bhmt2, Tkfc, Srebf1, Ablim3, Extl1 and Cyp39a1 genes (Fig. 4B).
In this study, we built a ML-based predictive process composed of ten genes that should be regulated in rat liver after 24 h of toxicant exposure and accurately predicts a liver necrosis phenotype, an indicator of liver carcinogenicity after long-term molecule exposure52. We compared various feature selection and classification methods to identify early gene biomarkers of liver toxicity using an extensive gene expression database, TG-GATEs and an independent validation dataset, MAQC II. Initially, we focused on necrosis, which is a valid end point to predict liver cancer52 as necrotic cell death is a common feature in liver disease53,54,55. Given that necrosis is a fairly common end point for adverse processes, we anticipate that our methods are applicable to other apical end-points. Rather than depending solely on the parametric models, the methods utilized in the feature selection and predictive analysis are adaptive, and involve models requiring the optimization of a tuning or smoothing parameter to control the trade-off between model generality and complexity. Appropriate choice of tuning parameters is critical for feature selection stability and good performance of the resulting predictive model estimator. TG-GATEs microarray gene expression data contains few samples (n) and very large features or genes (p). In machine learning, this p ≫ n problem usually has major consequences for prediction modeling. For example, over fitting may occur, which can cause unreliability for the prediction model to be used on other data sets 56. Our study design with an extensive, independent validation and careful feature selection and curation, likely overcomes this hurdle.
Parameter tuning has traditionally been a manual task because of the limited number of trials. Recently, it has been shown automated pre-tuning surrogate-based parameter optimization was successfully applied in the learning for a wide variety of feature selector/classifiers57,58 and to deep belief networks59,60. These methods combined computational power with model building about the behavior of the error function in the parameter space, and they improve on manual parameter tuning. To improve the performance of our feature selection and predictive analysis steps we utilized MAQCII-NIEHS (GSE16716) dataset as the surrogate for pre tuning the parameters of these methods17. Since we used an independent validation set (MAQCII) to select prediction models with higher accuracy, we avoided overfitting issues that typically afflict studies that only employ cross-validation. We also utilized methods that dealt directly with binary classification rather than regressive methods to generally predict multiple apical end-points from the TG-GATEs database.
We have previously used t-test and RF coupled with logistic regression to identify biomarkers of breast cancer risk61. The dataset we used contained much less features from a smaller population. Since, in our study we are dealing with many more features from larger number of experiment we used an expanded list of feature selection methods that fall into one of the three main categories: Mann–Whitney, t-test, DCor as filter methods; Boruta, RFE with both RF and SVM as wrapper methods; and RF, Elastic Net, Lasso, Ridge Regression Cross Validation (RidgeCV) and SVM as embedded methods. For assessing classification performance we used logistic regression, RF, and support vector machine (SVM), Lasso and ElasticNet. Instead of relying on one machine learning method7,8,9, we used an exhaustive approach wherein we have compared combinations of aforementioned feature selection and classification methods and tested their performance rigorously on a validation set. Our process addresses several limitations of traditional methods for multimodal signature studies in terms of data handling (the number of features are orders of magnitude greater than the number of samples, there are heterogeneous features from different modalities, and there are multiple phenotypic responses to the same conditions) as well as procedural (increased performance over a single approach and assessment of key features in the context of phenotype)35,36. The net outcomes were that we obtained a minimal descriptive set of 10 biomarkers (key star features) related to liver toxicity (specifically, necrosis), a ranked list of biomarkers that describe a phenotype, a classifier useful for toxicity screening, a confidence measure for the classifier, and a classifier performance evaluated on MACQII data unseen during training43,62,63. Number of features used for classification is very low, which avoids the problem of overfitting. In addition, we used an iterative process where we selected features and tested their performance on the validation set. This exhaustive process ensured that only best predictors with minimum number of genes were used and that their performance was validated in an independent dataset (MAQCII) to avoid low reproducibility of identified biomarkers.
To avoid overfitting while building our prediction models and to eventually utilize the biomarker genes in a practical laboratory test for unknown chemicals, we limited our gene list to 10 candidates. The genes that were selected with various methods are involved in metabolism and detoxification (Car3, Crat, Cyp39a1, Dcd, Lbp, Scly, Slc23a1, and Tkfc) and transcriptional regulation (Ablim3, Srebf1). Several of these genes were implicated in liver carcinogenesis including Crat64, Car365 and Slc23a166.
In summary, using feature selection, modeling and validation with an independent data set, we found a robust set of genes that appeared to be broadly generalizable for prediction. We selected the top genes and the best models to predict whether a compound would cause liver necrosis. This selected pipeline provided predictions with high accuracy. Given the broad set of conditions and a manageable set of predictor genes, we anticipate that this signature can be used to predict future carcinogenic effects of long-term exposure to liver toxicants in rodent models and accelerate the predictability of toxic effects in humans.
The datasets analyzed during the current study are available in the Life Science Database Archive, https://dbarchive.biosciencedbc.jp/en/open-tggates/download.html. A public GitHub repository with datasets and code is available here: https://github.com/brandis2/TG-GATES.
Microarray quality control—II
Toxicogenomics project-genomics assisted toxicity evaluation system
Recursive feature elimination
Support vector machine
Ridge regression cross validation
Receiver operating characteristic
Maggioli, J., Hoover, A. & Weng, L. Toxicogenomic analysis methods for predictive toxicology. J. Pharmacol. Toxicol. Methods 53, 31–37. https://doi.org/10.1016/j.vascn.2005.05.006 (2006).
Laura Suter-Dick, F. P. Predictive Toxicology (Springer, New York, 2014).
Dolinski, K. & Troyanskaya, O. G. Implications of Big Data for cell biology. Mol. Biol. Cell 26, 2575–2578. https://doi.org/10.1091/mbc.E13-12-0756 (2015).
Längkvist, M., Karlsson, L. & Loutfi, A. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recogn. Lett. 42, 11–24. https://doi.org/10.1016/j.patrec.2014.01.008 (2014).
Yang, S., Guo, L., Shao, F., Zhao, Y. & Chen, F. A systematic evaluation of feature selection and classification algorithms using simulated and real miRNA sequencing data. Comput. Math. Methods Med. 2015, 11. https://doi.org/10.1155/2015/178572 (2015).
Zhao, Z. & Liu, H. Proceedings of the 24th International Conference on Machine Learning 1151–1157 (ACM, Oregon, 2007).
Manzouri, F., Heller, S., Dümpelmann, M., Woias, P. & Schulze-Bonhage, A. A comparison of machine learning classifiers for energy-efficient implementation of seizure detection. Front. Syst. Neuroscie. https://doi.org/10.3389/fnsys.2018.00043 (2018).
Lane, T. et al. Comparing and validating machine learning models for mycobacterium tuberculosis drug discovery. Mol. Pharm. 15, 4346–4360. https://doi.org/10.1021/acs.molpharmaceut.8b00083 (2018).
Sakr, S. et al. Comparison of machine learning techniques to predict all-cause mortality using fitness data: the Henry ford exercIse testing (FIT) project. BMC Med. Inform. Decis. Mak. 17, 174. https://doi.org/10.1186/s12911-017-0566-6 (2017).
Kitchen, R. R. et al. Relative impact of key sources of systematic noise in Affymetrix and Illumina gene-expression microarray experiments. BMC Genom. 12, 589. https://doi.org/10.1186/1471-2164-12-589 (2011).
Kohonen, P. et al. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nat. Commun. 8, 15932–15932. https://doi.org/10.1038/ncomms15932 (2017).
Kim, J. & Shin, M. An integrative model of multi-organ drug-induced toxicity prediction using gene-expression data. BMC Bioinform. 15(Suppl 16), S2–S2. https://doi.org/10.1186/1471-2105-15-S16-S2 (2014).
Jennen, D. et al. Drug-induced liver injury classification model based on in vitro human transcriptomics and in vivo rat clinical chemistry data. Syst. Biomed. 2, 63–70. https://doi.org/10.4161/sysb.29400 (2014).
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J. M. & Herrera, F. A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135. https://doi.org/10.1016/j.ins.2014.05.042 (2014).
Yang, Z.-Y. et al. Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci. Rep. 9, 13504. https://doi.org/10.1038/s41598-019-49967-4 (2019).
Igarashi, Y. et al. Open TG-GATEs: a large-scale toxicogenomics database. Nucleic Acids Res. 43, D921–D927. https://doi.org/10.1093/nar/gku955 (2014).
Shi, L. et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838. https://doi.org/10.1038/nbt.1665 (2010).
Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151 (2006).
Villeneuve, D. L. & Garcia-Reyero, N. Vision & strategy: predictive ecotoxicology in the 21st century. Environ. Toxicol. Chem. 30, 1–8. https://doi.org/10.1002/etc.396 (2011).
Villeneuve, D. L. & Garcia-Reyero, N. Vision & strategy: predictive ecotoxicology in the 21st century. Environ. Toxicol. Chem. 30, 1–8. https://doi.org/10.1002/etc.1396 (2011).
Madak-Erdogan, Z. et al. Design of pathway preferential estrogens that provide beneficial metabolic and vascular effects without stimulating reproductive tissues. Sci. Signal 9, 53. https://doi.org/10.1126/scisignal.aad8170 (2016).
Madak-Erdogan, Z. et al. Free fatty acids rewire cancer metabolism in obesity-associated breast cancer via estrogen receptor and mTOR signaling. Cancer Res. 79, 2494–2510. https://doi.org/10.1158/0008-5472.CAN-18-2849 (2019).
Chen, K. L. A., Zhao, Y. C., Hieronymi, K., Smith, B. P. & Madak-Erdogan, Z. Bazedoxifene and conjugated estrogen combination maintains metabolic homeostasis and benefits liver health. PLoS ONE 12, e0189911. https://doi.org/10.1371/journal.pone.0189911 (2017).
Gautier, L., Cope, L., Bolstad, B. M. & Irizarry, R. A. Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315. https://doi.org/10.1093/bioinformatics/btg405 (2004).
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121. https://doi.org/10.1038/nmeth.3252 (2015).
Phipson, B., Lee, S., Majewski, I. J., Alexander, W. S. & Smyth, G. K. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann. Appl. Stat. 10, 946–963. https://doi.org/10.1214/16-AOAS920 (2016).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47. https://doi.org/10.1093/nar/gkv007 (2015).
Alston-Knox, C., Kuhnert, P., Lowchoy, S., McVinish, R. & Mengersen, K. Bayesian Model Comparison: Review and Discussion (Springer, New York, 2005).
Gordon, K. S. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. https://doi.org/10.2202/1544-6115.1027 (2004).
de Hoon, M. J., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20, 1453–1454. https://doi.org/10.1093/bioinformatics/bth078 (2004).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102, 15545–15550. https://doi.org/10.1073/pnas.0506580102 (2005).
Mootha, V. K. et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273. https://doi.org/10.1038/ng1180 (2003).
Li, H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2, 73–94 (2015).
Shen, Q., Diao, R. & Su, P. Feature selection ensemble. Turing 10, 289–306 (2012).
Braundmeier-Fleming, A. et al. Stool-based biomarkers of interstitial cystitis/bladder pain syndrome. Sci. Rep. 6, 26083. https://doi.org/10.1038/srep26083 (2016).
Candel, S. et al. Microbial profiles and tumor markers from culdocentesis: a novel screening method for epithelial ovarian cancer [3H]. Obstet. Gynecol. 129, 82S (2017).
Hagler, M. A. et al. Identification of novel microRNA profiles in patients with myxomatous mitral valve disease. Circulation 132, A19746–A19746 (2015).
Robison, H. V. E., Erskine, C., Auvil, L., Escalante, P., & Bailey, R., editors. Profiling cytokine-chemokine dynamics using silicon photonic microing resonators. Bioorganic Chemistry Gordon Research Conference (2016).
Su, W. B. M. & Candes, E. False discoveries occur early on the lasso path. http://arxiv.org/abs/151101957 (2015).
Gross, S. M. & Tibshirani, R. Collaborative regression. Biostatistics 16, 326–338 (2014).
Kohavi, R. Ijcai. 1137–1145 (Montreal, Canada).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Nilsson, R., M. Peña, J., Björkegren, J. & Tegner, J. Consistent Feature Selection for Pattern Recognition in Polynomial Time. Vol. 8 (2007).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Bureau, A. et al. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–182 (2005).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005).
Kohavi, R. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1137–1143 (Morgan Kaufmann Publishers Inc., Montreal, 1995).
Hanson, C., Cairns, J., Wang, L. & Sinha, S. Computational discovery of transcription factors associated with drug response. Pharmacogenom. J. 16, 573–582. https://doi.org/10.1038/tpj.2015.74 (2016).
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6. https://doi.org/10.1186/s12864-019-6413-7 (2020).
Metzler, M., Blaich, G. & Tritscher, A. M. Role of metabolic activation in the carcinogenicity of estrogens: studies in an animal liver tumor model. Environ. Health Perspect. 88, 117–121. https://doi.org/10.1289/ehp.9088117 (1990).
Hall, A. P. et al. Liver hypertrophy: a review of adaptive (adverse and non-adverse) changes—conclusions from the 3rd international ESTP expert workshop. Toxicol. Pathol. 40, 971–994. https://doi.org/10.1177/0192623312448935 (2012).
Allen, D. G., Pearse, G., Haseman, J. K. & Maronpot, R. R. Prediction of rodent carcinogenesis: an evaluation of prechronic liver lesions as forecasters of liver tumors in NTP carcinogenicity studies. Toxicol. Pathol. 32, 393–401. https://doi.org/10.1080/01926230490440934 (2004).
Chalasani, N. et al. Clinical advances in liver, pancreas, and biliary tract: causes, clinical features, and outcome from a prospective study of drug-induced liver injury in the United States. Gastroenterology 135, 1924–1934 (2016).
Malhi, H., GoresGregory, J. & LemastersJohn, J. Apoptosis and necrosis in the liver: a tale of two deaths?. Hepatology 43, S31–S44. https://doi.org/10.1002/hep.21062 (2006).
Bessems, J. G. M. & Vermeulen, N. P. E. Paracetamol (acetaminophen)-induced toxicity: molecular and biochemical mechanisms, analogues and protective approaches. Crit. Rev. Toxicol. 31, 55–138. https://doi.org/10.1080/20014091111677 (2001).
Walter Zucchini, I. L. M. & Langrock, R. Hidden Markov Models for time series: an introduction using R (2nd edition). J. Stat. Softw. 80, 1–12 (2017).
Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F. & Leyton-Brown, K. in Automated Machine Learning: Methods, Systems, Challenges (eds F. Hutter, L. Kotthoff, & J. Vanschoren) 81–95 (Springer, New York, 2019).
Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. http://arxiv.org/abs/1208.3719 (2012). https://ui.adsabs.harvard.edu/abs/2012arXiv1208.3719T.
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).
Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. in Proceedings of the 24th International Conference on Neural Information Processing Systems 2546–2554 (Curran Associates Inc., Granada, 2011).
Oktay, K. et al. A computational statistics approach to evaluate blood biomarkers for breast cancer risk stratification. Horm. Cancer 11, 17–33. https://doi.org/10.1007/s12672-019-00372-3 (2020).
Austin, P. C. & Tu, J. V. Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. J. Clin. Epidemiol. 57, 1138–1146. https://doi.org/10.1016/j.jclinepi.2004.1104.1003 (2004).
Heidema, A. G. et al. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 7, 23–23. https://doi.org/10.1186/1471-2156-7-23 (2006).
Gao, T. et al. DNA methylation of oxidative stress genes and cancer risk in the Normative Aging Study. Am. J. Cancer Res. 6, 553–561 (2016).
Tawa, G. J. et al. Characterization of chemically induced liver injuries using gene co-expression modules. PLoS ONE 9, e107230. https://doi.org/10.1371/journal.pone.0107230 (2014).
Lv, H. et al. Vitamin C preferentially kills cancer stem cells in hepatocellular carcinoma via SVCT-2. Precis. Oncol. 2, 1. https://doi.org/10.1038/s41698-017-0044-8 (2018).
This work was supported by grants from Corteva Agriscience (Dow Agrisciences Day Award to RB and ZME), the University of Illinois, Office of the Vice Chancellor for Research, College of ACES FIRE grant (to ZME), National Center for Supercomputing Applications Faculty Fellowship (to ZME) and National Institute of Food and Agriculture, U.S. Department of Agriculture, award ILLU-698-909 (to ZME). Authors from Corteva Agriscience (NE, KJ) contributed to the development of the research question and design for this study. All other funders had no input in the design and implementation of this study.
There are competing interests between the authors (ZME, RB) and Corteva Agrisciences (NE, KJ); specifically the research was supported by Corteva Agrisciences. Other authors do not declare competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Smith, B.P., Auvil, L.S., Welge, M. et al. Identification of early liver toxicity gene biomarkers using comparative supervised machine learning. Sci Rep 10, 19128 (2020). https://doi.org/10.1038/s41598-020-76129-8
This article is cited by
Unraveling the mechanisms underlying drug-induced cholestatic liver injury: identifying key genes using machine learning techniques on human in vitro data sets
Archives of Toxicology (2023)
Identification of metabolic pathways contributing to ER+ breast cancer disparities using a machine-learning pipeline
Scientific Reports (2023)