## Introduction

Drug safety issues have been the leading cause for attrition during preclinical development as well as in late-stage clinical trials of new drugs1,2,3,4. After analyzing attrition data for small molecule drug candidates from four large pharmaceutical companies, a study found that preclinical toxicology was the highest cause of attrition during candidate nomination stage, and clinical safety was also a leading cause of attrition in phase I (first) and phase II (second) clinical trials5. Even in late stage clinical trials, safety issues remain the leading cause of clinical failure, which account for 25% phase II and 14% phase III failures from 2013 to 20156. Toxicity testing for drugs in development continues to rely heavily on animal models, which are expensive and low throughput with results difficult to translate to humans. To predict the potential toxicological effects of thousands of environmental chemicals, including drugs and drug candidates in early stage of drug development, alternative strategies are required to supplement traditional toxicity testing methods. A number of in silico approaches have been developed recently to predict adverse drug reactions using available public datasets of drugs7,8,9. Prediction models were built using chemical structure10,11,12, protein target information13,14, phenotypic data7,15, or combinations of different data types on drugs, with the application of various machine learning methods. Some of these approaches have shown promising results, yet suffer from a number of limitations. Chemical structure based models rely on structure similarity, thus are often poorly predictive of drugs that are new structure entities. Target information and phenotypic observations are not always available, especially for new drug candidates where early assessment is most critical. Preclinical in vitro safety profiling of compounds with biochemical and cellular assays offers an informative and relatively cost-effective approach to complement in silico methods16. Systematic testing of large chemical libraries to establish a consistent and robust set of in vitro activity profiles is challenging but would add tremendous value to improved drug toxicity evaluation17.

A major effort addressing this challenge is the U.S. Tox21 (Toxicology in the 21st Century) collaborative effort, which was initiated in 2008 with an emphasis on developing new methodology to evaluate the potential risk of environmental chemicals on human health. The Tox21 program18,19,20,21 is a collaboration between the National Toxicology Program (NTP) of the National Institute of Environmental Health Sciences (NIEHS), the National Center for Computational Toxicology (NCCT) of the U.S. Environmental Protection Agency (EPA), the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH), and the U.S. Food and Drug Administration (FDA). Tox21 adopted high-throughput screening (HTS) techniques to efficiently test large numbers of chemicals, using the data generated to (1) identify patterns of compound-induced biological responses in order to get insight on toxicity pathways and compound mechanism of toxicity; (2) prioritize compounds for more extensive toxicological evaluation; (3) develop predictive models for biological response in human. The ultimate goal of the Tox21 program is to identify in vitro chemical signatures that could act as predictive surrogates for in vivo toxicity.

Tox21 has established a library of ~10,000 chemicals for the production phase of the program, including the NCATS Pharmaceutical Collection (NPC), which contains drugs used in the clinic22. This library has been screened against 47 cell-based assays in a quantitative high-throughput screening (qHTS) format23,24,25,26 generating nearly 70 million data points to date. Recently, we have evaluated the utility of these data toward achieving the Tox21 goals27. Computational models were built using the in vitro assay activity profiles and/or compound structure data to predict in vivo toxicity. While useful for generating hypotheses on compound mechanism of toxicity, the assay data based models achieved reasonable but less than ideal performance for most in vivo toxicity endpoints, which may be accounted for by species difference (most in vivo toxicity endpoints are of animal-model origins while the assays used by Tox21 utilized human cells) and insufficient coverage of the biological space by the assays screened so far, which focused primarily on nuclear receptors28 and stress response pathways29.

To overcome these limitations and re-evaluate the utility of the in vitro human cell-based assay data, here we accessed publicly-available human toxicity data and rebuilt models to predict adverse drug effects on humans. In addition, we tested whether expanded biological space coverage provided by additional in vitro data could help improve the predictive performance of the models. As surrogates of in vitro data, we incorporated known drug target annotation (DTA) information, such as the protein/gene/pathway target of a drug (e.g. estrogen receptor, TNF signaling pathway) or drug mechanism of action (e.g. dopamine D2-receptor antagonist), on some drugs into these models. Based on the results, we propose a short list of targets/pathways not presently existing within the Tox21 datasets, which can serve as a guide for new assay development and screening toward establishing a robust set of in vitro compound activity profiles. Data generated based on these additional targets and pathways may improve the predictive power of the in vitro assay data on in vivo human toxicity.

## Results

### In vitro assay performance and activity

As stated in Methods, all data associated with the 47 assays subject to the present analyses are publicly available30. Thirty of the 47 assays have been described in detail in our previous study27. The performance statistics of all 47 assays in qHTS format are summarized in Table S1. Similar to the 30 previously-described assays, most of the 17 assays screened more recently performed well with signal to background (S/B) ratios ≥ 3-fold, coefficient of variances (CVs) ≤ 10% and Z’ factors ≥ 0.531. The overall performance of an assay is better represented by data reproducibility25 in terms of active match, inactive match, inconclusive, and mismatch rates of data generated from the three copies of the compound library with compounds plated in different well locations in each copy (Fig. 1(A)). Forty-two of the 47 assays scored (score = 2 × %active match + %inactive match −%inconclusive − 2 × %mismatch) > 80 (grade A or B) in terms of reproducibility with < 1% mismatches in activity (Table S2). Eleven assays had reproducibility scores between 80 and 90 (grade B) with mismatch rates < 1%. The other five assays scored below 80, but still above 70 with three > 75, with 0.4–2% mismatch rates. The ROR (retinoid-related orphan receptor gamma) and RAR (retinol signaling pathway) antagonist assays were the two lowest scoring assays. For the same sample, the average AC50 (50% activity concentration) differences between the three runs were < 2 fold for all the assays (Table S2). The activity distribution of the compounds screened against the 47 assays is shown in Fig. 1(B). The active rates ranged from 0.27% (NFκB agonist mode assay) to 27.4% (DT40 Rad54/Ku70 mutant assay) with an average active rate of 5.7%. The activity patterns of the NPC compounds across all 47 assays and 156 readouts are better illustrated in Fig. 2 in comparison with their target/mode of action (MOA) annotations and observed adverse effects.

### Comparing prediction performance with traditional classifiers

Given that the goal of this study was to evaluate the value of different data types, in vitro assay data in particular, in predicting human in vivo toxicity, and not to build optimal models for ADE prediction, we chose to apply one method to model all data types for the results to be consistent and comparable. We chose the WFS method that we have previously shown a robust and flexible method that fits this purpose32. In the previous study, we compared WFS with more traditional classifiers such as SVM and Naïve Bayesian for toxicity prediction on a few different datasets32. The three methods showed variable but mostly comparable performances depending on the dataset, while WFS outperformed the other two in some cases especially on structurally diverse datasets. To test if the performance of WFS would hold up on ADE modeling, we again compared WFS with three traditional classifiers: support vector machine (SVM), random forest and Naïve Bayes. These three classifiers were applied to build ADE prediction models using the assay data in combination with selected DTA as an example and their performances were compared to those of WFS. WFS (average AUC-ROC = 0.63) outperformed random forest (average AUC-ROC = 0.57) and Naïve Bayes (average AUC-ROC = 0.61), and showed performances similar to SVM (average AUC-ROC = 0.65) (Figure S2). WFS thus appeared to be a method suitable for our purposes and the models could be optimized further when we obtain data from new assays with expanded biological space coverage.

## Discussion

In this study, we tested the applicability of different data types, in vitro cell-based assay data in particular, to building predictive models for adverse drug effects in humans by using in-house generated in vitro assay data on 1,511 approved drugs, as well as publicly-available human adverse effect data for said drugs. We also conducted the first meta-analysis to compare the performance of animal in vivo toxicity data in predicting human adverse outcomes with that of in vitro assay data. Animal toxicity data do not seem to have a clear advantage over human cell-based data in predicting human in vivo effects. Models built with in vivo animal toxicity endpoints showed similarly moderate performance compared to those built with the in vitro assay data in predicting adverse drug effects in human. This result again confirms that species differences, as well as data sparsity and lack of consistency, limit the reliability of extrapolating animal in vivo toxicity data to human in vivo effects.

However, most models built with in vitro human cell-based assay data alone did not show good predictive capacity. In comparison, models built with chemical structure information showed better predictive performance for many ADEs. We hypothesized that the low performance of in vitro assay data may be due to the limited biological space coverage by the current panel of assays. Therefore, we combined in vitro assay data with DTAs collected from the literature (2,370 DTAs) to build new models that showed remarkable improvements in predictive performance. To identify which DTAs contributed the most to the prediction, each of the DTA was then evaluated individually resulting in a set of 58 DTAs that were predictive of at least one of the human ADEs with AUC-ROC or balanced accuracy (BA) > 0.6. Adding the set of 58 DTAs to in vitro assay data significantly improved the model performance. Moreover, the assay + DTA models performed better than the structure based models by at least 0.1 in AUC-ROC on a number of ADEs including mania, abnormal behavior, hypercholesterolaemia, ventricular extrasystoles, hiccups and erythema multiforme with AUC-ROC values ranging from 0.70 to 0.83, whereas none of the structure based models outperformed the assay + DTA models by 0.1 in AUC-ROC.

Chemical structure-based prediction of toxicity relies on structure similarity assuming that chemicals sharing similar structure features would exhibit similar biological or toxicological effects. The structure-based models in our study showed good predictive performances for many ADEs. However, structure-based models are not reliable when applied to making predictions on completely new scaffolds not present in the training data. In addition, slightly altered structure features may dramatically change the interaction between a chemical and their targets leading to unexpected toxicity. Models based on in vitro assay data circumvent this problem as predictions are based on similarity in activity profiles, with no structure information required, assuming chemicals sharing similar activities would most likely hit the same targets resulting in similar toxicity outcomes. Nevertheless, in vitro assay data-based models have their own limitations. Chemicals are often metabolized in the human body whereas most in vitro assays do not have metabolic capacity. Therefore, in vitro assay data-based models would likely fail in predicting human toxicity caused by these metabolites. One solution for this problem is to introduce metabolic capacity into in vitro assays20,33. Another solution is to combine in vitro assay data with structure data and other available target information. As we have shown in our previous predictive modeling efforts27, and further confirmed by the current study, the combined models achieved the best performance in predicting human toxicity.

We have shown that DTA information can significantly improve the predictive performance of the assay data based models. More importantly, data on just a small set of additional DTAs (2% of the entire 2,370 DTA set) that contributed the most to the models can already expand the biological space coverage sufficiently to produce predictive models of human toxicity effect when combined with in vitro assay data. While the entire DTA set improved the model performance by 22–28% on average, the selected set of 58 DTAs alone improved the model performance, on average, by 15–18%. That is, 2% of the DTA information could account for ~70% of the improvement in the predictive capacity of the models. It is not surprising that the DTA based models showed the best performance as this data is from the literature and can be considered as validated experimental or assay data, and these DTAs have a good coverage of the drug target space known in the literature. In addition to limited target space coverage, the current assay data used for modeling are primary HTS data without further validation and thus undoubtedly confounded with noise and assay artifacts. These results thereby highlight the importance of data quality and selecting the right assays. Validated DTA data seems to be the best choice for ADE or human in vivo toxicity prediction, the DTA based models however, cannot be applied to predict new compounds without such annotations available. It is, therefore, important to generate high quality assay data that have a good coverage of the biological space and have these data validated.

In summary, qHTS in vitro assay activity profiles have been evaluated for their predictive capacity of human toxicity manifested as adverse approved-drug effects, alone and in conjunction with structure and/or DTA information. Models built with in vitro human assay data alone showed limited predictive power of human effect, possibly due to the limited biological space coverage of the current suite of assays and lack of further validation, but with performance close to models built with animal in vivo toxicity data. Both chemical structure information and additional DTA annotations significantly improved the predictive performance of the assay data-based models resulting in robust models for many adverse drug effects. Most importantly, just a small set of targets selected to complement the biological space covered by the in vitro assays was shown to produce models that performed nearly as well as models built with all DTA information included. This set of targets can serve as guide for assay development in order to generate in vitro data that can better predict human toxicity.

## Data and Methods

### In vitro assay and structure data

qHTS data generated on the NPC part of the Tox21 10 K collection up to the end of 2015 were used for modeling, including 47 assays with 156 readouts. All data and detailed descriptions of these assays are publicly available through the NCATS website (https://tripod.nih.gov/tox21/assays/) and PubChem30. A complete list of assays and readouts can be found in Table S5. Curve rank was used as the measure for compound activity28. The detailed process of data normalization, correction, classification of concentration response curves, and activity assignment was described previously44. For modeling purposes, compounds with absolute curve rank > 0.5 were set as active (1), and inactive (0) otherwise. Structure fingerprints were generated for compounds in the NPC library using Leadscope® (Leadscope, Inc. Columbus, Ohio, USA) for structure based models.

### In vivo toxicity data

Animal toxicity data were retrieved from the Registry of Toxic Effects of Chemical Substances (RTECS) database compiled by Leadscope® (Leadscope, Inc. Columbus, Ohio, USA). This compilation contains 48 acute toxicity endpoints from various species such as rodents, dogs, birds and other mammals on > 10,000 molecules, 2,968 of which overlap with compounds in the NPC library. In addition, RTECS annotates compounds by their toxicity category such as primary irritant, mutagen, reproductive effector, and tumorigen, for a total of 52 endpoints. For acute toxicity endpoints, we followed the Globally Harmonized System (GHS) classification to determine the toxicity cutoff value46. Chemicals labeled by the GHS as Categories 1–3 (signal word “DANGER”) were considered toxic. The complete list of in vivo toxicity endpoints and their cutoff values for toxicity can be found in Table S6.

### Drug target annotations (DTA)

Medical Subject Headings (MeSH) (http://www.ncbi.nlm.nih.gov/mesh) pharmacological action (PA) terms were used for compound mode of action (MOA) annotations. Gene target annotations for drugs were downloaded from Drug Bank (http://www.drugbank.ca/)47. In v3.0 of the database release, 3,228 gene targets were linked to 5,785 drugs. Additional drug target annotations were downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.jp/kegg/) in January of 2016. In this compilation of the database, 3,555 drugs were mapped to 254 human pathways, 4,536 drugs were annotated with 997 gene targets, and 792 drugs were annotated with 33 enzymatic targets. Combining all the drug target and MOA annotations, a total of 2,370 annotations were mapped to 2,567 unique compounds (CAS numbers) in the NPC collection. For modeling purposes, a “1” was assigned to all known drug-target associations and a “0” was assigned if no known association was reported between a drug-target pair.

### Modeling

Models were built for the human ADEs using assay activity (activity based models), compound structure (structure based models), combinations of structure and activity data with or without drug target annotations (DTAs), and animal toxicity endpoints. The Weighted Feature Significance (WFS) method previously developed at NCATS32 was applied to construct the models. Briefly, WFS is a two-step scoring algorithm. In the first step, a Fisher’s exact test is used to determine the significance of enrichment for each feature in the drugs with a certain ADE compared to the ones without such ADE reported, and a p-value is calculated for all the features present in the data set. For assay activity data, each assay readout was treated as a feature and the feature value was set to 1 for active compounds and 0 for inactive compounds. For animal in vivo toxicity data, each toxicity endpoint was treated as a feature, and the feature value was set to 1 for toxic compounds and 0 for non-toxic compounds. Missing data were omitted from p-value calculations. For structure data, the feature value was set to 1 for drugs containing that structural feature and 0 for drugs that do not have that feature. For DTA data, each DTA was treated as a feature, and the feature value was set to 1 for drugs that reported to have that DTA and 0 for drugs that not known to have the DTA. If a feature is less frequent in the active compound set than the non-active compound set, then its p-value is set to 1. These p-values form what we call a “comprehensive” feature fingerprint, which is then used to score each drug for its potential to cause a certain ADE according to Equation (1), where p i is the p-value for feature i; C is the set of all features present in a drug; M is the set of features encoded in the “comprehensive” feature fingerprint (i.e., features present in at least one drug with that ADE); N is the number of features; and α is the weighting factor, which is set to 1 in all the models described here. A high WFS score indicates a strong potential for ADE.

$$WFS=\frac{\sum \mathrm{log}({p}_{i})}{\min (\mathrm{log}({p}_{i}))\times (\alpha {N}_{C-M}+{N}_{M\cap C})}$$
(1)

For each model, compounds were randomly split into two groups of approximately equal sizes, one used for training and the other for testing. The randomization was conducted 10 times to generate 10 different training and test sets to evaluate the robustness of the models. Model performance was assessed by calculating the area under the receiver operating characteristic (ROC) curve (AUC-ROC), which is a plot of sensitivity [TP/(TP + FN)] versus (1 − specificity [TN/(TN + FP)])48. A perfect model would have an AUC-ROC of 1 and an AUC-ROC of 0.5 indicates a random classifier. The random data split and model training and testing were repeated 10 times, and the average AUC-ROC values were calculated for each model.

Each of the 2,370 DTAs was evaluated for their predictive capacity of the human ADEs using the ROC approach. DTAs (58) found predictive of at least one of the human ADEs at AUC-ROC > 0.6 or with balanced accuracy (BA = (sensitivity + specificity)/2) > 0.6 were selected to compare their impact on model performance with all DTAs.

### Data Availability

The datasets generated during and/or analyzed during the current study are available in PubChem [https://www.ncbi.nlm.nih.gov/pcassay?term = tox21] and from the corresponding author on reasonable request.