Introduction

Estrogen receptors (ERs) play essential roles in cell differentiation [1], reproductive function [2,3,4], and morphogenesis [4]. ERs exist in two major subclasses: those that act via a classical genomic mechanism of transcriptional regulation (nuclear ERα and ERβ) and those that act via nonclassical mechanisms (estrogen-related receptors and membrane-bound G-protein-coupled ERs) [5]. Nuclear ERα has a large binding pocket, which allows for nonspecific ER binding by compounds that are estrogen-like [6]. In the classical genomic mechanism, nuclear ERα or ERβ binds to an estrogenic compound. This ligand binding triggers a conformational change and activates the receptor [1, 4, 7]. Two activated nuclear ERs can then dimerize, bind to the estrogen-response element (ERE) promoter region on the cell’s DNA, and recruit cofactors required for transcription [1, 7]. The resulting increased production of mRNA can trigger cell proliferation downstream [7]. This cell proliferation has been linked to adverse effects such as uterine and breast cancers [4, 8]. Therefore, screening new compounds (e.g., drugs as well as commercial and personal care products) for undesired nuclear ER interactions early in development may be valuable.

Traditional experimental testing to identify toxicants relies on costly and time-consuming in vivo animal testing, which is impractical to efficiently assess the toxicity potential of the tens of thousands of registered compounds that require screening [9]. Computational modeling and in vitro high-throughput screening (HTS) assays are promising alternative methods for toxicity evaluation. However, traditional computational methods, such as quantitative structure–activity relationship (QSAR) models, often have limitations when they were developed by using small datasets. QSAR models trained with datasets of insufficient size are limited by narrow coverage of chemical space [10], activity cliffs [11], and overfitting [12], which in turn reduces their utility for predicting more complex chemical modes of action.

Over the past 20 years, deep learning emerged as an integral field of machine learning, especially with regard to the processing of big data [13]. Deep learning has advanced many fields, including voice and image recognition, language processing, and bioinformatics [14]. Most current deep- learning studies employ biologically inspired deep neural networks (DNNs) [15]. Both classic QSAR models and DNNs usually undergo training to predict a single activity (e.g., a single-toxicity endpoint). However, many toxicologically relevant modes of action require complex biological pathway perturbations to elicit an adverse biological effect, and consequently, the evaluation of the overall potential of a compound to exert an adverse outcome requires the prediction of multiple biological endpoints in a comprehensive manner. Multitask learning allows for the development of models that can simultaneously predict multiple activities, and is a potential solution to this challenge. The application of a multitask-learning approach can improve the ability of a model developed for related endpoints to generalize to new compounds due to information sharing during model development, thereby increasing prediction accuracy on new compounds. Successful modeling efforts using both normal and multitask deep learning demonstrate the potential for this technique to improve drug discovery [16,17,18,19] and toxicology [20, 21]. However, currently, no universal criteria for the selection of machine- versus deep-learning methods exist [22,23,24,25,26].

The development of in vitro testing protocols using robots [27] rather than humans allows for the rapid generation of data through HTS programs, advancing computational modeling into a big-data era [28,29,30,31,32,33]. One of the first significant HTS programs in toxicology was the Environmental Protection Agency (EPA) Toxicity Forecaster (ToxCast) initiative, which used an extensive battery of HTS assays to screen over 1000 compounds [34, 35]. The success of ToxCast led to the development of the Toxicity in the 21st Century (Tox21) collaboration of the EPA, Food and Drug Administration (FDA), National Center for Advancing Translational Sciences, and National Toxicology Program, which has a goal of testing ~10,000 compounds in HTS assays [36,37,38]. The direct result of these HTS efforts is the generation of large datasets that researchers can use in computational toxicity-modeling studies.

The availability of big data in public repositories brings urgent needs for researchers to create innovative computational models that can overcome the limitations associated with models based on small datasets. The application of nonanimal models for toxicity evaluation using computational toxicology is becoming feasible with newly developed algorithms and modeling strategies [39,40,41,42,43,44]. Recently, Browne et al. [42] and Judson et al. [43] described models trained using a subset of 18 ToxCast and Tox21 in vitro assays that are mechanistically relevant to the classical ER pathway. However, despite the success of these models, they require experimental concentration-response data, which make them inapplicable to new, untested compounds for which only structural information is available. Our goal was to address these limitations by evaluating machine- and deep-learning approaches for their ability to predict compound activity using models based on mechanistically related suites of assays. In this study, we assessed the applicability of traditional machine-learning (ML) algorithms and deep-learning approaches, including multitask learning with DNNs, to model these 18 mechanistic in vitro assays addressing ER pathway perturbations. The consensus predictions from averaging the predicted probabilities in relevant assays showed advantages compared with individual models, including multitask-learning models. The agonist, antagonist, or binding score was determined for new compounds based on consensus predictions and compared with their known experimental in vitro and in vivo toxicities. The results from this study suggest that a lack of universal criteria for chemical descriptor and algorithm selection for computational toxicology modeling continues to exist, and consensus predictions will still be the best strategy for computational chemical toxicity evaluation purposes.

Materials and methods

ER HTS assay dataset

The toxicity dataset used for modeling is the output of 18 high-throughput in vitro assays from the ToxCast and Tox21 programs (Table 1) [42, 43]. In total, the ToxCast and Tox21 programs tested 8589 compounds against these 18 assays. However, the chemical fingerprints calculated in this study are two-dimensional, which exclude the differences between stereoisomers and cannot deal with inorganic compounds. Therefore, the chemical structures needed further curation before modeling. The CASE Ultra v1.8.0.0 DataKurator tool was used to accomplish this chemical structure standardization. All salts and mixtures were separated into their constituent parts, and the largest organic fraction was kept. Compounds with duplicate structures but different activities in the same assays were evaluated, and the compound with the most active responses across all assays was retained. Compounds with missing/inconclusive results in all 18 assays were removed from the dataset.

Table 1 Estrogen receptor Toxicity Forecaster (ToxCast) agonism, antagonism, and binding assays.

The final dataset used for modeling in this study consisted of 7576 unique compounds, each of which showed conclusive active or inactive test results in at least one of the 18 nuclear ER-related in vitro assays (Supplementary Table SI). Inconclusive results were treated as missing data for modeling purposes. Each chemical was assigned an activity vector consisting of 18 active, inactive, or missing/inconclusive results for all assays.

Chemical descriptors

Three types of two-dimensional binary chemical fingerprints, Molecular ACCess System (MACCS), Extended Connectivity FingerPrint (ECFP), and Functional Connectivity FingerPrint (FCFP) descriptors, were generated for all compounds in Python v3.6.2 using the cheminformatics package RDKit v2017.09.1 (http://rdkit.org/). MACCS descriptors are a set of 167 fingerprints based on chemical substructures widely used in cheminformatics modeling [45]. ECFP and FCFP descriptors are substructure fingerprints calculated using a modified version of the Morgan algorithm (i.e., by evaluating the environment surrounding particular atoms in a molecule using a specified bond radius) [46]. FCFP descriptors can represent functional group information about a molecule rather than a specific substructure, whereas ECFP descriptors can represent specific chemical information about a molecule. For example, FCFP descriptors detect the presence of an aryl halide rather than the specific presence of chlorine bonded to a benzene ring that ECFP descriptors detect. In this study, 1024 ECFP and FCFP descriptors were calculated for all compounds using a bond radius of 3.

QSAR model development

Four ML algorithms were used to develop QSAR models for each ToxCast assay endpoint: Bernoulli Naive Bayes (BNB), k-Nearest Neighbors (kNN), Random Forest (RF), and Support Vector Machines (SVM). In this study, all four ML algorithms were implemented in Python v3.6.2 using scikit-learn v0.19.0 (http://scikit-learn.org/) [47]. Briefly, BNB models apply Bayes’ theorem to datasets with binary features by “naively” assuming that features are independent of one another [48]. kNN models learn and predict a compound based on the activities of its kNN calculated by a subspace similarity search [49]. RF models are ensemble models that construct a series of decision trees using a random selection of features and training set compounds [50]. RF models ultimately produce an average of the output from each decision tree to prevent overfitting. SVM models represent training compounds in the descriptor space, and attempt to locate the optimal hyperplane that separates active and inactive compounds [51]. The ML algorithms were tuned to identify the optimal input parameters for model performance, as described previously [23]. Briefly, hyperparameters, or any other parameters set before model training, were optimized using an exhaustive grid-search algorithm [23]. Each ML algorithm was fit to the ER HTS training data using each possible set of hyperparameters to identify the best-performing model. The model with the best combination of hyperparameters was retained and then used for the prediction of the test set.

Both normal and multitask DNNs were implemented in Python v3.6.2 using keras v2.1.2 (http://keras.org) and TensorFlow v1.4.0 (https://www.tensorflow.org/). DNNs consist of an input layer that contains information about the features of the data, such as chemical fingerprints, used to train the model, and an output layer, which is a prediction for the activity of interest [15]. A series of “dense” layers connect the input and output layers, such that every node in each layer shares a weighted connection with every node in the previous and next layers. These weighted connections undergo optimization in the model-training process. All DNNs in this study were implemented with three hidden layers of width equal to the number of fingerprints in the input layer (i.e., 167 for MACCS descriptors and 1024 for ECFP and FCFP descriptors). Before model training, the weights between the neurons of each layer were randomly initiated using the He normal method [52]. These weights were optimized during training to achieve the minimum binary cross-entropy. To this end, the following standard deep-learning methods were implemented: stochastic gradient descent optimization [53] (learning rate = 0.01, Nesterov momentum [54] = 0.9), rectified linear unit hidden-layer activation [55], and automatic learning-rate reduction [56] (90% reduction upon 50 consecutive epochs with no loss improvement, minimum = 0.0001). Dropout [57] (rate = 0.5) and L2 [58] (β = 0.001) regularizations and early stopping [59] (upon 200 epochs with no loss improvement) were implemented to avoid overfitting. The model output layer used a sigmoid- activation function [60] so that the predicted result was interpretable as a probability.

Model performance was evaluated using the area under the receiver-operating curve (ROC) metric (AUC). Each model developed in this study computes a probability that a tested compound will be active in a given bioassay. Tested compounds are classified as active when they exceed a determined probability threshold. The ROC curve for model performance is a plot of the true- positive rate (Eq. 1) against the false-positive rate (Eq. 2) using various probability thresholds for the classification of active compounds [61]. The area under this plotted curve (AUC) is interpretable as a measure of the likelihood of a model to distinguish active from inactive compounds correctly. An AUC of 0.5 represents a random model performance as the baseline. The AUC is a suitable metric for this study due to the highly imbalanced nature of the assay data used to train the models. In modeling studies using imbalanced datasets (e.g., HTS assay data), the default probability threshold of 0.5 is not always appropriate [62]. Using the AUC as an evaluation method takes this consideration into account by evaluating model performance at several different probability thresholds.

$$\rm{TPR} = \frac{{True\,positives}}{{True\,positives + False\,negatives}},$$
(1)
$$\rm{FPR} = \frac{{False\,positives}}{{False\,positives + True\,negatives}}.$$
(2)

External validation

The developed models can be used to predict new compounds to prove their predictivity. To this end, external validation was performed using two datasets: the Collaborative Estrogen Receptor Activity Prediction Project (CERAPP) in vitro agonist, antagonist, and binding datasets [63] and the Estrogenic Activity Database (EADB) in vivo rodent uterotrophic dataset [64]. Before model validation, the CASE Ultra v1.8.0.0 DataKurator tool was used to prepare the structures of new compounds as previously described. Only the new compounds not existing in the training dataset were kept. The final curated CERAPP in vitro agonist, antagonist, and binding validation sets contained 368, 264, and 569 compounds, respectively (Supplementary Table SII). The final curated EADB in vivo rodent uterotrophic agonist validation set contained 966 compounds (Supplementary Table SIII).

Three new parameters were created to evaluate a chemical’s potential to act as a nuclear ER agonist, antagonist, or binder based on its predicted activity in relevant assays: agonist score (\(S_{Ag}\), Eq. 3), antagonist score (\(S_{Ant}\), Eq. 4), and binding score (\(S_B\), Eq. 5). In these equations, \(P(Ai)\) is the probability for a predicted compound to be active in Assay i. The 18 total assays contain 16 agonism assays (A1–A16), 13 antagonism assays (A1–A11, A17, and A18), and 11 binding assays (A1–A11). These three parameters integrate relevant models of ER agonism, antagonism, and binding to evaluate new compounds for their toxicity potential at nuclear ERs. The performance of models during external validation was evaluated using ROC curve plots and AUC calculations, as previously described for the cross-validation procedure.

$$S_{Ag} = \frac{{\mathop {\sum }\nolimits_{i = 1}^{16} P(Ai)}}{{16}}$$
(3)
$$S_{Ant} = \frac{{\mathop {\sum }\nolimits_{i = 1}^{11} P(Ai) + \mathop {\sum }\nolimits_{i = 17}^{18} P(Ai)}}{{13}}$$
(4)
$$S_B = \frac{{\mathop {\sum }\nolimits_{i = 1}^{11} P(Ai)}}{{11}}$$
(5)

Results

Dataset

Figure 1 shows a summary of the 7576 unique compounds tested against at least one of the 18 ToxCast and Tox21 nuclear ER-related in vitro assays. HTS assay data usually contain missing and inconclusive data points, and the results are biased (i.e., more inactive than active) [28, 29]. In total, these compounds consist of over 53,000 total conclusively active or inactive assay hit calls, indicating that missing/inconclusive results exist in the dataset. The results show a diverse number of conclusive activities per compound, ranging from 2 to 18 hit calls in these assays (Fig. 1a). Only 476 compounds showed conclusive results for all 18 assays, representing 6.3% of the full dataset. The low active response ratio across all assays (i.e., the active ratio ranges from 1:16 to 1:3) compared with inactive responses reflects the nature of HTS results for chemical toxicity testing [28, 29]. Furthermore, no individual assay has conclusive results for all 7576 compounds. Instead, the size of each assay dataset ranges from 883 to 7263 compounds, depending on the assay nature (Table 1, Fig. 1b). For example, NVS_NR_bER (A1, 1004 compounds), NVS_NR_hER (A2, 1076 compounds), and NVS_NR_mERa (A3, 883 compounds) show the lowest number of tested compounds, and they are NovaScreen assays. TOX21_ERa_BLA_Agonist_ratio (A14), TOX21_ERa_LUC_BG1_Agonist (A15), TOX21_ERa_BLA_Antagonist_ratio (A17), and TOX21_ERa_LUC_BG1_Antagonist (A18) are Tox21 assays that consist of 7263 compounds with conclusive results, representing the richest individual assay datasets. Therefore, these 18 assay datasets represent a large range of data size and chemical diversity, which are suitable for modeling studies to evaluate the ML algorithms.

Fig. 1: Summary of estrogen receptor high-throughput screening dataset.
figure 1

Distributions of compounds in the ToxCast and Tox21 dataset (n = 7576) by the number of conclusive active or inactive results per compound (top) and individual assay datasets (n = 18) by the number of active and inactive compounds (bottom).

The data used in this study also show a bias toward inactive responses. Out of the full dataset, only six of these compounds showed active results across all 18 assays: Bisphenol AF (CAS 1478-61-1), 2-ethylhexyl 4-hydroxybenzoate (CAS 5153-25-3), 4-tert-octylphenol (CAS 140-66-9), diethylstilbestrol (CAS 56-53-1), 4-cumylphenol (CAS 599-64-4), and hexestrol (CAS 84-16-2). These six compounds show uterotrophic activity in at least one guideline-like study [65]. By comparison, 4698 compounds show only inactive results in one or more of these 18 assays, representing a majority (62.0%) of all compounds. The individual assay datasets reveal a similar trend, with small ratios of active versus inactive results. For example, ATG_ERE_CIS_up (A13), which is an mRNA-induction assay, has the highest active ratio of ~1:3. Compared with this assay, TOX21_ERa_BLA_Agonist_ratio (A14), which is a beta-lactamase-induction assay, has the lowest active ratio of ~1:16. Some previous studies showed that downsampling to remove some inactive compounds from training datasets was beneficial to the resulting QSAR models [66, 67]. However, in this study, the full dataset was retained to preserve an ample chemical space for the prediction of new compounds.

QSAR model development

Four ML (BNB, kNN, RF, and SVM) and two DNN algorithms were paired with ECFP, FCFP, and MACCS descriptors individually to develop 18 models for each ER assay (Fig. 2). Simpler algorithms, such as logistic regression, were not used in this study since previous studies have shown the advantages of advanced ML algorithms [23, 68]. In total, 273 models (216 ML models, 54 normal DNN models, and 3 multitask DNN models) were developed for all of the ER assay data. In 2007, the Organization for Economic Co-Operation and Development (OECD) published a guidance document on the validation of QSAR models developed for risk-assessment purposes [69]. The guidelines set forth by this document require that models undergo statistical evaluation for goodness-of-fit, robustness, and predictivity, including model cross-validation [69]. Cross-validation procedures that leave compounds out during each iteration provide reliable model evaluations [70]. In this study, all models were evaluated using a fivefold cross-validation procedure, with 20% of the dataset left out for prediction purposes during each iteration. Each assay dataset was randomly split into five equal subsets maintaining the original proportion of active and inactive responses. In this procedure, four subsets (80% of the total compounds) were combined as a training set, and the remaining 20% was used as a test set. This procedure was repeated five times, such that each compound was used in a test set one time. The six resulting models for each assay-descriptor combination were averaged to give a consensus prediction, as described in previous publications [66, 71,72,73].

Fig. 2: Consensus QSAR modeling workflow used in this study.
figure 2

The consensus QSAR modeling workflow employed here consists of three main stages: generation of three sets of binary fingerprints for each compound in the curated dataset, development of 273 total QSAR models using classic machine-learning, normal deep neural network, and multitask deep neural network approaches, and averaging the resulted predictions to give one consensus prediction.

Table 2 shows the fivefold cross-validation results for each model. The AUC values for all the resulted models ranged between 0.562 and 0.870. The highest AUC value ranged between 0.645 and 0.870 for each assay, indicating that at least one descriptor–algorithm combination yielded a satisfactory model for each endpoint. OT_ER_ERaERb_0480 (A6) had the best-performing models, with AUC values ranging between 0.609 and 0.870. Compared with this assay, TOX21_ERa_LUC_BG1_Agonist (A15) and ACEA_T47D_80 h_Positive (A16) consistently had lower-performing models with AUC values ranging between 0.562–0.660 and 0.562–0.645, respectively. In previous studies, QSAR model performance was high when modeling simple endpoints (e.g., physical–chemical properties) but became lower for complex biological activities (e.g., cellular responses) [29]. A15 and A16 are nuclear ER agonism assays that represent protein production induced by ER-mediated transcriptional activation [74] and the resulting cell proliferation [75, 76] (Table 1). Among the biological processes represented by these 18 assays, transcriptional activation and cell proliferation represent the farthest downstream processes in the classical genomic ER signaling pathway [43], which may be the reason that they are the most difficult to model.

Table 2 Performance of individual models for 18 ToxCast and Tox21 ER assays using a fivefold cross-validation.

Notably, no algorithm could outperform the others across all of the 18 assay endpoints and three descriptor sets (Table 2). However, compared with normal DNNs, multitask DNNs had better predictivity for 16 out of 18, 18 out of 18, and 13 out of 18 assay endpoints using MACCS, FCFP, and ECFP descriptors, respectively (Table 2), indicating the advantage of using multitask learning to model these mechanistically related endpoints. The three consensus models showed better or similar results compared with all other algorithms. For example, when using MACCS descriptors, the fivefold cross-validation results of the consensus model achieved AUC values as high as 0.870, representing the best performance for 10 out of 18 assay endpoints (55.5%) compared with individual models. When using the FCFP descriptors, the consensus model achieved AUC values as high as 0.829, representing the best performance for 8 out of 18 assay endpoints (44.4%) compared with individual models. When using the ECFP descriptors, the consensus model achieved AUC values as high as 0.833, representing the best performance for 5 out of 18 assay endpoints (27.8%) compared with individual models. No individual model showed better performance than the consensus model across all 18 assay endpoints.

External validations

External validation is necessary to prove the predictivity of the resulting QSAR models. An external validation procedure was conducted using two new datasets: the in vitro CERAPP dataset consisting of 368 new agonists, 264 new antagonists, and 569 new binders, and the in vivo EADB uterotrophic dataset consisting of 966 new agonists. Before performing external validation, compounds that were also included in the model-training set were removed from both datasets, resulting in 569 and 966 unique compounds that were not tested in the ToxCast and Tox21 ER HTS assays, and are new to the developed models. Since each assay is only relevant to a specific target of a binding mechanism, using the parameters \(S_{Ag}\), \(S_{Ant}\), and \(S_B\), which were defined to integrate all relevant models, can estimate the estrogenic activities of new compounds more reliably compared with using a single QSAR model for the external compounds (Eqs. 35). For example, the \(S_B\) parameter represents the likelihood of a compound to be an in vitro ER binder (Eq. 5). This parameter includes 11 assays (A1–A11) that represent receptor binding [77,78,79,80], receptor dimerization [81,82,83], and DNA binding [83] (Table 1). The \(S_{Ag}\) parameter (Eq. 3) represents the likelihood of a compound to be an in vitro ER agonist, and includes five additional assays (A12–A16) that represent RNA transcription [84], protein production [74], and cell proliferation [75, 76]. The \(S_{Ant}\) parameter (Eq. 4) includes all assays used to calculate \(S_B\) and two extra assays (A17 and A18) that represent transcriptional suppression [74].

Table 3 shows the results of these external validations. The AUC values of the prediction results using the \(S_{Ag}\) parameter for the new agonists in the CERAPP and EADB datasets ranged from 0.732–0.906 and 0.640–0.802, respectively. The highest-performing models for the CERAPP dataset were RF models, regardless of the descriptors used. The combination of normal DNNs with FCFP descriptors showed the best performance for the EADB dataset. The AUC values of the prediction results using the \(S_{Ant}\) parameter for the new antagonists in the CERAPP dataset ranged from 0.711 to 0.869. The highest-performing model for this dataset used multitask DNNs with FCFP descriptors and achieved an AUC value of 0.869. The AUC values of the prediction of new binders in the CERAPP dataset using the \(S_B\) parameter ranged from 0.622 to 0.754. The highest- performing model for the CERAPP dataset was the combination of normal DNNs with MACCS descriptors. Although the consensus model did not show the best performance in the external predictions, its prediction accuracy was similar to the best-performing model in the four datasets (Table 3).

Table 3 External validation of ER agonists, antagonists, and binders.

Discussion

Computational methods offer potential advantages for rapid early screening of compounds for possible estrogenic and antiestrogenic effects. In 2015, the US EPA published a computational model that incorporated concentration-response data from 18 quantitative HTS assays from the ToxCast and Tox21 programs [42, 43]. The success of this model to predict in vivo uterotrophic activity led to the acceptance of its results as an alternative to rodent uterotrophic testing [85]. However, this model requires experimental concentration-response data for evaluating compounds, and cannot be applied to new compounds that did not yet undergo testing in these assays. Furthermore, not all of the included assays are readily available to be applied. This issue was solved in the current study by developing machine- and deep-learning models to predict the ER activity of new compounds directly from chemical structures. Multitask deep learning outperformed normal deep learning for the prediction of in vitro activity in almost all cases across 18 ToxCast and Tox21 assays. None of the six algorithms used for modeling could consistently outperform all others across 18 assays, regardless of the descriptors used. Consensus modeling is, therefore, still the most suitable and robust modeling approach. These advantages are evident in this study, with consensus models yielding the highest AUC for 11 of the 18 total assays across all descriptor–algorithm combinations (61%, Table 2). The combination of all descriptor–algorithm sets to generate one consensus prediction instead of selecting an algorithm that is specific to a descriptor set is still the best strategy for future model development.

The \(S_{Ag}\), \(S_{Ant}\), and \(S_B\) parameters used for the prediction of the in vitro agonist, antagonist, and binding activities of external validation datasets are also based on the concept of consensus modeling (Eqs. 35). Each of these parameters incorporates predictions using assays that represent between three and six different biological processes relevant to the activity of interest. For example, the \(S_{Ag}\) parameter includes 16 assays related to nuclear ER agonism, which represent six biological processes: receptor binding, receptor dimerization, DNA binding, RNA transcription, protein production, and cell proliferation (Table 1). Furthermore, these assays represent four general types of technology: radioligand, fluorescence, bioluminescence, and electrical impedance [42, 43] (Table 1). By incorporating assays that represent a variety of technologies, the results are more reliable because technology-specific artifacts will affect fewer probabilities.

The predictivity of new compounds, especially toxic compounds, can be explained by revealing their nearest-neighbor compounds. For example, 6α-hydroxyestradiol (CAS 1229-24-9) was classified as a binder and a strong agonist in the CERAPP dataset [63]. This compound is an estrogenic product from the liver metabolism of the prominent endogenous estrogen estradiol (E2) [86]. 6α-hydroxyestradiol showed both the highest \(S_B\) score (\(S_B\) = 0.882) and the highest \(S_{Ag}\) score (\(S_{Ag}\) = 0.879) among all new compounds using the consensus models. 6α-hydroxyestradiol was predicted to be active in all binding-related assays (A1–A11) and all agonism-specific assays (A12–A16). Its nearest neighbor in the training set was alfatradiol (CAS 57-91-0), a stereoisomer of E2 that behaves as a nuclear ER agonist in both in vitro [63] and in vivo [65] assays. Alfatradiol also showed active responses in all binding and agonist assays used to train the models in this study. Among the EADB in vivo uterotrophic agonists, mestilbol (CAS 18839-90-2) showed the highest \(S_{Ag}\) score (\(S_{Ag}\) = 0.870). Mestilbol is a synthetic monomethyl ether derivative of diethylstilbestrol (CAS 56-53-1), which is its nearest neighbor in the training set. Diethylstilbestrol (DES) is a well-known synthetic nonsteroidal estrogen that was previously prescribed to pregnant women to prevent miscarriages [87]. DES is a known strong agonist of the ER that showed uterotrophic activity in several independent guideline-like studies [65]. Another external compound, pipendoxifene (CAS 198480-55-6), was classified as an ER antagonist in the CERAPP dataset [52] and was predicted correctly. Pipendoxifene is an investigational drug currently undergoing clinical trials as a selective ER modulator (SERM) [88]. Pipendoxifene is under development to treat ER-positive breast cancers as well as osteoporosis [89]. Pipendoxifene showed mixed (either active or inactive) results in binding assay model predictions, but was predicted as an antagonist in the specific assays (A17 and A18). Among these assays, this compound’s two nearest neighbors were raloxifene hydrochloride (CAS 82640-04-8) and bazedoxifene acetate (CAS 198481-33-3), which are FDA-approved SERMs for the treatment of osteoporosis [89, 90]. Clinical trials of these compounds indicated ER antagonist activity in breast and uterine tissue [89, 90].

The predictive accuracy of this study can be improved by implementing applicability domains. The QSAR models were based on chemical structures, and therefore are most reliable when predicting new compounds that are chemically and structurally similar to compounds in the training dataset. A common method to implement a QSAR model applicability domain is only to predict compounds that are within a certain similarity threshold with their nearest neighbor in the training set [91, 92]. Figure 3 shows the effect of only predicting compounds within a Jaccard similarity of 0.8, 0.4, or 0.3 using models with MACCS, FCFP, or ECFP descriptors, respectively, on the fivefold cross-validation and external validation results. For external validation, new compounds were predicted if the \(S_{Ag}\), \(S_{Ant}\), and \(S_B\) parameters can be calculated with at least half of their constituent assay models (Eqs. 35). Using these thresholds allows for 42–83% coverage of the external predictions. Implementing these applicability domains enhanced the cross-validation performance of all the algorithms, including consensus predictions, for 18 ER assays (Fig. 3a, c, e). The average AUC value for each algorithm improved from 0.600–0.759 to 0.617–0.800 using the applicability domains (i.e., Jaccard similarity 0.8 for MACCS, 0.3 for ECFP, and 0.4 for FCFP descriptors). The use of the applicability domains also enhanced most external predictions (Fig. 3b, d, f). For CERAPP compounds, the AUC values improved from 0.622–0.906 to 0.696–0.923 using the applicability domain. However, for the EADB compounds, implementing the applicability domain did not improve the results significantly (Fig. 3b, d, f). Although the \(S_{Ag}\), \(S_{Ant}\), and \(S_B\) parameters as currently calculated show good predictivity (Table 3), utilizing applicability domains and reducing the weight of binding assays in the calculations is expected to enhance the results further. Defining the applicability domain is also one of the principles for validation of QSAR use for regulatory purposes, and thus is a prudent consideration if the ultimate purpose of the QSAR model is to make a regulatory decision [93].

Fig. 3: Effect of applicability domains on model predictions.
figure 3

Predictivity of individual and consensus QSAR models using MACCS descriptors for (a) cross-validation and (b) external validation with a chemical similarity threshold of 0.8, using FCFP descriptors for (c) cross-validation and (d) external validation with a chemical similarity threshold of 0.4, and using ECFP descriptors for (e) cross-validation and (f) external validation with a chemical similarity threshold of 0.3. All AUC values are reported as the mean value ± standard deviation.

In this study, 7576 compounds that were tested in ToxCast and Tox21 assays related to nuclear ER agonism, antagonism, and binding were used for exhaustive modeling using classic machine learning, normal deep learning, and multitask deep-learning approaches. To this end, 273 individual QSAR models were developed for 18 assay datasets related to nuclear ER activity. QSAR models developed using multitask deep learning outperformed models developed with normal deep learning (i.e., trained for a single endpoint) in almost all endpoints. However, no individual algorithm could consistently outperform all others across the 18 endpoints. The consensus models generated by averaging the predictions of the individual models had similar or higher predictivity than the individual models. Three parameters were defined to incorporate predictions from models that represent mechanistically relevant assays to predict a compound’s likelihood of behaving like a nuclear ER agonist, antagonist, or binder. External validation based on these parameters showed reliable predictivity for new compounds that did not undergo experimental testing in 18 assays. The results of this study demonstrate the advantages of multitask deep learning for the QSAR modeling of mechanistically related assay endpoints. Furthermore, consensus modeling remains the most reliable strategy for QSAR modeling in the current big-data era, as no algorithm or chemical descriptor set is universally better than others are.