Introduction

Estrogen receptors (ERs) play a crucial role in the initiation and advancement of breast cancer, a prevalent malignancy that affects millions worldwide1. Breast cancer is a diverse ailment, and its different subcategories are frequently identified by whether ERs are present or absent2 ERs are proteins located in breast cells that engage with the hormone estrogen, which is a vital regulator of numerous physiological processes, including the development and upkeep of breast tissue3. In this context, ERs serve as molecular switches that can either promote or hinder the growth and proliferation of breast cancer cells, depending on the presence or absence of estrogen.

There are two primary estrogen receptors: ERα and ERβ. ERα is predominantly situated in breast tissue and can also be found in the uterus, ovaries, and other reproductive organs. When estrogen activates ERα, it is associated with the stimulation of cell growth and replication, which is essential for the development and maintenance of breast tissue. In contrast, ERβ is found in breast tissue, although in smaller quantities compared to ERα, and it is also commonly distributed in various other tissues throughout the body, including the prostate, colon, and bone4. The function of ERβ is more complex and not as well-understood as that of ERα. However, recent research has emerged emphasizing ERβ's anti-cancer properties and its potential as a predictor of treatment effectiveness, irrespective of the presence of ERα5,6. Grasping the role of ERs in breast cancer is of great importance in both clinical practice and scientific exploration. This comprehension has paved the way for the development of tailored treatments specifically designed to address ER-positive breast cancers, resulting in improved treatment outcomes and overall patient survival rates. These therapeutic strategies involve substances that mitigate the effects of estradiol by competitively binding to ER, such as selective estrogen receptor modulators (SERMs, like tamoxifen), which decrease the levels of natural estrogens, such as aromatase inhibitors (e.g., letrozole, anastrozole, and exemestane), or a selective estrogen receptor degrader (SERD, like fulvestrant), which completely counter and degrade ER7. However, around 15–30% of those affected do not see benefits from the usual treatments owing to the innate resistance mechanisms, and during treatment, around 30–40% will acquire resistance8,9. Therefore, the development of treatment resistance is a significant factor leading to unfavorable outcomes and remains a substantial challenge in the management of ER-positive breast cancer.

To address the problem of resistance, researchers are exploring various computer-assisted approaches for drug design. These methods include quantitative structure–activity relationship (QSAR)10,11,12, machine learning (ML)-based models13,14,15, deep learning (DL)-based models16, molecular docking10,17,18, molecular dynamic simulations18,19, and pharmacophore analysis18, among others. It's important to note that most of these research endeavors primarily focus on targeting ERα rather than ERβ20. To date, there is only one ML-based approach (named ERpred21) that is developed for predicting the effectiveness of inhibitors against ERα and ERβ. ERpred is a random forest-based model trained on 659 and 714 compounds for ERα and ERβ, respectively. Although ERpred provided reasonable prediction performance, there are three major issues that need to be addressed. Firstly, because the existing datasets used to develop ERpred contained a small number of compounds (Table 1), their predictive ability might be unsatisfactory for real-life applications. Secondly, ERpred did not conduct a comparative analysis among different ML classifiers and molecular descriptors in the identification of ERα and ERβ inhibitors. Lastly, ERpred was developed using only single ML algorithm and molecular descriptor. On the other hand, ensemble learning approach can automatically integrate several different ML classifiers to enhance the predictive performance.

Table 1 Comparison of training and independent test datasets used in ERpred and this study.

Keeping these issues in mind, we introduce StackER, a stacked ensemble learning approach for the accelerated and accurate identification of inhibitors against ERα and ERβ using SMILES information only. To obtain an accurate prediction model, first, we established an up-to-date dataset by collecting positive and negative samples from the ChEMBL database. Second, we investigated and evaluated variant ML models in predicting ERα and ERβ inhibitors by employing nine different types of SMILES-based feature descriptors (i.e., AP2D, AP2DC, FP4, FP4C, KR, KRC, MACCS, Pubchem, and RDK5) cooperating with eight popular ML algorithms (i.e., RF, generalized linear model (GLM), support vector machine (SVM), extreme gradient boosting (XGB), k-nearest neighbors (KNN), partial least squares regression (PLS), recursive partitioning and regression tree (rpart), and multi-layer perceptron (MLP)). Their predictive performances were obtained by performing the tenfold cross-validation and independent tests. In the meanwhile, all the ML classifiers were applied to generate probabilistic features (PFs). Finally, the optimal PFs were identified through a two-step feature selection method and used for the development of an efficient stacked model. Experimental results based on the cross-validation and independent tests showed that StackER can achieve a better overall performance as compared to several conventional ML classifiers and the existing method in precisely predicting inhibitors against ERα and ERβ. Furthermore, StackER was applied to identify important features for being ERα and ERβ inhibitors to be substructures with fluorine and nitrogen-containing and cyclohexanone derativatives, respectively, while our proposed model was used to identify FDA-approved drugs as potential ERα inhibitors in efforts to facilitate drug repurposing.

Materials and methods

Data collection and curation

The datasets for ERα and ERβ (CHEMBL206 and CHEMBL242, respectively) were obtained from the ChEMBL database22 (version 33, accessed on August 20, 2023). Initially, there were 15,446 compounds for ERα and 8979 compounds for ERβ in the dataset. In this study, we collected the IC50 bioactivity data for inhibitory activity against ERα and ERβ from the initial dataset, resulting in 5180 compounds for ERα and 3605 compounds for ERβ. These curated datasets underwent further pre-processing, which involved standardizing the chemical structure representations (SMILES), removing duplicates, and eliminating salt components. All of these pre-processing steps were carried out using the R programming language23. Then, we obtained the subsequent dataset consisting of 2532 and 1577 compounds for ERα and ERβ, respectively. To generate active and inactive compounds, we applied the same criteria as employed in previous studies21,24,25,26. As a result, we obtained actives and inactives (ERα and ERβ) of (1145 and 736) and (851 and 471), respectively. Finally, we randomly selected 80% of all compounds for each subtype to construct the training datasets, whereas the remaining compounds were used to construct the independent test datasets. The detail of the training and independent test datasets involved in this study is provided in Table 1.

Descriptor extraction

For each compound, we generated multiple sets of fingerprint descriptors using the PaDEL-Descriptor software27 and RDKit (https://www.rdkit.org). Molecular fingerprints are widely employed in the field of cheminformatics because they effectively capture the structural characteristics of chemical compounds28,29,30. In this study, we considered nine different categories of molecular fingerprints, which include AP2D, AP2DC, KR, KRC, MACCS, Pubchem, FP4, FP4C, and RDK531,32,33,34,35,36. A summary of these descriptor types is recorded in Table 2. In essence, we used the chemical structures represented in SMILES format as input to compute the fingerprint descriptors. Before the calculation of these descriptors, we standardized the tautomeric forms and removed any salt components. In total, we extracted eight molecular descriptors using the R programming environment (version 4.3.023) and the RDK5 fingerprint descriptor was extracted using the Python programming environment37.

Table 2 Summary of nine molecular fingerprints used in this study.

Two-step feature selection strategy

From the viewpoint of classification, the feature selection procedure is an important step to exclude noisy features while improving performance. Herein, we applied a two-step feature selection method to determine m informative features to construct the final model. In the first step, RF method was used to assess the importance score of each feature. The RF method used herein was implemented in the R programming environment (version 4.3.1)38. Then, all the features were ranked according to their importance scores. The RF method is widely applied in various biological and chemical classification problems21,24,39,40. After obtaining the ranked features, we constructed n feature sets containing the m top-ranked important features ranging from top mstart to mend with an interval of s. The values of mstart, mend, s, and n depend on the feature dimension. In the second step, ML models were trained using all the n feature sets and their performance were assessed using the tenfold cross-validation test. The optimal feature set having the highest Matthews correlation coefficient (MCC) was utilized to construct the final model in this study.

StackER framework

Stacking is a powerful ensemble learning strategy that allows the integration of the outputs of heterogynous prediction models as mean to construct a stacked model. Numerous studies have highlighted that the stacked models outperform single-based models in terms of high accuracy and low error. As shown in Fig. 1, the stacking strategy uses a two-layer learning framework, where the corresponding classifiers at each layer are referred as base-classifier and meta-classifier. In brief, the base-classifier is constructed using the original feature descriptors and used to generate PFs. Then, the PFs are considered as the input feature for the meta-classifier construction. A detailed description of the stacking strategy is provided in details as follows.

Figure 1
figure 1

Workflow of the StackER development for identifying inhibitors against ERα and ERβ. This framework involves four primary steps, which include dataset preparation, base-classifier construction, meta-classifier optimization, and performance evaluation and model interpretation.

In the first-layer, we employed eight ML algorithms (i.e., GLM, MLP, KNN, RF, PLS, rpart, SVM and XGB) cooperating with nine molecular descriptors (i.e., AP2D, AP2DC, KR, KRC, FP4, FP4C, MACCS, Pubchem, and RDK5) to construct heterogeneous base-classifiers. As a result, we obtained a total of 72 base-classifiers, which were implemented based on the caret package for the R programming environment (version 4.3.1)38, their parameters were tuned using the grid optimization algorithm24,26,41,42,43 (Supplementary Table S1). After that, we employed these base-classifiers to generate PFs. The PF generation based on the stacking strategy is as following: (i) we used the tenfold cross-validation procedure to randomly divide the training dataset (DTRN) into 10 equal-sized subsets, where \({D}_{TRN}=\{{S}_{1}, {S}_{2},\dots ,{S}_{10}\}\); (ii) for the kth iteration, we treated \({D}_{TRN}-{S}_{k}\) and \({S}_{k}\) as the current training and testing sets. The base-classifier trained with \({D}_{TRN}-{S}_{k}\) was used to generate the prediction output (\({P}_{k}\)); and we obtained 10 prediction outputs \(\{{P}_{1}, {P}_{2},\dots ,{P}_{10}\}\) of DTRN. Then, the PF was obtained by averaging the 10 prediction outputs. Finally, in this layer, 72 PFs of all the 72 base-classifiers were obtained and represented with a 72-D probabilistic feature vector (APF). In the second layer, the meta-classifier was constructed using the SVM method (called mSVM) cooperated with APF. To optimize the performance of the meta-classifier, the two-step feature selection method was employed to determine a set of optimal PFs (called OPF). As a result, the values of mstart, mend, s, and n are 5, 50, 5, and 14, respectively. The optimal feature set having the highest MCC was utilized to construct the final stacked models for the identification of inhibitors against ERα and ERβ. Moreover, we employed six well-known performance metrics, including MCC, area under the receiver operating characteristic (ROC) curve (AUC), accuracy (ACC), balanced accuracy (BACC), specificity (Sp), and sensitivity (Sn) to assess the performance of the proposed model and conventional ML models. The details of these six performance metrics are mentioned in the Supplementary Information.

Case study and docking study of FDA-approved drugs

In this study, we obtained a library of FDA-approved small molecule drugs, consisting of 2,735 compounds, from the DrugBank database (version 5.1.10; released on January 4, 2023). After removing salt and inorganic compounds, as well as eliminating duplicate and disconnected SMILES representations and SMILES with explicit valence, the remaining number of compounds was reduced to 1,737. We then calculated molecular descriptors for all these compounds, which were used as input for prediction with our StackER model. The top fifteen compounds identified by our stack model were subsequently subjected to docking analysis, facilitating drug repurposing efforts. The target protein (PDB ID: 3ERT) was obtained from the Protein Data Bank (https://www.rcsb.org) and adjusted for docking. This optimization involved energy minimization in the SwissPDB Viewer44 and the addition of polar hydrogens and removal of water molecules in AutoDockTools version 1.5.7. Likewise, in order to ensure docking compatibility with AutoDock Vina45, ligands were prepared using AutoDockTools. Both the optimized protein and ligands were saved in pdbqt file formats. To enable accurate binding affinity calculations, we used the amino acid residues in the active site of the ERα protein to define a grid with dimensions of 50 × 40 × 48, with its center coordinates set at X = 29.621, Y = −0.545, Z = 26.455. The binding affinity of the ligands was determined by docking them inside the predetermined grid box of the target protein. The exhaustiveness was set to 32, and the energy range was set to 4, with the maximum energy difference between the best and worst binding mode not exceeding 3 kcal/mol. The binding potential of individual ligands can be represented by docking score or energy, where lower scores indicate higher binding affinity46,47. Finally, the analysis of the docked protein–ligand binding complexes was carried out using Discovery Studio software.

Results and discussion

Analysis of applicability domain

The applicability domain (AD) of a QSAR model delineates a region within the chemical space where the model is expected to provide accurate predictions48. To understand this, we employed t-distributed stochastic neighbor embedding (t-SNE) to visually represent the feature space associated with the nine molecular fingerprints. The visualizations in Supplementary Figs. 1 and 2 depict the compounds from both the training and independent datasets, denoted in green and red, respectively, for ERα and ERβ. The AD boundary was established based on the characteristics of the training dataset, and compounds falling within this boundary are considered to be within the model's applicability domain. As seen in Supplementary Figs. 1 and 2, all nine molecular fingerprints for both protein subtypes exhibited overlapping chemical spaces between the training and independent datasets, indicating their suitability for the models developed in this study.

Construction of StackER

In this section, we constructed different SVM-based meta-classifiers by taking advantages of our two new probabilistic feature vectors (i.e., APF and OPF) to provide more accurate ERα and ERβ inhibitors prediction. In addition, to improve the predictive performance, we used the two-step feature selection strategy to independently optimize the APF for each subtype. The two-step feature selection strategy determined 35 and 35 important PFs for developing SVM-based meta-classifiers for ERα and ERβ, respectively. Table 3 lists the performance evaluation results of four SVM-based meta-classifiers in terms of both the cross-validation and independent tests. In the case of ERα, OPF provided a better performance than APF in terms of BACC, Sn, and MCC on the training dataset, while the performance of OPF was comparable to APF in terms of BACC (0.894 versus 0.989) and MCC (0.786 versus 0.792). Impressively, OPF performed better than APF in terms of both the cross-validation and independent tests for ERβ subtype. To be specific, on the independent test dataset, the BACC, MCC, and AUC values of OPF were 0.849, 0.712, and 0.974, which were 4.02, 7.10, and 7.37%, respectively, higher than APF. Therefore, we applied the OPF to develop SVM-based meta-classifiers (called StackER) for ERα and ERβ in the following studies.

Table 3 Cross-validation and independent test results of different feature representations.

Stacking strategy contributes to performance improvement

To verify the necessity of the stacking strategy in this study, we compared the performance of StackER against conventional ML classifiers for predicting inhibitors against ERα and ERβ. As mentioned above, these ML classifiers were independently developed using 8 ML methods cooperating with 9 molecular descriptors for each subtype. The performance results of all the ML classifiers are summarized in Supplementary Tables 2, 3, 4 and 5. In addition, we selected 5 top-ranked ML classifiers having high cross-validation MCC for conducting a comparative analysis herein (Fig. 2). As shown in Fig. 3 and Tables 4 and 5, the 5 top-ranked ML classifiers for predicting inhibitors against ERα consist of RF-Pubchem, RF-MACCS, RF-FP4C, GLM-KRC, and SVM-MACCS with respective MCC values of 0.785, 0.784, 0.780, 0.778, and 0.768 (Table 4), while the 5 top-ranked ML classifiers for predicting inhibitors against ERβ consist of RF-FP4C, RF-MACCS, RF-Pubchem, RF-AP2DC, SVM-MACCS with respective MCC values of 0.755, 0.736, 0.732, 0.730, and 0.719 (Table 5).

Figure 2
figure 2

Performance comparison of StackER and top-five prediction models for ERα (A,B) and ERβ (C,D) subtypes assessed by the tenfold cross-validation (A,C) and independent tests (B,D).

Figure 3
figure 3

MCC values of top-30 ML classifiers for ERα (A,B) and ERβ (C,D) assessed by the tenfold cross-validation (A,C) and independent tests (B,D).

Table 4 Performance comparison of StackER and top-five prediction models in identifying active and inactive compounds for ERα.
Table 5 Performance comparison of StackER and top-five prediction models in identifying active and inactive compounds for ERβ.

From Fig. 4 and Tables 4 and 5, several points can be observed: (i) StackER achieved a better performance in terms of all six performance metrics over the tenfold cross-validation test for both for ERα and ERβ. Specifically, StackER provided MCC values of 0.847 and 0.829 for ERα and ERβ, which were 6.18–7.95% and 7.41–10.99%, respectively; (ii) As for the independent test results, StackER performed better than almost all of the 5 top-ranked ML classifiers in terms of MCC, with the exception of SVM-MACCS for ERα. However, the performance of StackER was most comparable to that of SVM-MACCS (0.786 versus 0.796) for ERα. In addition, for ERβ, StackER significantly outperformed SVM-MACCS in terms of ACC, BACC, Sp, MCC, and AUC; (iii) StackER attained outstanding AUC values of 0.974 and 0.973 for ERα and ERβ, which were 6.18–7.95% and 7.41–10.99%, respectively; (iv) The PFs were able to create a clearer boundary between the two clusters (i.e., active and inactive) compared to Pubchem and MACCS, demonstrating that the information derived from the PFs is more crucial than conventional molecular descriptors for capturing the distinct patterns between active and inactive samples. Taken together, our comparative results reveal the effectiveness of the stacking strategy to the performance improvement.

Figure 4
figure 4

t-SNE distribution of our probabilistic features (PFs) and two informative conventional molecular descriptors for ERα (A–C) and ERβ (D–F) on the training dataset.

Performance comparison with the existing method

As mentioned above, ERpred21 is the only SMILE notation-based approach for predicting ERα and ERβ inhibitors. Since ERpred was not developed based on the up-to-date dataset constructed herein, we implemented ERpred using the same procedure as mentioned in the previous study. Table 6 shows the detailed performance comparison between StackER and ERpred. As can be seen from Table 6, for both ERα and ERβ, StackER is superior to ERpred in terms of almost all performance metrics, including ACC, BACC, Sp, MCC, and AUC, on both the training and independent test datasets. In particular, StackER outperformed ERpred as judged by the independent test results, with a 1.75–4.21% increase in Sp, 1.59–0.3. 54% increase in MCC, and 1.33–5.55% increase in AUC, thereby highlighting the effectiveness and robustness of StackER. Furthermore, as StackER attained impressive Sp and MCC values, it could be implied that our proposed model might effectively narrow down the number of candidate drugs against ERα and ERβ.

Table 6 Performance comparison of StackER and the existing method on the same training and independent test datasets.

Model interpretation and feature importance analysis

In this section, we utilized the SHAP method49 to assess the contribution of each feature on the prediction outputs and identify the most important feature that might be responsible for potential inhibitory effects against ER. Figure 5A, B showcases the top 20 most influential features of StackER for predicting ERα and ERβ inhibitors, respectively, where high and low SHAP values demonstrate that the prediction outputs favour active and inactive classes, respectively. The top-five base-classifiers that were important for predicting ERα and ERβ inhibitors involved (SVM-KR, MLP-MACCS, GLM-KRC, KNN-Pubchem, and RF-RDK5) and (PLS-KR, MLP-Pubchem, GLM-Pubchem, MLP-AP2DC, and MLP-MACCS), respectively. To gain a more profound understanding of the specific features for ERα and ERβ, we also applied the SHAP method to MLP-Pubchem. Figure 5C, D displays the top 20 crucial features for ERα and ERβ, respectively. Furthermore, the particulars of these analyzed substructure fragments, including their general structures and SMARTs patterns, can be found in Table 7.

Figure 5
figure 5

Feature importance analysis based on the SHAP method for StackER (A,B) and MLP-Pubchem (C,D). The impact of each feature on the identification of inhibitors against ERα (A,C) and ERβ (A,C). Mean absolute SHAP values, where positive and negatives SHAP values influences the predictions toward positive and negative samples, respectively.

Table 7 Top 10 important features for ERα and ERβ as determined by SHAP method.

Upon comparing the top 20 important features for ERα and ERβ, we observed that the two subtypes shared five common features, namely Pubchem697, Pubchem667, Pubchem696, Pubchem308, and Pubchem450, which correspond to 2-methylheptane, prop-2-en-1-ol, octane, hydroxyl group, and formimidamide (Table 7). Notably, among these, Pubchem697 and Pubchem667 exhibited a significant impact on both subtypes as ER inhibitors, as indicated by SHAP (Fig. 5C, D). Interestingly, Pubchem697, representing 2-methylheptane, a branched alkane isomeric to octane (i.e., Pubchem696), showed a high SHAP value only for ERα. This feature was also emphasized in our previous work on ERpred21, further underscoring its significance. Researchers observed that in derivatives of tamoxifen, a well-known ER inhibitor, the elongated alkyl side chains led to the degradation of ER50. In addition, researchers devised a set of diphenylalkane derivatives, incorporating several elongated alkyl chains linked to the hydroxyl group. Subsequently, they assessed the compounds' biological characteristics, encompassing their effects on ER degradation, anti-proliferative properties, transcriptional activity, and binding affinity51. Furthermore, in their analysis of the novel compound docking, the scientists observed the interaction between the carboxylic acid of Glu351 in ERα and the hydrogen atom bound to nitrogen. This interaction served as the foundation for the bonding between the ERα hydrophobic groove and the elongated alkyl chain. Consequently, the essential factors contributing to the downregulation of ERα can be attributed to both the nitrogen group and the diphenylheptane with a specific length of extended alkyl chain52.

Pubchem667, corresponding to prop-2-en-1-ol, was found to be a potent ER antagonist in a study conducted by Anita et al.53. Their research focused on examining the apoptosis in human MCF-7 breast cancer cells and the inhibition of cell proliferation induced by an analogue of Eugenol (4-[4-hydroxy-3-(prop-2-en-1-yl) phenyl]-2-(prop-2-en-1-yl)). Additionally, in the work of Reddy et al.54, various compounds containing the prop-2-en-1-ol substructure were tested in vitro, demonstrating their strong efficacy across a broad spectrum of human tumor cell lines, including MCF-7, which is an ER-positive breast cancer cell line. Pubchem308, as shown in Fig. 5C, D and detailed in Table 7, represents a hydroxyl group that gains significance when it is a part of other significant molecular structures, such as bisphenol A (BPA), bisphenol C (BPC), and bisphenol P (BPP). These compounds have been identified as endocrine-disrupting chemicals55. The authors of these studies demonstrated that ERα-related transcriptional activity is dependent on the existence of the 4-hydroxyl group in both the A-phenyl and B-phenyl rings of BPA derivatives, which clearly exhibits ER inhibitory effects both in laboratory experiments and in living organisms56,57.

Furthermore, nitrogen-containing characteristics, specifically Pubchem391, Pubchem345, and Pubchem375, with high SHAP values, were found to be more prominent in ERα. Conversely, alcohol-containing features, like Pubchem777, Pubchem590, and Pubchem617, had a greater impact on ERβ (Fig. 5C, D and Table 7). The mentioned nitrogen-containing features, associated with N,N-dimethylmethanamine, ethanamine, and methanediamine, respectively, serve as precursors for several significant ER inhibitors, including tamoxifen, 4-hydroxy tamoxifen, raloxifene, and their analogues58,59,60,61. In addition, Makar et al.62, highlighted the importance of the N,N-dimethylamine side chain in the triphenylethylene-based ER inhibitor tamoxifen. This side chain altered the conformation of helix-12 and inhibited co-activator binding, underscoring its significance in ER inhibition.

The most prominent feature for ERα was identified as fluoromethane (Pubchem287; see Fig. 5C and Table 7), a compound known for its ability to notably enhance various pharmaceutical properties, including potency, metabolic stability, hydrogen bonding, improved binding interactions, and pharmacokinetics, across a range of medications63. Furthermore, the inclusion of fluorine atoms in about 20–25% of known drugs highlights the element's importance in medicinal chemistry64,65. Scott et al.66 conducted a study on the impact of fluorinated analogues on a clinical SERD candidate, and they concluded that the resulting molecules exhibited high quality and advanced profiling stages. Recently, Lu et al.63 reported their work on designing and synthesizing fluorinated SERDs based on the clinical drug candidate G1T48 (NCT03455270). Their findings suggested that introducing fluorine atom substitutions into SERDs enhanced overall therapeutic effectiveness, making them superior clinical candidates for orally treating ER-positive breast cancer. As for the top feature for ERβ, Pubchem777 (Fig. 5D and Table 7), which relates to 4-methylcyclohexanol, when oxidized to cyclohexanone and its derivatives, serves as a valuable scaffold in the development of anticancer agents67 Such compounds have the potential to act as potent inhibitors of tamoxifen-resistant MCF-7 cancer cells68,69. Consequently, the presence of these top features for ERα and ERβ inhibitors underscores the capability of our StackER model to discern the features of significant importance in the field of medicinal chemistry.

Case study: potential ER inhibitors from FDA-approved drugs

In this section, we applied our StackER model to identify promising ERα inhibitors among existing approved drugs, seeking to maximize therapeutic benefits while minimizing the risks of toxicity. We obtained the data from the DrugBank and applied various filtering criteria, as outlined in the “Materials and methods” section. Following this filtering process, we had a total of 1,737 compounds available for predicting their potential as ER inhibitors using our StackER model. In this context, our primary focus was on identifying potential inhibitors for ERα, as the role of ERβ in breast cancer is intricate and subject to ongoing debate. The results of our predictions for the top 15 potential ERα inhibitors, as determined by our developed model, are presented in Table 8. Notably, among these top 15 compounds, six are directly associated with ERα treatment, including SERMs, SERDs, or substrates of BC resistance proteins. The remaining eight compounds consist of diverse medications, such as antidepressants, antihistamines, anti-cancer agents, and anti-COVID agents.

Table 8 Probability, docking scores, and description of 15 selected FDA-approved drugs against ERα as deduced from our StackER model.

With this in mind, we conducted docking simulations for all of the top compounds using Autodock Vina with the default parameters. The five compounds with the highest docking scores were identified, and their interactions with ERα were further investigated (as shown in the Fig. 6 and Table 8), comparing them to the co-crystal ligand, tamoxifen (OHT). Tamoxifen is a widely recognized SERM used for breast cancer treatment70,71 with a long list of side-effects72. It forms hydrogen bonds (H-bonds) with Glu353 and Arg394 at distances of 3 Å and 1.98 Å, respectively (as depicted in Fig. 6A). In a similar manner, the top-docked compound, lasofoxifene, with a docking score of −11.6 kcal/mol, is a non-steroidal SERM73 and also establishes H-bonds with Glu353 and Arg394 at distances of 2.05 Å and 2.02 Å, respectively (illustrated in Fig. 6B).

Figure 6
figure 6

Binding interactions of ERα with OHT (A) and the top 5 FDA-approved drugs—Lasofoxifene (B), Lomitapide (C), Bazedoxifene (D), Berotralstat (E), and Raloxifene (F). Residues forming hydrogen bonds are represented in dark green and light green colors while residues forming pi-sigma, pi-alkyl, pi-sulfur and halogen interactions are depicted in purple, pink, orange and blue colors, respectively.

It's worth noting that the docking score for OHT was −9.6 kcal/mol. However, OHT did not rank among the top 15 of the predicted potential compounds. This discrepancy may be attributed to the fact that OHT was discovered a long time ago, and our model is trained on the latest data, which includes newer generations of more potent SERMs. In addition, Lainé et al.73, recently discovered that lasofoxifene has the potential to treat mutant types of ER+ metastatic breast cancer. Additionally, among the top 5 docked candidates, three are non-steroidal SERMs (i.e., lasofoxifene, bazedoxifene, and raloxifene). The remaining two, lomitapide and berotralstat, are a lipid-lowering drug and a kallikrein inhibitor, respectively. These two compounds could be strong candidates for drug repurposing.

Lomitapide, initially developed for the treatment of a rare genetic disorder known as familial hypercholesterolemia74, achieved a docking score of −11.2 kcal/mol. Figure 6C illustrates the docking interactions of lomitapide with ERα. Notably, lomitapide features two trifluoromethyl groups linked to a nitrogen atom at one end and a carbon atom at the other end, establishing interactions with Ala350, Asp351, Glu419, Glu420, and Gly521, respectively. The substitution of fluorine has been extensively explored in drug design and development to enhance biological activity, metabolic or chemical reactivity, and metabolic or chemical stability75. This is primarily attributed to the properties exhibited by fluorine, including lipophilicity, electronegativity, electrostatic interactions, and size76. Additionally, lomitapide forms one conventional H-bond with His524 and two carbon-based H-bonds with Asp351 and Gly521. In the work by Zuo et al.77, the anti-cancer effects of lomitapide were observed in colorectal cancer. Similarly, in a study by Lee et al.78, the authors revealed that lomitapide induces autophagy-dependent cell death in HCT116 colorectal cancer cells. More recently, Wang et al.79, demonstrated that lomitapide has the ability to inhibit a key enzyme responsible for the downstream proliferation of pancreatic cancer cells. Moreover, the impressive anti-tumor properties of lomitapide were demonstrated in triple-negative breast cancer (TNBC) cell lines, where researchers observed substantial induction of apoptosis, diminished capacity of TNBC cells to form spheres and colonies while also hindering cell cycle progression80.

In contrast, berotralstat has received approval for its use in hereditary angioedema, a rare genetic disorder characterized by recurrent, unpredictable episodes of swelling that affect subcutaneous or submucosal tissues. Berotralstat is an orally administered synthetic small-molecule inhibitor targeting a serine protease called plasma kallikrein. The stimulation of plasma kallikrein leads to the plasma kallikrein/kinin system activation and enhancement. This activation plays a role in the classical complement cascade pathway, the alternative complement pathway, and blood coagulation81,82,83,84. Nevertheless, no previous reports have documented the anti-cancer properties of berotralstat. In our study, berotralstat emerged as one of the top 5 candidates in both the prediction and docking studies for its potential as an ERα inhibitor, boasting a probability score of 0.7008 and a docking score of -10.2 kcal/mol (as shown in Table 8). Figure 6E displays the interacting residues of berotralstat with ERα. It is notable that berotralstat forms two conventional H-bonds with Gly420 and Leu536, while also establishing pi-sigma and pi-sulfur interactions with Leu525 and Met343, respectively. Furthermore, akin to lomitapide, berotralstat contains a trifluoromethyl group, which interacts with Glu353 and Leu387. This further underscores the significance of fluorine as a sidechain substitution.

It's worth mentioning that the most crucial feature for ERα, as determined by our StackER model and SHAP analysis (as mentioned in the previous section and shown in Fig. 5C and Table 7), was Pubchem287, corresponding to fluoromethane. Consequently, these findings shed light on the potential repurposing of lomitapide and berotralstat as novel therapeutic options for the treatment of ERα-induced cancers.

Conclusions

In this study, a novel SMILES-based stacked ensemble learning approach, terms StackER, is developed for the accelerated and accurate identification of inhibitors against ERα and ERβ. First, we collected an up-to-date dataset from the ChEMBL database to develop an efficient and effective prediction model. Second, we trained and evaluated several ML classifiers trained with eight ML algorithms combined with nine molecular descriptors. Finally, an optimized stacked approach was constructed based on the combination of selected ML classifiers derived from the two-step feature selection method. Experimental results based on the cross-validation and independent tests, highlighted the effectiveness and robustness of StackER by outperforming the existing method (i.e., ERpred) and several conventional ML classifiers. Three important factors can be attributed to the performance improvement of our developed model: (i) StackER is optimized based on the up-to-date dataset having a larger sample size; (ii) StackER takes advantage of several state-of-the-art ML algorithms and molecular descriptors; and (iii) StackER is developed using the ensemble learning strategy along with the two-step feature selection method. We anticipate that StackER will provide useful insights for the accelerated and large-scale discovery of high potential breast cancer drugs and inspire follow-up research in the future. Although StackER has attained superior predictive performance in comparison to several conventional ML classifiers and the existing method, it still has a few shortcomings, which can be addressed in follow-up works. The first point pertains to developing a two-layer prediction framework that is capable of identifying ERα and ERβ inhibitors (actives or inactives) as well as the inhibitory activity against ERα and ERβ (IC50 bioactivity). The second point is to utilize efficient molecular representation learning (MRL), such as Mol2vec85, geometry-enhanced MRL86 and self-supervised pretrained learning87 strategies. The last point pertains to incorporating StackER with novel ML frameworks, such as a pre-trained language model88 and DL-based framework25,89.