Introduction

BCR-ABL tyrosine kinase (TK) oncoprotein as an oncogene is present in 95% of patients suffering from chronic myeloid leukemia (CML). Therefore, tyrosine kinase inhibitors (TKIs), such as imatinib as the first drug against the BCR-ABL TK, have been used in the therapy of most cases of CML patients. Imatinib competitively targets the ATP-binding site in the TK domain of the BCR-ABL oncoprotein and reduces the activity of BCR-ABL. Due to the point mutations in the BCR-ABL kinase domain, some patients particularly in the advanced phases of CML, develop imatinib resistance. Therefore, to overcome imatinib resistance, novel analogues of Imatinib such as ponatinib, nilotinib, dasatinib, bosutinib, etc., have been developed as TKIs and tested in patients with BCR-ABL positive CML. Hence, the development and design of more potent BCR-ABL TKIs, specifically imatinib derivatives is a matter of great importance and would help in the therapeutic treatments of CML patients1,2,3,4,5.

Quantitative structure–activity relationship (QSAR) is an approach that can be applied to the construction of pharmacophore models, new drug discovery, and assessment of the activity/behavior of compounds6,7,8. Also, QSAR is a predictive and diagnostic process employed for finding quantitative relationships between chemical structures and biological activity or property. QSAR is the concluding outcome of computational methods that begin with an appropriate molecular structure description and conclude with some interpretation, assumption, and judgments on the behaviour of molecules in the biological and physicochemical under examination9,10. Finding a class of molecular descriptors that indicates variations in the structural properties of the molecule, is the main goal of QSAR model development.

The Monte Carlo algorithm of CORrelation And Logic (CORAL) software has been applied for QSAR modeling of different endpoints11,12,13,14,15. Random distribution of dataset into training and validation subsets, production of optimal descriptors of correlation weights (DCW), and the construction of predictive models using the physicochemical conditions of corresponding experiments are unique options available in the CORAL software for the development of QSAR models16,17,18,19,20,21,22. The literature survey shows that the Index of Ideality of Correlation (IIC) has been applied to improve the statistical result of the QSAR model23,24,25,26,27,28. In addition, the most descriptors used in common QSAR models do not have physical meaning and can not be associated with mechanistic interpretation. It has to be noted that QSAR models developed with CORAL software are developed with SMILES notation based molecular descriptors that have mechanistic interpretation and could be associated with molecular fragments.

The objective of the present work is to apply the inbuilt Monte Carlo algorithm of CORAL software for the building QSAR model to predict inhibition potencies (pIC50) of 306 Imatinib derivatives against BCR-ABL tyrosine kinase (TK). The balance of correlation method with IIC is used to develop QSAR models. The reliability and predictability of the designed QSAR model are assessed by three random splits.

Method

Data

Zin et al.29 had extracted the inhibition potential of 306 compounds for the human BCR-ABL tyrosine-kinase from the ChEMBL v23 (2017) database30. The inhibition potential of compounds was defined as half maximal inhibitory concentration in mol/L (IC50). Additionally, the inhibition experimental data of BCR-ABL tyrosine kinase was transformed to a negative logarithm value (pIC50 ). The endpoint pIC50 was taken as the dependent parameter for constructing QSAR models. The range of pIC50 was between 9.37 and 4.03. Three splits were created form the dataset (n = 306) and the compounds of each split was randomly divided into the training (34%), invisible training (35%), calibration (15%) and validation (16%) sets. The SMILES notations, split distribution, experimental pIC50, predicted pIC50, and applicability domain of each compound are depicted in Table S1. The task of each set in developing the QSAR models was already described in the literature31,32.

Optimal SMILES-based descriptors

In the CORAL software, three types of optimal descriptors i.e. SMILES-based, graph-based and hybrid descriptors (combination of SMILES and Graph) can be employed to develop QSAR models.

The optimal descriptor is a mathematical function of so-called correlation weights (CW). Correlation weights are numerical coefficients associated with various molecular features extracted from SMILES symols. In other words, the univariate models investigated in this research are based on the “descriptors of correlation weights” (DCW). The Monte Carlo algorithm was used to calculate the DCW. In the present research, the SMILES-based descriptor was employed to make the QSAR models. The optimal descriptors used to build pIC50 models are calculated as follows:

$$\mathrm{DCW}\left({\mathrm{T}}^{*}, {\mathrm{N}}^{*}\right)={}^{\mathrm{SMILES}}\mathrm{DCW}\left({\mathrm{T}}^{*}, {\mathrm{N}}^{*}\right)$$
(1)
$${}^{\mathrm{SMILES}}\mathrm{DCW}{}_{ }{}^{ }({\mathrm{T}}^{*}, {\mathrm{N}}^{*})=\sum \mathrm{CW}\left({\mathrm{SSS}}_{\mathrm{K}}\right)+\mathrm{CW}\left(\mathrm{HALO}\right)+\mathrm{CW}\left(\mathrm{NOSP}\right)+\mathrm{CW}\left(\mathrm{HARD}\right)+\mathrm{CW}\left(\mathrm{PAIR}\right)+\mathrm{CW}\left({\mathrm{C}}_{\mathrm{max}}\right)+\mathrm{CW}\left({\mathrm{N}}_{\mathrm{max}}\right)+\mathrm{CW}\left({\mathrm{O}}_{\mathrm{max}}\right)$$
(2)

Here, T is the notation of threshold and N is the notation of the number of epochs. The T is an integer utilized to split SMILES attributes (i.e. Sk, SSk, and SSSk) into two classes i.e. active and rare. If a molecular attribute, A, takes place less than T times, then this molecular attribute should be omitted from the construction of the model ( molecular attribute is calculated from SMILES in the training set), hence the correlation weight of the A, CW(A) = 0. Therefore, this molecular attribute has been distinguished as rare. The T* and N* are the numerical values of the T and N that yield the best statistical result of a model for the calibration set.

The details of notation given in Eq. (2) are as follows: SSSk, a local SMILES attribute, is a combination of three SMILES atoms; NOSP, HALO, and BOND are global SMILES attributes that display the existence or absence of nitrogen (N), oxygen (O), sulfur (S), and phosphorus (P) (NOSP), fluorine, chlorine, and bromine (HALO); BOND illustrates the presence or absence of double (‘ = ’), triple (‘#’) and stereochemical (‘@’ or ‘@@)’ bonds; PAIR imply the combination of BOND and NOSP; HARD displays the presence or existence of NOSP, HALO, and BOND; Cmax represents the maximum number of rings; Nmax and Omax are the total numbers of nitrogen and oxygen atoms in the molecular structure. The CW(A) demonstrates the correlation weight for the SMILES-attributes e.g. SSSk, NOSP, BOND, HALO, PAIR, Cmax, Nmax, and Omax. These correlation weights are calculated using the Monte Carlo optimization33,34,35,36,37.

The obtained numerical data in terms of DCW is used to determine the inhibition potential for Imatinib derivatives (pIC50) by the least square method using the following one-variable model:

$${pIC}_{50}={\mathrm{C}}_{0}+{\mathrm{C}}_{1}\times \mathrm{DCW}\left({\mathrm{T}}^{*}, {\mathrm{N}}^{*}\right)$$
(3)

Monte Carlo optimization

In the present research modified target function (TFm) i.e. the balance of correlation with IIC was employed to compute the DCW32. The following mathematical relationships are used to compute TFm:

$$TF={R}_{training}+{R}_{invTraining}-\left|{R}_{training}-{R}_{invTraining}\right|\times Const$$
(4)
$${TF}_{m}=TF+{IIC}_{CAL} \times Const$$
(5)

Here, Rtraining and RinvTraining indicate the correlation coefficients for the training and invisible training sets, respectively. The empirical constant (Const) is usually fixed.

The index of ideality if correlation for the calibration set (IICCAL) is calculated using the following equation:

$$\mathrm{IIC}={\mathrm{R}}_{\mathrm{C}AL}\times \frac{\mathrm{min}({}^{-}{\mathrm{MAE}}_{\mathrm{CAL}}, {}^{+}{\mathrm{MAE}}_{\mathrm{CAL}})}{\mathrm{max}({}^{-}{\mathrm{MAE}}_{\mathrm{CAL}}, {}^{+}{\mathrm{MAE}}_{\mathrm{CAL}})}$$
(6)
$${}^{-}{\mathrm{MAE}}_{\mathrm{CLB}}=-\frac{1}{\mathrm{N}}\sum_{y=1}^{{N}^{-}} \left|{\Delta }_{\mathrm{k}}\right| \quad {\Delta }_{\mathrm{k}}<0, {}^{-}\mathrm{N \, is \, the \, number \, of } \, {\Delta }_{\mathrm{k}}<0$$
(7)
$${}^{+}{\mathrm{MAE}}_{\mathrm{CLB}}=+\frac{1}{\mathrm{N}}\sum_{y=1}^{{N}^{+}}\left|{\Delta }_{\mathrm{k}}\right| \quad {\Delta }_{\mathrm{k}}\ge 0, {}^{+}\mathrm{N \, is \, the \, number \, of } \, {\Delta }_{\mathrm{k}}\ge 0$$
(8)
$${\Delta }_{\mathrm{k}}={\mathrm{Observed}}_{\mathrm{k}}-{\mathrm{Calculated}}_{\mathrm{k}}$$
(9)

The ‘k’ is the index (1, 2, …. N). The observedk and calculatedk are related to the endpoint.

Applicability domain

According to the 3rd principle of the OECD, the applicability domain (AD) is recommended for the validation of the established QSAR model. The physicochemical, structural, or biological space, knowledge, or information on which the model's training set was created and for which it is used to generate predictions about new compounds is known as the AD38,39.

In the CORAL program, Monte Carlo-based QSAR, scattering of SMILES attributes in the training, invisible training and calibration sets is utilized to achieve AD40,41. If a substance does not fall within the scope of AD, it is identified as an outlier and cannot be associated with a reliable prediction.

In CORAL, a compound is recognized in the scope of AD if the following inequality is fulfilled, otherwise, it is recognized as an outlier:

$${\mathrm{Defect}}_{\mathrm{molecule}} <2\times {\overline{\mathrm{Defect}} }_{TRN}$$
(10)

where \({\overline{\mathrm{Defect}} }_{\mathrm{TRN}}\) is an average of the statistical defect (D) for the dataset of the training set.

The statistical defect (D) can be described as the sum of statistical defects of all attributes present in the SMILES notation.

$${\mathrm{Defect}}_{\mathrm{Molecule}}=\sum_{\mathrm{k}=1}^{N{\mathrm{A}}}{\mathrm{Defect}}_{{\mathrm{A}}_{\mathrm{K}}}$$
(11)

NA is the number of active SMILES attributes for the given compounds.

The “statistical defect,Defect(A) for an attribute of SMILES can be defined by the following mathematical equation:

$${\mathrm{Defect}}_{{\mathrm{A}}_{\mathrm{K}}}=\frac{\left|{\mathrm{P}}_{\mathrm{TRN}}{(\mathrm{A}}_{\mathrm{K}})-{\mathrm{P}}_{\mathrm{CAL}}{(\mathrm{A}}_{\mathrm{K}})\right|}{{\mathrm{N}}_{\mathrm{TRN}}{(\mathrm{A}}_{\mathrm{K}})+{\mathrm{N}}_{\mathrm{CAL}}{(\mathrm{A}}_{\mathrm{K}})} \quad \mathrm{ If }{\mathrm{A}}_{\mathrm{K}}>0$$
(12)
$${\mathrm{Defect}}_{{\mathrm{A}}_{\mathrm{K}}}=1 \quad \mathrm{ If }{\mathrm{A}}_{\mathrm{K}}=0$$

\({P}_{TRN}{(A}_{K})\) and \({P}_{TCAL}{(A}_{K})\) are the probability of an attribute 'Ak' in the training and the calibration sets; \({N}_{TRN}{(A}_{K})\) and \({N}_{CAL}{(A}_{K})\) are the number of times of Ak in the training and calibration sets, respectively.

Validation of the model

The statistical eminence of the created QSAR models for pIC50 of Imatinib derivatives is evaluated on the basis of the three methodologies: (i) internal validation or cross-validation by determining the R2, IIC, CCC, Q2, and F-test on the training set; (ii) external validation by determining the Q2F1, Q2F2, Q2F3, CRp2, s, MAE, r̅m2, and Δrm2 utilizing the test set substances and (iii) data randomization or Y-scrambling (Table 1). The mathematical relationship of these statistical parameters has been provided in the literature42,43,44,45,46. In Table 1, Yobs is observation endpoint; Yprd is the prediction endpoint; R2 and \({R}_{0}^{2}\) are the squared correlation coefficient values between the observed and predicted endpoints with intercept and without intercept respectively, and \({R}_{r}^{2}\) is squared mean correlation coefficient of randomized models.

Table 1 The mathematical equation of different statistical benchmark of the predictive potential for CORAL models.

Results and discussion

QSAR models

With the mentioned data in “Data”, three splits were generated randomly. Each split was further divided into four sets namely training, invisible training, calibration and validation sets. To establish the QSAR model, a balance of correlation with the IIC technique was employed. The values of IICweight (weight of IIC) and dRweight (weight for dR in the balance of correlations) were 0.2, and 0.1, respectively. The result for the preferable T* and N* was 1 and 15 for all splits. With the best-preferred values of T* and N*, the pIC50 (endpoint) for each split was computed and the developed QSAR models are as the following:

$$\mathrm{Split }1\quad {pIC}_{50}=3.6679\left(\pm 0.0196\right)+0.2889(\pm 0.0016)\times DCW(1, 15)$$
(13)
$$\mathrm{Split }2\quad {pIC}_{50 }=1.5438\left(\pm 0.0259\right)+0.2660(\pm 0.0017)\times DCW(1, 15)$$
(14)
$$\mathrm{Split }3\quad {pIC}_{50 }=3.4165\left(\pm 0.0126\right)+0.2696(\pm 0.0010)\times DCW(1, 15)$$
(15)

The statistical characteristics of the generated QSAR models computed by relationships 13–15 are depicted in Table 2. The outcomes in Table 2 demonstrate that all generated QSAR models from the statistical point of view are appropriate and match the requirements of various validation criteria. The robustness of established QSAR models was demonstrated by the numerical value of R2 and Q2 values which were more than 0.5 and 0.747,48. In addition, the numerical value of the R2m metric for the validation set of all designed QSAR models was satisfactory and follows the criteria suggested by Roy et al.49. Also, the \({\overline{R} }_{m}^{2}\)-scaled and \({\Delta R}_{m}^{2}\)-scaled introduced as modified R2m metric by Roy et al. were computed50, these values were 0.6928 and 0.0216, 0.6878 and 0.0929, and 0.7339 and 0.1230 for split 1 to 3, respectively. The trustworthiness of the constructed QSAR models was also confirmed by the Y-randomization test.

Table 2 The summary statistical characteristics and criteria of predictability of the QSAR models for three random splits.

After several repetitions of new random models were developed and the values of R2 were found below 0.1 (see Table S2 as supplementary information). These result indicates that the correlation between pIC50 and molecular attributes is not based on chance correlation. Moreover, for three splits, the CR2p was obtained greater than 0.75, which confirmed the non-chance correlation of developed models51.

The AD for each compound in models 1 to 3 shown in Table S1 based on the results of defectvalue. The percentages of compounds in the AD of models were 81, 83, and 87% for splits 1–3, respectively. It showed that the three prediction models were able to predict more than 80% of the new data.

Figures 1 and 2 demonstrate the pictorial presentation of experimental data of pIC50 versus predicted pIC50 and residual pIC50 versus predicted pIC50 of three models. As can be seen in Fig. 1, there is good agreement between experimental and predicted data in the suggested models. It can also be seen in Fig. 2 that the dispersion of residual pIC50 near the horizontal line centred around zero. All these results confirmed that all constructed QSAR models were robust and well fitted.

Figure 1
figure 1

The graph of the experimental versus predicted values of pIC50 for split 1 to split 3.

Figure 2
figure 2

The graph of the residuals versus predicted values of pIC50 for split 1 to split 3.

Interpretation of the QSAR model

Mechanistic interpretation of models helps in understanding the effectiveness of descriptors in the predicted endpoint. The mechanistic interpretation of built-up QSAR models utilizing the CORAL program is done with correlation weights (CW) of SMILES-attributes which are achieved from several runs of the Monte Carlo optimization. The CW for each SMILES attributes in various probs of a model likely positive, negative, or both positive and negative. The positive and negative promoters are considered as promoters of increase and decrease of the activity or an endpoint, respectively. Consequently, promoters of increase of pIC50 have positive CW and promoters of decrease of pIC50 have negative CW. But, if the structural attribute in all runs both positive and negative values of CW, then these attributes are undefined. Table 3 represents the list of the structural features as the promoters of increase or decrease of pIC50 achieved in the results of three probs of the Monte Carlo optimization with optimum T* and N* along with the interpretation of the promoters (NT is number of attributes in the training set, NiT is number of attributes in the invisible training set, and NC is number of attributes in the calibration set). According to the results, the important SMILES-descriptors as the promoter of increase/decrease of pIC50 were distinguished and recognized. The SMILES-based descriptors as promoters of increase of pIC50 were c…c…c…, c…c…1… and Cmax.3……, and the promoter of decrease pIC50 was C…(…(….

Table 3 List of structural attributes (SAk) as a promoter of increase/decrease extracted from three split of the constructed model.

Comparison with prior reports

Kyaw Zin and colleagues29 reported a QSAR model by the same data relying on deep neural nets (DNN) and hybrid sets of 2D/3D/MD descriptors to predict the inhibition potencies of 306 imatinib derivatives. The dataset was divided into two sets i.e. training set (260 compounds) and a test set (46 compounds). They built multiple DNN and RF regressors with hybrid 2D/3D/MD descriptors and showed high predictive power through rigorous validation tests. Through rigorous validation tests, they reported that their DNN regression models resulted excellent external prediction performances for the pIC50 data set. The R2 of training and validation setes was 0.99 and 0.68 respectively and the MAE of training and test set was 0.08 and 0.67 respectively.

The comparison QSAR model here with the previous study showed that the structure, physicochemical parameters or previous calculations of the chemicals descriptors for the construction of the models were required by the model, while in the case of CORAL software, a text file containing SMILES notations of compounds and endpoint was used for model development. Here, we used 3 splits to establish three QSAR models using four sets (training, invisible training, calibration and validation set), but in previously constructed models, a single split utilizing two sets (training and test set) was used. In the present research, the molecular features responsible for the increase/decrease of endpoint were also detected for mechanistic interpretation.

In terms of statistical characterization, the proposed QSAR model by CORAL for the prediction of pIC50 was superior to the reported model. The statistical parameters \({Q}_{F1}^{2}\), \({Q}_{F2}^{2}\), \({Q}_{F3}^{2}\), \({CR}_{p}^{2}\), CCC and IIC were not reported in the previous report. The R2 of training and validation setes for split 1 to 3 are between 0.76–0.85 and 0.71–0.78, respectively and the MAE of training and validation sets for split 1 to 3 are between 0.41–0.54 and 0.46–0.54, respectively. Therfore, the QSAR models established here are more reliable and have better predictability.

Conclusion

In this work, to predict pIC50 of 306 Imatinib derivatives, QSAR models were created using the Monte Carlo method and validated with several parameters. The QSAR models were established using a modified target function (TFm). The statistical characterization of constructed models was justified using internal and external validation metrics such as R2, IIC, CCC, Q2, \({Q}_{F1}^{2}\), \({Q}_{F2}^{2}\), \({Q}_{F3}^{2}\), F, s, MAE, RMSE, \(\overline{{R }_{m}^{2}}\), \(\overline{{\Delta R }_{m}^{2}}\), scaled-\(\overline{{\mathrm{R} }_{\mathrm{m}}^{2}}\), scaled-\(\overline{\Delta {\mathrm{R} }_{\mathrm{m}}^{2}}\), \({CR}_{p}^{2}\), and Y-randomization test. In the constructed QSAR model, the numerical value of R2, Q2, and IIC for the validation set of splits 1 to 3 were in the range of 0.7180- 0.7755, 0.6891–0.7561, and 0.4431–0.8611 respectively. The domain of applicability (AD) was applied to identify the outliers in the generated QSAR models. The structural features as promoters of pIC50 increase/decrease were also identified.