Abstract
In this paper, multiple linear regression (MLR) was used to build quantitative structure property relationship (QSPR) of noctanolwater partition coefficient (logP_{o/w}) of 195 substituted aromatic drugs. The molecular descriptors were calculated for each compound by the VLifeMDS. By applying genetic algorithm/multiple linear regressions (GA/MLR) the most relevant descriptors were selected to build a QSPR model. The robustness of the model was characterized by the statistical validation and applicability domain (AD). The prediction results from MLR are in good agreement with the experimental values. The R^{2} and Q^{2} _{LOO} for MLR are 0.9433, 0.9341. The AD of the model was analyzed based on the Williams plot. The effects of different selected descriptors are described.
Introduction
Lipophilicity is the tendency of a compound to partition into a nonpolar organic phase versus an aqueous phase. The typical quantitative descriptor of lipophilicity is the partition coefficient P of a given compound between two immiscible solvents^{1}. Traditionally, noctanol has been widely used as the nonpolar phase and water as the polar phase. The partitioning value that is measured is termed logP_{o/w} ^{2}.
The noctanol is considered a good mimic of phospholipids membrane characteristics because its nature is amphiphilic^{3}. Among other physicochemical properties, lipophilicity plays a key role for molecular discovery activities in a variety of domains including, agrochemicals, cosmetics, material sciences, environmental chemistry, food chemistry, and particularly medicinal chemistry^{4}. A correct estimation of logP_{o/w} is essential for the discovery and development of efficient therapeutic molecules^{5}. Whereas lipophilicity cannot characterize the whole physicochemical nature of a compound, properties governing lipophilicity have a basic effect on the actions of organic molecules, such as drugs or drug candidates. Many drugs will go through a series of partitioning steps: (a) leaving the aqueous extracellular fluids, (b) passing through lipid membranes, and (c) entering other aqueous environments before reaching the receptor. In this sense, a drug is passing the same partitioning phenomenon that happens to any chemical in a separatory funnel containing water and a nonpolar solvent. So a compound must have an optimal lipophilicity, because if the solute is very lipophilic it will remain trapped in the membrane^{6}. Lipophilicity is one of the main factors influencing the pharmacokinetic behavior of βblockers by several ways: 1Oral absorption, 2Penetration in the central nervous system (CNS), 3Renal clearance, 4Degree of biotransformation and plasma halflife, 5Cardioselectivity, 6Cornealpenetration^{7, 8}. For example, the most lipophilic βblockers (such as propranolol) penetrate readily into the CNS and raise central effects (somnolence), whereas the more hydrophilic drugs have a low CNS penetration and negligible central effects^{8}. The in situ rat gut technique is an informative tool yielding realistic absorption rates. In 1981 a study of 18 sulfonamides, the absorption rate constant k_{a} was correlated with the lipophilicity parameter^{9}. Good gastrointestinal absorption was for many years a problem in the development of Penicillins. Yoshimura^{10} developed an organized study in mice and rats and showed that the two major molecular properties influencing the GI absorption of penicillins are their stability in acidic solutions and their lipophilicity. Corneal penetration is an overcritical condition for the therapeutic success of ocularly administered drugs such as βblockers used as antiglaucoma agents. In 1983, an important study showed that lipophilicity clearly plays a key role in penetration through intact cornea. In a series of 12 βblockers, the logPC (permeability coefficient) exhibited a parabolic relation with lipophilicity^{11}. For a homogeneous set of phenols, a parabolic relation was found between human skin permeability (K_{p}) and the logP_{o/w} ^{12}. In 1991, for 11 aromatic acids (model compounds and antiinflammatory drugs) their binding constant to bovine serum albumin (in logarithmic form) was correlated with hydrophobic index obtained by RPHPLC^{13}. In another study, the unbound fraction in plasma (f_{u}) that was taken as the biological response, showed a sigmoidal relation with logP_{o/w} ^{14}. Interestingly, parabolic relations between protein binding and lipophilicity are also known, validating the limited dimensions of some binding sites. When large molecules such as Cephalosporins were tested for their association constant (K_{a}) to human serum albumin, a fair parabolic relation was found with lipophilicity^{15}. In the important study, the concentration of 10 basic drugs in plasma and 8 nonmetabolizing tissues was examined administration to rabbits. These drugs were weakly basic benzodiazepines and strongly basic neurological drugs. Good linear relations (R^{2} = 0.92 to 0.97) were found between the tissuetoplasma concentration ratios of unbound, nonionized drugs and their logP_{o/w}. The slope of the linear regressions raised in the series: muscle < skin < bone < brain < gut < heart < lung < adipose^{16}. In many studies on drug permeation through biological membranes (gut wall, skin, bloodbrain barrier, and Caco2 cell monolayer), relationships between permeation and lipophilicity have been developed with homologous series of compounds of a diverse nature (acidic, alkaline and neutral) to investigate the influence of lipophilicity on passive diffusion. For example Sigmoidal relationships were established between permeability coefficients in rat jejunum and logP_{o/w} for seven steroids^{17}, and 11 βblockers^{18}. Even so, despite the good solubility of most organic compounds in noctanol and ease in lab handling, the experimental determination of logP_{o/w} remains a resource and timeconsuming process. Methods to estimate logP_{o/w} are basically dedicated to medicinal chemistry and molecular design activities. Estimation approaches involve group and atom contribution methods^{19, 20}, quantitative structure property relationships (QSPR) derived from statistical regressions^{21,22,23}. Group and atom contribution models have usually been based on fragments, derived either from atoms or groups of atoms, which are assigned incremental logP_{o/w} contributions^{24}. QSPR have been developed as alternate strategies of estimating lipophilicity. The assumption of QSPR for logP_{o/w} is that physicochemical properties can be correlated with molecular structural characteristics (geometric and electronic) expressed in terms of appropriate molecular descriptors^{25}. In recent years, enhancements in logP_{o/w} QSPR have been suggested through the use of molecular descriptors derived from semiempirical Molecular Orbital theory (quantum mechanics) calculations^{26}. For example, Bodor^{27}, using AM1 semiempirical MO theory, reported a standard deviation of 0.306 logP_{o/w} for a 18 parameter linear correlation which was developed for estimating lipophilicity for a heterogeneous data set 302 organic compounds. In 1999, Eisfeld and Maurer^{28} proposed a logP_{o/w} correlation with dipole moment, polarizability, electrostatic potential and molar volume as chemical descriptors, based on a heterogeneous set of 202 compounds with a reported standard deviation and maximum absolute error of 0.287, respectively. Yaffe^{29}, using Fuzzy ARTMAP and BackPropagation Neural Networks Based QSPR, Estimated logP_{o/w} for heterogeneous set of 442 organic compounds.
In this work we develop QSPR modeling of logP_{o/w} of 195 substituted aromatic drugs. These drugs are very important in medicinal chemistry, such as: Alprazolam, that is mostly used to treat anxiety disorders, panic disorders, and nausea due to chemotherapy, Dapsone, that is commonly used in combination with Rifampicin and Clofazimine for the treatment of leprosy, Procaine, that is a local anesthetic drug of the amino ester group. It is used primarily to reduce the pain of intramuscular injection of penicillin, and it is also used in dentistry, Warfarin treatment can help prevent formation of future blood clots and help reduce the risk of embolism^{30}. In this paper all of 195 drugs are homogeneous set of aromatic drugs.
Computational approach
All calculations were run on a Dell Inspiron N5010 laptop computer with Intel® Core™ i7 processor with Windows 7 operating system. The molecular structures of all compounds were drawn into the HyperChem 8.0 (Hypercube, Inc., Gainesville, 2011) and preoptimized using MM^{+} molecular mechanics method (Polak–Ribiere algorithm). The final geometries of the minimum energy conformation were obtained by more precise optimization with the semiempirical PM3 method, applying a root mean square gradient limit of 0.05 (Kcal.mol1.Å^{−1}), as a stopping criterion for optimized structures. The molecular descriptors were calculated by VLifeMDS (version: 4.4) Software. A GA/MLR algorithm procedure was used for selection of descriptors using QSARINS (QSAINSubria version 2.2.1 2015) software package. MLR was performed by QSARINS.
Data set selection
For the present study logP_{o/w} of 195 drug compounds was collected from the literature^{31}. All molecules exhibited a wide range of lipophilicity (−2.17; 6.03). In order to obtain a validated and, therefore, predictive QSPR model, an available dataset should be divided into the training and test sets. Commonly, this splitting is performed using random and rational splitting methods^{32}. The data set was split randomly into 147 training set and 48 prediction set (see Table 1).
Computational methods
Descriptor generation
Molecular descriptors are generated from molecular structures. Although different descriptors utilize different processing steps, still there are numerous steps common to these procedures. Molecular descriptors are powerful tools for the approximation of selected properties of chemical structures in an easytohandle form that allows efficient comparison and selection of compounds possessing required chemical, structural, pharmacological or biological features. In this study molecular descriptors were calculated for each compound by the VLifeMDS on the minimal energy conformations. VLifeMDS calculates about 500 different molecular descriptors from the categories: topological, electronic, electrostatic, Estate, information theory based, physicochemical and semiempirical.
Descriptor selection
After descriptor generation a pool of the molecules with the corresponding descriptors become available for model calculation. But a limited number of modeling descriptors, related to the studied response, must be selected from the available pool. Descriptor selection is the process of selecting a subset of relevant variables for use in model construction. In QSARINS this is done using a GA/MLR procedure. This technique is able to explore a broad range of solutions, searching for the best ones, by maximizing or minimizing a selected fitness function. This is done mimicking the natural selection, where the best solutions replace the less performing. In biological terms, one would say that the best genes in the population displace the less fitting. In our case, every descriptor represents a gene, and a set of descriptors represents a chromosome. The fitness of a chromosome is related to the matching model performances. Starting with a pool of chromosomes, small subsets of chromosomes are picked randomly, and the best become parents. Couples of parent chromosomes are then crossed at a random position (crossingover), thus obtaining the offspring, whose chromosomes are a combination of the parent ones. If among the new chromosomes one or more of them outperform the less fitting in the parent population, these chromosomes will replace the less performing. Repeating the aforesaid procedure many times, and introducing also random mutations (descriptor substitution) in the chromosomes, the result at the end of the procedure is a population of models with better performances than the models introduced at the beginning. In order to prevent a completely random beginning of the GA, in QSARINS, the best set of descriptors extracted from the all subset process is used as the core of the chromosomes of the initial population. In QSARINS, the tuning of the GA can be done changing the population size, the mutation rate, and the number of generations. A fundamental option is the selection of the fitness function to be used by GA. In the work, leaveoneout crossvalidation (Q^{2} _{LOO}) was used as fitness function throughout the GA process. When increasing the model size does not improve the Q^{2} value significantly, the GA selection will be stopped. Q^{2} _{LOO} used as fitness function, is useable to select models with high fitting with the minimum number of descriptors. However, it is essential to note that they are fitting criteria, so they provide no information on the predictive ability of the models. For this reason, it is here proposed to use Q^{2} _{LOO} as fitness function for the selection of predictive models^{33}. The important parameters used in the GA process were set as below: population size 100, maximum allowed descriptors in a model 10 and reproduction/mutation tradeoff 0.5. Finally, we obtained a 10descriptor subset, which keeps most interpretive information for logP_{o/w}. Four descriptors were calculated for each compound in the data set. The selected descriptors are: SKMostHydrophobic Area, SAHydrophobic Area, SKAverage, XKAverage Hydrophobicity, PSA, Average Potential, Polar Surface Area Excluding P & S, 4Path Count, ChiV6chain and AlphaR.
Modeling method in QSARINS
The datasets used in QSPR analysis are, as previously mentioned, composed of descriptors that should be correlated with the corresponding experimental responses. At this step it is necessary to apply a quantitative method able to find the existing relationship between a limited number of structural descriptors and the modeled response. In QSARINS, the used method is the MLR approach that can be demonstrated by the following formula:
where a linear relationship is computed between the studied responses (y_{i}) and the selected values of the descriptors (x_{ij}); e_{i} is the random error (called also model residual). The intercept (b_{0}) and the coefficients (b_{j}) are thus to be evaluated. The equation (2) can be rewritten in a more compact form using the matrix notation:
where y is the responses vector, b the vector of the coefficients and e is the vector of the errors. X is the matrix of the model, where the columns are the descriptors. In this software, to estimate the vector of the coefficients, the OLS technique is used:
where \(\hat{{\rm{b}}}\) is the vector that estimates the b vector of the coefficients, X^{T} the transposed X matrix and ^{−1} is the inverse matrix operation. The OLS minimizes the sum of squares of the difference between the experimental responses and the ones calculated by the model. To work correctly, the OLS assumes that: (1) a linear relationship exists between the descriptors and the response, (2) the response errors are independent and similarly distributed, (3) the descriptors are not too correlated among them, (4) there are more compound than modeling descriptors (a ratio that should be always higher than 5:1). Once the coefficients of the model are calculated, it is possible to obtain the vector of the \(\hat{{\rm{y}}}\), as in the following formula:
where H is the leverage (or hat) matrix that relates the calculated and the experimental responses. The diagonal elements of the hat matrix h _{ ii } are useable to determine the distance of the i object from the centre of the chemical space of the model^{34, 35}, thus, for checking the structural applicability domain (AD) of the model.
Model evaluation
Evalution of QSPR model is a very important aspect. It is acknowledged that the goodnessoffit is very important for QSPR models. The quality of goodnessoffit of the models is quantified by the R^{2} squared correlation coefficient, R^{2} _{adj} is adjusted squared correlation coefficient, s is the standard error of the regression and F is the Fisher ratio for regression. R^{2} is a statistic that will give some information about the goodness of fit of a model. R^{2} is defined as:
where RSS is the residual sum of squares and TSS is the total sum of squares. Adjusted R^{2} detects the possible overfitting of a model so, used as fitness functions, are useful to select models with high fitting with the minimum number of descriptors. Adjusted R^{2} is defined as:
where n is the number of members of the training set and m is the number of descriptors included in the model. The Adjusted R^{2} is a better measure of the proportion of variance in the data explained by the correlation than R^{2}. The standard error indicates dispersion degree of random error. Fratio test in regression is defined as the ratio between the variance explained by the model to the residual variance. The larger R^{2}, R^{2} _{adj} and F, the smaller s, and the model will have more fitting ability.
Model validation
Model calculation and evaluation are the basic steps in QSPR analysis, but are not sufficient to guarantee the model validity. Validation is fundamental to ensure the reliability of data predicted by the models. Validation of QSPR model is very important aspect, thus internal and external validation is considered to be necessary for model validation^{35}.
Internal validation is obtained from analyzing of each one of individual objects that configure the final equation. This procedure is leaveoneout (LOO) crossvalidation. This process was done in training set and Q^{2} _{LOO} is calculated.
where TSS is the total sum of squares that is the sum of squared deviations from the data set mean and PRESS is the sum of squares of the prediction errors. The larger Q^{2} _{LOO} and the model will have more predictive ability. However, a perturbation of only one compound at a time is very weak to demonstrate real model robustness. In QSARINS, the stronger LeaveMore (or many)Out (LMO) technique is also included. This technique studies the behavior of the model when a larger number of compounds are eliminated. LMO is used to counteract the slight overoptimism of LOOcrossvalidation. The model under analysis can be considered stable if the R^{2} and Q^{2} values calculated in every LMO iteration and their averages (R^{2} _{LMO} and Q^{2} _{LMO}), are close to R^{2} _{LOO} and Q^{2} _{LOO} values of the model^{36}.
To show that the model is not the result of chance correlation, the Yscrambling procedure can be applied. In this process, the responses are shuffled at random, so no correlation between them and the descriptors should exist. As a consequence, the performances of the corresponding scrambled models should decrease drastically. In this case if the original model under validation is good, the values of R^{2} and Q^{2} of the every iteration, and their averages (R^{2} _{yscr} and Q^{2} _{LOOyscr}), must be far and much smaller from the values of the original model. If Q^{2} _{LOOyscr} < 0.2, and R^{2} _{yscr} < 0.2, there is no risk of chance correlation in the developed model.
In the process of model validation, external validation is necessary. External validation of the model is checked for its ability to predict new compounds. This is done by applying the model equation, obtained on the training set, to one or more prediction data set(s), that is the excluded compounds that have never been used in model calculation, and measuring the performances by means of different criteria, such as: RMSE^{37}, Q^{2} _{F1} ^{38}, Q^{2} _{F2} ^{39}, Q^{2} _{F3} ^{40}, CCC^{41} and Q^{2} _{EXT} ^{42}.
The external Q^{2} _{F1} for the test set is determined with the following equation:
where \({\bar{y}}_{TR}\) indicates the response means of the training set, respectively. PRESS is the predictive sum of squares, \(S{S}_{EXT}({\bar{y}}_{TR})\,\,\) is the total sum of squares of the external set calculated by means of the training set mean, respectively. Consequently, this formula gives valid values when the test set spans the whole response domain of the model because in this case the test set mean approaches the training set mean.
Q^{2} _{F2} is defined as:
where \({\bar{y}}_{EXT}\) indicates the response means of the external test set and \(S{S}_{EXT}({\bar{y}}_{EXT})\) is the total sum of squares of the external set calculated by means of the external set mean, respectively. Function Q^{2} _{F2} does not account for information about the reference model because \({\bar{y}}_{EXT}\) encodesinformation derived from the external set and this informationalters continuously on the basis of the objects belonging to the external set.
Q^{2} _{F3} is defined as:
where TSS is the total sum of squares n_{EXT} is number of test set and n_{TR} is number of train set. Expression Q^{2} _{F3} reduces to expression for Q^{2} _{LOO} when training and test sets coincide (n_{EXT} = n_{TR}), or, in other words, when all available data are used both for fitting and assessing model predictive ability.
CCC: Concordance correlation coefficient.
It is well suited to measure the consensus between experimental and predicted data, which should be the real aim of any predictive QSPR models. Where x_{i} and y_{i} correspond to the abscissa and ordinate values of the graph plotting the prediction experimental data values vs. the ones calculated using the model. Where n is the number of chemicals, and \(\bar{x}\) and \(\bar{y}\) correspond to the averages ofabscissa and ordinate values, respectively. This coefficient measures both precision (how far the observations are from the fitting line) and accuracy (how far the regression line deviates from the slope 1 line passing through the origin, the concordance line), consequently any divergence of the regression line from the concordance line gives as a consequence a value of CCC smaller than 1.
An elemental property of a function for the assessment of model fit from external evaluation data is that external observations are independent of each other. This means that the Q^{2} value derived from the whole external data set Q^{2} _{EXT} and the average of the Q^{2} values obtained taking separately each external data one at one time should coincide. The optimized model was applied for the prediction of logP_{o/w} values of 49 drugs in the prediction set which were not used in the optimization procedure. The predictive ability of a model on external validation set can be expressed by Q^{2} _{EXT}.
where Q^{2} _{i} is the external Q^{2} calculated taking into account only the ith object of the test set and n_{EXT} is the total number of external objects.
An additional measure of the accuracy of the proposed QSPR is the RMSE (root mean squared errors) that summarizes the overall error of the model.
where \({\bar{y}}_{i}\) is the predicted value for the ith test object and y_{i} its observed value, n_{EXT} is the total number of test objects. This parameter depends only on the mean deviations between predictions and observed values and it can always be calculated even when there is only one test object. It is calculated as the square root of the sum of squared errors in prediction divided by their total number. This parameter was calculated to compare the accuracy and the stability of our models in the training (RMSE_{TR}) and in the prediction (RMSE_{EXT}) sets. It is important to note that RMSE values must not only below but also as similar as possible for the training, crossvalidation and external prediction sets. This suggests that the proposed model has both predictive ability (low values) as well as sufficient generalizability (similar values).
The AD is a theoretical area in chemical space, defined by the model descriptors and modeled response, and thus by the nature of the chemicals in the training set, as represented in each model by specific molecular descriptors As even a robust, significant and validated QSPR cannot be expected to reliably predict the modeled property for the all universe of chemicals, its domain of application must be defined, and the predictions for only those chemicals that fall in this domain can be considered reliable. The Williams plot of the regression permits a graphical detection of both the outliers for the response and the structurally influential chemicals in a model. The Williams plot detects the outliers for the response (Youtliers) and those for the structure (Xoutliers). It consists of plotting the standardized residuals on the yaxis and the leverage values from the hat matrix diagonal on the xaxis. The leverage (h) of a compound measures its influence on the model. The leverage of a compound in the original variable space is defined as:
where the X is the model matrix derived from the training set descriptor values and the leverage values of training set are diagonal elements of the Hat or Influence matrix H (h_{i} = diag(H)). The leverage values are always between 0 and 1. The warning leverage h ^{*} is defined as follows:
where n is the number of training set compounds and p′ is the number of model parameters plus one. Observations with standardized residuals greater than (−3; +3) range, which lie outside the horizontal reference lines on the plot, are outlier’s responses in the QSARINS (standardized residuals >\(\pm 3\sigma \) is the standard deviation of residuals). Standardized residual (SR_{i}) for each sample is calculated as in equation (17):
where y_{i} and \({\hat{y}}_{i}\) are respectively the measured and predicted values of the property; n is the number of compounds in each set of data. To visualize the AD of a QSPR model, the plot of standardized residuals versus leverage values (h) (Williams plot) can be used for an immediate and simple graphical detection of both the response outliers and structurally influential chemicals in a model (h > h ^{*}). Concerning the residuals, all the chemicals falling above or below the user defined threshold are not well predicted and thus considered as outliers. Too many outliers, especially those underestimated, are symptomatic of a poor model and this is the reason of implementing the counting of the outliers. Leverage values represent the degree of influence that the structure of every single chemical has on the model. A compound with high leverage in a QSPR model is the driving force for the variable selection if this compound is in the training set (good leverage). A high leverage compound in the prediction set is detected as far from the chemical domain of the training compounds, thus it could lead to unreliable predicted data, being the result of substantial extrapolation of the model. Therefore, the structural information of the chemicals included in the training set could be not sufficient for a reliable prediction of chemicals lying outside of the trainingAD^{43}.
Results and Discussions
Multiple regression analysis
The MLR analysis was used to derive a QSPR model. The data set was randomly divided into training and test set. 147 drugs were selected as the training set in the modeling. 48 drugs were chosen as a prediction set and were used for external validation of the MLR. Making use of the MLR method, the linear model was obtained, in which the molecular descriptors were used as independent variables. In the Table 2, the list of descriptors, their coefficients and model parameters have been shown.
Where, n is the number of compounds used for regression, R^{2} is the squared correlation coefficient, R^{2} _{adj} is adjusted squared correlation coefficient, s is the standard error of the regression and F is the Fisher ratio for regression. R^{2} is a measure of how well the regression line approximates the real data points. The high R^{2} (R^{2} = 0.9433) indicates that the regression line perfectly fits the data. The squared correlation coefficient values closer to 1 represents the better fit of the model. Equation 18 has R^{2} _{adj} value of 0.9391, which indicates very good agreement between the correlation and the variation in the data. s represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values (s = 0.4031) are better because it indicates that the observations are closer to the fitted line. High values of the F (F = 226.3247) indicate that the model is statistically significant. The Ftest reflects the ratio of the variance explained by the model and the variance due to the error in the model, and high values of the Ftest indicate the model is statistically significant. The predicted and experimental values of logP_{o/w}, residuals (experimental logP_{o/w} − predicted logP_{o/w}), are presented in Table 1. The plots of predicted logP_{o/w} versus experimental logP_{o/w}, the residuals versus experimental logP_{o/w} value obtained by the MLR modeling and the random distribution of residuals about zero mean are shown in Fig. 1A and B. These results show that the predicted values are in good agreement with the experimental values. The leaveoneout and leavemanyout cross validations were performed in training set. The Q^{2} _{LOO} and Q^{2} _{LMO} describe the stability of a regression model obtained by focusing on sensitivity of the model to the elimination of any or more data point. (Q^{2} _{LOO} = 0.9341, Q^{2} _{LMO} = 0.9318 illustrate the stability of the model). In the present study, R^{2} _{yscr} = 0.0685 and Q^{2} _{LOOyscr} = −0.0901 show that the model is not the result of chance correlation (see Fig. 2). The external validation is an indispensable validation method used to determine the true predictive ability of the QSPR model. The large value of Q^{2} _{EXT} = 0.8982, Q^{2} _{F1} = 0.8941, Q^{2} _{F2} = 0.8921, Q^{2} _{F3} = 0.9118 and CCC = 0.9463 illustrate the predictive capability of a model on external prediction set. In the Williams plot for AD (see Fig. 3), Sulfasalazine in the test set is to the right of the vertical line, which indicates it has high leverage value (h > h ^{*} = 0.224) and low standardized residual, it is belong to the model AD. The chemical compound of Doxorubicin in the training set is to the right of the vertical line, which indicate they have high leverage value (h > h ^{*} = 0.224) and low standard residual. These chemicals with high leverages have a stronger influence on the model than other chemicals, and they are influential. In the standardized residuals plot, Enalapilat in training set and PhePhe in test set have standard residual > (−3; +3) range, which confirms that there are two outliers. Furthermore, there is no clear pattern in the residuals, so nothing seems to be wrong with the model. The fitting criteria, internal validation criteria and external validation criteria are shown in Table 3.
Interpretation of descriptors
SKMostHydrophobic Area, SAHydrophobic Area and SKAverage
SKMostHydrophobic Area is the most hydrophobic value on the van der Waals (vdw) surface. The van der Waals surface of a molecule is a surface might reside for the molecule based on the hard cutoffs of van der Waals radii for individual atoms, and it represents a surface through which the molecule might be conceived as interacting with other molecules. Hydrophobicity (also termed hydrophobic) materials possessing this characteristic have the opposite response to water interaction. Compared to hydrophilic materials, hydrophobic materials (water hating) have little or no tendency to absorb water and water tends to bead on their surfaces. Hydrophobic materials possess low surface tension values and lack active groups in their surface chemistry for formation of hydrogenbonds with water. Hydrophobicity is very important in solubility of drugs. Accordingly drugs that are extremely hydrophobic are also poorly absorbed, because they are totally insoluble in aqueous body fluids and, therefore, cannot gain access to the surface of cells. For a drug to be readily absorbed, it must be largely hydrophobic, yet have some solubility in aqueous solutions. This is one reason why many drugs are weak acids or weak bases. There are some drugs that are highly lipidsoluble, and they are transported in the aqueous solutions of the body on carrier proteins such as albumin. The results indicate that the SKMostHydrophobic Area increases as logP_{o/w} increases. SAHydrophobic Area is van der Waals surface descriptor showing hydrophobic surface area. Lipid solubility of a compound is of special importance to drug discovery and development, because it is directly related to the transport abilities of a drug candidate to cross biological membranes. The requirement is that drug molecules must be soluble enough in lipid to get into membranes but cannot be so soluble that they become trapped in the membranes. These membranes are not exclusively anhydrous fatty or oily structures. As a first approximation, membranes can be considered bilayers composed of lipids consisting of a polar cap and large hydrophobic tail. Phosphoglycerides are major components of lipid bilayers. Other groups of bifunctional lipids include the sphingomyelins, galactocerebrosides, and plasmalogens. The hydrophobic portion is composed largely of unsaturated fatty acids, mostly with cis double bonds. In addition, there are considerable amounts of cholesterol esters, protein, and charged mucopolysaccharides in the lipid membranes. The final result is that these membranes are highly organized structures composed of channels for transport of important molecules such as metabolites, chemical regulators (hormones), amino acids, glucose, and fatty acids into the cell and removal of waste products and biochemically produced products out of the cell. Apparently, increasing the SAHydrophobic Area increases logPo/w. SKAverage is the Average hydophobicity function value. According to Supplementary information, some molecules have a positive Hydrophobicity function, others are negative. If the desired compound is more soluble in nonpolar than polar phase, the Average hydophobicity function value is higher. Finally, increasing the SKAverage increases logP_{o/w}. SKMostHydrophobic Area, SAHydrophobic Area and SKAverage are calculated by SlogP method^{44}. This method represents a new atom type classification system for use in atombased calculation logP_{o/w}.
XKAverageHydrophobicity
XKAverageHydrophobicity is the Average hydrophobic value on the van der Waals (vdw) surface. This descriptor is calculated by XlogP method^{45}. In this method the atoms are classified by their hybridization states and their neighboring atoms. XlogP is based on the summation of atomic contributions and includes correction factors for some intramolecular interactions. The XKAverageHydrophobicity increases as logP_{o/w} increases.
PSA, Polar Surface Area Excluding P & S and Average Potential
Polar surface area of a molecule is defined as the sum of the contributions to the molecular surface area of polar atoms such as oxygen, nitrogen and their attached hydrogen’s. This parameter is easy to understand and, most importantly, provides good correlation with experimental transport data. PSA is a descriptor showing the correlation with passive molecular transport through membranes, which allows prediction of human intestinal absorption, caco2 monolayer permeability, and bloodbrain barrier penetration. Molecules with a polar surface area of greater than 140 angstrom squared tend to be poor at permeating cell membranes. For molecules to penetrate the bloodbrain barrier a PSA less than 90 angstroms squared is usually needed. In new approach, PSA is calculated based on the summation of tabulated surface contributions of polar fragments by Ertl^{46}. PSA increases as logP_{o/w} decreases. Polar Surface Area Excluding P & S signifies total polar surface area excluding phosphorous and sulphur. According to Table 2, this descriptor has a positive coefficient. This shows that the molecules have S and P, tend to dissolve in polar phase. In contrast, the molecules that have other atoms tend to dissolve in nonpolar phase. Thus, the presence of S and P atoms in the molecules are not in favor of the lipophilicity. Polar Surface Area Excluding P & S increases as logP_{o/w} increases. Average Potential signifies average of the total electrostatic potential on van der Waals surface area of the molecule. According to Table 2, Average Potential increases as logP_{o/w} decreases.
4PathCount, ChiV6chain and AlphaR
4Path count signifies total number of fragments of fourth order (four bond path) in a compound. This descriptor signifies total number of fragments of fourth order (four bond path) in a compound. 4Path Count describes the connectivity of the atoms within the molecule and also explains its branching and flexibility or rigidity. In fact, lipophilicity decreases with branching. This is due to the fact that the branching of the chain makes the molecular most compact and thereby decreases the surface area. Thus, more branching will reduce the size of the molecule, making it harder to solvate in nonpolar phase. As a result, the lipophilicity of the normal compound isomers is higher in all instances than the branched compounds. According to Table 2, 4Path Count shows a negative coefficient towards the lipophilicity, which indicates this descriptor increases as logP_{o/w} decreases. ChiV6chain signifies atomic valence connectivity index for six membered rings. This descriptor indicates the importance of molecular bulk for lipophilicity. Lipophilicity increases with molecular bulk because large molecules are better solved in nonpolar phase such as noctanol. This descriptor is calculated by molecular graph. Apparently, increasing the chiV6chain increases logP_{o/w}. AlphaR indicates sum of α value of all nonhydrogen atoms in a reference alkane. The reference alkane is when all heteroatoms in the molecular graph are replaced by carbon and multiple bonds are replaced by single bonds, corresponding molecular graph may be considered as the reference alkane. The parameter α is related to the size of an atom. The term ∑α is a measure of molecular bulk. When ∑α is compared to that of the corresponding reference alkane, a measure of the heteroatom count and size of a molecule can be obtained.
Where, Z and Z^{v} represent atomic number and valence electron number respectively. The PN stands for period number. Hydrogen atom is considered as reference, α for hydrogen is taken to be zero. Table 4 shows that α value of different atoms. According to Table 2, the coefficient of AlphaR is negative. These results indicate the electronegativy of atoms must be considered. If the molecules that have the atoms such as Cl, Br, S and P, have the higher α and increases size and electronegativy. As a result, more electronegative molecules are solved in the aqueous phase^{47}. Finally AlphaR increases as logP_{o/w} decreases.
Conclusion
In this work, the MLR was used to construct linear QSPR model to predict logP_{o/w} of a wide and homogeneous set of aromatic drugs. MLR method could model the relationship between logP_{o/w} and descriptors. The GA/MLR method is applied for descriptor selection. The results show that the GA/MLR method is a very effective descriptor selection approach for QSPR analysis. The results indicate that the goodness of fit, robustness and predictive ability of MLR model was perfect from internal and external validation. By performing model validation, it can be concluded that the presented model is valid model and can be effectively used to predict the logP_{o/w}. Moreover, the mechanism of the model was interpreted and the applicability domain of the model was defined.
References
Daina, A., Michielin, O. & Zoete, V. A Simple, Robust, and Efficient Description of n Octanol/Water Partition Coefficient for Drug Design Using the GB/SA Approach. J. Chem. Inf. Model. 54, 3284–3301 (2014).
Kerns, E. H. & Di, L. Druglike Properties: Concepts, Structure Design and Methods: from ADME to Toxicity Optimization (Academic Press, Elsevier, 2008).
Liu, X., Testa, B. & Fahr, A. Lipophilicity and its relationship with passive drug permeation. Pharm. Res. 28, 962–977 (2011).
Plika, V., Testa, B. & van de Waterbeemd, H. Lipophilicity: The Empirical Tool and the Fundamental Objective. An Introduction. In Lipophilicity in Drug Action and Toxicology; Methods and Principles in Medicinal Chemistry (Weinheim, WileyVCH Verlag GmbH, Germany, 1996).
Yazdanian, M. Overview of determination of biopharmaceutical properties for development candidate selection. Curr. Protoc. Pharmacol. Chapter 9, Unit 9.17 (2013).
Conradi, R. A., Burton, P. S. & Borchardt, R. T. Physicochemical and biological factors that influence a drug’s cellular permeability by passive diffusion. In: Lipophilicity in drug action and toxicology (Weinheim, VCH Publishers, 2008).
Taylor, D. C., Pownall, R. & Burke, W. The absorption of βadrenoceptor antagonists in rat insitu small intestine; the effect of lipophilicity. J. Pharm. Pharmacol. 37, 280–283 (1985).
Woods, P. B. & Robinson, M. L. An investigation of the comparative liposolubilities of βadrenoceptor blocking agents. J. Pharm. Pharmacol. 33, 172–173 (1981).
PláDelfina, J. M. & Moreno, J. Intestinal absorptionpartition relationships: a tentative functional nonlinear model. J. Pharmacokinet. Biopharm. 9, 191–215 (1981).
Yoshimura, Y. & Kakeya, N. Structuregastrointestinal absorption relationship of penicillins. Int. J. Pharm. 17, 47–57 (1983).
Schoenwald, R. D. & Huang, H. S. Corneal penetration behavior of βblocking agents I: Physiochemical factors. J. Pharm. Sci. 72, 1266–1272 (1983).
El Tayar, N. et al. Percutaneous penetration of drugs: A quantitative structurepermeability relationship study. J. Pharm. Sci. 80, 744–749 (1991).
Kaibara, A., Hirose, M. & Nakagawa, T. Evaluation of hydrophobic interaction between acidic drugs and bovine serum albumin by reversedphase highperformance liquid chromatography. Chem. Pharm. Bull. 39, 720–723 (1991).
Láznicek, M., Kvĕtina, J., Mazák, J. & Krch, V. Plasma protein bindinglipophilicity relationships: interspecies comparison of some organic acids. J. Pharm. Pharmacol. 39, 79–83 (1987).
DemotesMainard Péhourcq, F., Radouane, A., Labat, L. & Bannwarth, B. Influence of Lipophilicity on the Protein Binding Affinity of Cephalosporins. Pharm. Res. 12, 1535–1538 (1995).
Yokogawa, K. et al. Relationships in the Structure–Tissue Distribution of Basic Drugs in the Rabbit. Pharm. Res. 7, 691–696 (1990).
Komiya, I., Park, J. Y., Kamani, A., Ho, N. F. H. & Higuchi, W. I. Quantitative mechanistic studies in simultaneous fluid flow and intestinal absorption using steroids as model solutes. Int. J. Pharm. 4, 249–262 (1980).
Taylor, D. C., Pownall, R. & Burke, W. The absorption of betaadrenoceptor antagonists in rat insitu small intestine; the effect of lipophilicity. J. Pharm. Pharmacol. 37, 280–283 (1985).
Leo, A. Comprehensive Medicinal Chemistry (Oxford, Pergamon, 1990).
Meylan, W. M. & Howard, P. H. Estimating log P with Atom/Fragments and Water Solubility with log P. Perspectives Drug DiscoVery Design. 19, 67–84 (2000).
Yang, S. S., Lu, W. C., Gu, T. H., Yan, L. M. & Li, G. Z. QSPR Study of nOctanol/Water Partition Coefficient of Some Aromatic Compounds Using Support Vector Regression. QSAR. Comb. Sci. 28, 175–182 (2009).
Schüürmann, G. Quantum Chemical Estimation of Octanol/Water Partition CoefficientFirst Results with Aromatic Phosphorothionates. Fresenius. EnViron. Bull. 4, 238–243 (1995).
Gomber, V. K. & Enslein, K. Assessment of nOctanolWater Partition Coefficient: When Is the Assessment Reliable? J. Chem. Inf. Comput. Sci. 36, 1127–1134 (1996).
Leo, A., Hansch, C. & Elkins, D. Partition Coefficients and their Uses. Chem. Rev. 71, 525–616 (1971).
Sabljic, A. & Horvatic, D. Graph III: A Computer Program from Calculation Molecular Connectivity Indices on Microcomputers. J. Chem. Inf. Comput. Sci. 33, 292–295 (1993).
Duprat, A. F., Huynh, T. & Dreyfus, G. Toward a Principled Methodology for Neural Network Design and Performance Evaluation in QSPR. Application to the Prediction of logP. J. Chem. Inf. Comput. Sci. 38, 586–594 (1998).
Bodor, N. & Huang, M. J. An extended version of a novel method for the estimation of partition coefficients. J. Pharm. Sci. 81, 272–281 (1992).
Eisfeld, W. & Maurer, G. Study on the Correlation and Prediction of Octanol/Water Partition Coefficients by Quantum Chemical Calculations. J. Phys. Chem. B. 103, 5716–5729 (1999).
Yaffe, D., Cohen, Y., Espinosa, G., Arenas, A. & Giralt, F. Fuzzy ARTMAP and BackPropagation Neural Networks Based Quantitative StructureProperty Relationships (QSPRs) for OctanolWater Partition Coefficient of Organic Compounds. J. Chem. Inf. Comput. Sci. 42, 162–183 (2002).
Ravina, E. The Evolution of Drug Discovery: From Traditional Medicines to Modern Drugs (John Wiley & Sons, 2011).
Avdeef, A. Absorption and Drug Development: Solubility, Permeability, and Charge State (John Wiley & Sons, 2003).
Martin, T. M. et al. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? J. Chem. Inf. Model. 52, 2570–2578 (2012).
Haupt, R. L. & Haupt, S. E. Practical Genetic Algorithms (Wiley, New Jersey, 2004).
Tropsha, A., Gramatica, P. & Gombar, V. K. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR. Comb. Sci. 22, 69–77 (2003).
Gramatica, P. Principles of QSAR models validation: internal and external. QSAR. Comb. Sci. 26, 694–701 (2007).
Gramatica, P., Chirico, N., Papa, E., Cassani, S. & Kovarich, S. QSARINS: A new Software for the Development, Analysis, and Validation of QSAR MLR Models. J. Comput. Chem. 34, 2121–2132 (2013).
Papa, E., Kovarich, S. & Gramatica, P. Development, Validation and Inspection of the Applicability Domain of QSPR Models for Physicochemical Properties of Polybrominated Diphenyl Ethers. QSAR. Comb. Sci. 28, 790–796 (2009).
Shi, L. M. et al. QSAR models using a large diverse set of estrogens. J. Chem. Inf. Comput. Sci. 41, 186–195 (2001).
Schüürmann, G., Ebert, R. U., Chen, J., Wang, B. & Kühne, R. External validation and prediction employing the predictive squared correlation coefficient test set activity mean vs training set activity mean. J. Chem. Inf. Model. 48, 2140–2145 (2008).
Consonni, V., Ballabio, D. & Todeschini, R. Comments on the Definition of the Q^{2} Parameter for QSAR Validation. J. Chem. Inf. Model. 49, 1669–1678 (2009).
Chirico, N. & Gramatica, P. Real External Predictivity of QSAR Models: How to Evaluate It? Comparison of Different Validation Criteria and Proposal of Using the Concordance Correlation Coefficient. J. Chem. Inf. Model. 51, 2320–2335 (2011).
Consonni, V., Ballabio, D. & Todeschini, R. Evaluation of model predictive ability by external validation techniques. J. Chemometrics. 24, 194–201 (2010).
Gramatica, P., Giani, E. & Papa, E. Statistical external validation and consensus modeling: A QSPR case study for K_{oc} prediction. J. Mol. Graph. Model. 25, 755–766 (2007).
Wildman, S. A. & Crippen, G. M. Prediction of Physicochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Wang, R., Fu, Y. & Lai, L. A New AtomAdditive Method for Calculating Partition Coefficients. J. Chem. Inf. Comput. Sci. 37, 615–521 (1997).
Ertl, P., Rohde, B. & Selzer, P. Fast Calculation of Molecular Polar Surface Area as a Sum of FragmentBased Contributions and Its Application to the Prediction of Drug Transport Properties. J. Med. Chem. 43, 3714–3717 (2000).
Roy, K. & Das, R. N. On some novel extended topochemical atom (ETA) parameters for effective encoding of chemical information and modeling of fundamental physicochemical properties. SAR and QSAR in Environmental Res. 22(5–6), 451–472 (2011).
Acknowledgements
We wish to thank Prof. Paola Gramatica for their precious help in use of QSARINS software. We are grateful to the University of Kurdistan Research Councils and Islamic Azad University for partial support of this work.
Author information
Authors and Affiliations
Contributions
Saadi Saaidpour designed research and analyzed the data. Asrin Bahmani performed the research and wrote the paper. Amin Rostami guidance on the whole study. All authors were involved in revising the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bahmani, A., Saaidpour, S. & Rostami, A. A Simple, Robust and Efficient Computational Method for nOctanol/Water Partition Coefficients of Substituted Aromatic Drugs. Sci Rep 7, 5760 (2017). https://doi.org/10.1038/s4159801705964z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159801705964z
This article is cited by

Quantification of PM2.5 Bound Polycyclic Aromatic Hydrocarbons (PAHs) and Modelling of Benzo[a]pyrene in the Ambient Air of Automobile Workshops in Benin City
Aerosol Science and Engineering (2023)

Multiple linear regression models for predicting the n‑octanol/water partition coefficients in the SAMPL7 blind challenge
Journal of ComputerAided Molecular Design (2021)

A probable means to an end: exploring P131 pharmacophoric scaffold to identify potential inhibitors of Cryptosporidium parvum inosine monophosphate dehydrogenase
Journal of Molecular Modeling (2021)

A new approach for simultaneous calculation of pIC50 and logP through QSAR/QSPR modeling on anthracycline derivatives: a comparable study
Journal of the Iranian Chemical Society (2021)

Introducing a pyrazolopyrimidine as a multityrosine kinase inhibitor, using multiQSAR and docking methods
Molecular Diversity (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.