A Simple, Robust and Efficient Computational Method for n-Octanol/Water Partition Coefficients of Substituted Aromatic Drugs

In this paper, multiple linear regression (MLR) was used to build quantitative structure property relationship (QSPR) of n-octanol-water partition coefficient (logPo/w) of 195 substituted aromatic drugs. The molecular descriptors were calculated for each compound by the VLifeMDS. By applying genetic algorithm/multiple linear regressions (GA/MLR) the most relevant descriptors were selected to build a QSPR model. The robustness of the model was characterized by the statistical validation and applicability domain (AD). The prediction results from MLR are in good agreement with the experimental values. The R2 and Q2 LOO for MLR are 0.9433, 0.9341. The AD of the model was analyzed based on the Williams plot. The effects of different selected descriptors are described.

The n-octanol is considered a good mimic of phospholipids membrane characteristics because its nature is amphiphilic 3 . Among other physicochemical properties, lipophilicity plays a key role for molecular discovery activities in a variety of domains including, agrochemicals, cosmetics, material sciences, environmental chemistry, food chemistry, and particularly medicinal chemistry 4 . A correct estimation of logP o/w is essential for the discovery and development of efficient therapeutic molecules 5 . Whereas lipophilicity cannot characterize the whole physicochemical nature of a compound, properties governing lipophilicity have a basic effect on the actions of organic molecules, such as drugs or drug candidates. Many drugs will go through a series of partitioning steps: (a) leaving the aqueous extracellular fluids, (b) passing through lipid membranes, and (c) entering other aqueous environments before reaching the receptor. In this sense, a drug is passing the same partitioning phenomenon that happens to any chemical in a separatory funnel containing water and a non-polar solvent. So a compound must have an optimal lipophilicity, because if the solute is very lipophilic it will remain trapped in the membrane 6 . Lipophilicity is one of the main factors influencing the pharmacokinetic behavior of β-blockers by several ways: 1-Oral absorption, 2-Penetration in the central nervous system (CNS), 3-Renal clearance, 4-Degree of biotransformation and plasma half-life, 5-Cardioselectivity, 6-Cornealpenetration 7,8 . For example, the most lipophilic β-blockers (such as propranolol) penetrate readily into the CNS and raise central effects (somnolence), whereas the more hydrophilic drugs have a low CNS penetration and negligible central effects 8 . The in situ rat gut technique is an informative tool yielding realistic absorption rates. In 1981 a study of 18 sulfonamides, the absorption rate constant k a was correlated with the lipophilicity parameter 9 . Good gastrointestinal absorption was for many years a problem in the development of Penicillins. Yoshimura 10 developed an organized study in mice and rats and showed that the two major molecular properties influencing the GI absorption of penicillins are their stability in acidic solutions and their lipophilicity. Corneal penetration is an overcritical condition for the therapeutic success of ocularly administered drugs such as β-blockers used as antiglaucoma agents. In 1983, an important study showed that lipophilicity clearly plays a key role in penetration through intact cornea. In a series of 12 β-blockers, the logPC (permeability coefficient) exhibited a parabolic relation with lipophilicity 11 . For a homogeneous set of phenols, a parabolic relation was found between human skin permeability (K p ) and the logP o/ w 12 . In 1991, for 11 aromatic acids (model compounds and anti-inflammatory drugs) their binding constant to bovine serum albumin (in logarithmic form) was correlated with hydrophobic index obtained by RP-HPLC 13 . In another study, the unbound fraction in plasma (f u ) that was taken as the biological response, showed a sigmoidal relation with logP o/w 14 . Interestingly, parabolic relations between protein binding and lipophilicity are also known, validating the limited dimensions of some binding sites. When large molecules such as Cephalosporins were tested for their association constant (K a ) to human serum albumin, a fair parabolic relation was found with lipophilicity 15 . In the important study, the concentration of 10 basic drugs in plasma and 8 non-metabolizing tissues was examined administration to rabbits. These drugs were weakly basic benzodiazepines and strongly basic neurological drugs. Good linear relations (R 2 = 0.92 to 0.97) were found between the tissue-to-plasma concentration ratios of unbound, non-ionized drugs and their logP o/w . The slope of the linear regressions raised in the series: muscle < skin < bone < brain < gut < heart < lung < adipose 16 . In many studies on drug permeation through biological membranes (gut wall, skin, blood-brain barrier, and Caco-2 cell monolayer), relationships between permeation and lipophilicity have been developed with homologous series of compounds of a diverse nature (acidic, alkaline and neutral) to investigate the influence of lipophilicity on passive diffusion. For example Sigmoidal relationships were established between permeability coefficients in rat jejunum and logP o/w for seven steroids 17 , and 11 β-blockers 18 . Even so, despite the good solubility of most organic compounds in n-octanol and ease in lab handling, the experimental determination of logP o/w remains a resource-and time-consuming process. Methods to estimate logP o/w are basically dedicated to medicinal chemistry and molecular design activities. Estimation approaches involve group and atom contribution methods 19,20 , quantitative structure property relationships (QSPR) derived from statistical regressions [21][22][23] . Group and atom contribution models have usually been based on fragments, derived either from atoms or groups of atoms, which are assigned incremental logP o/w contributions 24 . QSPR have been developed as alternate strategies of estimating lipophilicity. The assumption of QSPR for logP o/w is that physicochemical properties can be correlated with molecular structural characteristics (geometric and electronic) expressed in terms of appropriate molecular descriptors 25 . In recent years, enhancements in log-P o/w QSPR have been suggested through the use of molecular descriptors derived from semi-empirical Molecular Orbital theory (quantum mechanics) calculations 26 . For example, Bodor 27 , using AM1 semi-empirical MO theory, reported a standard deviation of 0.306 logP o/w for a 18 parameter linear correlation which was developed for estimating lipophilicity for a heterogeneous data set 302 organic compounds. In 1999, Eisfeld and Maurer 28 proposed a logP o/w correlation with dipole moment, polarizability, electrostatic potential and molar volume as chemical descriptors, based on a heterogeneous set of 202 compounds with a reported standard deviation and maximum absolute error of 0.287, respectively. Yaffe 29 , using Fuzzy ARTMAP and Back-Propagation Neural Networks Based QSPR, Estimated logP o/w for heterogeneous set of 442 organic compounds.
In this work we develop QSPR modeling of logP o/w of 195 substituted aromatic drugs. These drugs are very important in medicinal chemistry, such as: Alprazolam, that is mostly used to treat anxiety disorders, panic disorders, and nausea due to chemotherapy, Dapsone, that is commonly used in combination with Rifampicin and Clofazimine for the treatment of leprosy, Procaine, that is a local anesthetic drug of the amino ester group. It is used primarily to reduce the pain of intramuscular injection of penicillin, and it is also used in dentistry, Warfarin treatment can help prevent formation of future blood clots and help reduce the risk of embolism 30 . In this paper all of 195 drugs are homogeneous set of aromatic drugs.

Computational approach
All calculations were run on a Dell Inspiron N5010 laptop computer with Intel ® Core ™ i7 processor with Windows 7 operating system. The molecular structures of all compounds were drawn into the HyperChem 8.0 (Hypercube, Inc., Gainesville, 2011) and pre-optimized using MM + molecular mechanics method (Polak-Ribiere algorithm). The final geometries of the minimum energy conformation were obtained by more precise optimization with the semi-empirical PM3 method, applying a root mean square gradient limit of 0.05 (Kcal.mol-1. Å −1 ), as a stopping criterion for optimized structures. The molecular descriptors were calculated by VLifeMDS (version: 4.4) Software. A GA/MLR algorithm procedure was used for selection of descriptors using QSARINS (QSAINSubria version 2.2.1 2015) software package. MLR was performed by QSARINS.

Data set selection
For the present study logP o/w of 195 drug compounds was collected from the literature 31 . All molecules exhibited a wide range of lipophilicity (−2.17; 6.03). In order to obtain a validated and, therefore, predictive QSPR model, an available dataset should be divided into the training and test sets. Commonly, this splitting is performed using random and rational splitting methods 32 . The data set was split randomly into 147 training set and 48 prediction set (see Table 1).

Computational methods
Descriptor generation. Molecular descriptors are generated from molecular structures. Although different descriptors utilize different processing steps, still there are numerous steps common to these procedures. Molecular descriptors are powerful tools for the approximation of selected properties of chemical structures in an easy-to-handle form that allows efficient comparison and selection of compounds possessing required chemical, Scientific RepoRts | 7: 5760 | DOI:10.1038/s41598-017-05964-z Descriptor selection. After descriptor generation a pool of the molecules with the corresponding descriptors become available for model calculation. But a limited number of modeling descriptors, related to the studied response, must be selected from the available pool. Descriptor selection is the process of selecting a subset of relevant variables for use in model construction. In QSARINS this is done using a GA/MLR procedure. This technique is able to explore a broad range of solutions, searching for the best ones, by maximizing or minimizing a selected fitness function. This is done mimicking the natural selection, where the best solutions replace the less performing. In biological terms, one would say that the best genes in the population displace the less fitting. In our case, every descriptor represents a gene, and a set of descriptors represents a chromosome. The fitness of a chromosome is related to the matching model performances. Starting with a pool of chromosomes, small subsets of chromosomes are picked randomly, and the best become parents. Couples of parent chromosomes are then crossed at a random position (crossing-over), thus obtaining the offspring, whose chromosomes are a combination of the parent ones. If among the new chromosomes one or more of them outperform the less fitting in the parent population, these chromosomes will replace the less performing. Repeating the aforesaid procedure many times, and introducing also random mutations (descriptor substitution) in the chromosomes, the result at the end of the procedure is a population of models with better performances than the models introduced at the beginning. In order to prevent a completely random beginning of the GA, in QSARINS, the best set of descriptors extracted from the all subset process is used as the core of the chromosomes of the initial population.
In QSARINS, the tuning of the GA can be done changing the population size, the mutation rate, and the number of generations. A fundamental option is the selection of the fitness function to be used by GA. In the work, leave-one-out cross-validation (Q 2 LOO ) was used as fitness function throughout the GA process. When increasing the model size does not improve the Q 2 value significantly, the GA selection will be stopped. Q 2 LOO used as fitness function, is useable to select models with high fitting with the minimum number of descriptors. However, it is essential to note that they are fitting criteria, so they provide no information on the predictive ability of the models. For this reason, it is here proposed to use Q 2 LOO as fitness function for the selection of predictive models 33 . The important parameters used in the GA process were set as below: population size 100, maximum allowed descriptors in a model 10 and reproduction/mutation trade-off 0.5. Finally, we obtained a 10-descriptor subset, which keeps most interpretive information for logP o/w . Four descriptors were calculated for each compound in the data set. The selected descriptors are: SKMostHydrophobic Area, SAHydrophobic Area, SKAverage, XKAverage Hydrophobicity, PSA, Average Potential, Polar Surface Area Excluding P & S, 4Path Count, ChiV6chain and AlphaR.
Modeling method in QSARINS. The datasets used in QSPR analysis are, as previously mentioned, composed of descriptors that should be correlated with the corresponding experimental responses. At this step it is necessary to apply a quantitative method able to find the existing relationship between a limited number of structural descriptors and the modeled response. In QSARINS, the used method is the MLR approach that can be demonstrated by the following formula: where a linear relationship is computed between the studied responses (y i ) and the selected values of the descriptors (x ij ); e i is the random error (called also model residual). The intercept (b 0 ) and the coefficients (b j ) are thus to be evaluated. The equation (2) can be rewritten in a more compact form using the matrix notation: where y is the responses vector, b the vector of the coefficients and e is the vector of the errors. X is the matrix of the model, where the columns are the descriptors. In this software, to estimate the vector of the coefficients, the OLS technique is used: where b is the vector that estimates the b vector of the coefficients, X T the transposed X matrix and −1 is the inverse matrix operation. The OLS minimizes the sum of squares of the difference between the experimental responses and the ones calculated by the model. To work correctly, the OLS assumes that: (1) a linear relationship exists between the descriptors and the response, (2) the response errors are independent and similarly distributed, the descriptors are not too correlated among them, (4) there are more compound than modeling descriptors (a ratio that should be always higher than 5:1). Once the coefficients of the model are calculated, it is possible to obtain the vector of the ŷ, as in the following formula: Xb X(X X) X y Hy Model evaluation. Evalution of QSPR model is a very important aspect. It is acknowledged that the goodness-of-fit is very important for QSPR models. The quality of goodness-of-fit of the models is quantified by the R 2 squared correlation coefficient, R 2 adj is adjusted squared correlation coefficient, s is the standard error of the regression and F is the Fisher ratio for regression. R 2 is a statistic that will give some information about the goodness of fit of a model. R 2 is defined as: where RSS is the residual sum of squares and TSS is the total sum of squares. Adjusted R 2 detects the possible overfitting of a model so, used as fitness functions, are useful to select models with high fitting with the minimum number of descriptors. Adjusted R 2 is defined as: where n is the number of members of the training set and m is the number of descriptors included in the model. The Adjusted R 2 is a better measure of the proportion of variance in the data explained by the correlation than R 2 . The standard error indicates dispersion degree of random error. F-ratio test in regression is defined as the ratio between the variance explained by the model to the residual variance. The larger R 2 , R 2 adj and F, the smaller s, and the model will have more fitting ability.
Model validation. Model calculation and evaluation are the basic steps in QSPR analysis, but are not sufficient to guarantee the model validity. Validation is fundamental to ensure the reliability of data predicted by the models. Validation of QSPR model is very important aspect, thus internal and external validation is considered to be necessary for model validation 35 .
Internal validation is obtained from analyzing of each one of individual objects that configure the final equation. This procedure is leave-one-out (LOO) cross-validation. This process was done in training set and Q 2 LOO is calculated.
where TSS is the total sum of squares that is the sum of squared deviations from the data set mean and PRESS is the sum of squares of the prediction errors. The larger Q 2 LOO and the model will have more predictive ability. However, a perturbation of only one compound at a time is very weak to demonstrate real model robustness. In QSARINS, the stronger Leave-More (or many)-Out (LMO) technique is also included. This technique studies the behavior of the model when a larger number of compounds are eliminated. LMO is used to counteract the slight overoptimism of LOO-cross-validation. The model under analysis can be considered stable if the R 2 and Q 2 values calculated in every LMO iteration and their averages (R 2 LMO and Q 2 LMO ), are close to R 2 LOO and Q 2 LOO values of the model 36 .
To show that the model is not the result of chance correlation, the Y-scrambling procedure can be applied. In this process, the responses are shuffled at random, so no correlation between them and the descriptors should exist. As a consequence, the performances of the corresponding scrambled models should decrease drastically. In this case if the original model under validation is good, the values of R 2 and Q 2 of the every iteration, and their averages (R 2 yscr and Q 2 LOO-yscr ), must be far and much smaller from the values of the original model. If Q 2 LOO-yscr < 0.2, and R 2 yscr < 0.2, there is no risk of chance correlation in the developed model. In the process of model validation, external validation is necessary. External validation of the model is checked for its ability to predict new compounds. This is done by applying the model equation, obtained on the training set, to one or more prediction data set(s), that is the excluded compounds that have never been used in model    where y TR indicates the response means of the training set, respectively. PRESS is the predictive sum of squares, SS y ( )

EXT TR
is the total sum of squares of the external set calculated by means of the training set mean, respectively. Consequently, this formula gives valid values when the test set spans the whole response domain of the model because in this case the test set mean approaches the training set mean. Q 2 F2 is defined as: where y EXT indicates the response means of the external test set and SS y ( ) EXT EXT is the total sum of squares of the external set calculated by means of the external set mean, respectively. Function Q 2 F2 does not account for information about the reference model because y EXT encodesinformation derived from the external set and this informationalters continuously on the basis of the objects belonging to the external set. Q 2 F3 is defined as:  It is well suited to measure the consensus between experimental and predicted data, which should be the real aim of any predictive QSPR models. Where x i and y i correspond to the abscissa and ordinate values of the graph plotting the prediction experimental data values vs. the ones calculated using the model. Where n is the number of chemicals, and x and y correspond to the averages ofabscissa and ordinate values, respectively. This coefficient measures both precision (how far the observations are from the fitting line) and accuracy (how far the regression line deviates from the slope 1 line passing through the origin, the concordance line), consequently any divergence of the regression line from the concordance line gives as a consequence a value of CCC smaller than 1.
An elemental property of a function for the assessment of model fit from external evaluation data is that external observations are independent of each other. This means that the Q 2 value derived from the whole external data set Q 2 EXT and the average of the Q 2 values obtained taking separately each external data one at one time should coincide. The optimized model was applied for the prediction of logP o/w values of 49 drugs in the prediction set which were not used in the optimization procedure. The predictive ability of a model on external validation set can be expressed by Q 2 EXT .
where Q 2 i is the external Q 2 calculated taking into account only the ith object of the test set and n EXT is the total number of external objects.
An additional measure of the accuracy of the proposed QSPR is the RMSE (root mean squared errors) that summarizes the overall error of the model.
where y i is the predicted value for the ith test object and y i its observed value, n EXT is the total number of test objects. This parameter depends only on the mean deviations between predictions and observed values and it can always be calculated even when there is only one test object. It is calculated as the square root of the sum of squared errors in prediction divided by their total number. This parameter was calculated to compare the accuracy and the stability of our models in the training (RMSE TR ) and in the prediction (RMSE EXT ) sets. It is important to note that RMSE values must not only below but also as similar as possible for the training, cross-validation and external prediction sets. This suggests that the proposed model has both predictive ability (low values) as well as sufficient generalizability (similar values). The AD is a theoretical area in chemical space, defined by the model descriptors and modeled response, and thus by the nature of the chemicals in the training set, as represented in each model by specific molecular descriptors As even a robust, significant and validated QSPR cannot be expected to reliably predict the modeled property for the all universe of chemicals, its domain of application must be defined, and the predictions for only those chemicals that fall in this domain can be considered reliable. The Williams plot of the regression permits a graphical detection of both the outliers for the response and the structurally influential chemicals in a model. The Williams plot detects the outliers for the response (Y-outliers) and those for the structure (X-outliers). It consists of plotting the standardized residuals on the y-axis and the leverage values from the hat matrix diagonal on the x-axis. The leverage (h) of a compound measures its influence on the model. The leverage of a compound in the original variable space is defined as: where the X is the model matrix derived from the training set descriptor values and the leverage values of training set are diagonal elements of the Hat or Influence matrix H (h i = diag(H)). The leverage values are always between 0 and 1. The warning leverage h * is defined as follows: where n is the number of training set compounds and p′ is the number of model parameters plus one. Observations with standardized residuals greater than (−3; +3) range, which lie outside the horizontal reference lines on the plot, are outlier's responses in the QSARINS (standardized residuals > σ ±3 is the standard deviation of residuals). Standardized residual (SR i ) for each sample is calculated as in equation (17): where y i and ŷ i are respectively the measured and predicted values of the property; n is the number of compounds in each set of data. To visualize the AD of a QSPR model, the plot of standardized residuals versus leverage values (h) (Williams plot) can be used for an immediate and simple graphical detection of both the response outliers and structurally influential chemicals in a model (h > h * ). Concerning the residuals, all the chemicals falling above or below the user defined threshold are not well predicted and thus considered as outliers. Too many outliers, especially those underestimated, are symptomatic of a poor model and this is the reason of implementing the counting of the outliers. Leverage values represent the degree of influence that the structure of every single chemical has on the model. A compound with high leverage in a QSPR model is the driving force for the variable selection if this compound is in the training set (good leverage). A high leverage compound in the prediction set is detected as far from the chemical domain of the training compounds, thus it could lead to unreliable predicted data, being the result of substantial extrapolation of the model. Therefore, the structural information of the chemicals included in the training set could be not sufficient for a reliable prediction of chemicals lying outside of the training-AD 43 .

Results and Discussions
Multiple regression analysis. The MLR analysis was used to derive a QSPR model. The data set was randomly divided into training and test set. 147 drugs were selected as the training set in the modeling. 48 drugs were chosen as a prediction set and were used for external validation of the MLR. Making use of the MLR method, the linear model was obtained, in which the molecular descriptors were used as independent variables. In the Table 2, the list of descriptors, their coefficients and model parameters have been shown.
Where, n is the number of compounds used for regression, R 2 is the squared correlation coefficient, R 2 adj is adjusted squared correlation coefficient, s is the standard error of the regression and F is the Fisher ratio for regression.  Fig. 3), Sulfasalazine in the test set is to the right of the vertical line, which indicates it has high leverage value (h > h * = 0.224) and low standardized residual, it is belong to the model AD. The chemical compound of Doxorubicin in the training set is to the right of the vertical line, which indicate they have high leverage value (h > h * = 0.224) and low standard residual. These chemicals with high leverages have a stronger influence on the model than other chemicals, and they are influential. In the standardized residuals plot, Enalapilat in training set and Phe-Phe in test set have standard residual > (−3; +3) range, which confirms that there are two outliers. Furthermore, there is no clear pattern in the residuals, so nothing seems to be wrong with the model. The fitting criteria, internal validation criteria and external validation criteria are shown in Table 3.

Interpretation of descriptors
SKMostHydrophobic Area, SAHydrophobic Area and SKAverage. SKMostHydrophobic Area is the most hydrophobic value on the van der Waals (vdw) surface. The van der Waals surface of a molecule is a surface might reside for the molecule based on the hard cutoffs of van der Waals radii for individual atoms,  Table 4. The list of α of atoms commonly occurring in organic compound.
and it represents a surface through which the molecule might be conceived as interacting with other molecules. Hydrophobicity (also termed hydrophobic) materials possessing this characteristic have the opposite response to water interaction. Compared to hydrophilic materials, hydrophobic materials (water hating) have little or no tendency to absorb water and water tends to bead on their surfaces. Hydrophobic materials possess low surface tension values and lack active groups in their surface chemistry for formation of hydrogen-bonds with water. Hydrophobicity is very important in solubility of drugs. Accordingly drugs that are extremely hydrophobic are also poorly absorbed, because they are totally insoluble in aqueous body fluids and, therefore, cannot gain access to the surface of cells. For a drug to be readily absorbed, it must be largely hydrophobic, yet have some solubility in aqueous solutions. This is one reason why many drugs are weak acids or weak bases. There are some drugs that are highly lipid-soluble, and they are transported in the aqueous solutions of the body on carrier proteins such as albumin. The results indicate that the SKMostHydrophobic Area increases as logP o/w increases. SAHydrophobic Area is van der Waals surface descriptor showing hydrophobic surface area. Lipid solubility of a compound is of special importance to drug discovery and development, because it is directly related to the transport abilities of a drug candidate to cross biological membranes. The requirement is that drug molecules must be soluble enough in lipid to get into membranes but cannot be so soluble that they become trapped in the membranes. These membranes are not exclusively anhydrous fatty or oily structures. As a first approximation, membranes can be considered bi-layers composed of lipids consisting of a polar cap and large hydrophobic tail. Phosphoglycerides are major components of lipid bi-layers. Other groups of bi-functional lipids include the sphingomyelins, galactocerebrosides, and plasmalogens. The hydrophobic portion is composed largely of unsaturated fatty acids, mostly with cis double bonds. In addition, there are considerable amounts of cholesterol esters, protein, and charged mucopolysaccharides in the lipid membranes. The final result is that these membranes are highly organized structures composed of channels for transport of important molecules such as metabolites, chemical regulators (hormones), amino acids, glucose, and fatty acids into the cell and removal of waste products and biochemically produced products out of the cell. Apparently, increasing the SAHydrophobic Area increases logPo/w. SKAverage is the Average hydophobicity function value. According to Supplementary information, some molecules have a positive Hydrophobicity function, others are negative. If the desired compound is more soluble in non-polar than polar phase, the Average hydophobicity function value is higher. Finally, increasing the SKAverage increases logP o/w . SKMostHydrophobic Area, SAHydrophobic Area and SKAverage are calculated by SlogP method 44 . This method represents a new atom type classification system for use in atom-based calculation logP o/w .

XKAverageHydrophobicity. XKAverageHydrophobicity is the Average hydrophobic value on the van der
Waals (vdw) surface. This descriptor is calculated by XlogP method 45 . In this method the atoms are classified by their hybridization states and their neighboring atoms. XlogP is based on the summation of atomic contributions and includes correction factors for some intra-molecular interactions. The XKAverageHydrophobicity increases as logP o/w increases.

PSA, Polar Surface Area Excluding P & S and Average Potential. Polar surface area of a molecule is
defined as the sum of the contributions to the molecular surface area of polar atoms such as oxygen, nitrogen and their attached hydrogen's. This parameter is easy to understand and, most importantly, provides good correlation with experimental transport data. PSA is a descriptor showing the correlation with passive molecular transport through membranes, which allows prediction of human intestinal absorption, caco-2 mono-layer permeability, and blood-brain barrier penetration. Molecules with a polar surface area of greater than 140 angstrom squared tend to be poor at permeating cell membranes. For molecules to penetrate the blood-brain barrier a PSA less than 90 angstroms squared is usually needed. In new approach, PSA is calculated based on the summation of tabulated surface contributions of polar fragments by Ertl 46 . PSA increases as logP o/w decreases. Polar Surface Area Excluding P & S signifies total polar surface area excluding phosphorous and sulphur. According to Table 2, this descriptor has a positive coefficient. This shows that the molecules have S and P, tend to dissolve in polar phase. In contrast, the molecules that have other atoms tend to dissolve in non-polar phase. Thus, the presence of S and P atoms in the molecules are not in favor of the lipophilicity. Polar Surface Area Excluding P & S increases as logP o/w increases. Average Potential signifies average of the total electrostatic potential on van der Waals surface area of the molecule. According to Table 2, Average Potential increases as logP o/w decreases. 4PathCount, ChiV6chain and AlphaR. 4Path count signifies total number of fragments of fourth order (four bond path) in a compound. This descriptor signifies total number of fragments of fourth order (four bond path) in a compound. 4Path Count describes the connectivity of the atoms within the molecule and also explains its branching and flexibility or rigidity. In fact, lipophilicity decreases with branching. This is due to the fact that the branching of the chain makes the molecular most compact and thereby decreases the surface area. Thus, more branching will reduce the size of the molecule, making it harder to solvate in non-polar phase. As a result, the lipophilicity of the normal compound isomers is higher in all instances than the branched compounds. According to Table 2, 4Path Count shows a negative coefficient towards the lipophilicity, which indicates this descriptor increases as logP o/w decreases. ChiV6chain signifies atomic valence connectivity index for six membered rings. This descriptor indicates the importance of molecular bulk for lipophilicity. Lipophilicity increases with molecular bulk because large molecules are better solved in non-polar phase such as n-octanol. This descriptor is calculated by molecular graph. Apparently, increasing the chiV6chain increases logP o/w . AlphaR indicates sum of α value of all non-hydrogen atoms in a reference alkane. The reference alkane is when all heteroatoms in the molecular graph are replaced by carbon and multiple bonds are replaced by single bonds, corresponding molecular graph may be considered as the reference alkane. The parameter α is related to the size of an atom. The term ∑α is a measure of molecular bulk. When ∑α is compared to that of the corresponding reference alkane, a measure of the heteroatom count and size of a molecule can be obtained. Where, Z and Z v represent atomic number and valence electron number respectively. The PN stands for period number. Hydrogen atom is considered as reference, α for hydrogen is taken to be zero. Table 4 shows that α value of different atoms. According to Table 2, the coefficient of AlphaR is negative. These results indicate the electronegativy of atoms must be considered. If the molecules that have the atoms such as Cl, Br, S and P, have the higher α and increases size and electronegativy. As a result, more electronegative molecules are solved in the aqueous phase 47 . Finally AlphaR increases as logP o/w decreases.

Conclusion
In this work, the MLR was used to construct linear QSPR model to predict logP o/w of a wide and homogeneous set of aromatic drugs. MLR method could model the relationship between logP o/w and descriptors. The GA/MLR method is applied for descriptor selection. The results show that the GA/MLR method is a very effective descriptor selection approach for QSPR analysis. The results indicate that the goodness of fit, robustness and predictive ability of MLR model was perfect from internal and external validation. By performing model validation, it can be concluded that the presented model is valid model and can be effectively used to predict the logP o/w . Moreover, the mechanism of the model was interpreted and the applicability domain of the model was defined.