Correlation between the structure and skin permeability of compounds

A three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Modeling the penetration of manmade and naturally derived chemicals through human skin is of great importance for pharmaceutical and cosmetic industries, as well as toxicology and risk assessment of environmental and occupational hazards. It is very time-consuming and expensive to estimate the skin permeability of chemicals. Further, there are many ethical challenges associated with human and animal testing for assessment of skin permeability 1,2 .
Quantitative structure-activity/toxicity relationship (QSAR/QSTR) models [3][4][5][6] can be used for the prediction of physicochemical property of compounds, even for those that have not been synthesized. Some researchers have carried out QSAR studies for skin permeability of chemicals (the logarithm of the skin permeability coefficients, log K p ).
Patel et al. developed QSAR models for the skin permeability of 158 chemicals with multiple linear regression (MLR) analysis 7 . The model based on four descriptors has an excellent fit to the data with a coefficient of determination of R 2 of = 0.90. Fujiwara et al. proposed MLR QSARs for the skin permeability of 94 structurally diverse compounds 8 . The models obtained from ten data sets of the skin permeability possess high R 2 values with an average R 2 of 0.815. Magnusson et al. introduced a regression model (R 2 = 0.760) for the skin permeability of 269 compounds 9 . They found that molecular weight was the main determinant of log K P and QSAR model can be improved when other descriptors such as melting point and hydrogen bonding acceptor capability were added. Chauhan and Shakya built a QSAR model for the skin permeability from the training set of 150 compounds through partial least-squares regression 10 . The model with a R 2 of 0.936 for the training set was validated by the test set of 53 compounds. The root mean square (rms) error and R 2 from the test set were equal to 0.670 and 0.542. Xu et al. proposed an expanded version of a linear free-energy relationship model for the skin permeability of complex chemical mixtures 11 . The model (R 2 = 0.70) showed a better fit and predictive power compared with the simple model (R 2 = 0.21). Chen et al. generated a MLR model for the skin permeability with four molecular descriptors 12 . The model has a R 2 of 0.858 for the training set (85 compounds), and 0.839 for the test set (21 compounds), which are accurate and acceptable. All these QSAR models referred to were obtained with the linear techniques.
Generally, nonlinear QSAR models possess better statistical performance than linear QSAR models because of the nonlinear correlation between molecular physicochemical properties and structure descriptors. Neely et al. constructed a nonlinear artificial neural network (ANN) model for the skin permeability of 160 molecular structures 13 . The ANN model (10-3-7-1) based on ten descriptor and two hidden layers had an absolute-average percentage deviation, rms error, and R of 8.0%, 0.34, and 0.93, respectively. Khajeh and Modarress introduced a novel nonlinear QSAR model for the skin permeability of 283 compounds with the hybrid of ANN and a fuzzy inference system, adaptive neuro-fuzzy inference system (ANFIS) 14  www.nature.com/scientificreports/ 0.890, respectively. The model possesses good predictive ability, although there are nine compounds in duplicate in the data set. ANN algorithm may easily fall into a local minimum value and possesses the disadvantages of slow convergence speed 15 . Support vector machine (SVM) algorithm is based on the principle of structural risk minimization. SVMs can effectively avoid local optimums and have unique advantages in solving practical problems such as limited training samples, high dimensional and nonlinear data. The aim of this study was to develop a nonlinear SVM QSAR model for the skin permeability of a sufficiently large data set consisting of 274 compounds.
ChemDraw Ultra 8.0 in ChemOffice 2004 was adopted to generate the structures of 274 compounds, which were converted into three-dimensional structures with Chem3D Ultra 8.0 and optimized with a semi-empirical AM1 method in MOPAC. Dragon 6.0 17 was used to calculate 4885 molecular descriptors for each compound. After some molecular descriptors that equal a constant or their correlation coefficients are above 0.90 were deleted, 1820 descriptors (including Neoplastic-80) were obtained for descriptor selection. Stepwise MLR analysis in IBM SPSS Statistical 19 was performed to select the optimal subset of descriptors and develop MLR models.
For non-linear regression, SVM algorithms map input variables into high-dimensional feature space, from which linear regression analysis is carried out 18,19 . For sample data, y 1 , x 1 , . . . , y l , x l , x ∈ R n , y ∈ R , the regression function is expressed as follows: The optimal regression function can be obtained by means of the following minimization problem:

Subject to Eqs. (3-4):
In SVM regression, the ε-insensitive loss function is employed for minimizing the training error: By applying a kernel function k(x, y), Eq. (6) can be expressed as: Gaussian radial basis function (RBF) was used in this work: www.nature.com/scientificreports/ For SVM models, their SVM parameters C and γ can affect greatly their prediction performance. Both C and γ were optimized with the genetic algorithm. In this study, the LibSVM toolbox 20 working on Matlab platform was used to develop models, which can be downloaded freely from https:// www. csie. ntu. edu. tw/ ~cjlin/ libsvm/.

Results and discussion
After carrying out stepwise MLR analysis in IBM SPSS Statistical 19 for the skin permeability log K p of 274 compounds and 1820 descriptors, a three-descriptor QSAR model was obtained, which includes A log P, X3v, and Neoplastic-80.
The Ghose-Crippen-Viswanadhan octanol-water partition coefficient (A log P) is based on the A log P model 21 and calculated by: where n i is the number of atom of type i and a i is the corresponding hydrophobicity constant. Previous works have shown that A log P is positively correlation with skin permeability log K p . In this work, the descriptors were converted to a new descriptor cos 2 [(4.31 + A log P) /8.66]. An analysis of cos 2 [(4.31 + A log P)/8.66] with respect to the skin permeability log K p of 274 compounds resulted in regression Eq. (10) and statistical parameters: where n is the number of samples in the training set, R 2 is the coefficient of determination, R 2 adj is the adjusted R square, se is the standard error of the estimate, and F is the Fischer ratio. Figure 1 shows the correlation between cos 2 [(4.31 + A log P)/8.66] and log K p . The descriptor cos 2 [(4.31 + A log P)/8.66] (or A log P) describes the hydrophobic character of a compound and is related to log K p .
Connectivity indices are used widely in QSARs. They are based on the H-depleted molecular graph whose vertexes belong to non-hydrogen atom and are correlated with the number of connected non-hydrogen atoms 17 . The general formula for calculating connectivity indices is: where n is the number of vertices; k is an integer ranging from 0 to 5, denoting the total number of kth order paths present in the molecular graph; and δ is the vertex degrees. Valence connectivity indices (Xkv) can be used to account for the presence of heteroatoms in the molecule as well as of double and triple bonds, by means of replacing the vertex degree with the valence vertex degree. The valence connectivity index of order 3, X3v, describes molecular size and shape.
By correlating log K p to the two descriptors, cos 2 [(4.31 + A log P)/8.66] and X3v, we obtained the following regression equation: www.nature.com/scientificreports/ Compared with Eq. (10), the quality of Eq. (12) improved noticeably when the descriptor X3v was added. Figure 2 shows the correlation between the experimental and calculated log K p with Eq. (12). As illustrated in Fig. 2, there were two samples, ouabain (No. 5 in Table S1), and fluocinonide (No. 11) with larger prediction errors for log K p . Thus, more molecular descriptors should be added.
The descriptor Ghose-Viswanadhan-Wendoloski antineoplastic-like index at the qualifying range that covers approximately 80% of the drugs studied, Neoplastic-80, depends on A log P and reflects molecular polarity and hydrophobicity 17 . The Neoplastic-80 value of a molecule that has a benzene ring, heterocyclic ring, aliphatic amine, carboxamide group, alcoholic hydroxyl group, carboxy ester and/or keto group, was equal to 1, when its A log P value is in the range of − 1.5 to 4.7, the molar refractivity of 43-128, the molecular weight of 180-470, and the total number of atoms of 21-63; otherwise Neoplastic-80 equals zero. A molecule with larger Neoplastic-80 might have a smaller log K p value. Carrying out regression analysis between log K p of 274 compounds and the three descriptors stated above resulted in Eq. (13): The correlation coefficient R of 0.945 in Eq. (13) was slightly higher than the 0.942 of the model 13 . Moreover, Eq. (13) has accurate prediction for the skin permeability log K p of compounds including the two samples (Nos. 5 and 11 in Table S1 in "Supplementary Materials") stated above, since Fig. 3 shows that there are no samples with obvious larger errors. When the descriptor A log P, together with X3v and Neoplastic-80, was directly used to develop the MLR model, its correlation coefficient R was only 0.939, which was lower than the 0.945 of Eq. (13). Thus the three descriptors, cos 2 [(4.31 + A log P)/8.66], X3v, and Neoplastic-80 shown in Table S1 in "Supplementary Materials" were used to develop QSAR models.
A correlation analysis between the skin permeability log K p of 139 compounds in the training set and the three descriptors resulted in Eq. (14) (i.e., MLR model): The characteristics of molecular descriptors in MLR model are listed in Table 1. As can been observed in Table 1, the three descriptors, cos 2 [(4.31 + A log P)/8.66], X3v, and Neoplastic-80 descriptor all were significant and made a contribution to log K p , because their significance values (or P values) are less than 0.05. In addition, their variance inflation factors (VIF) were far less than ten suggesting that the three descriptors describe different structure factors affecting skin permeability log K p . The t-test can be used to measure the significance of descriptors in making a contribution to molecular physicochemical properties. The higher the absolute value of the t-test, the greater the significance of the descriptor. According to the t-test values in Table 1, the absolute values of t-test increased in the sequence: Neoplastic-80, X3v, and cos 2 [(4.31 + A log P)/8.66], the significance of descriptors increased in the same sequence. www.nature.com/scientificreports/ The MLR model was further used to predict the skin permeability log K p of 135 compounds in the test set. The correlation coefficient R of the test set was 0.928. The rms errors for the training set, test set and total set were 0.343, 0.302, and 0.323, respectively. The prediction log K p values are illustrated in Fig. 4 and listed in Table S1 in "Supplementary Materials".
The three molecular descriptors used in Eq. (14) were used as input variables to develop SVM models for skin permeability log K p from the training set of 139 compounds, by applying the LibSVM toolbox in the MATLAB R2014a software platform. A genetic algorithm was adopted to optimize the SVM parameters C and γ under the    Table S1 in "Supplementary Materials" and illustrated in Fig. 5. The coefficient of determination R 2 and rms error for the training set of 139 compounds were 0.946 and 0.253, respectively; R 2 and rms for the test set of 135 compounds were 0.872 and 0.302, respectively; and R 2 and rms error for the total set were 0.925 and 0.270, respectively. The rms errors of 0.253, 0.302, and 0.270, respectively, for the training set, test set and total set from the SVM model were lower than those (0.343, 0.302, and 0.323, respectively) of Eq. (14) (MLR model) in this study. Therefore, there were non-linear relationships between the skin permeability log K p and molecular descriptors used.
The SVM model was further evaluated with the criteria by Golbraikh and Tropsha: 22 where q 2 ext is external correlation coefficient; R 0 2 and R 0 ′2 are determination coefficients of the predicted vs. the observed values and of the observed vs. the predicted values, respectively; k and k′ are slopes of regression lines of the predicted vs. the observed values and of the observed values vs. the predicted values; y train is the average value  www.nature.com/scientificreports/ of the training set; y i and y i are the observed and the predicted activities, respectively; y r 0 i = k y i and y r 0 i = k ′ y i . Obviously, our SVM model satisfied the validation criteria 22,23 .
The coefficient of determination R 2 (= 0.946) in this study is higher than the R 2 of 0.90 7 , 0.815 8 , 0.760 9 , 0.936 10 , 0.70 11 , 0.858 12 , and 0.93 13 . In addition, the rms errors of the training set, test set and total set from the ANFIS model of Khajeh and Modarress that dealt with the 283 samples were 0.318, 0.308, and 0.316 respectively 14 , which were greater than the rms errors ( 0.253, 0.302, and 0.270, respectively) from our SVM model. Compared with results of other models reported in the literature 9-14 , our SVM model shows better statistical performance in a model that deals with more samples in the test set.

Conclusions
A three-descriptor SVM model with SVM parameters C of 7.2906 and γ of 1.7200 was successfully built for the skin permeability log K p of a sufficiently large data set consisting of 274 compounds, by means of a genetic algorithm. The SVM model possesses rms errors of 0.253 for the training set (139 compounds), 0.302 for the test set (135 compounds), and 0.270 for the total set (274 compounds). Our SVM model shows better statistical performance in a model that deals with more samples in the test set, compared with other QSARs of the skin permeability of log K p reported in the literature. There were non-linear relationships between the skin permeability log K p and molecular descriptors used. It was reasonable applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability.

Data availability
All data generated or analysed during this study are included in this published article (and its "Supplementary Information" files).