Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources

A generalized toxicity classification model for 7 different oxide nanomaterials is presented in this study. A data set extracted from multiple literature sources and screened by physicochemical property based quality scores were used for model development. Moreover, a few more preprocessing techniques, such as synthetic minority over-sampling technique, were applied to address the imbalanced class problem in the data set. Then, classification models using four different algorithms, such as generalized linear model, support vector machine, random forest, and neural network, were developed and their performances were compared to find the best performing preprocessing methods as well as algorithms. The neural network model built using the balanced data set was identified as the model with best predictive performance, while applicability domain was defined using k-nearest neighbours algorithm. The analysis of relative attribute importance for the built neural network model identified dose, formation enthalpy, exposure time, and hydrodynamic size as the four most important attributes. As the presented model can predict the toxicity of the nanomaterials in consideration of various experimental conditions, it has the advantage of having a broader and more general applicability domain than the existing quantitative structure-activity relationship model.

Developing the generalized toxicity prediction model requires a standardized database containing comprehensive toxicity data of nanomaterials obtained using international protocols and good laboratory practice (GLP). In addtion, the quality and completeness of the toxicity data must be assessed and validated [18][19][20][21] .
Under such circumstances, Safe and Sustainable Nanotechnology (S2NANO), a research group in the Republic of Korea, built a S2NANO database (www.s2nano.org) including various experimental results related to nanomaterials obtained from different sources. The quality and completeness of collected data in the S2NANO database are assessed and validated using a physicochemical(PChem) score screening and nano-specific data gap filling method proposed by S2NANO 22 . The PChem score screening system evaluates the quality of physicochemical data while the nano-specific data gap filling method replaces missing values with manufacturer's specifications and/or estimations.
For development of general models applicable to nanomaterial toxicity prediction, there has been an initial sign for developing QSAR models using quasi-SMILEs representing all conditions related to physicochemical properties and biological profiles such as toxic assay method, cell line, and so on [23][24][25][26] . The values of the physicochemical properties and biological profiles are coded to quasi-SMILES expressed in a simple syntactic sequence (i.e., character string). Quasi-QSAR model developed using the quasi-SMILES presented more generalized QSAR model than existing one. However, as the quasi-SMILES is made up using limited characters, it's difficult efficiently to assign characters when the data contains many records and a wide range of data.
The solution is to use the appropriate pre-processing techniques to transform the comprehensive toxicity data into a numeric input metric suitable for generalized toxicity prediction models.
The object of this study is to develop a generalized toxicity prediction model for oxide nanomaterials using the S2NANO database, which consists of various experimental results for the nanomaterials. The model development methodology and results of the model development and validation were presented. In addition, effects of a few more preprocessing techniques were described in this paper.

Materials and Methods
Model development workflow. Figure 1 shows the workflow for the development of generalized QSAR models that are able to determine toxicity based on different biological conditions. A data set comprised of 574 observations was used in developing the models. Various methods were carried out in the preprocessing step and univariate analysis was conducted to identify the key differences between the toxic class and nonToxic class. After that, the data set was used for model development and validation. In addition, reliability of the developed model was validated using toxicity data obtained from the laboratory experiment conducted by S2NANO group. Four modelling algorithms, including the generalized linear model (GLM), support vector machine (SVM), random forest (RF), and neural network (NNET), were used for building the models, and their performance was evaluated via measures based on the confusion matrix. Analysis of relative optimal descriptors (attributes) was conducted. Finally, the k-nearest neighbors (kNN)-based applicability domain was defined in order to ensure the reliable prediction of the model which showed the best performance. Experimental Data. The   The PChem score representing data quality level is criteria which revised and expanded from existing evaluation criteria 18,19 for assessing the quality of published experimental data on nanomaterials. The existing evaluation criteria include assessment as to whether or not toxicological data were obtained using international protocols (i.e., EU, EPA, FDA, OECD, etc.). In addition, the criteria evaluation method considers GLP. If data was generated in a laboratory that used GLP principles then, the quality of data should be better than that of data from a laboratory that was not working according to GLP principles. Using these reference criteria, the PChem scoring method evaluates the data quality taking into account the data source (experiment, manufacture, and article) and data method (the characterization methods for physicochemical properties: e.g., TEM, DLS, BET, etc.). The detailed criteria of PChem score are listed in Supplementary Table S1. High-PChem score means that the toxicity data was generated in a laboratory that used GLP principles and the physicochemical prosperities were characterized by widely recognized and acknowledged techniques (TEM, DLS, BET, etc. suggested by the OECD). As good quality input always result in the accurate prediction of properties, high quality data thoroughly evaluated by PChem scoring method was used in this paper.
Cell viability (%) was classified as either the toxic class or nonToxic class: if the cell viability was less than 50%, it was classified as the toxic class; otherwise, it was classified as the nonToxic class. The cell viability classified was used as an endpoint. As the data was collected from different sources, various assay methods, cell names, cell species, cell origins, and cell types were involved as nominal attributes in the data. The values of nominal attributes are listed in Supplementary Table S2.

Results and Discussion
Data preprocessing. Most QSAR models aim to predict endpoints related to nanomaterials under particular experimental conditions; however, the model which was developed in this paper, using the data extracted from S2NANO database, is aimed to predict endpoints for various nanomaterials under diverse experimental conditions. It is possible to predict endpoints under different experiment conditions if the value of each attribute is properly preprocessed according to their data characteristic. The used data consists of numeric attributes (PChem and QM) and nominal attributes (Tox); these attributes were normalized and encoded by taking their characteristic, data type (numeric or nominal), and model performance.
Normalization for numeric attributes. As the measurement unit can affect the performance of the model, the data should be normalized or standardized 27,28 . Normalizing data is one step in addressing data that does not fit the model assumptions and is also used in coercing different variables to have similar distributions.
The numeric attributes in the used data were normalized via min-max, z-score, and log. The log transformation is often used for data which have positive skewness [29][30][31] . The normalization method, which is suitable for the data, was chosen by considering the distribution of each attribute. The skewness for each numeric attribute, a measure of symmetry in a distribution, is listed in Table 2. As attributes with the exception of hydrodynamic size, ΔHsf, and Ev have a right (positive) skewed distribution, those attributes were normalized using a log transformation.
The data of hydrodynamic size and ΔHsf were standardized via z-score because their skewed value is close to zero. As the skewness of Ev and χMeO was not improved after a log transformation, they were normalized by a min-max method. Most skewness for numeric attributes got closer to zero with the exception of Ev and χMeO after normalization. A zero value of skewness means that the tails on both sides of the mean balance out overall.
Additionally, the performance of the model according to different normalization methods including one considering the skewness was compared for each method in order to determine the optimal normalization method for each modelling algorithm. 10 data subsets divided randomly were normalized using min-max, z-score, log10, and combination (min-max, z-score, and log10) methods, and were used for building models. The models built were evaluated using 10-fold cross validation and a confusion matrix (the average value of balanced accuracy for 10 subsets was used). The results confirmed that log transformation is applicable to the GLM and the combination normalization method is suitable for the SVM and NNET algorithms as listed in Table 3. In particular, RF showed roughly similar performance regardless of the normalization method used. One-Hot encoding for nominal attributes. The categorical data must be converted to a numerical form because most prediction algorithms cannot operate on label data directly [32][33][34][35][36][37] ; it may be converted using integer encoding and one-hot encoding. Integer encoding is used when the categorical attribute has a natural ordered relationship between each element, while one-hot encoding is used when the categorical attributes does not have an ordinal relationship. As the categorical attributes such as assay method, cell line, and so on do not have an ordinal relationship, they were encoded into dummy variables using the one-hot encoding. This encoding method allows the model to classify toxicity by considering various experimental conditions.

Data division.
A data set with a high PChem score was divided into 10 subsets randomly because the dangers of using the same data to both select and fit the model have been known for many years 38 . As more data subsets are used, the model performance converges in accordance with the law of large numbers 39 . The 10 subsets were divided into a training set and test set with a ratio of 60:40 for internal validation and external validation, respectively.
Handling class imbalance problem. The class imbalance problem occurs when one of the classes has more samples than the other classes 40 . Most traditional classification algorithms can be limited in their performance on highly unbalanced data [41][42][43][44][45] . As the sample size of the two classes in the data used is highly imbalanced (toxic 16%, nonToxic 84%), SMOTE, which generates synthetic minority examples to over-sample the minority class 46 , was used to address the problem of class imbalance. The models for imbalanced data (ID) and balanced data (BD) were developed and compared in order to examine if the class balancing affects the predictive performance.
Univariate analysis. Univariate analysis (Wilcoxon-Mann-Whitney-test (WMW)) was conducted to examine difference of the distribution between the toxic class and nonToxic class. As the data used are not normally distributed, a WMW test was carried out. WMW, which is used to assess not difference of means but the distribution of two independent groups, is the nonparametric alternative test to the independent sample t-test 47   analysis, the high p-value indicates that there is a significant distribution difference between the two independent groups in a particular attribute. It means that the attribute with the high p-value in the WMW analysis could be considered as an important factor for the group discrimination.
The results of WMW are listed in Table 4. The attributes were sorted in descending order by p-value. ΔHsf has a significant mean difference between classes. In addition, it was confirmed that all attributes have a meaningful mean difference between classes (All p-values < 0.05). The important mean difference was identified in order of the QM attribute, Tox attribute, and PChem attribute.
Model development and validation. Four modelling algorithms, including GLM, SVM, RF, and NNET, were used for building the models, while the performance of models was evaluated by measures based on the confusion matrix.
GLM is extensions of traditional regression models that allow the mean to depend on the explanatory variables through a link function, and the response variable to be any member of a set of distributions called the exponential family 48 . GLM covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, log-linear models for count data, and so on through its very general model formulation.
SVM is one of the most popular machine learning algorithms that can be employed for both classification and regression purposes 49 . In particular, SVM is more commonly used in classification problems. It performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the support vectors.
RF is an ensemble learning method, consists of an arbitrary number of simple trees, which are used to determine the final outcome 50 . For classification problems, the ensemble of simple trees votes for the most popular class.
NNET, usually called neural network, is a mathematical model and commonly used for classification in data science 51 . NNET is typically organized in layers such as input layer, hidden layer, and an output layer. An input pattern is applied to the input layer and its effect propagates, layer by layer, through the network until an output is produced. NNET is trained using optimization techniques like gradient descent with consideration for error between target and the output.
Confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one 52 . Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). Various measures, such as accuracy, sensitivity, specificity, and precision, are derived from the confusion matrix. This matrix is often used to describe the performance of a classification model.
The models were developed using R caret packages 53 . Basic measures such as balanced accuracy, sensitivity, and specificity from the confusion matrix, where toxic class is a positive instance and nontoxic class is a negative instance, were used. Accuracy computed from a confusion matrix is not a reliable measure for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced. Therefore, the balanced accuracy considering both sensitivity and specificity were used.
The training set and test set were used for the model development, internal validation, and external validation. The validation results for 10 splits were averaged. In addition, toxicity data generated from an experiment conducted by S2NANO research group was used for the reliability validation of a developed model which showed the best performance.
Internal validation. The commonly used k-fold cross validation technique is used for evaluating predictive models 54,55 . 10-fold cross validation is commonly used 56 . In particular, this validation method is used to avoid the overfitting problem and to estimate the general performance of model 57,58 . 10-fold cross validation was used for internal validation of models built using imbalanced and balanced data from the training data (60% of total data). The models built using the balanced training data showed better accuracy than the models built using the imbalanced training data. The basic measures of the confusion matrix for internal validation are listed in Table 5.  The best results in terms of the basic measures were obtained from the NNET model built using balanced training data. Accuracy was improved on all models developed using the balanced training data.
External validation. Table 6 lists the results of external validation for the models built using the imbalanced and balanced data from the test data (40% of total data). Most models built using the balanced data showed better balanced accuracy, with the exception of the GLM model. The best results in terms of the basic measures were obtained from the NNET model built using balanced training data, as was the case for internal validation. There was a trade-off between sensitivity and specificity. In general, as the minority samples (toxic class) rarely occur but very important, the classification model should be sensitive to the minority samples than majority them.

Reliability validation.
For reliability validation of the developed model, the toxicity data including 144 rows generated from an experiment conducted by S2NANO research group was used as the validation set. A549 and BEAS-2B cell lines were exposed to four oxide nanoparticles (SiO 2 , ZnO, TiO 2 , and Fe 3 O 4 ) with various physicochemical properties with respect to the core size, hydrodynamic size, and surface charge. The exposure time was 24 hours. The concentration of nanomaterials ranged from 0 to 100 ppm. MTS and CCK-8 assays were used to measure the cell viability of the A549 and BEAS-2B cell lines, with the results expressed as a percentage compared with control samples. A data row was labeled "toxic" if the viability percent was less than 50%; otherwise, it was considered "nonToxic". After labeling, it was confirmed that the number of the toxic and nontoxic class was 14 and 130, respectively.
After the preprocessing step for the toxicity data, the data was used for the reliability validation of the developed NNET model. The validation results are listed in Table 7 (see S2NANO reliability validation data for more information).
The model showed good prediction result. It implies that the developed model could be considered as a more generalized predictive model, which is capable of predicting the toxicity label for the nanomaterials with consideration for the diverse experimental conditions. In the developed model, the ZnO nanomaterials with low ΔHsf value were classified to the toxic class as its dose was increased. The model was more sensitive to the dose change of ZnO in the A549 cell line than

with relatively high
ΔHsf value were classified to nontoxic class. The range of attributes changes according to PChem score. If data set with the high PChem score is used, the range of attributes is reduced. In contrast, the range is extended when dataset with the over medium PChem score is used. Because of this characteristic, lesser/bigger range of attributes was not considered in the models.
It may be possible to estimate the toxicity of the lesser size using extrapolation, the process of estimating, beyond the original observation range. If the discrimination threshold of the developed QNTR models was well adjusted, the extrapolation of the lesser or greater size may be valid.
The toxicity data for the reliability validation of the developed model includes the toxicity results of the lesser hydrodynamic size (69.1 nm and 45.6 nm) of Fe 3 O 4 nanomaterials. After the reliability validation with the toxicity data, it confirmed that the developed model correctly classified a toxic class of the toxicity data. However, the extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results. Therefore, the applicability domain of the model was defined to avoid such risks. Important attribute analysis. Variable (attribute) importance can be relatively measured and quantified based on information obtained from the models. The advantage of measuring the importance based on built model information is that it is more closely tied to the model performance and it may be able to incorporate the correlation structure between the predictors into the importance calculation 59 . The analysis of relative attribute importance for the built NNET model was carried out using a varimp function of caret package supported in R software 60 . The importance is measured based on weights between layers in the NNET model.
The results of important attribute analysis are presented in Table 8. Dose, ΔHsf, Exposure time, and Hydrodynamic size were relatively identified as important attributes when compared to the other attributes; this means that they acted as important attributes in classifying materials as either toxic or nonToxic. This result is similar to the previous result of MWM analysis.
Dose and ΔHsf were identified as important attributes in both analyses. That is, they play an important role in determining toxic label. In contrast, the exposure time and hydrodynamic size, that their importance was lower in the Univariate analysis, were considered as an important factor in the important attribute analysis based on information of the developed model. The toxicity data set for various nanomaterials under diverse experimental conditions were used in the model development. The effects and heterogeneities associated with the diverse experimental conditions were not considered in the Univariate analysis. On the other hand, the weights to be used for computing the relative importance for each attribute are adjusted with consideration for the diverse experimental conditions during the model training process. It indicates that the exposure time and hydrodynamic size are considered as an important attribute in the real situation. This result implies that the important attributes analysis provides more reliable results than the Univariate analysis in case that dataset including diverse environment conditions (e.g., cell line, assay method, etc.).
The importance of attributes were relatively measured based on built model information such as weight adjusted during model training. In the real scenario, the cell origin is an important attribute contributing to the change of the toxicity. However, the relative importance of the cell origin was measured low in the analysis. The result does not mean that the cell origin does not affect the toxicity change. The model cannot precisely predict the toxicity without the descriptive attributes such as cell origin, cell species, cell type, cell name, and assay method. The model predicts the toxicity of nanomaterial using all available information from the whole attributes. The descriptive attributes (nominal attributes) were coded to dummy variables. The dummy variables act like 'switches' that turn various parameter on and off in the developed model. The dummy variable which for some observation has a value of 0 will cause that variable's coefficient to have no role in influencing the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the intercept. Although the relative  importance of the descriptive attributes was measured low, the descriptive attributes actually play an important role in the toxicity classification of the model.

Definition of the applicability domain. The Organization for Economic Co-operation and Development
(OECD) has recommended that for the application of validated QSAR models to the prediction of new data points, there is a strict requirement of defining the applicability domain (AD) according to Principle 3 61,62 . The AD is widely understood to express the scope and limitations of a model, i.e. the range of chemical structures for which the model is considered to be applicable 63 . Generally, the training set is used to define the AD with a range-based method, geometric methods, distance-based methods and probability density distribution-based method 64 . In the case of ANN-based classification models, the AD can be defined based on Euclidean distance (ED) metrics 65 .
As the best predictive performance was identified in the NNET model built using the balanced data set, kNN-based AD using ED metrics was defined. A new compound will be predicted by the model [66][67][68] if and only if: where <D k > is the average Euclidian distance between each compound of the training set and its k nearest neighbors in the descriptors space, s k is the standard deviation of the distances between each compound of the training set and its k nearest neighbors in the descriptors space, and Z is an empirical parameter (0.5 by default). For each test compound i, the distance D i is calculated as the average of the distances between i and its k nearest neighbors in the training set. The value of k was chosen as the square-root of the number of training patterns. As 10 NNET models for 10 subsets were built, the AD for each NNET model was defined and the average values for <D k >, s k , and Z were calculated in turn; in particular, the preferable Z value was selected by increasing the Z value from 0.1 to 2.4 in increments of 0.1 and identifying the value with the highest accuracy for the test set. The results are listed in Table 9. The average preferable Z value was identified as 0.77; this means that D i for the new compound should be less than a cutoff value, (<D k > + 0.77 × s k ), so that the new compound will be reliably predicted by the model.

Conclusions
We developed a generalized QSAR model using a dataset with a high PChem score in the S2NANO database, which includes various experimental results. Various preprocessing techniques and modelling algorithms were used for model development. In addition, the analysis of relatively important attributes based on model information was performed. Finally, the kNN-based AD region was set up. As the proposed model can predict the toxicity of the nanomaterials in consideration of various experimental conditions, it has the advantage of having a broader and more general AD than the existing QSAR model. The results of this paper also indicate that preprocessing techniques appropriate to the characteristic of data should be applied for the generalized QSAR model.
The existing QSAR models can only predict the toxicity endpoint of nanomaterials under specific experiment condition. In contrast, the developed model can predict the toxicity endpoint (toxicity class) under the various experimental conditions conducted according to the international protocol and GLP. It allows the nanotechnologists as well as nanomaterial manufacturers, and researcher to obtain various toxicity results through one prediction model. Therefore, it enables the efficient utilization of the developed model.
The model development workflow presented in this paper can be considered a new methodology to develop a generalized QSAR model using a database containing various toxicity experimental results. Several databases exist that are relevant for engineered nanomaterials (ENM) toxicity assessment such as eNano-Mapper, NanoMaterialRegistry, and Nanoparticle Information Library. The development of a generalized model for these databases using this methodology is expected to contribute to the application and utilization of QSAR.  Table 9. Applicability domain of the best model.