A universal methodology for reliable predicting the non-steroidal anti-inflammatory drug solubility in supercritical carbon dioxide

Understanding the drug solubility behavior is likely the first essential requirement for designing the supercritical technology for pharmaceutical processing. Therefore, this study utilizes different machine learning scenarios to simulate the solubility of twelve non-steroidal anti-inflammatory drugs (NSAIDs) in the supercritical carbon dioxide (SCCO2). The considered NSAIDs are Fenoprofen, Flurbiprofen, Ibuprofen, Ketoprofen, Loxoprofen, Nabumetone, Naproxen, Nimesulide, Phenylbutazone, Piroxicam, Salicylamide, and Tolmetin. Physical characteristics of the drugs (molecular weight and melting temperature), operating conditions (pressure and temperature), and solvent property (SCCO2 density) are effectively used to estimate the drug solubility. Monitoring and comparing the prediction accuracy of twelve intelligent paradigms from three categories (artificial neural networks, support vector regression, and hybrid neuro-fuzzy) approves that adaptive neuro-fuzzy inference is the best tool for the considered task. The hybrid optimization strategy adjusts the cluster radius of the subtractive clustering membership function to 0.6111. This model estimates 254 laboratory-measured solubility data with the AAPRE = 3.13%, MSE = 2.58 × 10–9, and R2 = 0.99919. The leverage technique confirms that outliers may poison less than four percent of the experimental data. In addition, the proposed hybrid paradigm is more reliable than the equations of state and available correlations in the literature. Experimental measurements, model predictions, and relevancy analyses justified that the drug solubility in SCCO2 increases by increasing temperature and pressure. The results show that Ibuprofen and Naproxen are the most soluble and insoluble drugs in SCCO2, respectively.

www.nature.com/scientificreports/ Application and interest in using the supercritical CO 2 (SCCO 2 ) for pharmaceutical processing have been sharply increased recently 15,[22][23][24][25][26][27][28] . Understanding the drug solubility in SCCO 2 is the central information for designing the supercritical-based pharmaceutical technology 29 . The size 26 , shape 26 , surface structure 22 , morphology 22 , and crystallization process 26 of synthesized solid drugs are determined by their solubility in the supercritical fluid. In addition, the economic success of the supercritical technology highly depends on reliable insight about the solid (drug) solubility in supercritical solvents 23 .
Therefore, some researchers focused on laboratory measurements of solid drug solubility in supercritical CO 2 15,22-28 . However, experimental determination of pharmaceutical solubility in SCCO 2 is complex, expensive, and time-consuming 23,30 . In addition, it is not possible to measure equilibrium solubility in all ranges of desired operating conditions 26,30 .
Hence, several empirical 31,32 and thermodynamic-based 23,33 correlations have been proposed to calculate the solid drug solubility in the CO 2 at the supercritical state. Traditionally equations of state are the most utilized thermodynamic-based correlations for predicting the phase equilibria of drugs/SCCO 2 [34][35][36] . Unfortunately, these thermodynamic-based methods have at least one temperature-dependent interaction parameter that must be adjusted appropriately 23 . Surprisingly, there is no general thermodynamic-based method for effectively monitoring the solubility of several solid drugs in SCCO 2 23 . Furthermore, it is claimed that equations of state often provide high levels of uncertainty 34 and sometimes wholly fail 35 . On the other hand, available empirical correlations have usually been developed for estimating the solubility of a specific solid drug in supercritical CO 2 , and it is impossible to find which correlation is better to use 22 .
The non-steroidal anti-inflammatory drugs (NSAID) are often prescribed to reduce pain/fever/inflammation and prevent blood clots 26 . The current research intends to propose a universal intelligent model to predict the solubility of twelve NSAIDs (Fenoprofen, Flurbiprofen, Ibuprofen, Ketoprofen, Loxoprofen, Nabumetone, Naproxen, Nimesulide, Phenylbutazone, Piroxicam, Salicylamide, and Tolmetin) in SCCO 2 . For doing so, 2150 intelligent paradigms from three different categories (i.e., artificial neural networks, hybrid neuro-fuzzy, and support vector regression) have been constructed, and their accuracy monitored. The ANFIS model with the subtractive clustering membership function and cluster radius of 0.6111 presents the most reliable prediction results. This straightforward model can accurately predict the solubility of 12 NSAIDs in supercritical CO 2 in wide ranges of operating pressures and temperatures. To the best of our knowledge, it is the most generalized approach developed for phase equilibria modeling of NSAIDs/SCCO 2 up to now.

Material and methods
The collected drug solubility data, their sources, and ranges of experimental measurements have been reported in this section. Furthermore, the current section has concisely introduced the applied machine learning methods.
Experimental data for anti-inflammatory drug solubility in SCCO 2 . Development, as well as validation stages of all machine learning techniques, require an experimentally measured databank about the given problems. Therefore, in the current research, the information of 254 experiments related to the anti-inflammatory drug solubility in supercritical CO 2 has been gathered from eight trusted literature 15,[22][23][24][25][26][27][28] . A complete description of these experiments, including their range of operating pressures and temperatures, the observed solubility levels, and numbers of available data for all anti-inflammatory drug/SCCO 2 systems, have been introduced in Table 1. It is also necessary to highlight that subscript 1 and 2 are associated with the anti-inflammatory drug and supercritical carbon dioxide, respectively.
Since the solubility of all anti-inflammatory drugs in supercritical CO 2 is planned to be estimated by a single model, it is necessary to include the drugs' inherent characteristics in the modeling stage, too. Table 2 shows the molecular weight and melting temperature of the considered anti-inflammatory drugs. It is better to note that each anti-inflammatory drug has its unique values for these properties. Therefore, the molecular weight and melting temperature can be incorporated in the model's entry to differentiate among different anti-inflammatory drugs. www.nature.com/scientificreports/ Although it is possible to extract some features from the experimental database 37 and utilize them as model's entry, the current research aims to relate anti-inflammatory drug solubility in SCCO 2 ( y 2 ) to the molecular weight ( Mw 2 ), melting temperature ( Tm 2 ), operating pressure (P), temperature (T), and SCCO 2 density ( ρ 1 ). The mathematical statement of this expression is shown by Eq. (1).
Three trustful relevancy analysis approaches, namely Spearman, Pearson, and Kendal, have been utilized to check whether the selected independent variables are appropriate features for the model development 38 . These techniques show the relevancy level between a pair of dependent-independent variables by a coefficient in the range of minus one to plus one 39 . The negative coefficients indicate indirect dependency, positive ones show a direct relationship, and zero coefficient value is associated with no relevancy 39 . Figure 1 presents the observed coefficients of Spearman, Pearson, and Kendall techniques for interrelations of the anti-inflammatory drug solubility in SCCO 2 with the selected independent variables. This analysis approves that increasing the molecular weight and melting temperature of anti-inflammatory drugs reduces their dissolution in the supercritical CO 2 . On the other hand, raising pressure, temperature, and solvent density enhance drug solubility in the SCCO 2 . Furthermore, molecular weight and pressure have the weakest indirect and strongest direct influences on the drug solubility in the SCCO 2 , respectively. The performed relevancy analysis results can be considered a justification for the appropriate selection of the independent variables.
Computational methodologies. Machine learning methods have been extensively engaged in approximation 40,41 , interpretation 42 , action recognition 43 , and classification 44,45 porpuses. This study focuses on five artificial neural networks (ANN), four hybrid neuro-fuzzy types, and three kinds of support vector regression (SVR) to simulate anti-inflammatory drug solubility in supercritical CO 2 . The considered ANN models are multilayer perceptron neural network (MLPNN) 46   www.nature.com/scientificreports/ rent neural network (RNN) 49,50 , general regression neural network (GRNN) 48 , and radial basis function neural networks (RBFNN) 51 . The efficiency of the support vector regression with the linear kernel (LSSVR-L) 52 , polynomial kernel (LSSVR-P) 52 , and Gaussian kernel (LSSVR-G) 53 are also evaluated over the considered purpose. The neuro-fuzzy models with the subtractive clustering membership function trained by the hybrid (ANFIS2-H) and backpropagation (ANFIS2-BP) algorithms have also been applied in the current study 54 . The last intelligent tools used in the present research are the neuro-fuzzy models with the C-means clustering membership function trained by hybrid (ANFIS3-H) and backpropagation (ANFIS3-BP) optimization strategies 55 . It should be mentioned that these paradigms can be viewed as advanced regression-based tools. Therefore, they have all limitations of the conventional regression-based methods. Indeed, the developed intelligent schemes are only valid for the ranges of experimental data reported in Table 1. Utilizing these models for extrapolation purposes is not suggested.

Results and discussions
The focus of the present section is devoted to constructing different numbers of the considered intelligent paradigms through the trial-and-error tactic and determining models with the lowest deviation from experimental measurements. Then the model with the highest accuracy is found applying the ranking analysis. After this, several visual inspections have been directed to evaluate the selected model efficiency for estimating antiinflammatory drugs' solubility in supercritical CO 2 . The ability of the fabricated intelligent model to recall the physical-based behavior of the anti-inflammatory drug in the supercritical fluid (variation of drug solubility by the operating conditions) has also been inspected in the present section.
Smart models' construction. The present research employs five types of artificial neural networks (MLPNN, CFNN, RNN, GRNN, and RBFNN), three support vector regression kinds (LSSVR-L, LSSVR-P, and LSSVR-G), and four hybrid neuro-fuzzy approaches (ANFIS2-H, ANFIS2-BP, ANFIS3-H, and ANFIS3-BP) for simulating the anti-inflammatory drugs' solubility in the supercritical CO 2 . All these intelligent tools have their own unique features required to be appropriately determined. Table 3 expresses both fixed and tunable elements of the applied machine learning methodologies in the present research. This table also indicates the range of the tunable features of the intelligent paradigms during the trial-and-error process. The last column of Table 3 shows the numbers of the constructed models for all individual smart categories. Cumulatively, 2150 intelligent estimators have been fabricated during the development stage.
Training process. The actions followed to adjust hyperparameters of machine learning methods is known as the training process 56 . This process utilizes historical data of a given phenomenon and an optimization algorithm to perform this duty. The literature has already compared the accuracy and computation time of some well-known Table 3. Complete information about 2150 constructed computational techniques by the trial-and-error procedure. www.nature.com/scientificreports/ training algorithms engaged in the training stage of machine learning methods 56 . The training stage begins with randomly generated hyperparameters. The estimated targets have been obtained by entering independent variables into an intelligent estimator. The deviation between the calculated and actual values of the dependent variable is considered an objective function of the optimization algorithm. Indeed, the optimization algorithm continuously updates the hyperparameters of the machine learning method to minimize the objective function or at least reduce it as much as possible. The training stage finishes when the maximum number of iterations is reached or the objective function converges to the prespecified value 57 . A trained machine learning method is then possible to employ for estimating the target variable in unknown situations. All trained intelligent tools only require the independent variables to do their duty.
It can be understood from Table 3 that the radial basis function and general regression neural networks, and support vector regression benefit from the Gaussian function 58 . Indeed, the first two models have the Gaussianshape activation function, but the latest uses the Gaussian as the kernel function.
Smart models' selection. In order to find the best structure of each smart method, it is necessary to quantize the prediction errors of the engineered models using appropriate statistical criteria. Those models provided the lowest prediction errors finally selected as the best ones. In this way, it is also possible to determine the most appropriate structural features. Table 4 presents the final twelve smart paradigms (one model per category) with the slightest prediction errors. This table also displays the prediction errors of these selected models in terms of six uncertainty criteria (AAPRE%, MAE, RAE%, RRSE%, MSE, and R 2 ). The calculated uncertainties have been separately reported for the training and testing categories. Equations (2) to (7) express that only laboratorymeasured ( y exp 2 ) and calculated ( y cal 2 ) drug solubility, numbers of data (N), and the average value of solubilities ( y exp 2 ) are needed to quantize these accuracy criteria 38,59 .  www.nature.com/scientificreports/ Ranking analysis for finding the highest accurate smart model. The previous two sections applied a coupling technique based on the trial-and-error process and accuracy tracking to find the best topology of each smart machine. Indeed, twelve models with the highest accuracy have been extracted from 2150 fabricated approaches. The ranking technique is directed to find the most accurate estimator among these twelve smart methods. The outcome of performing the ranking technique on the reported results in Table 4 has been plotted in Fig. 2. Indeed, AARPE%, MAE, RAE%, RRSE%, and R 2 with the same weight have been utilized for conducting this ranking analysis. The GRNN and ANFIS2-H are the first ranked during the training and testing stages, respectively. On the other hand, the worst model is the LSSVR-L, with the twelve ranking places for training and testing. The GRNN fails to extend its excellent ability in the training step to the testing phase (it places at the fifth ranking). This finding may indicate the overfitting of the GRNN with the 216 hidden neurons and spread index of 1.3 × 10 -4 . The ANFIS2-H efficiency in the testing stage is better than its performance in the training stage (second and first rankings in the training and testing phases). Figure 2 also indicates the performance of the selected intelligent approaches for the combination of the testing and training datasets.
Performance evaluation. This section concentrates on different graphical inspections to visually investigate the proposed ANFIS2-H's performance. The cross-plot for calculated and actual drug solubilities in the SCCO 2 have been separately depicted for the development (training) and validation (testing) stages in Fig. 3. The legend of Fig. 3 shows that the red hexagonal symbols show training subdivision, while the blue squared symbols

Ranking order
Training subdivision Testing subdivision Training + Testing Figure 2. Ranking orders of the selected intelligent strategies in the learning and testing steps as well as over the whole of the datasets (testing + training). Average values of solubility of the concerned anti-inflammatory drugs in the supercritical CO 2 for experimental measurements and ANFIS2-H predictions have been illustrated in Fig. 4. This figure can readily approve a satisfactory agreement between actual measurements and the proposed model predictions. Moreover, it can be seen that Ibuprofen and Naproxen are the most soluble and low soluble anti-inflammatory drugs in the SCCO 2 . Nabumetone and Phenylbutazone with an almost equal average solubility level are the subsequent high soluble drugs in the considered supercritical fluid.
The capability of the generated ANFIS2-H with the optimized topology for estimating the phase equilibria of all possible drug/SCCO 2 systems has been depicted in Fig. 5. This figure exhibits the model's capability in terms of AAPRE%. It can be seen that the drug/SCCO 2 phase equilibria are simulated with the AAPRE ranges from 1.04% (Phenylbutazone) to 6.05% (Nabumetone). As mentioned earlier, an overall AAPRE of the developed ANFIS2-H for predicting 254 solubility datasets is 3.13%. It should be noted that an AAPRE of lower than 10% is an acceptable accuracy from the modeling perspective. Meanwhile, the highest observed uncertainty for predicting the Nabumetone solubility in supercritical carbon dioxide may be associated with either accompanied measurement error in experimental data or ANFIS2-H inability to estimate the Nabumetone/SCCO 2 equilibrium accurately.   The variation of Fenoprofen solubility in the supercritical CO 2 by the isobaric temperature alteration has been shown in Fig. 6. This figure states that the ANFIS2-H successfully understands and persuades the physical behavior of the Fenoprofen/SCCO 2 system at different operating conditions. Moreover, this figure explains that the Fenoprofen solubility in the concerned supercritical fluid increases by increasing pressure as well as temperature. The positive effect of the temperature on the drug solubility improves by increasing the pressure. It can be claimed that the highest amount of solubility in the SCCO 2 is achievable at the maximum allowable pressure and temperature.
It is worth noting that all other anti-inflammatory drugs also show a similar response to the alteration of the pressure/temperature. These experimental and modeling discoveries fully agree with the previously anticipated results by the relevancy analysis ("Experimental data for anti-inflammatory drug solubility in SCCO 2 " Section).
Endothermic drugs' dissolution in the supercritical carbon dioxide may be responsible for the increasing effect of the temperature. On the other hand, increasing the pressure increases the mass driving force to transfer the drug's molecules to the supercritical phase. Increasing the density of the supercritical fluid by increasing the pressure may be seen as another responsible for this observation.
The influence of isothermal pressure alteration on the Tolmetin dissolution in carbon dioxide in the supercritical state has been exhibited in Fig. 7. Excellent compatibility between laboratory-measured data points and www.nature.com/scientificreports/ ANFIS2-H predictions is observable from this figure. Like the previous analysis, the Tolmetin solubility in the SCCO 2 continuously intensifies by raising pressure or temperature. It can also be observed that the effect of pressure on the drug solubility at high temperatures is stronger than the lower ones. As previously stated, the drug type also affects the magnitude of the solubility in supercritical CO 2 . The y 2 -pressure profiles of several anti-inflammatory drugs in the presence of CO 2 in the supercritical state have been presented in Fig. 8. This figure shows outstanding compatibility between laboratory-measured information and those results calculated by the designed ANFIS2-H machine. Indeed, the proposed estimator easily distinguishes/ discriminates the solubility of different anti-inflammatory drugs in the SCCO 2 . This figure easily justifies the gradual increase of the anti-inflammatory drugs' solubility by equilibrium pressure.
Analyzing data validity. Machine learning strategies gain their knowledge from the historical behavior of a concerning phenomenon (here, anti-inflammatory drug solubility in CO 2 at supercritical state). Experimentations have the highest importance level to provide machine learning strategies with such insights. On the other hand, the laboratory-measured or real-field historical data is inevitably poisoned by outliers 60 . The measurement error, instrument' wrong calibration, and environmental side effects on the experimentation are the primary sources of the outlier 52 . If the outlier information highly poisons an experimental databank used for model development, the reliability of the constructed approach is under question. Hence, the leverage tactic is suggested to inspect the validity of the experimental data 56 . This tactic plots the standard residual (SR) against    (8) to (11) define the formula of these variables.
here, RE ave and SD represent the average value of the residual error and standard deviation, respectively. The consequence of applying the leverage tactic on the gathered database for anti-inflammatory drug-SCCO 2 systems has been published in Fig. 9. Only one segment of Fig. 9 is valid, and all other five parts are suspect. This tactic confirms that 244 out of 254 experiments are valid, and the outlier may poison only less than four percent of the historical datasets. The accomplished analysis in this stage reveals that the collected databased used for model construction is mainly valid. Thus, the proposed ANFIS2-H is solely allowed to be used for estimating anti-inflammatory drug solubility in supercritical CO 2 from molecular weight, melting temperature, pressure, solvent density, and temperature.

Conclusion
This study systematically compared the prediction accuracy of 2150 intelligent estimators from three different categories (artificial neural networks, hybrid neuro-fuzzy, and support vector regression) to estimate anti-inflammatory drug solubility in supercritical CO 2 . The conducted comparisons approved that the adaptive neuro-fuzzy inference system with the subtractive clustering membership function (ANFIS2-H) has the highest accuracy for the considered objective. The cluster radius of this ANFIS2-H model adjusted by the hybrid optimization algorithm is 0.6111. The ANFIS2-H model estimated 254 laboratory-measured solubility data with the AAPRE = 3.13%, MSE = 2.58 × 10 -9 , and R 2 = 0.99919. Furthermore, the AAPRE associated with each NSAID-SCCO 2 phase equilibrium ranges from 1.04 to 6.05%. In addition, the LSSVR with the linear kernel function shows the worst predictive performance for estimating the NSAID's solubility in the SCCO 2 . The relevancy analyses performed by three diverse scenarios justified that increasing the drug's molecular weight and melting temperature decreases their solubility in supercritical CO 2 . In addition, experimental observations, modeling findings, and relevancy analyses indicated that increasing pressure, temperature, and SCCO 2 density raise the drug solubility in supercritical solvents. The leverage methodology showed that only ten datasets are potential outliers, and all other experiments have been conducted on a valid basis. Both modeling and experimental observations clarified that the maximum and minimum tendency of the supercritical CO 2 is devoted to the Ibuprofen and Naproxen drugs, respectively. Coupling the developed intelligent scenario with an optimization technique to precisely locate the operating conditions that maximize each anti-inflammatory drug's solubility in supercritical carbon dioxide may be considered as a next research step in this field.

Standardized residuals
Valid data Suspect data Critical Leverage Upper suspect bound Lower suspect bound Figure 9. Analyzing the laboratory-measured solubility data for identifying valid and suspect information.