Introduction

Due to the inherent hydrophobic nature of pharmaceuticals, most drugs are poor soluble or practically insoluble in aqueous solutions which makes the commercialization stage of some drugs impossible. Another problem with low solubility of drugs is that they need to be taken at higher dosage to achieve the therapeutic effects. So, their efficacy is low, and it should be enhanced. Different techniques can help improve the solubility of drug substances which rely on increasing the solubility through chemical and physical methods1,2,3. For instance, ball milling is a facile method to reduce the size of drug particles and increase their solubility due to the smaller size4. The method of pharmaceutical cocrystallization is another approach to enhance drugs solubility which is based on molecular interactions between drug and a coformer to build a combination of species with enhanced solubility in aqueous media5.

For commercialization and application of a wide range of medications, more attractive processes and techniques are needed. For instance, the method of drugs nanonization via supercritical fluids has been assessed and studied recently to enhance the solubility of medications by production of nanosized drugs particles. The method is attractive owing to the utilization of supercritical fluids such as CO2 (SC-CO2) which are green solvents6. There are different steps involved in this process among which the drug solubility in the supercritical fluid is the most important one as the process efficiency is determined by the solubility of drug. Furthermore, due to the variation of pressure and temperature, the solubility must be evaluated as a function of these parameters. Some techniques have been proposed to evaluate drugs solubility in SC-CO2such as thermodynamics and machine learning. Despite the physical basis of thermodynamic models in evaluating drug solubility7,8, machine learning models have offered higher precision in estimation of drugs solubility in supercritical fluids9,10,11,12.

Machine learning (ML) models have emerged as effective methods for data analysis and predictive modeling across a wide range of domains. Over the previous decade, there has been significant progress in the field of ML, as evidenced by the development of a plethora of algorithms and models aimed at addressing a wide range of applications including drug development13,14. Several ML models have been already reported for correlation of drugs solubility, but new models should be customized for a new drug to make the generalized framework for analysis of drugs solubility in SC-CO2 solvent. Here, for the first time Polynomial Regression (PR), Extreme Gradient Boosting (XGB), and LASSO are developed and optimized for estimation of niflumic acid solubility in SC-CO2. The Barnacles Mating Optimizer (BMO) was used to train and optimize these models. Indeed, utilization of ML models, optimization, and implementation for niflumic acid is carried out for the first time in this research.

In the field of linear regression analysis, LASSO regression is a powerful and widely used technique. It effectively addresses multicollinearity, overfitting, and high-dimensional data, balancing model interpretability and predictive accuracy. LASSO promotes feature selection and regularization, which helps to develop more robust and parsimonious models in a variety of domains15,16. PR is a useful extension of linear regression that allows us to model nonlinear relationships between variables. This technique can capture complex patterns in data and provide valuable insights into the underlying relationships by introducing higher-order polynomial terms17. The XGBoost regression algorithm is a commonly used machine learning algorithm that is well-known for its high predictive accuracy and efficient handling of complex datasets. XGBoost produces robust and interpretable regression models for a variety of applications by combining gradient boosting, regularization, and decision trees. Its ability to handle missing data, identify feature importance, and resist overfitting adds to its appeal among data scientists and practitioners seeking accurate and reliable predictive modeling18,19.

Data of solubility

The dataset utilized in this work is collected from a previous work20 and contains four distinct variables for drug solubility in supercritical CO2, namely Temperature, Pressure, Solvent Density, and Solubility of niflumic acid. The experimental conditions are represented by the Temperature and Pressure values, while the corresponding measurements are provided by the Solvent Density and Solubility of niflumic acid as reported in20. As such, two inputs and two outputs are assumed for building the machine learning models in this study.

Methodology

Barnacles mating optimizer (BMO)

BMO draws inspiration from the mating behavior of barnacles which is used for tuning models in this work. These microorganisms are regarded as promising candidates (combinations of hyperparameters in this study) in this algorithm21. The BMO process comprises two primary stages, namely selection and reproduction. Two parent barnacles are chosen for the selection phase according to the length of their penises (pl).

The Hardy-Weinberg principle is used by the algorithm to generate offspring during the reproduction stage. If the pl of the father’s barnacle falls within the selection range of the parent barnacles, the father inherits p% of the characteristics and the mother inherits (1-p)%. If the father’s pl is outside the range of acceptable mutations, a new generation is generated by modifying only the maternal traits. This approach promotes exploitation when the father’s plis within the range and exploration when it is not22.

The formulation for generating offspring from the parents’ mating process is expressed through the following Eqs21,22:

$$\:{x}_{i}^{{N}_{n}ew}=p{x}_{barnacl{e}_{d}}^{N}+q{x}_{barnacl{e}_{m}}^{N}$$

In this process, the generation of offspring relies on two random numbers, p and q, both falling within the range of [0, 1]. Here, barnacled represents the solution for the father, and barnaclem represents the solution for the mother. If barnacled chooses barnacle8, it exceeds the cap, leading to the termination of the usual mating process.

Instead, the algorithm employs a method called “sperm cast,” a term coined in BMO, to generate the offspring. This approach facilitates exploration during the mating process22:

$$\:{x}_{i}^{{N}_{n}ew}=rand\left(\right)\times\:{x}_{barnacl{e}_{m}}^{n}$$

The function rand() generates a random number within the interval [0, 1].

LASSO regression

LASSO (Least Absolute Shrinkage and Selection Operator) is an advanced statistical technique used in linear regression applications. It was introduced as a method to handle multicollinearity and perform feature selection by imposing a penalty on the absolute values of the regression coefficients. This model has gained popularity due to its ability to effectively handle high-dimensional datasets and produce interpretable and sparse models16,23.

The primary objective of LASSO regression is to determine the best linear model by minimizing the sum of squared residuals while simultaneously shrinking the less informative coefficients to zero. This encourages the selection of the most relevant features and avoids overfitting, leading to a more robust and generalizable model.

Let’s consider a linear regression problem with n observations and p predictors. The model can be represented as:

$$\:y={{\upbeta\:}}_{0}+{{\upbeta\:}}_{1}{x}_{1}+{{\upbeta\:}}_{2}{x}_{2}+\dots\:+{{\upbeta\:}}_{p}{x}_{p}+\:\epsilon\:$$

where:

  • - y stands for the dependent variable,

  • - \(\:{{\upbeta\:}}_{0}\) indicates the intercept,

  • - \(\:{x}_{i}\)’s are the predictors,

  • - \(\:{{\upbeta\:}}_{i}\)’s are the coefficients, and

  • - \(\:{\upepsilon\:}\) represents the error.

The LASSO regression optimizes the following objective function24:

$$\:\text{arg}\underset{{\upbeta\:}}{\text{min}}\left\{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\left({{\upbeta\:}}_{0}+{\sum\:}_{j=1}^{p}{{\upbeta\:}}_{j}{x}_{ij}\right)\right)}^{2}+{\uplambda\:}{\sum\:}_{j=1}^{p}\left|{{\upbeta\:}}_{j}\right|\right\}$$

The symbol \(\:\lambda\:\) represents the regularization factor. As \(\:{\uplambda\:}\) increases, the penalty for non-zero coefficients strengthen, leading to more shrinkage and feature selection.

Extreme gradient boosting (xgboost)

XGBoost has gained widespread acclaim for its high predictive performance and versatility across a wide range of domains. As an ensemble learning technique, XGBoost combines the strengths of gradient boosting and regularization to deliver robust and accurate regression models19,25.

The primary objective of XGBoost regression is to create an optimized regression model that can effectively predict continuous numeric values. By employing a combination of weak learners (Decision Trees in this study), typically decision trees, XGBoost progressively improves its predictive capability through iterative boosting. It aims to minimize the overall prediction error and deliver superior results compared to traditional gradient boosting algorithms. A Flowchart for overall process of XGBoost is displayed in Fig. 126.

Fig. 1
figure 1

The XGBoost Flowchart.

Polynomial regression (PR)

PR method allows for the modeling of nonlinear relationships between the inputs and outputs for complicated tasks. By introducing polynomial terms, this technique can capture complex patterns where linear models fail. In this model description, we explore the key concepts and benefits of polynomial regression27,28. In this regression method, PR model is employed to fit a polynomial function to the data in order to approximate the underlying relationship between the variables. Polynomial regression can capture curved and nonlinear trends in the data29.

The PR of order dis given by12:

$$\:y={{\upbeta\:}}_{0}+{{\upbeta\:}}_{1}{x}_{1}+{{\upbeta\:}}_{2}{x}_{2}+\dots\:+{{\upbeta\:}}_{p}{x}_{p}+{{\upbeta\:}}_{p+1}{x}_{1}^{2}+{{\upbeta\:}}_{p+2}{x}_{1}{x}_{2}+\dots\:+{{\upbeta\:}}_{p+n}{x}_{p}^{d}+\:\epsilon\:$$

where d is the PR order, \(\:{x}_{i}^{d}\) represents the d-th power of the i-th predictor, and \(\:{\beta\:}_{p+1}\) to \(\:{\beta\:}_{p+n}\) are the additional parameters to be estimated.

For the ML modeling and optimization tasks, Python software was used along with machine learning, optimization, and plotting libraries.

Results and discussion

The results of three models in estimating the solubility of niflumic acid and corresponding density of solvent using temperature and pressure as inputs are presented in this section. Indeed, both responses have been modeled and their values are compared by three models to find out the accuracy of optimized models in this study. The results of analyses for all models and responses are listed in Tables 1 and 2. Three important criteria have been considered for comparison including R2 (Coefficient of Determination), RMSE (Root mean square error), and Maximum Error.

Table 1 SC-CO2 density estimation by ML models.
Table 2 Solubility estimation by ML models.

From the results listed in Tables 1 and 2, it is evident that Polynomial Regression (PR) consistently outperforms the other ML models in predicting both outputs, i.e., SC-CO2 density and niflumic acid solubility. The comparison of real and predicted values for both outputs is shown in Figs. 2 and 3 which illustrates the dataset for training and testing. PR achieves remarkable accuracy, with high R2 scores of 0.992 and 0.969 for density and solubility, respectively, and RMSE values of 12.203 and 0.256. XGB also demonstrates good predictive performance, with R-squared scores of 0.927 and 0.930, and RMSE values of 28.623 and 0.286. LASSO, as a regularization technique, provides competitive results, with R-squared scores of 0.819 and 0.821, and RMSE values of 40.774 and 0.462. So, the criteria confirmed that PR can be chosen the most accurate model for description of density and solubility.

Fig. 2
figure 2

Train and Test Results of Predicted and Actual values of Solvent Density using PR model.

Fig. 3
figure 3

Train and Test Results of Predicted and Actual values of Solubility using PR model.

Based on the outcomes of modeling, PR model was used as the model for generating the 3D Response Surfaces of two outputs, which are shown in Figs. 4 and 5. Also, the individual effect of inputs on both outputs visualized in Figs. 6, 7, 8 and 9. The results revealed that the solubility is increased with pressure of solvent as it behaves like gas solvents and its density varies with pressure unlike organic liquid solvents. This is indeed an important advantage of supercritical fluids whose solubility can be tuned with manipulating process pressure in addition to the temperature. On the other hand, the temperature reduces the solvent density (see Fig. 7) which has negative effect on the solubility, while the solubility is enhanced with increasing temperature which is due to the various phenomena involved in the solubility before and after cross-over pressure point in the system30. For determination of optimum point of operation, some economical evaluations are needed to find the cost of operation at each pressure and temperature.

Fig. 4
figure 4

Response Surface of Solvent Density generated using PR model.

Fig. 5
figure 5

Response Surface of niflumic acid solubility generated using PR model.

Fig. 6
figure 6

Solvent Density based on Pressure keeping Temperature constant on different levels.

Fig. 7
figure 7

Solvent Density based on Temperature keeping Pressure constant on different levels.

Fig. 8
figure 8

Niflumic acid solubility based on Pressure keeping Temperature constant on different levels.

Fig. 9
figure 9

Niflumic acid solubility based on Temperature keeping Pressure constant on different levels.

Conclusion

In this research study, we compared three models - Polynomial Regression (PR), Extreme Gradient Boosting (XGB), and LASSO - for estimating SC-CO2 density and niflumic acid solubility using temperature and pressure inputs. PR emerged as the most accurate model with R-squared scores of 0.992 for density and 0.969 for solubility, achieving low RMSE values of 12.203 and 0.256, respectively. XGB also performed well with R-squared scores of 0.927 and 0.930, and RMSE values of 28.623 and 0.286. LASSO demonstrated competitive results, with R-squared scores of 0.819 and 0.821, and RMSE values of 40.774 and 0.462. The BMO algorithm improved the models’ performance. Overall, PR is the recommended model for accurate and interpretable predictions in these applications. The findings have practical implications for materials science, chemical engineering, and pharmaceutical research, supporting informed decision-making and process optimization. The developed methodology can be used as a generalized approach for data-driven decision making in pharmaceutical processing.