Modeling of nitrogen solubility in normal alkanes using machine learning methods compared with cubic and PC-SAFT equations of state

Accurate prediction of the solubility of gases in hydrocarbons is a crucial factor in designing enhanced oil recovery (EOR) operations by gas injection as well as separation, and chemical reaction processes in a petroleum refinery. In this work, nitrogen (N2) solubility in normal alkanes as the major constituents of crude oil was modeled using five representative machine learning (ML) models namely gradient boosting with categorical features support (CatBoost), random forest, light gradient boosting machine (LightGBM), k-nearest neighbors (k-NN), and extreme gradient boosting (XGBoost). A large solubility databank containing 1982 data points was utilized to establish the models for predicting N2 solubility in normal alkanes as a function of pressure, temperature, and molecular weight of normal alkanes over broad ranges of operating pressure (0.0212–69.12 MPa) and temperature (91–703 K). The molecular weight range of normal alkanes was from 16 to 507 g/mol. Also, five equations of state (EOSs) including Redlich–Kwong (RK), Soave–Redlich–Kwong (SRK), Zudkevitch–Joffe (ZJ), Peng–Robinson (PR), and perturbed-chain statistical associating fluid theory (PC-SAFT) were used comparatively with the ML models to estimate N2 solubility in normal alkanes. Results revealed that the CatBoost model is the most precise model in this work with a root mean square error of 0.0147 and coefficient of determination of 0.9943. ZJ EOS also provided the best estimates for the N2 solubility in normal alkanes among the EOSs. Lastly, the results of relevancy factor analysis indicated that pressure has the greatest influence on N2 solubility in normal alkanes and the N2 solubility increases with increasing the molecular weight of normal alkanes.

www.nature.com/scientificreports/ model provided satisfying results with an overall deviation lower than 10%. They also mentioned that for the hydrocarbon + N 2 systems (except CH 4 ); k ij is a decreasing function of temperature 39,40 . At low temperatures, Justo-Garcia et al. 41 modeled vapor-liquid-liquid equilibria (VLE) for N 2 and alkanes in three distinct ternary systems. The findings demonstrate that both SRK and PC-SAFT EOSs estimate the experimentally observed values with reasonable accuracy 41 . In another study, Justo-Garcia et al. 42 used the SRK and PC-SAFT EOSs to model three-phase vapor-liquid-liquid equilibria for a combination of natural gas having high N 2 content. The results revealed that the PC-SAFT EOS accurately predicts phase behavior, but the SRK EOS suggests a threephase region that is larger than what was observed experimentally 42 . The Krichevsky-Ilinskaya equation was used by Zirrahi et al. 27 to estimate the solubility of light solvents (CO 2 , N 2 , CH 4 , C 2 H 6 , and CO) in bitumens from five Alberta reservoirs. The gas phase is analyzed applying the PR-EOS. The suggested model is then validated using experimental data on light solvent solubility. The results demonstrated that the proposed model accurately reflects known solubility data in bitumen for light hydrocarbons (CH 4 and C 2 H 6 ) and non-hydrocarbon solvents (N 2 , CO 2 , and CO) 27 . Haghbakhsh et al. 43 investigated the vapor-liquid equilibria of binary N 2 -hydrocarbon mixtures across an extensive range of temperature and pressure applying PR and ER EOSs. They introduced a new correlative mode for the proposed equations to improve accuracy, which was likely to be effective, improving accuracy by up to three times 43 . Thermo-physical characteristics of CO 2 and N 2 /bitumen solutions were studied by Haddadnia et al. 28 . Furthermore, PR-EOS was used to describe the calculated solubility 28 . PC-SAFT and SRK EOSs were employed by Wu et al. 44 to estimate gas solubilities in n-alkanes. The PC-SAFT EOS was found to be able to accurately predict an empirically observed linear connection between gas solubilities in n-alkanes and their carbon number. Despite its satisfactory accuracy for gas solubility in lighter n-alkanes, the SRK EOS typically produces significantly poorer results than the PC-SAFT EOS 44 . Tsuji et al. 45 investigated N 2 and oxygen gas solubilities in benzene, divinylbenzene, and styrene. For a particular isotherm, gas solubility in liquids had a linear pressure dependency and declined with rising temperature. Ultimately, PR-EOS was implemented to predict gas solubilities 45 . Aguilar-Cisneros et al. 46 determined the solubility of N 2 , CO 2 , and CH 4 in petroleum fluids using the PR-EOS in conjunction with various mixing rules in systems including bitumens, heavy oils, refinery cuts, and coal liquids. The universal and van der Waals mixing rules revealed satisfactory outcome between experimental data and predicted values, while the modified Huron-Vidal of order one mixing rule produced large discrepancies 46 . During the last decade, alongside the developments of intelligent methods based on machine learning (ML) techniques, many attempts have been made to predict thermodynamic results with a higher accuracy based on reliable experimental data. Abdi-Khanghah et al. 47 studied alkane solubility in supercritical CO 2 . Two kinds of artificial neural networks were used for their study: Radial basis function (RBF) and multi-layer perceptron (MLP) artificial neural network (ANN). The MLP-ANN outperformed the RBF-ANN in predicting n-alkane solubility in supercritical CO 2 47 . Songolzadeh et al. 48 demonstrated that the PSO-LSSVM model is an effective technique for predicting n-alkane solubility in supercritical CO 2 with high accuracy. The least-squares support vector machine (LSSVM) was employed, which was tuned using two different optimizing algorithms: particle swarm optimization (PSO) and cross-validation-assisted Simplex algorithm (CV-Simplex) 48 . Chakraborty et al. 49 developed a set of data-driven models capable of predicting VLE for the binary systems of C 10 -N 2 and C 12 -N 2 . In comparison to the VLE modeled using the PR-EOS, both models significantly improved the estimated value of binary mixture equilibrium pressure 49 . Mohammadi et al. 50 implemented different ML models to predict hydrogen solubility in various pure hydrocarbons in wide pressure and temperature ranges and compared them with some of the common EOSs. Their results showed that using intelligent models shows more precise results than the common usage of EOSs in hydrogen solubility estimation 50 . To predict nitrogen solubility in unsaturated, cyclic and aromatic hydrocarbons, Mohammadi et al. 51 employed a convolutional neural network (CNN) and the results showed that pressure is the most significant factor for nitrogen solubility in unsaturated hydrocarbons. In general, prediction based on EOSs semi-analytical methods has been the common way to estimate the N 2 solubilities in alkanes. On the other hand, the mentioned method is case-specific and it is limited to some defined hydrocarbons with specific parameters for each EOS. Hence, using intelligent models like proper ML algorithms and reliable experimental data may lead to a model for predicting N 2 solubility in normal alkanes with high accuracy and this helps to accelerate predictions.
In this study, we use a dataset containing 1982 experimental N 2 solubility data points for 19 distinct normal alkanes gathered under various operating states. Models for estimating N 2 solubility in normal alkanes are constructed using well-known ML algorithms namely k-nearest neighbor (k-NN) and random forest (RF), as well as innovative ML methods such as extreme gradient boosting (XGBoost), gradient boosting with categorical features support (CatBoost), and light gradient boosting machine (LightGBM). Furthermore, statistical parameters and graphical error assessments are used to verify the validity of the suggested models. Numerous N 2 solubility systems are predicted by the methods proposed in this research and five EOSs, namely perturbed-chain statistical associating fluid theory (PC-SAFT), Redlich-Kwong (RK), Peng-Robinson (PR), Soave-Redlich-Kwong (SRK), and Zudkevitch-Joffee (ZJ). Eventually, the relevancy factor is utilized to assess the relative impact of input parameters on N 2 solubility in normal alkanes.

Data collection
The modeling of N 2 solubility in normal alkanes was performed using a large solubility databank containing 1982 data points collected from the literature 29, . The properties of 19 normal alkanes (nC 1 to nC 36 ) utilized in this survey are presented in Table 1.
The inputs of the models were chosen to be temperature (K), pressure (MPa), and molecular weight (g/mol) of normal alkanes, whereas N 2 solubility (in terms of mole fraction) was the desired output. The statistical details of the N 2 solubility databank used for modeling are tabulated in Table 2

Models' implementation
Algorithms' selection. Due to recent advances in computation capacities and also the advent of new machine learning algorithms, there are many choices to use as algorithms for the problem under consideration. Because of the size of the dataset and small instance number and also based on the limited number of the features, some of the non-parametric ML models which mainly focus on the dataset and do not suffer from the small size of the dataset were noticed as the best choices in this case.

K-nearest neighbors (k-NN).
The k-NN method is an ML technique that is employed to solve both classification and regression problems. This supervised algorithm is widely used as a non-parametric technique for various applications 92 . In this algorithm, the k is the number of neighbors which are assigned to a new sample to predict the target based on its inheritance from these k samples that are closest to the new sample using a uniform weight assigning system or a specific distance function 93 . Distance function is a tool to allocate a weight to each of the k samples features to identify its contribution in final predicted value. Minkowski distance equation is the typical choice for the distance function. The general form of this equation is provided in Eq. (1), where X and Y are two samples feature sets. This function turns to Manhattan or Euclidean distance function in most of the cases by using the p = 1 or p = 2, respectively. Finding and selection of the optimal value of the k hyperparameter is the most crucial stage in the training of this algorithm to achieve a satisfactory accuracy. Hence, the algorithms are run by a wide range of k value and the optimal case is revealed based on the comparison of statistical accuracy measurements among the explored cases.  94 . This algorithm avoids high prediction variance, which is a common issue in the decision tree algorithm. Random forests have trees, which run parallelly. These trees do not have any interaction with each other during the forest construction. It works by training a large number of decision trees and then determining the class that is the mean prediction of the individual trees in regression cases. At each node, the number of attributes that may be divided is limited to a certain proportion of the total which is known as the hyperparameter. This guarantees that the ensemble model does not depend too strongly on any specific attribute and that all potentially predictive variables are considered equally. In any CART tree training, the random forest technique picks the training dataset T i , randomly from the complete training set T, by replacement (i.e., bootstrapping sampling). The data that was not included in the random sampling technique is referred to as "out-of-bag" data. The random forest technique picks N features or input variables randomly from a set of M input independent factors (N < M) while building each CART tree. According to the randomly picked T i and M characteristics, the best splitting for each CART tree is calculated. The final results of the regression are being determined via majority voting. To increase the estimation precision, the averaged prediction reduces the averaged squared error on the individual estimations produced from an individual CART tree. The resulting ensemble trees are designated as follows (Eq. 2): Extreme gradient boosting (XGBoost). The fundamental concept behind a tree-based ensemble method is to use an ensemble of classification and regression trees (CARTs) to fit training data using a regularized objective function minimization. One of those other tree-based models is XGBoost, which is part of the gradient boosting decision tree framework (GBDT). To further explain the construction of the CART, each cart is made up of (I) a root node, (II) internal nodes, and (III) leaf nodes, as illustrated in Fig. 1. The root node, which represents the entire dataset, is split into internal nodes by the binary decision technique, whilst the leaf nodes reflect the final classifications. In gradient boosting, a sequence of basic CATRs are created simultaneously, with the weight of each individual CART being adjusted via the training process 95 . An ensemble of n trees must be trained to predict the y for a specific dataset, m and n respectively show the count of features and instances.

Root node
Internal node leaf node where Ω shows the regularization term that helps to reduce overfitting by reducing the model's complexity; l stands for a loss function that is differentiable and convex; γ is the minimal loss reduction required to split a new leaf; and λ displays the regulation coefficient. It is worth noting that in these equations λ and γ assist to increase model variance and avoid overfitting. The objective function for each individual leaf is reduced in the gradient boosting technique, and additional branches are added sequentially.
The t-th iteration of the above-mentioned training procedure is represented by t. The XGBoost method aggressively adds the space of regression trees to greatly improve the ensemble model, which is sometimes dubbed "greedy algorithm". As a result, the model output is updated continuously by minimizing the objective function: The XGBoost takes use of a shrinkage technique in which newly added weights are scaled by a learning factor rate after each stage of boosting. This minimizes the risk of overfitting by reducing the impact of future additional trees on each available individual tree 96 .

Light gradient boosting machine (LightGBM).
LightGBM is a novel gradient learning framework based on the decision tree concept. The main advantages of LightGBM over XGBoost are that it uses less memory, uses a leaf-wise growth method with depth constraints, and uses a histogram-based technique to speed up the training process. LightGBM discretizes continuous floating-point eigenvalues to k bins through using the aforementioned histogram technique, resulting in a k-width histogram. Furthermore, the histogram technique does not require additional storing of pre-sorted results, and values may be stored in an 8-bit integer after feature discretization, reducing memory usage to 1/8. Despite this, the model's accuracy suffers as a result of the harsh partitioning method. LightGBM also employs a leaf-by-leaf technique, which is more successful than the usual level-by-level strategy. The reason for this inefficiency in level-wise approach is that at each step, only leaves from the same layer are examined, resulting in unnecessary memory allocation. Alternatively, at each stage of the leafwise method, the algorithm finds the leaves with the largest branching gain, and then proceeds to the branching cycle. In comparison to the horizontal direction, errors can be reduced and greater precision can be attained with the same number of segmentations. The leaf-wise tree development technique is illustrated in Fig. 2. The disadvantage of leaf orientation is that it forces you to build deeper decision trees, which invariably leads to overfitting. On the other hand, LightGBM prevents overfitting while maintaining high efficiency by imposing a maximum depth restriction on the leaf top 97,98 .
For a specific training dataset , LightGBM searches an approximation f (x) to the function f*(x) to minimize the expected values of specific loss functions L (y, f (x)): To estimate the objective function, the newton's approach is employed.

Gradient boosting with categorical features support (CatBoost).
CatBoost, which employs one hot max size (OHMS) that is a permutation technique beside the target-based statistics, employs categorical columns for categorical boosting. For a new split of the present tree, a greedy approach is utilized in this methodology, allowing CatBoost to identify the exponential evolution of the feature combination 99 . In CatBoost, for each feature with more categories than OHMS, the following steps are applied: 1. Records are divided into subsets at random. 2. Integer conversion of labels 3. Convert categorical features to numeric values as follows: where countInClass is the number of targets having a value of one for a category attribute, and totalCount is the number of preceding objects (the starting parameters specify prior to count objects) 100,101 .

Equations of state (EOSs).
EOS is a mathematical expression for the connection among a substance's volume, temperature, and pressure. This equation may be used to explain VLE, volumetric behavior, and thermodynamic properties of mixtures and pure substances. EOSs are used to estimate the phase behavior of petroleum fluids. As previously stated, EOSs have poor predictors of gas solubility in solvents, particularly under complicated working circumstances. Five EOSs were used to assess N 2 solubility in hydrocarbons in this research, and their reliability in predicting N 2 solubility is compared to ML algorithms. Mathematical equations of implemented EOSs are shown in Table 3. Table 4 also shows the parameters of the EOSs. Also, some required molecular parameters corresponding to each substance which is investigated with PC-SAFT EOS are provided in Table 5. Besides, a proper mixing rule is needed to use for estimation of each mixture's parameters. In this study, van der Waals one-fluid mixing rules have been utilized, and its corresponding mathematical expression is provided in Table 4.

Evaluation of models
The following statistical parameters, namely root mean square error (RMSE), standard deviation (SD), and coefficient of determination (R 2 ) were used in this survey to evaluate the performance of models: Table 3. EOSs Formulas utilized in this study. 103,104 PC-SAFTã = A kTN =ã hc +ã id +ã disp +ã assoc 105

EOS Formula References
P C For non -hydrocarbons and hydrocarbons lighter than C 7 : where a i and b i depend on the chain length as given in Gross and Sadowski 105 The expressions for the contributions from the dispersion and ideal gas are identical to those of Gross and Sadowski 105 105,106 Van der Waals one-fluid mixing rules On the other hand, the following graphical tools were utilized simultaneously to evaluate the performance of the ML models: Cross plot: The most well-known graphical analysis in which the predicted values are plotted against the measured values and the accuracy of the models is evaluated by examining the proximity of the data points to the unit slop line.
Trend plot: This plot helps to check the validity of the model by sketching both real data and the model's estimation versus the specific property or data index.
Error distribution plot: The error (measured value − predicted value) is plotted against the real data to assess the scatter of data around the zero-error line and to explore the possible error trend.
Histogram plot of errors: This graph shows how the errors from the model are distributed. This statistical tool indicates the discrepancy between the measured and predicted values, in which a normal distribution centered at zero error is expected for a good model.

Results and discussion
Model optimization and tuning. To find the best model in each aforementioned algorithm, a routine procedure has been done to find the hyperparameters and the other functional features of each model. Since these models have been implemented in python, different libraries including scikit-learn for k-NN and Random forest 110 , xgboost for XGBoost, lightgbm for LightGBM 98 , and catboost 99 for Catboost have been employed in this study. In each of these involves some parameters that should be set by user or they can be work on default mode. To find the best model state in each of algorithms, a wide range of selective parameters have been selected and the best model based on the training and test data RMSE has been chosen. The search space and the final arrangements of model are provided in Table 6.
Statistics and performance metrics of the models. The model's precision in predicting N 2 solubility in normal alkanes was assessed statistically based on several statistical criteria including RMSE, R 2 , and SD. Table 7 reports the calculated values of these statistical factors for the training subset, testing subset, and the entire dataset of all ML models. The possibility of overtraining is completely rejected given that no meaningful difference was seen between the testing and training subsets for all models. Based on Table 7, the CatBoost model has the lowest prediction errors among the developed ML models with RMSE values of 0.0125, 0.0213, and 0.0147 for the training subset, testing subset, and the entire dataset, respectively. Also, the overall R 2 of 0.9943 for the CatBoost model is higher than other models and has a lower SD, indicating a better fit for this www.nature.com/scientificreports/ model to the experimental data. Moreover, random forest, XGBoost, LightGBM, and k-NN models are categorized after the CatBoost model in terms of good performance, respectively. As mentioned earlier, several EOSs have been used comparatively with the ML models to estimate N 2 solubility in normal alkanes. Hence, the solubilities of N 2 in several normal alkanes namely Hexadecane, Eicosane, Octacosane, and hexatriacontane, which experimental values have been reported in the literature 29,90 , are estimated utilizing ML models and EOSs. Tables 8, 9, 10 and 11 represented the N 2 solubility data and predictions of EOSs and ML models along with RMSE values for each of them. As can be seen, the CatBoost model provides  www.nature.com/scientificreports/ Graphical analysis of the models. In the next step, the evaluation of the ML models is performed by graphical analysis. First, cross plots of the experimental N 2 solubility data versus predicted values by the ML models for the training and testing stages are presented in Fig. 3. All five ML models performed well in both training and testing stages and most of the data points are accumulated around the X = Y line, although the scatter of points is much less for the CatBoost model and is more concentrated around the X = Y line, indicating the excellent performance of this model in estimating N 2 solubility in normal alkanes. Next, the distributions of the N 2 solubility prediction errors (measured-predicted) utilizing the ML models versus the experimental data are shown in Fig. 4. High concentrations of near-zero error points for a predictive tool indicate a better performance of that predictive tool in predicting N 2 solubility in normal alkanes. Again, the CatBoost model resulted in near-zero errors, verifying its accuracy and reliability. However, other ML models especially random forest shows good predictions with low errors for the N 2 solubility in normal alkanes.
The next step of the graphical assessment of introduced ML models for the prediction of N 2 solubility in normal alkanes is related to the frequency of errors. Figure 5 depicts the histograms of errors corresponding to the proposed ML models in this work. As it is clear, the symmetric distributions are seen in the histogram graphs of all ML models. Also, the bursts of growing at the zero-error value for all developed models confirm the superb match between estimated and experimental data of N 2 solubility in normal alkanes. However, the percentage frequency of errors at the zero-error value is about 85% for the CatBoost model and it is much higher than other ML models indicating the high credit of this model in estimating N 2 solubility in normal alkanes.
However, all the models used in this study show satisfactory performances. As it is obvious from the statistical and graphical analyses, the CatBoost model shows the best performance among the implemented ML models. The performance of a model depends on many factors, such as the case of study and the structure of the dataset, and this superiority in performance for this model stems from two main reasons. The first one is the structure of the dataset used in this work, based on the shape of the dataset, there are many instances that have equal values in the n-1 feature and their only difference is in one feature. This feature enables the tree-based models to do a better splitting operation and finally brings higher accuracy. Secondly, Catboost models use symmetric trees and it helps to have a faster inference. Also, its boosting schemes are the main reason which avoids overfitting and increases the model quality after the training process. Finally, it should be noted that these advantages for Catboost strongly depend on the dataset and it cannot be generalized to all problems.
Pressure and temperature trend analysis. As the final assessment step, various visual evaluations were executed to appraise the CatBoost model's capability in various N 2 solubility in hydrocarbons systems. Figure 6 represents the effect of pressure on N 2 solubility for n-Decane system at the temperature of 503 K. Figure 6 shows N 2 solubilities estimated by the CatBoost model for this case, as well as the values determined by the EOSs along with the literature experimental results 87 . The mismatch between standard EOSs estimations and actual experimental data is quite significant at high temperatures. As seen in this figure, the CatBoost model predicts experimental data quite well. Based on expectations, the solubility of N 2 in n-Decane rises as the pressure increases. Meanwhile, the EOSs overestimate or underestimate the N 2 solubility 'growth when pressure rises, while the CatBoost model strictly traces the trend.
The predictions of CatBoost and other proposed ML models for N 2 solubility data in a light hydrocarbon (methane) 61 under various operation conditions at a constant temperature of 180 K are provided in Fig. 7. All the intelligent models follow the trend well, and show a positive trend in N 2 solubility as pressure increases. The CatBoost model, as shown in this figure, accurately recognizes data patterns and provides excellent estimations in all pressures.  Fig. 8, similar to the previous case, a satisfactory trend capturing is observed in all the intelligent models. However, the Catboost model provides more accurate predictions. Also, the figure indicates an increase in N 2 solubility as temperature rises. Sensitivity analysis. Utilizing the CatBoost model as the best-developed model in the current study, a sensitivity analysis was performed. To this end, the relevancy factor (r) 113 was calculated for each input parameter using the following equation, with the knowledge that the higher the r-value, the greater impact on the model's output. It should also be noted that the positive r-value for a parameter indicates its direct effect on the output of the model and vice versa 114 . www.nature.com/scientificreports/ where I i,j represents the jth value of the ith input variable (i is molecular weight of normal alkanes, pressure, and temperature); I m,i shows mean value of the ith input; NS m and NS j denote the mean value and the jth value of predicted N 2 solubility in normal alkanes, respectively. The outcomes of the relevancy factor analysis are depicted in Fig. 9. According to Fig. 9, all input parameters, namely temperature, pressure, and molecular weight of normal alkanes have a positive effect on N 2 solubility in normal alkanes. The results reveal that the pressure has the greatest impact on N 2 solubilities in normal alkanes and the N 2 solubility increases with increasing the molecular weight of normal alkanes. Based on Henry's law, the amount of dissolved gas in a liquid is proportional to its partial pressure in equilibrium with that liquid. When the gas is at a higher pressure, its molecules collide more with each other and with the liquid's surface. As the molecules collide more with the surface of the liquid, they can squeeze between the liquid molecules and thus become a part of the solution 115,116 . On the other hand, the sensitivity analysis overall shows that the solubility of N 2 in normal alkanes increases when the temperature increases. This shows the reverse order solubility phenomenon that is the opposite of what commonly happens for a binary mixture of a supercritical component and a subcritical component 73,81 . The reason for this may be due to the repulsive nature of N 2 -N 2 interaction. The N 2 -N 2 repulsive force decreases with an increase in temperature, www.nature.com/scientificreports/ which results in increased solubility of N 2 at higher temperatures. However, increasing the solubility of N 2 with an increase in temperature may not be true for all normal alkanes and literature survey shows that the N 2 solubility in methane and ethane decreases with increasing temperature 117 . Normal alkanes are nonpolar, as they contain nothing but C-C and C-H bonds. N 2 is also a nonpolar molecule and nonpolar substances tend to dissolve in nonpolar solvents such as normal alkanes. The molecular weight of the normal alkanes is mainly increased by adding C-C and C-H bonds. The obvious consequence of this is that the N 2 solubility increases as the number or length of the nonpolar chains increases.

Conclusions
In the present work, N 2 solubility in normal alkanes (nC 1 to nC 36 ) was modeled using five representative ML models namely CatBoost, k-NN, LightGBM, random forest, and XGBoost by utilizing a large N 2 solubility databank in a wide range of operating temperature (91.21-703.4 K) and pressure (0.0212-69.12 MPa). Also, five EOSs namely RK, SRK, ZJ, PR, and PC-SAFT were used comparatively with the ML models to estimate N 2 solubility in normal alkanes. The developed CatBoost model was superior to all of ML models and EOSs with an overall RMSE of 0.0147 and R 2 of 0.9943. Moreover, Random Forest, XGBoost, LightGBM, and k-NN models www.nature.com/scientificreports/ were ranked after the CatBoost model in terms of good performance, respectively. Furthermore, ZJ EOS showed the best performance among the EOSs. Finally, the results of relevancy factor analysis indicated that all input variables to the models, namely temperature, pressure, and molecular weight of normal alkanes have a positive effect on N 2 solubilities in normal alkanes and pressure has the greatest effect among these input variables. The solubility of N 2 increases with increasing the molecular weight of normal alkanes.