Modeling of CO2 adsorption capacity by porous metal organic frameworks using advanced decision tree-based models

In recent years, metal organic frameworks (MOFs) have been distinguished as a very promising and efficient group of materials which can be used in carbon capture and storage (CCS) projects. In the present study, the potential ability of modern and powerful decision tree-based methods such as Categorical Boosting (CatBoost), Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and Random Forest (RF) was investigated to predict carbon dioxide adsorption by 19 different MOFs. Reviewing the literature, a comprehensive databank was gathered including 1191 data points related to the adsorption capacity of different MOFs in various conditions. The inputs of the implemented models were selected as temperature (K), pressure (bar), specific surface area (m2/g) and pore volume (cm3/g) of the MOFs and the output was CO2 uptake capacity (mmol/g). Root mean square error (RMSE) values of 0.5682, 1.5712, 1.0853, and 1.9667 were obtained for XGBoost, CatBoost, LightGBM, and RF models, respectively. The sensitivity analysis showed that among all investigated parameters, only the temperature negatively impacts the CO2 adsorption capacity and the pressure and specific surface area of the MOFs had the most significant effects. Among all implemented models, the XGBoost was found to be the most trustable model. Moreover, this model showed well-fitting with experimental data in comparison with different isotherm models. The accurate prediction of CO2 adsorption capacity by MOFs using the XGBoost approach confirmed that it is capable of handling a wide range of data, cost-efficient and straightforward to apply in environmental applications.

www.nature.com/scientificreports/ regeneration step. Furthermore, the effectiveness of the capturing process could be maximized provided that the structure of the material could be modified using various functional groups and molecular tuning approaches. Metal organic frameworks (MOFs) have been distinguished as a very promising and efficient group of materials which can be used in CCS projects because of their unique properties of being modifiable, stable in high temperatures, and having a chemical structure which can be easily adjusted [16][17][18][19][20] . Therefore, different advantages and disadvantages of MOFs have been investigated by several researchers such as an investigation done by Le et al., which reported how the architecture and active functional groups of MOFs could be controlled 17 . Moreover, having conducted various investigations on the thermal stability of MOFs, scientists have reported the impact of morphology and crystalline shape of these materials on their thermal stability [21][22][23][24] . Taking MOFs synthesized from strontium and calcium by the way of example, Yeh et al. 25 found that, as a result of their micro-porosity, these materials were thermally stable in all temperatures lower than 450 °C. Some other scientific investigations have found that MOFs could be modified for various applications just by changing the functional groups, which are located on their pore walls 26,27 . It is wieldy known that MOFs are comprised of metal ions clusters and linkers which are organic molecules. These linkers have a 3D-structure of pores and connecting channels. While 3D-structurevoids adsorb molecules as their host, the primary structure of these materials provides reversible channels and pores as soon as desorption step take place [26][27][28] . Properties of MOFs is determined by the selected linker and the metal. For example, zinc in the structure of IRMOF-1 is the metal which is located at the center of structures and is connected to the terephthalic molecules. Having this structure, IRMOF-1 benefits from existence of pores with high capacity of adsorption. Furthermore, there are various groups of MOFs with their unique structures, properties and application. Some MOFs like UMCs have unsaturated metallic centers 29 , which provides carbon dioxide molecules with more active sites, hence facilitates formation of strong bonds between CO 2 and the structure. Other advantage of MOFs in comparison to other materials like zeolite is that the MOFs have typically wider pores, which boost diffusion rate of molecules not only in a single structure, but also between different crystals [30][31][32][33] . That is why scientists believe that MOFs are promising for adsorption of CO 2 , emphasizing on their adjustable porosities and having a modifiable surface chemistry 5,[34][35][36] . Regarding developing and testing MOF, there are some important challenges and obstacles to overcome or be solved using alternative methods. Firstly, experimental investigation of adsorption capacity of MOFs not only is time consuming, but also is not cost-effective. Secondly, data cannot be matched with the developed isotherms because typically these isotherms are proposed for a specific range of data 31 . So as to address the problem, many scientists have been trying to use soft computing methods by which not only the time and money could be saved, but also there is no need for simplifying assumptions and the data could be modeled more precisely. Artificial intelligence (AI) approaches are useful tools which enable us to estimate and develop representative models in various disciplines [37][38][39][40][41][42] . These powerful algorithms are able to model non-linear relationship which exist between influencing parameters. Up to the date, a plethora of models have been developed such as fuzzy logic, radial basis networks, support vector machine, and colony optimization [43][44][45][46][47] .
In the current investigation, authors tried utilizing smart models for the prediction of nonlinear adsorption of CO 2 by MOFs. According to the literature, scientists have done fewer researches on the modeling of CO 2 adsorption by MOFs using AI methods. To model the CO 2 uptake capacity of different MOFs, new and powerful methods of CatBoost, LightGBM, Random Forest (RF), and XGBoost were employed. CatBoost (which is the abbreviated form of categorical boosting) is an open-source and modern gradient boosting library and it can deal with problems which are intrinsically heterogeneous thought handling categorical features. XGboost and LightGBM belong to GBDTs (Gradient Boosted Decision Trees) and in these methods a tree structure includes two separate steps. Firstly, the appropriate structure for the tree must be found. Secondly, leaf values must be set as soon as the tree structure is finalized 48 . Another approach which was used in this investigation was Random Forest. Since it was introduced, the accuracy of classifications improved significantly because not only growth of various trees were allowed, but also the program is able to vote for best and most distinguished class 49

Implementation of models
Extreme gradient boosting model (XGBoost). In a tree-based ensemble method, a group of various classification and regression trees (CARTs) are utilized to minimize a set of objective functions applied to a training dataset. The XGBoost approach could be contemplated as a tree-based model which basically belongs to a gradient boosting decision tree (GBDT). So as to explain the CART's basic structure, it is comprised of three various nodes namely (a) the main node (root node), (b) internal nodes, and (c) leaf nodes like as illustrated in Fig. 1. The binary decision-making processes will split the root node into internal nodes. Doing so, the dataset which is located in the root will be classified into various nodes in the internal nodes and the final classification will take place in the leaf nodes as the final classes. Aiming for developing a powerful set according to the gradient booting model, an ensemble of CATRs are introduced and developed using determination of their influence by giving them a specific weight during the training process 54 .
In a dataset where m dimension features and n examples exist, the modeling output (y) would be trained according to the following expression to form n tree nodes 55  www.nature.com/scientificreports/ where a binary leaf index will be formed by mapping an example X using a defined decision rule q(x). In Eqs. (1) and (2), the corresponding space of each regression tree is depicted by 'f '. Accordingly, f k represents the kth tree, tree leaves are denoted by T, and their corresponding weight is determined by ω.
As the next step in the modeling, tree sets will be determined by minimizing an objective function denoted by L 55 : In the given formulation, Ω represents the regularization function and limits the model complexity by reducing the overfitting issues; loss function is shown by l and intrinsically is a differentiable convex; the minimum loss is denoted by γ and it is necessary in division of a new ultimate class as a leaf, and λ stands for the regulation coefficient. γ and λ facilitates the growth of the variance of the model and resultantly plummet the overfitting issue 55 . Every leaf in the boosting model has its own objective function, which should be minimized iteratively as follows 55 : In the presented formula, t is the iteration number for the minimization of a leaf objective function in the training step. So as to improve the model, an algorithm known as the greedy algorithm is utilized, which is designed to provide enough space for regression trees. Doing so, XGBoost model can continuously update its final results through improving the preciseness of the objective functions 55 : www.nature.com/scientificreports/ Shrinkage strategy is another strategy that XGBoost method utilizes properly. In this strategy, a learning factor is defined and its learning rate is regulated in every gradient boosting step through definition of additional weights. Shrinkage strategy prevents overfitting problem by restricting future trees from affecting previously formed trees 56 .

Light gradient boosting machine (LightGBM).
On the basis of gradient learning theories, a novel learning machine was developed which is known as LightGBM 57 . The LightGBM approach, in comparison to XGBoost, needs lower memory spaces and speeds up the training step by using a histogram 58 . LightGBM can form a histogram with a width of 'k' by discretizing eigenvalues into 'k' different bins. Furthermore, the aforementioned approach diminishes the need for a set of pre-sorted results and values will be saved in an integer with a size of eight bits, which results in a drastic reduction of memory consumption. That said, such as kind of approach unfortunately results in dipping the model's preciseness. Leaf-wise approach has also been utilized in LightGBM. Drawing comparison between traditional growth strategies and the leaf-wise strategy, it must be admitted that this approach is considerably more efficient than the others. What makes the leaf-wise strategy more efficient than the alternative level-wise strategy is this fact that the leaves existing in the same layer are properly taken into consideration, which diminish the need for unnecessary allocation of memory. Therefore, by finding leaves with the maximum branching gain, errors could be minimized and a better accurateness could be achieved. In Fig. S1, the leaf-wise strategy of tree development is illustrated. An important drawback of leaf orientation is that as the decision trees grow deeper, unfavorable overfitting gets exacerbated. That said, simultaneously when LightGBM is resulting in overfitting, by definition of an upper limit on depth of the leaf top, a high efficiency will be achieved 57,58 .
Regarding the formulation of a LightGBM model, parameters and calculations could be introduced as follows 59 : Being given a training dataset of , the LightGBM approach approximates a f (x) according to a f * (x) aiming for the minimization of the desirable values of a loss function shown by L y, f (x) : A wide variety of T regression trees with the formulation of T t=1 f t (x) will be formed as LightGBM sets which can be used for the model's approximation. In a defined regression tree ( W q(x) , q ∈ {1, 2, . . . , N} ), w stands for a vector which represents the weight of each leaf node, N depicts how many leaves exist in a tree, and the decision rules applied to trees are shown by q. The training step of the model development, at step t, is formulated as follows 59 : The objective functions are determined by utilizing Newton's method.

Gradient boosting with categorical features support (CatBoost).
In the CatBoost approach aiming for successful application of categorical boosting technique, categorical columns must be utilized. The aforementioned column benefit from a range of processing techniques. The target-based statistics and the one_hot_ max_size (OHMS) are the most important ones. In the growth of every branch of the running tree, a greedy approach will be applied to facilitate finding changes in the combination of features of CatBoost method 60 . In the CatBoost method, the following steps must progress properly for every feature of various categories 48,61 : 1. To form a random subgroup of the available records 2. Convert labels to integer 3. According to Eq. (7), features related to categories must be transformed into numeric form: In the given formulation, targets are counted by countInClass. Each target is assigned with some categorical features, each having a value of one, and all the previous objects will be counted in totalCount (the prior ones which are needed in counting the objects are determined by the initiating parameters) 48,61 . Random forest (RF). An ensemble of various decision trees forms a random forest, in which trees are being trained in parallel. The superiority and the importance of each decision tree is determined by the algorithm 62 . Additionally, a built-in property of RF classifier which is utilized to select various features makes the RF able to manage different inputted features without need for removing a number of parameters for reduction of dimension 63 . In the modeling, in order to enhance the diversity of trees of the forest, the RF technique employs a method known as Bagging (which stands for bootstrap aggregating). Typically, the population of trees are given as an input to the model, and accordingly the model will divide data points into various sets. Being a type of random sampling methods, bagging employs just a third of data points for the training step of a subtree development process and the remaining data points are referred as the out-of-bag (OOB). Furthermore, in the model  49 . In Fig. S2, the strategy of RF method is illustrated. A successful training process will happen if a training sample is given to the model as a requirement. If there is a training data set in the form of D = x 1 .y 1 . x 2 .y 2 . · · · x n .y n , the defined training dataset for tree h t will be denoted by D t , and the resulting prediction of the out-of-bag dataset of sample x will be H oob , as it is given as follows 49 : For the purpose of the modeling, the error of the OOB dataset is generalized as follows 49 : The RF's operation should be random and this feature is controlled by a parameter formulated as k = log 2 d 49 . The significance of a characteristic of a variable X i could be measured using the following expression 49 : Accordingly, in the X vector, the i th factor is denoted by X i , B depicts how many trees exist in the current RF, the predicted error of the OOB samples is defined by OOBerr t i , which stands for the feature X i of tree t , and finally the initial OOB data samples are given as the OOBerr t , which includes the permuted variables.
The significance of the character permutation process illustrates how much a feature is salient for the estimation. Resultantly characteristic permutation severely changes the model's estimation and it can be conclusively observed that an insignificant feature possess a very limited power for changing the prediction of the model 64 . Figure 2 shows the feature selection and classification using different algorithm for predicting CO 2 adsorption on MOFs.  Table 1 represents the details of the data used in this study. Moreover, Table 2 lists down the statistical details of the whole dataset which was collected for the purpose of this investigation. Trying www.nature.com/scientificreports/ to conceptualize and perceive the effect of various parameters on the MOF's CO 2 uptake capacity, the authors have incorporated temperature (K), pressure (bar), surface area (m 2 /g) and pore volume (cm 3 /g). To provide a precise and reliable set of models, deliberately about 80% of data points were devoted to model establishment and training phase and just about 20% were considered for testing phase. Therefore, the models' preciseness and trustworthiness were ensured by utilizing two statistical factors, namely root mean-square error (RMSE) and coefficient of determination (R 2 ) were used 65-67 : Outlier detection. Having a range of uncertainties and being associated with outliers, experimental data can introduce errors in the modeling process. To prevent any undesirable and not reliable outcome, the experimental models must undergo data evaluation and outlier detection. For the case of CO 2 adsorption on various MOFs, the faulty data points will enormously impact the preciseness of the predicting models. In the outlier detection process, as soon as a notable deviation of a data point from the others is detected, it will be recognized as an outlier. This study seeks benefits from the well-known leverage value procedure for spotting an outlier 68 .  Table 2. Statistical details of the dataset gathered in this paper. In the abovementioned formula, X represents a matrix with a size of N × P, where the total number of data points is depicted by N and the number of inputs' features is denoted by P. Additionally, an alarming leverage value was determined as follows 37 :

Results and discussion
Model development. For predicting the amount of CO 2 absorbed on the surface of various MOFs, we developed various models using XGboost, LightGBM, Catboost and RF methods. In order to avoid overfitting, the grid search was used to optimize hyperparameters of models. Each model's grid search hyperparameters were different, and the importance of the hyperparameters was determined by theoretical and practical considerations. For each model, the following hyperparameters were employed. Table S1 shows the optimal values of the major hyperparameters, as well as the search intervals for the hyperparameters set for the machine learning models used in this study.
In Table S1, n_estimators represents the number of trees, subsample shows subsample ratio of columns when constructing a tree, C denotes a degree of importance that is given to misclassifications, max_depth is maximum depth of a tree, feature_fraction is parameters randomly selected in each iteration for building trees, learning_rate controls the impact of each tree on the final outcome.

Model implementation and accuracy evaluation.
The most salient feature of each model is its accuracy and trustworthiness. In Fig. 3 and Fig. S3, the predicted CO 2 adsorption capacity is illustrated versus the experimental data for the implemented models, in which the closer predicted data to the experimentally obtained data (forming a y = x line), the more accurate the model is. In the given figure, almost all models illustrate a slope very close to unity which prove that all of them can be reliable when anticipation of real data is needed. As an important statistical criterion, the calculated error corresponding to every data point is depicted in Fig. 4 and Fig. S4, which provide the readers with comprehensive information about the preciseness of the proposed models. Delineated in the given illustrations, data points fluctuate closely around the zero-error line which indicates that the models were developed well. According to this figure, the calculated error for RF, Catboost, LightGBM, and XGBoost are limited to ± 10, ± 8, ± 7, and ± 12.5, respectively. The cumulative frequency of errors of each developed model is illustrated in Fig. 5. As it can be seen, 95% of the data predicted by XGBoost model has an error less than 1%. Moreover, Catboost model could predict 89% data with errors less than 1%.
In Table 3, the accuracy of each model is reported based on statistical criteria. As it can be perceived from this table, having an R 2 of 0.9992 and 0.9733 in the training and testing steps of modeling, the XGboost is the most trustworthy model followed closely by the Catboost model. Moreover, the calculated RSME for every model is given in Fig. S5 from which it is obvious that the XGBoost model is the most aureate model having an RSME of 0.568. Following closely, the LightGBM approach had the second lowest RSME value. According to both of RSME and R 2 criteria, RF was determined as the poorest predicting model. www.nature.com/scientificreports/ With increasing pressure, the saturation process of the MOFs' pores will be facilitated according to the room-temperature isotherms. From these isotherms, it can be perceived that the uptake capacity of each MOF is qualitatively related to its surface area and among all types of assessed MOFs in the current study, the MOF-177, BeBTB, MOF-5 and PCN-11 showed outstanding and considerable CO 2 adsorption capacity. As the pore sizes are greater and more efficient, the contribution of the pressure regime appearance gets more influence on the CO 2 uptake capacity 69,70 . Comparison among all mentioned porous materials, the MOF-177 had a maximum capacity of 33.5 mmol/g, because of its high surface area (4508 m 2 /g) nearly twice of IRMOF-11. Figure 6a illustrates a detailed representation of uptake capacities in different pressures at 298 K for the investigated MOFs. Also, Fig. 6b    It has also been discovered that the amount of CO 2 adsorption on the MOF structure is directly related to the surface area and the polarity of the surface. As these features are higher, more CO 2 will be absorbed and even in low pressures a high adsorption will be achieved using some MOFs like Cu-BTTri and Mg 2 (dobdc). Moreover, operating at a high pressure significantly affects the CO 2 adsorption capacity. Regarding Co-BDP, a step-like appearance for the adsorption isotherm can be seen, which is potentially resulted from a phenomenon known as the gate opening. This event takes place due to the notable flexibility of the MOF's framework 71,72 . This claim must not be interpreted as if the other MOFs are not efficient and practical in CO 2 removal endeavors, but it means that MOF-177 has an excellent CO 2 removal capacity at 35 bar 51 . Furthermore, as presented in Fig. 7, dependency of adsorption capacity on the operational temperature is considerable, www.nature.com/scientificreports/ and generally for MOF-5, the CO 2 uptake capacity was higher at room temperature than its capacity at 313 K. Thus, temperature negatively affects the carbon dioxide adsorption efficiency. In other words, with increasing temperature, the adsorption capacity decreases. Ability of various models in prediction of CO 2 adsorption capacity at low pressure ranges is shown in Fig. S6, in which the models were utilized to estimate CO 2 removal efficiency of Mg 2 (dobdc) at 313 K. Among all models, the XGBoost and CatBoost were found to be the most trustable models. Moreover, this model can be successfully compared with different isotherm models. As illustrated in Fig. 8  Microporous MOFs are new emerging generation of promising adsorbents which can drastically improve the CO 2 removal efficiency due to their high selectivity ranges. Using these materials, a higher CO 2 uptake capacity could be achieved 73 specially between 5 and 40 bar which is appropriate for the separation of CO 2 and H 2 74,75 . In power plants in which coal is mainly consumed for generation of power, a wide range of pollutants will be emitted to the environment including carbon dioxide. As a highly efficient process, pre-combustion capture of CO 2 can take place in the integrated gasification and combined cycle method. In this system, carbon dioxide could be separated from H 2 and removed from the process 5,65,76 . Additionally, the intrinsic feature of the surface of these materials will result in a better and more powerful interaction between CO 2 and the surface of the MOFs 77 . Due to their promising removal efficiency, many scientists have paid attention to integration of these materials into various industries including gas separation and purification processes in the petroleum industry specially when separation of CO 2 /N 2 78 , CO 2 /CH 4 71 , and O 2 /N 2 79 are desired. To successfully investigate the adsorption process, a profusion of experimental endeavors has been done and lot of isotherm models have been developed. Nevertheless, a range of drawbacks have prevented both of experimental works and theoretical approaches to successfully be used. Firstly, the designing and conducting experiments are time consuming and costly. Additionally, a wide range of various assumptions must be made to simplify the problem when the currently available adsorption isotherms are going to be utilized. Resultantly, not only a big inventory of data is needed, but also modeling process encounters errors from first the steps of the investigation. These errors will be exacerbated if a set of needed data is not accessible 31 . Therefore, application of cost-effective, time-saving, robust, and simple model is vital in the investigation and prediction of CO 2 adsorption capacity by MOFs. The presented investigation has dealt with the adsorption of CO 2 , as one of the main causes of global warming issue 5 , on MOFs, which are known as very promising materials for CO 2 removal, by applying some well-known soft computing techniques. The most outstanding feature of these models was this fact that no limitation for the development of these comprehensive models were needed to be applied. Therefore, being capable of handling a wide range of data, the implemented models have estimated the adsorption capacity at various pressures and temperatures accurately. However, the models' performance and reliability are highly dependent on the selection of appropriate input parameters 1 . Additionally, successful application of smart models requires a large number of data points and excellent skills of programming.
Based on the calculated Hat values, an area limited by standardized residuals of ± 3 and 0 ≤ H ≤ H * is detected as the acceptable region. H* is determined for every model. The model is valid if majority of the data points are being located in the determined region. Figure 9 shows the Williams plot relating to the proposed XGBoost model as the best approach in this study. According to this figure, almost all data points were inside the acceptable region

Sensitivity analysis.
To investigate the impact of various parameters on the adsorption capacity of MOFs, a relevancy factor was defined as follows 80-82 : in which N, X k,i , Y i , X̅ k , Y̅ stand for the number of assembled data points, the k th parameter of i th input value, i th output data, mean of the kth input parameter, and the average of the outputs, respectively. Laying between ± 1, the relevancy factor depicts how much an input parameter affects the CO 2 removal capacity of the MOFs. As the absolute value of the relevancy factor corresponding to a specified parameter increases, the carbon dioxide adsorption capacity will change more dramatically with any change in that parameter. The calculated relevancy factor for each affecting parameter on the adsorption of CO 2 on various MOFs is illustrated in Fig. 10. According to the calculated r values, among all investigated parameters, only temperature negatively impacts the adsorption capacity having an r = − 0.114. Furthermore, it was found that pressure and surface area of MOFs have the most and the second most significant effects on the CO 2 adsorption capacities possessing "r" factors of 0.478 and 0.337, respectively. Pore volume also was found to have a positive impact on the CO 2 adsorption capacity of MOFs as provides more adoption sites for CO 2 molecules to be captured by them.

Conclusions
The adsorption capacity of various MOFs for CO 2 capture was modeled by development of robust models using 1191 data points. The data was used to train and test the XGBoost, LightGBM, CatBoost, and RF approaches and it was found that the XGBoost as the best fitting model and the Catboost as the second most trustworthy model (15)   www.nature.com/scientificreports/ predicted the CO 2 adoptions more precisely than other methods. The R 2 and RMSE values of 0.9955 and 0.5682, 0.9659 and 1.5712, 0.9837 and 1.0853, and 0.9466 and 1.9667 were obtained for XGBoost, CatBoost, LightGBM, and RF approaches, respectively. Additionally, it was found that all parameters except temperature have positive impact on the CO 2 capture by MOFs and temperature had the smallest influence on the CO 2 uptake capacity. It has also been discovered that the amount of CO 2 adsorption on the MOF structure is directly related to the surface area and the polarity of the surface. As these features are higher, more CO 2 will be absorbed and even at low pressures, a high adsorption could be achieved using some MOFs like Cu-BTTri and Mg 2 (dobdc). Langmuir isotherm model showed well-fitting with correlation coefficient of 0.999 and 0.980 for PCN-11 and Mg-MOF-174, respectively, but totally the XGBoost models represented better fitting correlation with R 2 values of 0.998 for both MOFs. The obtained results of the current work could be beneficially utilized in the environmental studies such as carbon capture and gas separation and purification.