Abstract
We focus on gas sorption within metalorganic frameworks (MOFs) for energy applications and identify the minimal set of crystallographic descriptors underpinning the most important properties of MOFs for CO_{2} and H_{2}O. A comprehensive comparison of several sequential learning algorithms for MOFs properties optimization is performed and the role played by those descriptors is clarified. In energy transformations, thermodynamic limits of important figures of merit crucially depend on equilibrium properties in a wide range of sorbate coverage values, which is often only partially accessible, hence possibly preventing the computation of desired objective functions. We propose a fast procedure for optimizing specific energy in a closed sorption energy storage system with only access to a single water Henry coefficient value and to the specific surface area. We are thus able to identify hypothetical candidate MOFs that are predicted to outperform stateoftheart watersorbent pairs for thermal energy storage applications.
Introduction
Metalorganic frameworks (MOFs) are crystalline compounds consisting of metal ions and organic linkers, characterized by tunable porosity and incredibly high surface area^{1}. Due to their properties, MOFs have recently attracted remarkable attention in a wide range of different fields, including gas/vapor separation^{2}, reaction catalysis^{3}, drug delivery^{4}, energy storage, and heat transformations^{5,6}. Given their nature as porous adsorbent materials, intensively active research is focused on the use of MOFs for CO_{2} capture, towards the development of effective technologies for mitigating greenhouse gas emissions^{7}. Recently, MOFs have been also employed in adsorptionbased atmospheric water harvesting driven by solar thermal energy^{8,9}. In general, when dealing with applications of engineering relevance, different inlet gas streams, variable operating conditions, and target properties tailored per each specific case make it challenging to identify an ideal MOF crystal for all applications^{10}, thus leading to a fragmented casebycase optimization problem.
Hence, MOFs require proper and efficient methods for tailoring their features according to target properties of interest in each specific application. The latter is everything but an easy task. In fact, due to the myriad of degrees of freedom for MOFs structure and composition, more than 100 trillion compounds have been hypothesized^{11}, while almost 100,000 have been synthesized so far^{12}. Highthroughput computational screening and machine learning have been recently adopted to analyse large MOF datasets^{13}. Such computational tools allow us to identify significant correlations between nanoscale features and observable macroscale properties^{14,15}, and to select the most suitable crystal for a given application case. A few representative examples are provided by gasgas separation (D_{2}/H_{2}^{16}, O_{2}/N_{2}^{17}, CO/N_{2}^{18}, CO_{2}/H_{2}^{19}, ethane/ethylene^{20}, and other gas mixtures^{21}), the enantioselectivity of chemical compounds^{22}, gas adsorption (CO_{2}^{23}, CH_{4}^{24}, H_{2}^{25}, thiol^{26}, organosulfurs^{27}, and acetylene^{28}), and combinations thereof^{29,30}. Several computational explorations of MOFs datasets have been carried out also for biomedical (drug delivery^{31}), mechanical (CO_{2} Brayton cycle^{32} and osmotic heat engine^{33}), and energy applications (heat pumps/chillers^{34,35} and thermal energy storage^{36}).
In this context, modern sequential learning (SL) algorithms are emerging as particularly efficient tools for exploring the material highdimensional (crystallographic) feature space. In particular, while evaluating an objective blackbox function through demanding physical or numerical experiments, SL tools can provide a wellorchestrated procedure to rationally navigate the highdimensional parameter (feature) space^{37,38}. Thus, given an initial pool of evaluation points, one can sequentially choose the next experiment to carry out^{39,40}, without relying on naive random guessing. Rather general techniques have been proposed in the area of material science holding the promise to accelerate materials discovery and research^{41}, with a number of Authors reporting successful use of SL approaches in this field. Aggarwal et al.^{42}, by means of optimal experimental design, successfully characterized a substrate under a thin film. Seko et al.^{43} found the compound with the highest melting temperature in a given ensemble of candidate materials with less attempts than a naive random choice. Kiyohara et al.^{44} accelerated the search of a stable interface structure with respect to a traditional brute force approach. Dehghannasiri et al.^{45} efficiently guided experiments to design the shape memory alloy with the lowest energy dissipation at a given temperature. Needless to say that the identification of the parameter space is a very important preliminary step when implementing SL algorithms. Here, we choose to specifically focus on MOFs properties as gas/vapor sorbent materials, since those are particularly relevant for energy applications.
The first important objective of this work is the identification of the minimal set of MOFs features (or descriptors) ruling critical adsorption properties in the lowcoverage regime, i.e., the Henry solubility coefficients for both CO_{2}MOFs and, importantly, H_{2}OMOFs working pairs. The above minimal set represents the important crystallographic features underpinning a given adsorption property of interest. In this sense, each minimal set of descriptors can be regarded as the genetic code for a given property, and it is identified as described below.
First, we curate and enhance MOF data from a recently developed library made of 8206 compounds generated computationally^{46}. Each Crystallographic Information File (CIF) representing a given material is first featurized by means of 1557 Classical Forcefield Inspired Descriptors (CFID)^{47} taking into account both chemical and structural parameters. Subsequently, we train and validate regression models of target properties involved in heat storage applications, such as Henry coefficients and working capacity. These models are obtained by means of a Random Forestbased pipeline, with hyperparameter tuning in five fold crossvalidation. The final ranking and selection of the minimal set of descriptors is performed by evaluating the importance of each feature on the models outputs by means of the Tree SHAP interpretation algorithm^{48}.
Upon identification of the above crystallographic genetic code of sorption properties in MOFs, we investigate its role when using SL algorithms. Therefore, we compare the performance of three different SL methodologies aiming at maximizing H_{2}O and CO_{2} Henry coefficients, and CO_{2} working capacity: (a) random Forests with Uncertainty Estimates for Learning Sequentially (FUELS^{49}); (b) kriging algorithm^{50}; (c) COMmon Bayesian Optimization Library (COMBO)^{51}. For each SL methodology, we compare several strategies for choosing the next material to test, combining the exploration of highuncertainty regions with the exploitation of highperforming candidates. Importantly, we analyse the SL performance using both the minimal subset of features (from the pipeline and SHAP analysis) and a larger set of variables, to highlight how the identification of descriptors affects the minimum number of experiments needed to pick out a MOF with the highest value of the desired property. In Fig. 1 the above procedure is schematically represented.
We highlight that sorptionbased engineering applications rely upon sorbent material characterization in a wide coverage range. However, when a large number of hypothetical sorbents (here MOFs, but in principle also zeolites^{52,53} or other materials^{54}) have to be evaluated as potential candidates, only lowcoverage characterization (i.e. Henry coefficient) is often accessible thus making challenging any optimization of crucial figures of merits of engineering relevance. We thus formulate a procedure aiming at a fast evaluation of one of the most important figures of merit in closed watersorption seasonal thermal energy storage applications, namely the materialbased specific (stored) energy. Unlike traditional sensible or latent systems^{55}, the above sorptionbased energy storage technologies have the advantage to be lossfree. Our procedure can thus be used in SLbased (or other) optimization/screening processes of MOFs even under incomplete knowledge of the entire isosteric field of the candidate working pairs. Applied to the database of over 5000 computationally generated (hypothetical) compounds by ref. ^{46} (developed for different purposes), our procedure identifies MOFs that can possibly outperform stateoftheart sorbent materials for thermal energy storage.
Results
In this work, a crucial source of data on MOFs stems from the dataset of Boyd et al.^{46}, where important sorption properties (e.g., the Henry coefficients for CO_{2} and H_{2}O, the working capacity for CO_{2}, and the specific surface area) have been computed by DFTbased simulations for over 8000 potential MOFs. Capitalizing on the above comprehensive study, we construct machine learning (ML) models capable of accurately predicting MOFs solubility of both CO_{2} and H_{2}O as well as CO_{2} working capacity and surface area. The above models enable us to achieve the first key result of this work, namely the identification of the minimal set of crystallographicbased descriptors^{56} ruling these sorption and geometric properties in MOFs.
Moreover, a systematic comparison of SL approaches on the above MOFs database reveals important conclusions on the performance of the different regression schemes adopted and, most importantly, the role played by the selection of the feature space to be explored. Those conclusions are also supported by results obtained on a highlycontrollable synthetic dataset, as discussed in detail in Supplementary Note 1.
It is worth stressing that properties reported in ref. ^{46} only characterize MOFs in the Henry regime and are not sufficient to describe the equilibrium sorption properties in the high coverage regime. However, when targeting important engineering applications such as seasonal thermal energy storage, key figures of merits of the storage plant (e.g., the materialbased specific energy) critically rely upon the access to the entire isosteric field of the chosen sorbentsorbate pair or, equivalently, to the knowledge of equilibrium adsorption isotherms at several temperature values^{57,58}. The latter isotherms describe the adsorption properties (at equilibrium) of sorbents in a wide range of coverage values, from the Henry regime up to the saturation pressure of the sorbate fluid.
Therefore, in this work, we also propose an approach enabling us to optimize one of the most important engineering figures of merit of MOFs for seasonal thermal energy storage applications (i.e., materialbased specific energy in an ideal closed sorption cycle, see Methods, subsection Watersorption thermal energy storage, and Results, subsection Optimization under incomplete access to the isosteric field of candidate MOFswater working pairs), even if only an incomplete set of sorption properties are (experimentally or numerically) accessible. Based on the latter optimization procedure, we are finally able to identify potential MOFs candidates for seasonal thermal energy storage that can possibly outperform most of the current stateoftheart sorbent materials.
Descriptors of sorption properties in MOFs and their use in SL algorithms
We constructed four MOFs datasets, each one with the same 1557 features and a different target property among Henry coefficient for CO_{2} (8194 data entries), working capacity for CO_{2} (8202 data entries), Henry coefficient for H_{2}O (8202 data entries), and surface area (5028 data entries). The different number of data entries are due to missing values for some of the chosen properties in the available database by ref. ^{46}.
The above database also reported, for all the compounds, both the crystallographic file (used to extract the 1557 CFID^{47} by means of Matminer^{56}) and a list of DFTcomputed properties, among which we have only considered the abovementioned adsorption properties of interest. More specifically, the computed 1557 explanatory variables (also referred to as descriptors) proposed by Choudhary et al.^{47} consist of a set of both chemical (e.g., average chemical properties over the elements in the cell, average atomic radial charge) and structural (e.g., distribution functions) quantities. More details on descriptor subcategories are reported in Table 1.
We have trained four different ML models by means of a Random Forestbased pipeline with hyperparameter tuning in five fold crossvalidation to predict the Henry coefficient for CO_{2}, the working capacity for CO_{2}, the Henry coefficient for H_{2}O, and the surface area, achieving coefficients of determination of R^{2} = 0.785, R^{2} = 0.590, R^{2} = 0.874, and R^{2} = 0.924, respectively. We have used 80% of each dataset to train the models, and the remaining 20% to validate them. Since the Henry coefficient values span a few orders of magnitude, the corresponding ML models have been developed in terms of the natural logarithm of those properties. During the data preprocessing routines, each of the four trained pipelines (i.e., feature reduction and ML with hyperparameter tuning, see Supplementary Note 10 for details) already drops a significant number of the 1557 features, thus confirming that many of the initially selected descriptors do not significantly affect the chosen adsorption properties. More specifically, the final models include 237 descriptors for the CO_{2} Henry coefficient, 236 descriptors for CO_{2} working capacity, 177 descriptors for the H_{2}O Henry coefficient, and 234 descriptors for surface area.
Then, the Tree SHAP routine^{48,59} allows to identify the most meaningful descriptors as those accounting for the 75% of the cumulative curve over the coefficients of importance. The SHAP routine identifies the most meaningful features for the trained models among the reduced set of descriptors retained by the trained pipelines after the preprocessing. In particular, the impact of a descriptor depends on the comparison between the output of a model trained with that feature and another model output, trained without that feature (see Methods, subsection Model training and choice of descriptors). The coefficients of importance are thus computed over the testing set, i.e., over samples the model has never encountered during the training.
Overall, starting from the original 1557 Classical Forcefield Inspired Descriptors, 33 items for the CO_{2} Henry coefficient, 62 for the CO_{2} working capacity, 15 for the H_{2}O Henry coefficient, and 14 for the surface area are found to explain 75% of the corresponding regression models. Furthermore, we have repeated an analogous procedure with AutoMatminer^{60}, which allows us to automatically train and validate a complete pipeline—feature reduction, data cleaning, and machine learning—with automatic hyperparameter tuning, without crossvalidation. Results are reported in Supplementary Note 2. Models performances with the corresponding cumulative importance curves of the ruling descriptors are reported in Fig. 2. The H_{2}O Henry coefficient is key for the computation of the specific material energy in a MOFbased thermal energy storage plant (as discussed below in Results, Optimization under incomplete access to the isosteric field of candidate MOFswater working pairs). Hence, for the H_{2}O Henry coefficient only, we also report a few experimental values corresponding to real MOFs to compare with numerical predictions of hypothetical MOFs. In particular, numerical predictions happen to follow a good agreement with the experimental values. This aspect is key for material screening purposes and is further highlighted in the Discussion section below.
Importantly, Fig. 3 shows the SHAP rankings of the five most meaningful descriptors for each of the properties of interest. Table 2 summarizes the physicochemical meaning of the identified descriptors, based on the complete list by Choudhary et al.^{47}. The entire list of variables, together with their cumulative importance, the trained models, and the datasets on which they have been trained are publicly available online (see Data availability and Code availability).
As far as the four properties of MOF are concerned, the identification of those important descriptors represents per se an advancement of knowledge on sorption mechanism in MOFs and an important contribution to this work.
Focusing our attention on the top three descriptors in terms of importance, we notice that:

As far as the Henry coefficient for CO_{2} (see Fig. 3a) is concerned, such a property happens to be mainly ruled by the volume per atom, the nearest neighbor distribution, and an elastic constant. The former descriptor is the volume of a cell divided by the number of atoms in the cell, thus mainly representing a geometrical feature of the MOF crystal. The nearest neighbor distribution can also be regarded mainly as a geometry feature strictly related to the crystallographic structure. The elastic constant is one of the elements of an averaged 6 × 6 elastic tensor of the cell. In particular, for each of the chemical elements of a cell, the elastic tensor of the solid ground state structure at temperature 0 K is known. The weighted average of those tensor entries over the chemical elements of the cell are the elastic constants found by the featurizer. To our best interpretation, the latter two quantities effectively describe the chemical environment of the crystal thus underpinning the sorbentsorbate interaction potential.

As far as the working capacity for CO_{2} (see Fig. 3b) is concerned, this quantity is mainly ruled by nearest neighbor distribution and volume per atom. Therefore, as expected, geometric features are playing a key role in determining the maximum CO_{2} uptake into the crystal.

As far as the Henry coefficient for H_{2}O (see Fig. 3c) is concerned, we found that that quantity is mainly ruled by an elastic constant, radial distribution function, and nearestneighbor distributions. As such, as opposed to the CO_{2} case above, we can observe that descriptors more related to the chemical environment are needed this time since a polar molecule is interacting with the MOFs structure.

Finally, for the surface area (see Fig. 3d), we found that it is mainly ruled by volume per atom, packing fraction, and nearest neighbor distribution, thus showing the prominence of geometrical features similar to the case of the above working capacity.
Upon the identification of the above lists of descriptors, we have compared the performance of SL algorithms for the sorption properties of interest using both the reduced set of important descriptors (i.e., those explaining the 75% of the models predicting CO_{2} Henry coefficient, CO_{2} working capacity, and H_{2}O Henry coefficient, respectively) and a larger set of 100 descriptors composed by the previous and some additional (nonmeaningful) ones.
In particular, for the aforementioned sorption properties, we have chosen the nonrelevant features as the ones obtaining the least scores in the SHAP ranking. SL was adopted to find the maximum property value among a random subset of 500 samples from the original datasets (over 8000 MOFs), starting from a pool of 100 points with the lowest target property. Unexpectedly, SL optimization in the space of relevant descriptors does not ensure, in general, faster convergence of the procedure to the optimum property value (this is also confirmed by results in the synthetic case reported in Supplementary Note 1). Those six spaces (two domains for each of the three properties) are reported in Supplementary Note 9, as tSNE projections over two components^{61}. Furthermore, among the three methodologies examined, both Random Forest and COMBObased methods were able to provide always a faster convergence to the optimum value as compared to the random choice strategy. Results are shown in Fig. 4. As Ling et al. point out^{49}, a pure exploitative strategy—RFMEI, KMEI, COMBOPI—performs better when, already in the very first steps, the model is able to predict with high accuracy the value of the property of interest; conversely, a pure explorative strategy—RFMU, KMU—is more proper if the optimum is somehow very different with respect to all the other candidates, while the remaining strategies are a tradeoff between those two extremes—RFMLI, KMLI, and COMBOEI.
We notice that, over different regression methodologies—even with the same acquisition function—the performance ranking, in general, changes. For instance, in the case of the Henry coefficient for H_{2}O in Fig. 4c, among the Random Forestbased strategies, RFMLI is the top performer, followed by RFMEI and RFMU, for both 15 and 100dimensional input spaces; the very same ranking applies for Kriging based acquisition functions, KMLI, KMEI, and KMU. In the case of working capacity for CO_{2} in Fig. 4b, instead, among the Random Forestbased strategies, RFMEI is the top performer, followed by RFMU and RFMLI; on the contrary, the same ranking does not apply for Kriging based strategies, since KMEI is the top performer, followed by KMLI and KMU, for both 62 and 100dimensional input spaces. An intermediate case is represented by the Henry coefficient for CO_{2} in Fig. 4a, where the ranking of Random Forestbased strategies (RFMLI, RFMEI, RFMU, both 33 and 100dimensional input spaces), is preserved over the Kriging methodology only for the 100dimensional space (KMLI, KMEI, and KMU), but not for the 33dimensional (KMEI, KMLI, and KMU). Among COMBObased acquisition functions, COMBOPI and COMBOEI always require a similar number of evaluations, except in the 100dimensional input space of Henry coefficient for CO_{2}, and the 100dimensional input space is consistently better performing than the corresponding lowdimensional one.
This brief comparison shows that no general rule is valid a priori for SL, neither in terms of regression methodology, nor in terms of acquisition function, nor in terms of dimensionality of the problem, albeit selecting relevant descriptors.
Optimization under incomplete access to the isosteric field of candidate MOFswater working pairs
Without losing generality, here we aim at investigating the expected performance of hypothetical (i.e., computationally generated) MOFs for an important energy engineering application, namely watersorption seasonal thermal energy storage (see also the Methods section, subsection Watersorption thermal energy storage). The most challenging aspect of this task consists in the access to the entire isosteric field of each candidate MOFwater pair for estimating the engineering figure of merit of interest. Clearly, for a large number of MOFs candidates, this is challenging and timeconsuming both computationally (typically, only the Henry lowcoverage regime is reported in literature works^{62,63}) and experimentally^{64}. In this section, we specifically focus on such a challenging aspect. We envision an efficient optimization procedure that is capable of searching MOFs with the largest expected figure of merit of engineering relevance (either by SL, if materials are sequentially synthesized/computed, or by accessing readily available databases^{65}). As far as seasonal thermal energy storage applications are concerned, here we focus on the highest specific energy of MOFwater working pairs among the compounds reported in ref. ^{46}. An overview of the proposed methodology is schematically reported in Fig. 5.
As detailed in the Methods section (subsection Watersorption thermal energy storage), the ideal thermodynamic cycle of a closed sorption thermal energy storage system is completely defined by four operating temperatures. In this study, we assume T_{A} = 308 K (the minimum temperature on the user side), T_{C} = 353 K (the maximum temperature on the source side), T_{E} = 278 K (the average winter temperature), T_{F} = 303 K (the average summer temperature). Those temperature values are reasonable for space heating applications in temperate climates^{64}. Given the Antoine equation, the equilibrium water vapor pressures p_{E} = 866.2 Pa at the evaporator and p_{F} = 4231.6 Pa at the condenser are also uniquely defined considering the average winter and summer temperatures, respectively. We decided to evaluate and maximize over the database one of the most important engineering quantities in a thermal energy storage plant, namely the cycled heat per unit of material weight. While full details on the adopted models are given in the Methods section (subsection Watersorption thermal energy storage), in the following we report and discuss the main simplifying assumptions in our approach:

A key quantity to be estimated is the H_{2}O working capacity. That quantity is related to the available adsorption sites n_{TOT} per unit of dry sorbent mass. Boyd et al.^{46} reported only the CO_{2} working capacity, while no data are available on the maximum H_{2}O uptake. Nonetheless, we can rely on other related properties, such as the specific surface area of MOFs. To this end, we notice that Chaemchuen et al. have reported H_{2}O working capacity for a pool of 66 MOFs^{5}. A good correlation between the water uptake and the surface area (i.e., the available internal surface per gram of dry adsorbent) can be observed for typical MOFs used in the energy engineering field, and this is also in line with results found by Xu et al.^{66}. In this work, on the basis of that correlation, we impose a linear regression for finding the constant of proportionality between water uptake and surface area (water uptake = η × surface area). This yields \(\eta =3.875\times 10^{4}\,\rm{g}_{\rm{H}_{2}\rm{O}}\,\rm{m}^{2}\). More details can be found in Supplementary Note 4. We also notice that, without a loss of generality, if more accurate values of the water uptake are available (e.g., from numerical simulations of adsorption experiments) the above assumption on the uptakeinternal surface correlation can be fully relaxed.

Henry coefficients \(\widetilde{H}({T}_{0})\) for H_{2}O are listed at the reference temperature T_{0} = 298 K with units of \({\rm{mol}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{1}\,{{{{\rm{bar}}}}}^{1}\), thus representing the moles of adsorbed H_{2}O per kilogram of dry MOF per bar of H_{2}O vapor. In our approach, we adopt the Frumkin–Fawler–Guggenheim (FFG) model to estimate the adsorption isotherm over the entire range of coverages only relying upon such Henry coefficient. However, as discussed in the Methods sections (subsection Watersorption thermal energy storage), the FFG equation requires H(T_{0}) in units of Pa^{−1}: we have thus converted \(\widetilde{H}({T}_{0})\) (readily available from ref. ^{46}) to H(T_{0}). Let n_{s}, m_{MOF} and \({p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) be the number of adsorbed water moles, the mass of the hypothetical MOF, and the pressure of water in the vapor phase, respectively, it holds:
$$\widetilde{H}({T}_{0})=\frac{{n}_{{{{\rm{s}}}}}}{{m}_{{{{\rm{MOF}}}}}{p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}.$$(1)The approximation of the lowcoverage regime yields the linear relationship between the coverage and pressure, namely \(\theta =H({T}_{0}){p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\). Since the number of adsorbed water moles is related to the molar based total number of adsorption sites as n_{s} = θn_{TOT}, it follows:
$$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{n}_{{{{\rm{TOT}}}}}/{n}_{{{{\rm{MOF}}}}}},$$(2)with \({m}_{{{{\rm{MOF}}}}}={{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}{n}_{{{{\rm{MOF}}}}}\), \({{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}\) the molecular weight of the MOF, and n_{MOF} the total number of moles. Furthermore, the molar based total number of adsorption sites n_{TOT} corresponds to the maximum number of water moles \({n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}\) that can be adsorbed, and the following relationship holds:
$$\frac{{n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{n}_{{{{\rm{MOF}}}}}}=\frac{{m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{m}_{{{{\rm{MOF}}}}}}\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}},$$(3)where \({{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) is the molecular weight of water and \({m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}\) denotes the maximum mass of water that can be adsorbed. The ratio \({m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}/{m}_{{{{\rm{MOF}}}}}\) is related to the H_{2}O working capacity of the MOF and it is equal to ηS, where η is the constant of proportionality between the uptake and the surface area S. A comparison of Equations (3) and (2) yields:
$$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}{\eta S}\times 1{0}^{8},$$(4)which is the final conversion formula of the Henry coefficient for H_{2}O from measure units of \({{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{1}\,{{{{\rm{bar}}}}}^{1}\) into Pa^{−1}. Here, the factor 10^{−8} appears because \([{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{\rm{H2O}}}}}^{1}\), \([\eta ]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{2}\), \([S]=\,{{{{\rm{m}}}}}^{2}{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{1}\), and so \([\widetilde{H}({T}_{0}){{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}/(\eta S)]=\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{1}\,{{{{\rm{bar}}}}}^{1}\).

A crucial quantity for heat transformation is the isosteric heat of adsorption q_{st}. Due to the Clausius–Clapeyron relationship (see Equation (7) in the Methods section, subsection Watersorption thermal energy storage), at least two adsorption isotherm curves (at T_{A} and at T_{C}) are needed to estimate the corresponding q_{st}. In our database, though, the Henry coefficients are only available at T_{0}. In order to reconstruct a second adsorption isotherm for the same MOFwater working pair, we decided to resort to the potential theory of Polanyi, thus exploiting the basic notion that all adsorption isotherms are selfsimilar when rescaled with respect to the Polanyi potential function. Details are provided in the Supplementary Note 5. More specifically, the Polanyi potential is defined as:
$${{{\mathcal{A}}}}=RT\ln \left(\frac{{p}_{{{{\rm{s}}}}}(T)}{p}\right),$$(5)where p_{s}(T) is the saturation pressure of water at temperature T, while p is the pressure of the vapor phase on the adsorbent surface^{67}. Since, at a given pressure p, the Polanyi potential is a constant of the sorption pair, we have computed \({{{\mathcal{A}}}}\) at T_{0} in a range from 10^{−4} Pa up to p_{s}(T_{0}) = 3157 Pa (Antoine equation for water); then, in the θ − p chart, we have rescaled the abscissa p of the isotherm obtained at the temperature T_{0} according to \(p={p}_{{{{\rm{s}}}}}(T)\exp \left({{{\mathcal{A}}}}/(RT)\right)\), for getting the new curves at T_{A} and T_{C}. We have finally computed the isosteric heat by means of the Clausius–Clapeyron relationship:
$${q}_{{{{\rm{st}}}}}=\frac{R}{3}\frac{{T}_{{{{\rm{C}}}}}{T}_{{{{\rm{A}}}}}}{{T}_{{{{\rm{C}}}}}{T}_{{{{\rm{A}}}}}}\mathop{\sum }\limits_{i=1}^{3}\ln \frac{{p}_{2}({\theta }_{i})}{{p}_{1}({\theta }_{i})},$$(6)where the points 1 and 2 represent the intersections of an isosteric transformation with the two isotherms respectively at T_{A} and T_{C}. We have repeated the procedure for three coverage values (i.e., θ_{1} = 0.4, θ_{2} = 0.5, θ_{3} = 0.6) and averaged them.

Finally, upon determination of the low and hightemperature adsorption isotherms curves at T_{A} and T_{C}, the coverage span Δθ during the discharge phase can be determined as detailed in the Methods section, subsection Watersorption thermal energy storage. As for the estimate of the isosteric heat, this requires for all the compounds the rescaling of the horizontal axis in the θ − p chart of the isotherm obtained at temperature T_{0} according to the Polanyi potential theory.
We have thus computed the following objective function Sq_{st}Δθ, which recovers the cycled heat up to a constant, over the entire list of 5028 potential MOFs with positive surface area in ref. ^{46}. The MOF with the highest predicted specific energy turned out to be the compound with the chemical formula C_{9}_{6}H_{4}_{8}O_{2}_{8}N_{8}V_{4} and referred to as “str_m5_o18_o28_sra_sym.72” in the database (see also the molecular rendering in Fig. 6b), with Henry coefficient at 298 K of \(6110.54\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{1}\,{{{{\rm{bar}}}}}^{1}\) (or equivalently, 6.89 × 10^{−4} Pa^{−1}) and surface area \(S=4118.79\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{1}\). Figure 7 shows also the ideal expected thermodynamic cycle related to this optimal potential MOF.
We observe a coverage span Δθ = 0.653, an isosteric heat \({q}_{{{{\rm{st}}}}}=48.95\,{{{\rm{kJ}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{1}\) with an objective function value of \(1.32\times 1{0}^{5}\,{{{\rm{kJ}}}}\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{1}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{1}\). That quantity can be directly related to specific energy: upon multiplication by the constant \(\eta /{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) (\(\eta =3.875\times 1{0}^{4}\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{2}\), \({{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}=18.02\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{1}\)), we obtain a value of \(2.83\times 1{0}^{3}\,{{{\rm{GJ}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{1}\). Furthermore, we can compute the theoretical density ρ_{MOF} of the crystal knowing the mass of the cell (1965.21 u = 3.263 × 10^{−21} g_{MOF}, as from the database) and its volume (5.335 × 10^{−21} cm^{3}, as from the CIF file), leading to ρ_{MOF} = 0.612 g_{MOF}cm^{−3}. As a result, the volumebased energy density turns out to be 1.73 GJm^{−}^{3}. For the sake of comparison, Fig. 6a shows a 2D map where the two axes represent the two most important descriptors according to the SHAP ranking for the Henry coefficient of H_{2}O: the four topperforming potential MOFs are highlighted and the corresponding crystallographic cells are depicted in Fig. 6b. Moreover, Table 3 shows the ten most performing potential MOFs ranked in terms of specific energy. Interestingly, those compounds are all Vanadiumbased, mostly due to values of the Henry coefficient for H_{2}O in the optimal range, which produces a good coverage span Δθ over the thermodynamic cycle.
As depicted in Fig. 8 the four topperforming hypothetical MOFs are predicted to have (materialbased) specific energy values among the highest available in the literature for sorption thermal energy storage under similar operating conditions. A more extensive comparison is shown in Supplementary Fig. 11, where estimates of the specific energy for real MOFs by means of the FFG model are also reported.
Discussion
Sequential learning (SL) algorithms can in principle dramatically reduce the number of evaluations needed for finding the optimum of an unknown function as compared with a naive random choice and, as such, they are emerging also as effective tools for material optimization and discovery. In this work, focusing on metalorganic frameworks (MOFs) and some of their crucial adsorption properties (both with H_{2}O and CO_{2} as sorbate fluids), we have addressed a number of critical aspects related to the discovery of the minimal set of important crystallographic descriptors for SLbased optimization algorithms. We have shown that the general protocol for sorting out the minimal set of ruling descriptors for a given adsorption property is based on two steps: (i) construction and training of the machine learning (ML) model which identifies the number of ruling descriptors; (ii) evaluation of the relative importance of each explanatory variable on the chosen output by the SHAP analysis. We found that, as long as the set of such ruling descriptors (for a given property of interest) is included among the exploration space features, convergence performance is not affected, although the computational burden of an SL algorithm also depends on the dimension of the parameter space to be explored: taking into account only the most relevant features may be in fact beneficial in that respect. Furthermore, based on the several examples provided here (i.e., Henry coefficient for CO_{2}, working capacity for CO_{2}, and Henry coefficient for H_{2}O as well as the synthetic example discussed in the Supplementary Note 1), we have consistently noticed that the COMBO algorithm always performs better than random guessing.
Furthermore, we recognize that full access to the adsorption properties of hypothetical MOFs in the entire coverage regime (as requested in important applications of engineering relevance) is very challenging both experimentally and computationally. This holds particularly for waterMOFs working pairs, that are promising for a number of energy applications. Hence, we formulate a general and efficient computational screening procedure of hypothetical MOFs which, only relying upon the adsorption properties reported in Fig. 2, is capable to estimate important figures of merit for sorptionbased seasonal thermal energy storage. Remarkably, our procedure suggests that some of the MOFs hypothesized in the database by ref. ^{46} (developed for completely different purposes) can possibly outperform most of the stateoftheart watersorbent compounds. Interestingly, those compounds are all Vanadiumbased, mostly due to values of the Henry coefficient for H_{2}O in the optimal range, causing a good coverage span over the thermodynamic cycle. It is worth noticing that the above MOFs screening for thermal energy storage applications critically relies upon the prediction of the Henry coefficient for water.
Therefore, for the latter property only, we have also considered a number of real MOFs and their reported experimental values have been compared with the corresponding numerical values. It is worth stressing that finding both the experimentally evaluated property values and CIF file from the same publication is a nontrivial task, sometimes possibly leading to not necessarily consistent data. Furthermore, even for the three MOFs (i.e., MOF801SC, MOF808, and MOF841) considered in this study with experimental and computational data extracted from the same reference source, there is no guarantee that the tested material corresponds perfectly to the related CIF file, and discrepancies can always occur as demonstrated by the same Authors of ref. ^{68}. Moreover, the CIF files needed to extract the descriptors are always very ideal if compared with the experimentally tested crystals. Possible defects in real compounds can be related to any chemical changes, leading to variation in the hydrophilic nature of the material^{69,70,71}. Nonetheless, we observed that, while discrepancies can be found, the MLbased predictor of the Henry coefficient consistently shows a good agreement with the experimental values. Importantly, the numerical predictions based on the identified descriptors can selectively distinguish among MOFs with higher or lower values of the Henry coefficient. We believe that the above results represent an important step toward efficient MOFs screening and optimization, not only with respect to intrinsic materials properties but also (and importantly) with respect to figures of merit of engineering relevance for applications such as thermally driven water harvesting from the air, watersorption thermal energy storage, and solar cooling.
Clearly, we are also aware that our approach is based on a number of approximations and it still requires additional research activities. First, we notice that a large set of hypothetical MOFs may be characterized by properties (e.g., Henry coefficients) whose values span several orders of magnitude. Hence a unique ML model, as used in this pipeline, may achieve a high coefficient of determination if its logarithm is considered. Nonetheless, the computation of the coverage span Δθ depends directly on the Henry coefficient, which may thus be affected by a relatively high error. Furthermore, additional simplifying assumptions that have been used in our approach include fixed parameters such as the constant of proportionality between the specific surface area and the water working capacity, as well as the steepness coefficient β in the FFG model. Without losing generality, those assumptions could be relaxed in the near future relying on more sophisticated models. One possible way to cope with those challenges (not necessarily the only possible strategy) may be a preliminary classification of the hypothetical MOFs based on properly trained ML classifiers, with the purpose of assigning a given compound of interest to a specific category (e.g., the set of MOFs with Henry coefficient of similar magnitude, similar β, etc.). Afterward, property predictions on each MOF category can be possibly performed with higher accuracy.
Methods
Watersorption thermal energy storage
In this work, we focus on the use of potential MOFs for watersorptionbased thermal storage applications.
Physical adsorption processes are based on weak and reversible interactions between the (solid) sorbent material and the corresponding adsorbate, i.e., the fluid^{72}. Those phenomena are relevant to thermal energy engineering as sorption/desorption in solid sorbents can be accompanied by a significant amount of energy exchange. In the following, the solid sorbents are MOFs, while water is the adsorbate.
To allow desorption of an infinitesimal number (dn) of moles of adsorbate from the adsorbent surface, a given amount of heat dQ = q_{st} dn has to be provided to the system, where q_{st} (with units of kJ/mol) denotes the isosteric heat. Since the process is reversible, the same amount of heat dQ is released by the dry sorbent when dn moles of fluid at a pressure p, initially in the vapor phase, are adsorbed. Furthermore, we define load X as the ratio between the mass of adsorbate and the mass of dry sorbent. A process characterized by constant load X is referred to as an isosteric transformation and the popular Clausius–Clapeyron relationship yields:
where T is the absolute temperature and R = 8.314 J mol^{−1}K^{−1} denotes the gas constant^{73}. Therefore, an isosteric transformation in the Clapeyron chart (\(\ln p\) vs − 1/T) is a curve with local slope q_{st}/R. Similarly, the adsorbate isosteric curve has a slope ΔH_{(vap)}/R, where ΔH_{(vap)} is the molar enthalpy for liquidvapor phase change of the adsorbate.
Closely related to load X, the coverage θ is defined as the ratio between the number of already occupied adsorption sites n_{s} and the total available number of sites n_{TOT}. At equilibrium and at a given temperature T, the coverage θ depends on the pressure p of the vapor phase according to an adsorption isotherm, whose shape depends on the sorbent/adsorbate pair. MOFs/water pairs are known to show typical Sshaped isotherms in the θ − p chart (i.e., type V of the IUPAC classification^{74}). Thus, in this work, we make the assumption that the Frumkin–Fawler–Guggenheim (FFG) model can be used conveniently for describing analytically the MOFswater adsorption isotherms:
where \(\beta =\frac{\overline{n}{E}_{{{{\rm{p}}}}}}{RT}\) rules the steepness of the Sshape, \(\overline{n}\) denotes the neighboring binding sites and E_{p} represents the additional binding energy due to lateral interactions^{67}. We have used the FFG model to interpret eight experimental isotherms of real MOFwater pairs and achieve a proper choice of β. In particular, for each curve, we have identified the best value of β in terms of a leastsquares approach; then, we have taken the mean over those eight values, ending up with β = 3.4. More details can be found in Supplementary Note 6. Finally, H(T) is the Henry coefficient for the specific sorbent/adsorbate pair (with units Pa^{−1}) at a given absolute temperature T.
A schematic of a closed watersorption thermal energy storage system is shown in Fig. 9. These systems are based on a reactor, containing the solid sorbent, connected with a condenser/evaporator by means of a valve^{57}. Such chemical apparatus follows a seasonal closedcycle completely defined by four temperatures: T_{A} (the minimum temperature on the user side), T_{C} (the maximum temperature on the source side), T_{E} (the average winter temperature), and T_{F} (the average summer temperature). The two pressures p_{E} (evaporator) and p_{F} (condenser) are related to the absolute temperatures T_{E} and T_{F}, respectively, through the Antoine equation for water saturation \({p}_{{{{\rm{E}}}},{{{\rm{F}}}}}=133.2\times 1{0}^{AB/(C+{T}_{{{{\rm{E}}}},{{{\rm{F}}}}}273)}\), where A = 8.07131, B = 1730.63, C = 233.426^{75}. Hence, the ideal thermodynamic cycle of a closed thermal energy storage process (see Fig. 10) is based on the following four steps:

1.
The sorbent/adsorbate is heated isosterically up to a temperature T_{B}, corresponding to a pressure p_{F} in the condenser (line AB).

2.
Heating of the pair continues at constant pressure p_{F} and desorbed vapor flows to the condenser through the opened valve. In the condenser, the adsorbate rejects the condensation heat into the environment while condensing until the maximum temperature of the heat source T_{C} is reached (line BC). The condition of the minimum load is reached and the valve gets closed.

3.
Keeping the valve closed, the system in contact with the environmental temperature cools isosterically during the storage period (line CD) down to temperature T_{D}, corresponding to the evaporator pressure p_{E}.

4.
During the discharge phase of the heat storage system, the valve is opened to let the adsorbate evaporate and reach the reactor. During this isobaric transformation (line DA), the heat of adsorption Q_{DA}, also known as cycled heat, is released.
One of the most important figures of merit for energy storage systems is the specific stored energy, namely the maximum energy that can be stored per unit of mass of the plant or of the material^{76}. Clearly, at fixed material (or plant) mass, the higher the cycled heat the higher the specific energy of the storage system. In this view, the choice of the solid sorbent material for a given adsorbate is key for maximizing the cycled heat in the ideal thermodynamic cycle. We thus perform below a material screening aiming at the maximum value of the following cycled heat (i.e., the released heat during the DA process in Fig. 10):
where we have used the definition of the heat of adsorption dQ = q_{st} dn_{s}, coverage θ = n_{s}/n_{TOT} and the approximation \({q}_{{{{\rm{st}}}}}\approx {{{\rm{const}}}}\).
Sequential learning
Typical steps in any SL algorithm consist in (i) constructing a regression model over known data, (ii) using a strategy to suggest the bestunmeasured point to test, (iii) enlarging the known dataset with this tested point, and (iv) iterating up to the tested candidate meets the needed specification. Let \({{{\mathcal{D}}}}=\{({{{{\bf{x}}}}}_{1},{y}_{1}),\ldots ,({{{{\bf{x}}}}}_{n},{y}_{n})\}\) denote a set of n training data, where \({{{{\bf{x}}}}}_{i}\in {{\mathbb{R}}}^{d}\) and \({y}_{i}\in {\mathbb{R}}\) represent the ith vector of descriptors and its known response, respectively. Let {x_{n+1}, …, x_{m}} denote the (m − n) ddimensional arrays of descriptors with unknown responses {y_{n+1}, …, y_{m}}. To find the location x^{*} of the maximum y^{*}, we would need the exact model y = f(x). However, given the restricted set \({{{\mathcal{D}}}}\) of training data, only a surrogate model \(y=\hat{f}({{{\bf{x}}}} {{{\mathcal{D}}}})\) can be constructed. Hence, for each unmeasured point i = n + 1, …, m, different regression methodologies (here FUELSRandom Forest, kriging and COMBOGaussian processes) can be used to estimate the response \(\hat{f}({{{{\bf{x}}}}}_{i})\) in terms of a mean value μ(x_{i}) and the corresponding uncertainty σ(x_{i}), indicating the robustness of the prediction. To measure the performance of any combination regression model/query strategy, we put ourselves in the practitioner’s perspective, who is interested in a unique sequence of points to be tested, and not in an average over more paths (as, for instance, shown in ref. ^{49}). To achieve this, for those regression models not allowing a deterministic prediction (i.e., Random Forest and COMBO), at each step we have repeated 100 times the choice of the next point to query, picking the most preferred one. A comprehensive comparison of the above methodologies is reported in the Results (subsection Descriptors of sorption properties in MOFs and their use in SL algorithms), and the complete details on the adopted algorithms can be found in the Supplementary Notes 7 and 8.
Model training and choice of the descriptors
The first issue to be addressed when applying SL to material optimization is computation and selection of relevant features (or descriptors). The descriptor issue is critical in materials science^{77,78} as well as in other computational fields^{79}. In this work, we first investigate to which extent the choice of a minimal set of relevant descriptors is critical for the fast convergence of SL algorithms.
To this end, before even implementing SL procedures, we decided to perform a preliminary feature pruning for discovering the most meaningful ones in terms of the target property. We use the entire dataset (both descriptors and target property) to train and validate a Random Forestbased pipeline—feature reduction and machine learning—for regression, with hyperparameter tuning in five fold crossvalidation. More details can be found in Supplementary Note 10.
Upon model training and validation, we detect the most important features thanks to the Tree SHAP algorithm, which is optimized for treebased models such as Random Forest^{48,59}, thus quantifying to which extent a given feature impacts the output. The latter methodology is based on the classical Shapley value, which has in game theory its original field of application. There, the problem of assigning, in a cooperative game, a proportional reward to each player is addressed based on the real contribution provided to the common objective of the coalition. In a model, given F the set of all features and its generic subset S ⊆ F, the importance of the ith descriptor depends on the comparison between the model f_{S∪{i}} trained with that explanatory variable, and another model f_{S} trained without that feature; then, the difference between the predictions f_{S∪{i}}(x_{S∪{i}}) − f_{S}(x_{S}) is computed, where x_{S∪{i}} and x_{S} represent respectively the values of the input feature over the subsets S ∪ {i} and S. This difference is weighted over all possible subsets S and the importance value of the ith feature turns out to be
where ∣⋅∣ denotes the number of elements. Because of the huge number of possible descriptor subsets S of a set F, the classical Shapley values of Equation (10) are, in general, computationally challenging. Nonetheless, the Tree SHAP realization we have employed in this work is able to explain efficiently treebased models, such as Random Forest.
Data availability
Datasets and trained models of this study are available in Zenodo (DOI:10.5281/zenodo.6351366)^{80}.
Code availability
The codes used to obtain the results of this study are publicly available at https://github.com/giotre/MOFs.
References
Kitagawa, S. et al. Metal–organic frameworks (mofs). Chem. Soc. Rev. 43, 5415–5418 (2014).
Adil, K. et al. Gas/vapour separation using ultramicroporous metal–organic frameworks: insights into the structure/separation relationship. Chem. Soc. Rev. 46, 3402–3430 (2017).
Rogge, S. M. et al. Metal–organic and covalent organic frameworks as singlesite catalysts. Chem. Soc. Rev. 46, 3134–3184 (2017).
Wuttke, S., Lismont, M., Escudero, A., Rungtaweevoranit, B. & Parak, W. J. Positioning metalorganic framework nanoparticles within the context of drug delivery–a comparison with mesoporous silica nanoparticles and dendrimers. Biomaterials 123, 172–183 (2017).
Chaemchuen, S., Xiao, X., Klomkliang, N., Yusubov, M. S. & Verpoort, F. Tunable metal–organic frameworks for heat transformation applications. Nanomaterials 8, 661 (2018).
de Lange, M. F., Verouden, K. J., Vlugt, T. J., Gascon, J. & Kapteijn, F. Adsorptiondriven heat pumps: the potential of metal–organic frameworks. Chem. Rev. 115, 12205–12250 (2015).
Chen, S., Lucier, B. E., Boyle, P. D. & Huang, Y. Understanding the fascinating origins of co2 adsorption and dynamics in mofs. Chem. Mater. 28, 5829–5846 (2016).
Kim, H. et al. Adsorptionbased atmospheric water harvesting device for arid climates. Nat. Commun. 9, 1–8 (2018).
Kalmutzki, M. J., Diercks, C. S. & Yaghi, O. M. Metal–organic frameworks for water harvesting from air. Adv. Mater. 30, 1704304 (2018).
Ejeian, M. & Wang, R. Adsorptionbased atmospheric water harvesting. Joule 5, 1678–1703 (2021).
Lee, S. et al. Computational screening of trillions of metal–organic frameworks for highperformance methane storage. ACS Appl. Mater. Interfaces 13, 23647–23654 (2021).
Li, A., BuenoPerez, R., Wiggin, S. & FairenJimenez, D. Enabling efficient exploration of metal–organic frameworks in the cambridge structural database. CrystEngComm 22, 7152–7161 (2020).
Borboudakis, G. et al. Chemically intuited, largescale screening of mofs by machine learning techniques. Npj Comput. Mater. 3, 1–7 (2017).
Anderson, R., Rodgers, J., Argueta, E., Biong, A. & GómezGualdrón, D. A. Role of pore chemistry and topology in the co2 capture capabilities of mofs: from molecular simulation to machine learning. Chem. Mater. 30, 6325–6337 (2018).
Moghadam, P. Z. et al. Structuremechanical stability relations of metalorganic frameworks via machine learning. Matter 1, 219–234 (2019).
Zhou, M., Vassallo, A. & Wu, J. Toward the inverse design of mof membranes for efficient d2/h2 separation by combination of physicsbased and datadriven modeling. J. Membr. Sci. 598, 117675 (2020).
Yan, Y. et al. Machine learning and insilico screening of metal–organic frameworks for o2/n2 dynamic adsorption and separation. Chem. Eng. J. 427, 131604 (2022).
Rampal, N. et al. The development of a comprehensive toolbox based on multilevel, highthroughput screening of mofs for co/n 2 separations. Chem. Sci. 12, 12068–12081 (2021).
Avci, G., Erucar, I. & Keskin, S. Do new mofs perform better for co2 capture and h2 purification? computational screening of the updated mof database. ACS Appl. Mater. Interfaces 12, 41567–41579 (2020).
Halder, P. & Singh, J. K. Highthroughput screening of metal–organic frameworks for ethane–ethylene separation using the machine learning technique. Energy Fuels 34, 14591–14597 (2020).
Yang, W. et al. Computational screening of metal–organic framework membranes for the separation of 15 gas mixtures. Nanomaterials 9, 467 (2019).
Qiao, Z. et al. Molecular fingerprint and machine learning to accelerate design of highperformance homochiral metal–organic frameworks. AIChE J. 67, e17352 (2021).
Li, S., Chung, Y. G. & Snurr, R. Q. Highthroughput screening of metal–organic frameworks for co2 capture in the presence of water. Langmuir 32, 10368–10376 (2016).
Pardakhti, M., Moharreri, E., Wanik, D., Suib, S. L. & Srivastava, R. Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (mofs). ACS Comb. Sci. 19, 640–645 (2017).
Bobbitt, N. S. & Snurr, R. Q. Molecular modelling and machine learning for highthroughput screening of metalorganic frameworks for hydrogen storage. Mol. Simul. 45, 1069–1081 (2019).
Qiao, Z., Xu, Q., Cheetham, A. K. & Jiang, J. Highthroughput computational screening of metal–organic frameworks for thiol capture. J. Phys. Chem. C. 121, 22208–22215 (2017).
Liang, H. et al. Combining largescale screening and machine learning to predict the metalorganic frameworks for organosulfurs removal from highsour natural gas. APL Mater. 7, 091101 (2019).
Yang, P. et al. Analyzing acetylene adsorption of metal–organic frameworks based on machine learning. Green Energy Environ. https://doi.org/10.1016/j.gee.2021.01.006 (2021).
Dureckova, H., Krykunov, M., Aghaji, M. Z. & Woo, T. K. Robust machine learning models for predicting high co2 working capacity and co2/h2 selectivity of gas adsorption in metal organic frameworks for precombustion carbon capture. J. Phys. Chem. C. 123, 4133–4139 (2019).
Liu, Z. et al. Predicting adsorption and separation performance indicators of xe/kr in metalorganic frameworks via a precursorbased neural network model. Chem. Eng. Sci. 243, 116772 (2021).
Ma, P. et al. Computerassisted design for stable and porous metalorganic framework (mof) as a carrier for curcumin delivery. LWT 120, 108949 (2020).
Du, Z. et al. A highthroughput computational screening of potential adsorbents for a thermal compression co2 brayton cycle. Sustain. Energy Fuels 5, 1415–1428 (2021).
Long, R. et al. Screening metalorganic frameworks for adsorptiondriven osmotic heat engines via grand canonical monte carlo simulations and machine learning. iScience 24, 101914 (2021).
Shi, Z. et al. Machine learning and in silico discovery of metalorganic frameworks: Methanol as a working fluid in adsorptiondriven heat pumps and chillers. Chem. Eng. Sci. 214, 115430 (2020).
Shi, Z. et al. Technoeconomic analysis of metal–organic frameworks for adsorption heat pumps/chillers: from directional computational screening, machine learning to experiment. J. Mater. Chem. A 9, 7656–7666 (2021).
García, E. J., Bahamon, D. & Vega, L. F. Systematic search of suitable metal–organic frameworks for thermal energystorage applications with low global warming potential refrigerants. ACS Sustain. Chem. Eng. 9, 3157–3171 (2021).
Liang, Q. et al. Benchmarking the performance of bayesian optimization across multiple experimental materials science domains. Npj Comput. Mater. 7, 1–10 (2021).
Kim, Y., Kim, E., Antono, E., Meredig, B. & Ling, J. Machinelearned metrics for predicting the likelihood of success in materials discovery. Npj Comput. Mater. 6, 1–9 (2020).
Brochu, E., Cora, V. M. & De Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at https://arxiv.org/abs/1012.2599 (2010).
Ahmadi, M., Vogt, M., Iyer, P., Bajorath, J. & Fröhlich, H. Predicting potent compounds via modelbased global optimization. J. Chem. Inf. Model. 53, 553–559 (2013).
Rohr, B. et al. Benchmarking the acceleration of materials discovery by sequential learning. Chem. Sci. 11, 2696–2706 (2020).
Aggarwal, R., Demkowicz, M. & Marzouk, Y. in Information Science for Materials Discovery and Design (eds Alexander, F. J., Rajan, K. & Lookman, T.) Ch. 2 (Springer, 2016).
Seko, A., Maekawa, T., Tsuda, K. & Tanaka, I. Machine learning with systematic densityfunctional theory calculations: Application to melting temperatures of singleand binarycomponent solids. Phys. Rev. B 89, 054303 (2014).
Kiyohara, S., Oda, H., Tsuda, K. & Mizoguchi, T. Acceleration of stable interface structure searching using a kriging approach. Jpn. J. Appl. Phys. 55, 045502 (2016).
Dehghannasiri, R. et al. Optimal experimental design for materials discovery. Comput. Mater. Sci. 129, 311–322 (2017).
Boyd, P. G. et al. Datadriven design of metal–organic frameworks for wet flue gas co 2 capture. Nature 576, 253–256 (2019).
Choudhary, K., DeCost, B. & Tavazza, F. Machine learning with forcefieldinspired descriptors for materials: Fast screening and mapping energy landscape. Phys. Rev. Mater. 2, 083801 (2018).
Lundberg, S. M. et al. From local explanations to global understanding with explainable ai for trees. Nat. Mach. Intell. 2, 2522–5839 (2020).
Ling, J., Hutchinson, M., Antono, E., Paradiso, S. & Meredig, B. Highdimensional materials and process optimization using datadriven experimental design with wellcalibrated uncertainty estimates. Integr. Mater. Manuf. Innov. 6, 207–217 (2017).
Lophaven, S. N., Nielsen, H. B., Sondergaard, J. & Dace, A. A Matlab Kriging Toolbox. Report No. IMMTR2002 12 (Technical University of Denmark, 2002).
Ueno, T., Rhone, T. D., Hou, Z., Mizoguchi, T. & Tsuda, K. Combo: an efficient bayesian optimization library for materials science. Mater. Discov. 4, 18–21 (2016).
Fasano, M. et al. Water/ethanol and 13x zeolite pairs for longterm thermal energy storage at ambient pressure. Front. Energy Res. 7, 148 (2019).
Fasano, M., Bevilacqua, A., Chiavazzo, E., Humplik, T. & Asinari, P. Mechanistic correlation between water infiltration and framework hydrophilicity in mfi zeolites. Sci. Rep. 9, 1–12 (2019).
Anstine, D. M., Tang, D., Sholl, D. S. & Colina, C. M. Adsorption space for microporous polymers with diverse adsorbate species. Npj Comput. Mater. 7, 1–9 (2021).
Neri, M., Chiavazzo, E. & Mongibello, L. Numerical simulation and validation of commercial hot water tanks integrated with phase change materialbased storage units. J. Energy Storage 32, 101938 (2020).
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
Fasano, M., Borri, D., Chiavazzo, E. & Asinari, P. Protocols for atomistic modeling of water uptake into zeolite crystals for thermal storage and other applications. Appl. Therm. Eng. 101, 762–769 (2016).
Fasano, M. et al. Atomistic modelling of water transport and adsorption mechanisms in silicoaluminophosphate for thermal energy storage. Appl. Therm. Eng. 160, 114075 (2019).
Lundberg, S. M. & Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4768–4777 (2017).
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. Npj Comput. Mater. 6, 1–10 (2020).
Van der Maaten, L. & Hinton, G. Visualizing data using tsne.J. Mach. Learn. Res. 9, 2579−2605 (2008).
Wu, X., Xiang, S., Su, J. & Cai, W. Understanding quantitative relationship between methane storage capacities and characteristic properties of metal–organic frameworks based on machine learning. J. Phys. Chem. C. 123, 8550–8559 (2019).
Yu, X., Choi, S., Tang, D., Medford, A. J. & Sholl, D. S. Efficient models for predicting temperaturedependent henry’s constants and adsorption selectivities for diverse collections of molecules in metal–organic frameworks. J. Phys. Chem. C. 125, 18046–18057 (2021).
Lavagna, L. et al. Cementitious composite materials for thermal energy storage applications: a preliminary characterization and theoretical analysis. Sci. Rep. 10, 1–13 (2020).
Talirz, L. et al. Materials cloud, a platform for open computational science. Sci. Data 7, 1–12 (2020).
Xu, M., Liu, Z., Huai, X., Lou, L. & Guo, J. Screening of metal–organic frameworks for water adsorption heat transformation using structure–property relationships. RSC Adv. 10, 34621–34631 (2020).
Butt, H.J., Graf, K. & Kappl, M. Physics and Chemistry of Interfaces (John Wiley & Sons, 2013).
Furukawa, H. et al. Water adsorption in porous metal–organic frameworks and related materials. J. Am. Chem. Soc. 136, 4369–4381 (2014).
Ni, L. et al. Defectengineered uio66nh 2 modified thin film nanocomposite membrane with enhanced nanofiltration performance. Chem. Commun. 56, 8372–8375 (2020).
Huang, Y. et al. Tuning the wettability of metal–organic frameworks via defect engineering for efficient oil/water separation. ACS Appl. Mater. Interfaces 12, 34413–34422 (2020).
Xiang, W., Zhang, Y., Chen, Y., Liu, C.j. & Tu, X. Synthesis, characterization and application of defective metal–organic frameworks: current status and perspectives. J. Mater. Chem. A 8, 21526–21546 (2020).
Ruthven, D. M. Principles of Adsorption and Adsorption Processes (John Wiley & Sons, 1984).
Schmidt, F. P. Optimizing Adsorbents for Heat Storage Applications: Estimation of Thermodynamic Limits and Monte Carlo Simulations of Water Adsorption in Nanopores. Ph.D. thesis, Fakultät für Mathematik und Physik, Universität Freiburg (2004).
Sing, K. S. Reporting physisorption data for gas/solid systems with special reference to the determination of surface area and porosity (recommendations 1984). Pure Appl. Chem. 57, 603–619 (1985).
Speight, J. G. et al. Lange’s Handbook of Chemistry Vol. 1 (McGrawHill, 2005).
Dincer, I. & Rosen, M. A.Thermal Energy Storage Systems and Applications (John Wiley & Sons, 2021).
Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C. & Scheffler, M. Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015).
Gomes, S. I. et al. Machine learning and materials modelling interpretation of in vivo toxicological response to tio 2 nanoparticles library (uv and nonuv exposure). Nanoscale 13, 14666–14678 (2021).
Chiavazzo, E. et al. Intrinsic map dynamics exploration for uncharted effective freeenergy landscapes. Proc. Natl Acad. Sci. USA 114, E5494–E5503 (2017).
Trezza, G., Bergamasco, L., Fasano, M. & Chiavazzo, E. Models and datasets for “Minimal set of crystallographic descriptors for sorption properties in hypothetical Metal Organic Frameworks and their role in sequential learning optimization". Zenodo https://doi.org/10.5281/zenodo.6351366 (2022).
Küsgens, P. et al. Characterization of metalorganic frameworks by water adsorption. Microporous Mesoporous Mater. 120, 325–330 (2009).
Horcajada, P. et al. Synthesis and catalytic properties of mil100 (fe), an iron (iii) carboxylate with large pores. Chem. Commun. 27, 2820–2822 (2007).
Lebedev, O., Millange, F., Serre, C., Van Tendeloo, G. & Férey, G. First direct imaging of giant pores of the metal organic framework mil101. Chem. Mater. 17, 6525–6527 (2005).
Yang, D.A., Cho, H.Y., Kim, J., Yang, S.T. & Ahn, W.S. Co2 capture and conversion using mgmof74 prepared by a sonochemical method. Energy Environ. Sci. 5, 6465–6473 (2012).
Henkelis, S. E. et al. A single crystal study of cpo27 and utsa74 for nitric oxide storage and release. CrystEngComm 21, 1857–1861 (2019).
Permyakova, A. et al. Synthesis optimization, shaping, and heat reallocation evaluation of the hydrophilic metal–organic framework mil160 (al). ChemSusChem 10, 1419–1426 (2017).
Permyakova, A. et al. Design of salt–metal organic framework composites for seasonal heat storage applications. J. Mater. Chem. A 5, 12889–12898 (2017).
Acknowledgements
E.C. acknowledges financial support of the Italian National Project PRIN Heat transfer and Thermal Energy Storage Enhancement by Foams and Nanoparticles (2017F7KZWS) and of the research contract PTR 2019/21 ENEA (Sviluppo di modelli per la caratterizzazione delle proprietà di scambio termico di PCM in presenza di additivi per il miglioramento dello scambio termico) funded by the Italian Ministry of Economic Development (MiSE).
Author information
Authors and Affiliations
Contributions
E.C. conceived the idea and found financial support. G.T. performed all computations and wrote the first paper draft. E.C., M.F., and L.B. supervised the research activities. L.B. and M.F. helped with the result presentation and interpretation. M.F. suggested the synthetic dataset. All authors contributed to the final paper writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Trezza, G., Bergamasco, L., Fasano, M. et al. Minimal crystallographic descriptors of sorption properties in hypothetical MOFs and role in sequential learning optimization. npj Comput Mater 8, 123 (2022). https://doi.org/10.1038/s41524022008067
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524022008067
This article is cited by

Enhancing ReaxFF for molecular dynamics simulations of lithiumion batteries: an interactive reparameterization protocol
Scientific Reports (2024)

The impact of physicochemical features of carbon electrodes on the capacitive performance of supercapacitors: a machine learning approach
Scientific Reports (2023)