Minimal crystallographic descriptors of sorption properties in hypothetical MOFs and role in sequential learning optimization

Trezza, Giovanni; Bergamasco, Luca; Fasano, Matteo; Chiavazzo, Eliodoro

doi:10.1038/s41524-022-00806-7

Download PDF

Article
Open access
Published: 03 June 2022

Minimal crystallographic descriptors of sorption properties in hypothetical MOFs and role in sequential learning optimization

npj Computational Materials volume 8, Article number: 123 (2022) Cite this article

2776 Accesses
12 Citations
17 Altmetric
Metrics details

Subjects

Abstract

We focus on gas sorption within metal-organic frameworks (MOFs) for energy applications and identify the minimal set of crystallographic descriptors underpinning the most important properties of MOFs for CO₂ and H₂O. A comprehensive comparison of several sequential learning algorithms for MOFs properties optimization is performed and the role played by those descriptors is clarified. In energy transformations, thermodynamic limits of important figures of merit crucially depend on equilibrium properties in a wide range of sorbate coverage values, which is often only partially accessible, hence possibly preventing the computation of desired objective functions. We propose a fast procedure for optimizing specific energy in a closed sorption energy storage system with only access to a single water Henry coefficient value and to the specific surface area. We are thus able to identify hypothetical candidate MOFs that are predicted to outperform state-of-the-art water-sorbent pairs for thermal energy storage applications.

A comprehensive transformer-based approach for high-accuracy gas adsorption predictions in metal-organic frameworks

Article Open access 01 March 2024

High-throughput screening of hypothetical metal-organic frameworks for thermal conductivity

Article Open access 20 January 2023

On the role of history-dependent adsorbate distribution and metastable states in switchable mesoporous metal-organic frameworks

Article Open access 03 June 2023

Introduction

Metal-organic frameworks (MOFs) are crystalline compounds consisting of metal ions and organic linkers, characterized by tunable porosity and incredibly high surface area¹. Due to their properties, MOFs have recently attracted remarkable attention in a wide range of different fields, including gas/vapor separation², reaction catalysis³, drug delivery⁴, energy storage, and heat transformations^5,6. Given their nature as porous adsorbent materials, intensively active research is focused on the use of MOFs for CO₂ capture, towards the development of effective technologies for mitigating greenhouse gas emissions⁷. Recently, MOFs have been also employed in adsorption-based atmospheric water harvesting driven by solar thermal energy^8,9. In general, when dealing with applications of engineering relevance, different inlet gas streams, variable operating conditions, and target properties tailored per each specific case make it challenging to identify an ideal MOF crystal for all applications¹⁰, thus leading to a fragmented case-by-case optimization problem.

Hence, MOFs require proper and efficient methods for tailoring their features according to target properties of interest in each specific application. The latter is everything but an easy task. In fact, due to the myriad of degrees of freedom for MOFs structure and composition, more than 100 trillion compounds have been hypothesized¹¹, while almost 100,000 have been synthesized so far¹². High-throughput computational screening and machine learning have been recently adopted to analyse large MOF datasets¹³. Such computational tools allow us to identify significant correlations between nanoscale features and observable macroscale properties^14,15, and to select the most suitable crystal for a given application case. A few representative examples are provided by gas-gas separation (D₂/H₂¹⁶, O₂/N₂¹⁷, CO/N₂¹⁸, CO₂/H₂¹⁹, ethane/ethylene²⁰, and other gas mixtures²¹), the enantioselectivity of chemical compounds²², gas adsorption (CO₂²³, CH₄²⁴, H₂²⁵, thiol²⁶, organosulfurs²⁷, and acetylene²⁸), and combinations thereof^29,30. Several computational explorations of MOFs datasets have been carried out also for biomedical (drug delivery³¹), mechanical (CO₂ Brayton cycle³² and osmotic heat engine³³), and energy applications (heat pumps/chillers^34,35 and thermal energy storage³⁶).

In this context, modern sequential learning (SL) algorithms are emerging as particularly efficient tools for exploring the material high-dimensional (crystallographic) feature space. In particular, while evaluating an objective black-box function through demanding physical or numerical experiments, SL tools can provide a well-orchestrated procedure to rationally navigate the high-dimensional parameter (feature) space^37,38. Thus, given an initial pool of evaluation points, one can sequentially choose the next experiment to carry out^39,40, without relying on naive random guessing. Rather general techniques have been proposed in the area of material science holding the promise to accelerate materials discovery and research⁴¹, with a number of Authors reporting successful use of SL approaches in this field. Aggarwal et al.⁴², by means of optimal experimental design, successfully characterized a substrate under a thin film. Seko et al.⁴³ found the compound with the highest melting temperature in a given ensemble of candidate materials with less attempts than a naive random choice. Kiyohara et al.⁴⁴ accelerated the search of a stable interface structure with respect to a traditional brute force approach. Dehghannasiri et al.⁴⁵ efficiently guided experiments to design the shape memory alloy with the lowest energy dissipation at a given temperature. Needless to say that the identification of the parameter space is a very important preliminary step when implementing SL algorithms. Here, we choose to specifically focus on MOFs properties as gas/vapor sorbent materials, since those are particularly relevant for energy applications.

The first important objective of this work is the identification of the minimal set of MOFs features (or descriptors) ruling critical adsorption properties in the low-coverage regime, i.e., the Henry solubility coefficients for both CO₂-MOFs and, importantly, H₂O-MOFs working pairs. The above minimal set represents the important crystallographic features underpinning a given adsorption property of interest. In this sense, each minimal set of descriptors can be regarded as the genetic code for a given property, and it is identified as described below.

First, we curate and enhance MOF data from a recently developed library made of 8206 compounds generated computationally⁴⁶. Each Crystallographic Information File (CIF) representing a given material is first featurized by means of 1557 Classical Force-field Inspired Descriptors (CFID)⁴⁷ taking into account both chemical and structural parameters. Subsequently, we train and validate regression models of target properties involved in heat storage applications, such as Henry coefficients and working capacity. These models are obtained by means of a Random Forest-based pipeline, with hyperparameter tuning in five fold cross-validation. The final ranking and selection of the minimal set of descriptors is performed by evaluating the importance of each feature on the models outputs by means of the Tree SHAP interpretation algorithm⁴⁸.

Upon identification of the above crystallographic genetic code of sorption properties in MOFs, we investigate its role when using SL algorithms. Therefore, we compare the performance of three different SL methodologies aiming at maximizing H₂O and CO₂ Henry coefficients, and CO₂ working capacity: (a) random Forests with Uncertainty Estimates for Learning Sequentially (FUELS⁴⁹); (b) kriging algorithm⁵⁰; (c) COMmon Bayesian Optimization Library (COMBO)⁵¹. For each SL methodology, we compare several strategies for choosing the next material to test, combining the exploration of high-uncertainty regions with the exploitation of high-performing candidates. Importantly, we analyse the SL performance using both the minimal subset of features (from the pipeline and SHAP analysis) and a larger set of variables, to highlight how the identification of descriptors affects the minimum number of experiments needed to pick out a MOF with the highest value of the desired property. In Fig. 1 the above procedure is schematically represented.

**Fig. 1: Overview of the protocol to identify and test the minimal set of ruling crystallographic descriptors of sorption properties in several sequential learning algorithms.**

We highlight that sorption-based engineering applications rely upon sorbent material characterization in a wide coverage range. However, when a large number of hypothetical sorbents (here MOFs, but in principle also zeolites^52,53 or other materials⁵⁴) have to be evaluated as potential candidates, only low-coverage characterization (i.e. Henry coefficient) is often accessible thus making challenging any optimization of crucial figures of merits of engineering relevance. We thus formulate a procedure aiming at a fast evaluation of one of the most important figures of merit in closed water-sorption seasonal thermal energy storage applications, namely the material-based specific (stored) energy. Unlike traditional sensible or latent systems⁵⁵, the above sorption-based energy storage technologies have the advantage to be loss-free. Our procedure can thus be used in SL-based (or other) optimization/screening processes of MOFs even under incomplete knowledge of the entire isosteric field of the candidate working pairs. Applied to the database of over 5000 computationally generated (hypothetical) compounds by ref. ⁴⁶ (developed for different purposes), our procedure identifies MOFs that can possibly outperform state-of-the-art sorbent materials for thermal energy storage.

Results

In this work, a crucial source of data on MOFs stems from the dataset of Boyd et al.⁴⁶, where important sorption properties (e.g., the Henry coefficients for CO₂ and H₂O, the working capacity for CO₂, and the specific surface area) have been computed by DFT-based simulations for over 8000 potential MOFs. Capitalizing on the above comprehensive study, we construct machine learning (ML) models capable of accurately predicting MOFs solubility of both CO₂ and H₂O as well as CO₂ working capacity and surface area. The above models enable us to achieve the first key result of this work, namely the identification of the minimal set of crystallographic-based descriptors⁵⁶ ruling these sorption and geometric properties in MOFs.

Moreover, a systematic comparison of SL approaches on the above MOFs database reveals important conclusions on the performance of the different regression schemes adopted and, most importantly, the role played by the selection of the feature space to be explored. Those conclusions are also supported by results obtained on a highly-controllable synthetic dataset, as discussed in detail in Supplementary Note 1.

It is worth stressing that properties reported in ref. ⁴⁶ only characterize MOFs in the Henry regime and are not sufficient to describe the equilibrium sorption properties in the high coverage regime. However, when targeting important engineering applications such as seasonal thermal energy storage, key figures of merits of the storage plant (e.g., the material-based specific energy) critically rely upon the access to the entire isosteric field of the chosen sorbent-sorbate pair or, equivalently, to the knowledge of equilibrium adsorption isotherms at several temperature values^57,58. The latter isotherms describe the adsorption properties (at equilibrium) of sorbents in a wide range of coverage values, from the Henry regime up to the saturation pressure of the sorbate fluid.

Therefore, in this work, we also propose an approach enabling us to optimize one of the most important engineering figures of merit of MOFs for seasonal thermal energy storage applications (i.e., material-based specific energy in an ideal closed sorption cycle, see Methods, subsection Water-sorption thermal energy storage, and Results, subsection Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs), even if only an incomplete set of sorption properties are (experimentally or numerically) accessible. Based on the latter optimization procedure, we are finally able to identify potential MOFs candidates for seasonal thermal energy storage that can possibly outperform most of the current state-of-the-art sorbent materials.

Descriptors of sorption properties in MOFs and their use in SL algorithms

We constructed four MOFs datasets, each one with the same 1557 features and a different target property among Henry coefficient for CO₂ (8194 data entries), working capacity for CO₂ (8202 data entries), Henry coefficient for H₂O (8202 data entries), and surface area (5028 data entries). The different number of data entries are due to missing values for some of the chosen properties in the available database by ref. ⁴⁶.

The above database also reported, for all the compounds, both the crystallographic file (used to extract the 1557 CFID⁴⁷ by means of Matminer⁵⁶) and a list of DFT-computed properties, among which we have only considered the abovementioned adsorption properties of interest. More specifically, the computed 1557 explanatory variables (also referred to as descriptors) proposed by Choudhary et al.⁴⁷ consist of a set of both chemical (e.g., average chemical properties over the elements in the cell, average atomic radial charge) and structural (e.g., distribution functions) quantities. More details on descriptor sub-categories are reported in Table 1.

Table 1 Components of classical force-field inspired descriptors (CFID)⁴⁷.

Full size table

We have trained four different ML models by means of a Random Forest-based pipeline with hyperparameter tuning in five fold cross-validation to predict the Henry coefficient for CO₂, the working capacity for CO₂, the Henry coefficient for H₂O, and the surface area, achieving coefficients of determination of R² = 0.785, R² = 0.590, R² = 0.874, and R² = 0.924, respectively. We have used 80% of each dataset to train the models, and the remaining 20% to validate them. Since the Henry coefficient values span a few orders of magnitude, the corresponding ML models have been developed in terms of the natural logarithm of those properties. During the data preprocessing routines, each of the four trained pipelines (i.e., feature reduction and ML with hyperparameter tuning, see Supplementary Note 10 for details) already drops a significant number of the 1557 features, thus confirming that many of the initially selected descriptors do not significantly affect the chosen adsorption properties. More specifically, the final models include 237 descriptors for the CO₂ Henry coefficient, 236 descriptors for CO₂ working capacity, 177 descriptors for the H₂O Henry coefficient, and 234 descriptors for surface area.

Then, the Tree SHAP routine^48,59 allows to identify the most meaningful descriptors as those accounting for the 75% of the cumulative curve over the coefficients of importance. The SHAP routine identifies the most meaningful features for the trained models among the reduced set of descriptors retained by the trained pipelines after the preprocessing. In particular, the impact of a descriptor depends on the comparison between the output of a model trained with that feature and another model output, trained without that feature (see Methods, subsection Model training and choice of descriptors). The coefficients of importance are thus computed over the testing set, i.e., over samples the model has never encountered during the training.

Overall, starting from the original 1557 Classical Force-field Inspired Descriptors, 33 items for the CO₂ Henry coefficient, 62 for the CO₂ working capacity, 15 for the H₂O Henry coefficient, and 14 for the surface area are found to explain 75% of the corresponding regression models. Furthermore, we have repeated an analogous procedure with AutoMatminer⁶⁰, which allows us to automatically train and validate a complete pipeline—feature reduction, data cleaning, and machine learning—with automatic hyperparameter tuning, without cross-validation. Results are reported in Supplementary Note 2. Models performances with the corresponding cumulative importance curves of the ruling descriptors are reported in Fig. 2. The H₂O Henry coefficient is key for the computation of the specific material energy in a MOF-based thermal energy storage plant (as discussed below in Results, Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs). Hence, for the H₂O Henry coefficient only, we also report a few experimental values corresponding to real MOFs to compare with numerical predictions of hypothetical MOFs. In particular, numerical predictions happen to follow a good agreement with the experimental values. This aspect is key for material screening purposes and is further highlighted in the Discussion section below.

**Fig. 2: Predictions and corresponding normalized cumulative curves for the coefficients of importance of the four Random-forest regression models.**

Importantly, Fig. 3 shows the SHAP rankings of the five most meaningful descriptors for each of the properties of interest. Table 2 summarizes the physicochemical meaning of the identified descriptors, based on the complete list by Choudhary et al.⁴⁷. The entire list of variables, together with their cumulative importance, the trained models, and the datasets on which they have been trained are publicly available online (see Data availability and Code availability).

**Fig. 3: The five most important features according to SHAP ranking for each of the properties of interest.**

Table 2 Physicochemical meaning of the most relevant CFID descriptors⁴⁷.

Full size table

As far as the four properties of MOF are concerned, the identification of those important descriptors represents per se an advancement of knowledge on sorption mechanism in MOFs and an important contribution to this work.

Focusing our attention on the top three descriptors in terms of importance, we notice that:

As far as the Henry coefficient for CO₂ (see Fig. 3a) is concerned, such a property happens to be mainly ruled by the volume per atom, the nearest neighbor distribution, and an elastic constant. The former descriptor is the volume of a cell divided by the number of atoms in the cell, thus mainly representing a geometrical feature of the MOF crystal. The nearest neighbor distribution can also be regarded mainly as a geometry feature strictly related to the crystallographic structure. The elastic constant is one of the elements of an averaged 6 × 6 elastic tensor of the cell. In particular, for each of the chemical elements of a cell, the elastic tensor of the solid ground state structure at temperature 0 K is known. The weighted average of those tensor entries over the chemical elements of the cell are the elastic constants found by the featurizer. To our best interpretation, the latter two quantities effectively describe the chemical environment of the crystal thus underpinning the sorbent-sorbate interaction potential.
As far as the working capacity for CO₂ (see Fig. 3b) is concerned, this quantity is mainly ruled by nearest neighbor distribution and volume per atom. Therefore, as expected, geometric features are playing a key role in determining the maximum CO₂ uptake into the crystal.
As far as the Henry coefficient for H₂O (see Fig. 3c) is concerned, we found that that quantity is mainly ruled by an elastic constant, radial distribution function, and nearest-neighbor distributions. As such, as opposed to the CO₂ case above, we can observe that descriptors more related to the chemical environment are needed this time since a polar molecule is interacting with the MOFs structure.
Finally, for the surface area (see Fig. 3d), we found that it is mainly ruled by volume per atom, packing fraction, and nearest neighbor distribution, thus showing the prominence of geometrical features similar to the case of the above working capacity.

Upon the identification of the above lists of descriptors, we have compared the performance of SL algorithms for the sorption properties of interest using both the reduced set of important descriptors (i.e., those explaining the 75% of the models predicting CO₂ Henry coefficient, CO₂ working capacity, and H₂O Henry coefficient, respectively) and a larger set of 100 descriptors composed by the previous and some additional (non-meaningful) ones.

In particular, for the aforementioned sorption properties, we have chosen the non-relevant features as the ones obtaining the least scores in the SHAP ranking. SL was adopted to find the maximum property value among a random subset of 500 samples from the original datasets (over 8000 MOFs), starting from a pool of 100 points with the lowest target property. Unexpectedly, SL optimization in the space of relevant descriptors does not ensure, in general, faster convergence of the procedure to the optimum property value (this is also confirmed by results in the synthetic case reported in Supplementary Note 1). Those six spaces (two domains for each of the three properties) are reported in Supplementary Note 9, as t-SNE projections over two components⁶¹. Furthermore, among the three methodologies examined, both Random Forest- and COMBO-based methods were able to provide always a faster convergence to the optimum value as compared to the random choice strategy. Results are shown in Fig. 4. As Ling et al. point out⁴⁹, a pure exploitative strategy—RF-MEI, K-MEI, COMBO-PI—performs better when, already in the very first steps, the model is able to predict with high accuracy the value of the property of interest; conversely, a pure explorative strategy—RF-MU, K-MU—is more proper if the optimum is somehow very different with respect to all the other candidates, while the remaining strategies are a trade-off between those two extremes—RF-MLI, K-MLI, and COMBO-EI.

Fig. 4: Number of evaluations before converging to the maximum for the SL algorithms, normalized with respect to the random choice (corresponding to 200 experiments), for three sorption properties of MOFs.

We notice that, over different regression methodologies—even with the same acquisition function—the performance ranking, in general, changes. For instance, in the case of the Henry coefficient for H₂O in Fig. 4c, among the Random Forest-based strategies, RF-MLI is the top performer, followed by RF-MEI and RF-MU, for both 15- and 100-dimensional input spaces; the very same ranking applies for Kriging based acquisition functions, K-MLI, K-MEI, and K-MU. In the case of working capacity for CO₂ in Fig. 4b, instead, among the Random Forest-based strategies, RF-MEI is the top performer, followed by RF-MU and RF-MLI; on the contrary, the same ranking does not apply for Kriging based strategies, since K-MEI is the top performer, followed by K-MLI and K-MU, for both 62- and 100-dimensional input spaces. An intermediate case is represented by the Henry coefficient for CO₂ in Fig. 4a, where the ranking of Random Forest-based strategies (RF-MLI, RF-MEI, RF-MU, both 33- and 100-dimensional input spaces), is preserved over the Kriging methodology only for the 100-dimensional space (K-MLI, K-MEI, and K-MU), but not for the 33-dimensional (K-MEI, K-MLI, and K-MU). Among COMBO-based acquisition functions, COMBO-PI and COMBO-EI always require a similar number of evaluations, except in the 100-dimensional input space of Henry coefficient for CO₂, and the 100-dimensional input space is consistently better performing than the corresponding low-dimensional one.

This brief comparison shows that no general rule is valid a priori for SL, neither in terms of regression methodology, nor in terms of acquisition function, nor in terms of dimensionality of the problem, albeit selecting relevant descriptors.

Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs

Without losing generality, here we aim at investigating the expected performance of hypothetical (i.e., computationally generated) MOFs for an important energy engineering application, namely water-sorption seasonal thermal energy storage (see also the Methods section, subsection Water-sorption thermal energy storage). The most challenging aspect of this task consists in the access to the entire isosteric field of each candidate MOF-water pair for estimating the engineering figure of merit of interest. Clearly, for a large number of MOFs candidates, this is challenging and time-consuming both computationally (typically, only the Henry low-coverage regime is reported in literature works^62,63) and experimentally⁶⁴. In this section, we specifically focus on such a challenging aspect. We envision an efficient optimization procedure that is capable of searching MOFs with the largest expected figure of merit of engineering relevance (either by SL, if materials are sequentially synthesized/computed, or by accessing readily available databases⁶⁵). As far as seasonal thermal energy storage applications are concerned, here we focus on the highest specific energy of MOF-water working pairs among the compounds reported in ref. ⁴⁶. An overview of the proposed methodology is schematically reported in Fig. 5.

Fig. 5: Suggested procedure for estimating the specific energy of hypothetical MOF-water working pairs when only an incomplete knowledge of the isosteric field is experimentally or numerically accessible.

As detailed in the Methods section (subsection Water-sorption thermal energy storage), the ideal thermodynamic cycle of a closed sorption thermal energy storage system is completely defined by four operating temperatures. In this study, we assume T_A = 308 K (the minimum temperature on the user side), T_C = 353 K (the maximum temperature on the source side), T_E = 278 K (the average winter temperature), T_F = 303 K (the average summer temperature). Those temperature values are reasonable for space heating applications in temperate climates⁶⁴. Given the Antoine equation, the equilibrium water vapor pressures p_E = 866.2 Pa at the evaporator and p_F = 4231.6 Pa at the condenser are also uniquely defined considering the average winter and summer temperatures, respectively. We decided to evaluate and maximize over the database one of the most important engineering quantities in a thermal energy storage plant, namely the cycled heat per unit of material weight. While full details on the adopted models are given in the Methods section (subsection Water-sorption thermal energy storage), in the following we report and discuss the main simplifying assumptions in our approach:

A key quantity to be estimated is the H₂O working capacity. That quantity is related to the available adsorption sites n_TOT per unit of dry sorbent mass. Boyd et al.⁴⁶ reported only the CO₂ working capacity, while no data are available on the maximum H₂O uptake. Nonetheless, we can rely on other related properties, such as the specific surface area of MOFs. To this end, we notice that Chaemchuen et al. have reported H₂O working capacity for a pool of 66 MOFs⁵. A good correlation between the water uptake and the surface area (i.e., the available internal surface per gram of dry adsorbent) can be observed for typical MOFs used in the energy engineering field, and this is also in line with results found by Xu et al.⁶⁶. In this work, on the basis of that correlation, we impose a linear regression for finding the constant of proportionality between water uptake and surface area (water uptake = η × surface area). This yields $\eta =3.875\times 10^{-4}\,\rm{g}_{\rm{H}_{2}\rm{O}}\,\rm{m}^{-2}$. More details can be found in Supplementary Note 4. We also notice that, without a loss of generality, if more accurate values of the water uptake are available (e.g., from numerical simulations of adsorption experiments) the above assumption on the uptake-internal surface correlation can be fully relaxed.
Henry coefficients $\widetilde{H}({T}_{0})$ for H₂O are listed at the reference temperature T₀ = 298 K with units of ${\rm{mol}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}$, thus representing the moles of adsorbed H₂O per kilogram of dry MOF per bar of H₂O vapor. In our approach, we adopt the Frumkin–Fawler–Guggenheim (FFG) model to estimate the adsorption isotherm over the entire range of coverages only relying upon such Henry coefficient. However, as discussed in the Methods sections (subsection Water-sorption thermal energy storage), the FFG equation requires H(T₀) in units of Pa⁻¹: we have thus converted $\widetilde{H}({T}_{0})$ (readily available from ref. ⁴⁶) to H(T₀). Let n_s, m_MOF and ${p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}$ be the number of adsorbed water moles, the mass of the hypothetical MOF, and the pressure of water in the vapor phase, respectively, it holds:
$$\widetilde{H}({T}_{0})=\frac{{n}_{{{{\rm{s}}}}}}{{m}_{{{{\rm{MOF}}}}}{p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}.$$
(1)

The approximation of the low-coverage regime yields the linear relationship between the coverage and pressure, namely $\theta =H({T}_{0}){p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}$. Since the number of adsorbed water moles is related to the molar based total number of adsorption sites as n_s = θn_TOT, it follows:
$$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{n}_{{{{\rm{TOT}}}}}/{n}_{{{{\rm{MOF}}}}}},$$
(2)

with ${m}_{{{{\rm{MOF}}}}}={{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}{n}_{{{{\rm{MOF}}}}}$, ${{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}$ the molecular weight of the MOF, and n_MOF the total number of moles. Furthermore, the molar based total number of adsorption sites n_TOT corresponds to the maximum number of water moles ${n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}$ that can be adsorbed, and the following relationship holds:
$$\frac{{n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{n}_{{{{\rm{MOF}}}}}}=\frac{{m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{m}_{{{{\rm{MOF}}}}}}\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}},$$
(3)

where ${{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}$ is the molecular weight of water and ${m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}$ denotes the maximum mass of water that can be adsorbed. The ratio ${m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}/{m}_{{{{\rm{MOF}}}}}$ is related to the H₂O working capacity of the MOF and it is equal to ηS, where η is the constant of proportionality between the uptake and the surface area S. A comparison of Equations (3) and (2) yields:
$$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}{\eta S}\times 1{0}^{-8},$$
(4)

which is the final conversion formula of the Henry coefficient for H₂O from measure units of ${{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}$ into Pa⁻¹. Here, the factor 10⁻⁸ appears because $[{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{\rm{H2O}}}}}^{-1}$, $[\eta ]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{-2}$, $[S]=\,{{{{\rm{m}}}}}^{2}{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}$, and so $[\widetilde{H}({T}_{0}){{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}/(\eta S)]=\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}$.
A crucial quantity for heat transformation is the isosteric heat of adsorption q_st. Due to the Clausius–Clapeyron relationship (see Equation (7) in the Methods section, subsection Water-sorption thermal energy storage), at least two adsorption isotherm curves (at T_A and at T_C) are needed to estimate the corresponding q_st. In our database, though, the Henry coefficients are only available at T₀. In order to reconstruct a second adsorption isotherm for the same MOF-water working pair, we decided to resort to the potential theory of Polanyi, thus exploiting the basic notion that all adsorption isotherms are self-similar when rescaled with respect to the Polanyi potential function. Details are provided in the Supplementary Note 5. More specifically, the Polanyi potential is defined as:
$${{{\mathcal{A}}}}=-RT\ln \left(\frac{{p}_{{{{\rm{s}}}}}(T)}{p}\right),$$
(5)

where p_s(T) is the saturation pressure of water at temperature T, while p is the pressure of the vapor phase on the adsorbent surface⁶⁷. Since, at a given pressure p, the Polanyi potential is a constant of the sorption pair, we have computed ${{{\mathcal{A}}}}$ at T₀ in a range from 10⁻⁴ Pa up to p_s(T₀) = 3157 Pa (Antoine equation for water); then, in the θ − p chart, we have rescaled the abscissa p of the isotherm obtained at the temperature T₀ according to $p={p}_{{{{\rm{s}}}}}(T)\exp \left({{{\mathcal{A}}}}/(RT)\right)$, for getting the new curves at T_A and T_C. We have finally computed the isosteric heat by means of the Clausius–Clapeyron relationship:
$${q}_{{{{\rm{st}}}}}=\frac{R}{3}\frac{{T}_{{{{\rm{C}}}}}{T}_{{{{\rm{A}}}}}}{{T}_{{{{\rm{C}}}}}-{T}_{{{{\rm{A}}}}}}\mathop{\sum }\limits_{i=1}^{3}\ln \frac{{p}_{2}({\theta }_{i})}{{p}_{1}({\theta }_{i})},$$
(6)

where the points 1 and 2 represent the intersections of an isosteric transformation with the two isotherms respectively at T_A and T_C. We have repeated the procedure for three coverage values (i.e., θ₁ = 0.4, θ₂ = 0.5, θ₃ = 0.6) and averaged them.
Finally, upon determination of the low- and high-temperature adsorption isotherms curves at T_A and T_C, the coverage span Δθ during the discharge phase can be determined as detailed in the Methods section, subsection Water-sorption thermal energy storage. As for the estimate of the isosteric heat, this requires for all the compounds the rescaling of the horizontal axis in the θ − p chart of the isotherm obtained at temperature T₀ according to the Polanyi potential theory.

We have thus computed the following objective function Sq_stΔθ, which recovers the cycled heat up to a constant, over the entire list of 5028 potential MOFs with positive surface area in ref. ⁴⁶. The MOF with the highest predicted specific energy turned out to be the compound with the chemical formula C₉₆H₄₈O₂₈N₈V₄ and referred to as “str_m5_o18_o28_sra_sym.72” in the database (see also the molecular rendering in Fig. 6b), with Henry coefficient at 298 K of $6110.54\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}$ (or equivalently, 6.89 × 10⁻⁴ Pa⁻¹) and surface area $S=4118.79\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}$. Figure 7 shows also the ideal expected thermodynamic cycle related to this optimal potential MOF.

**Fig. 6: Hypothetical MOFs ranked with respect to the specific energy.**

**Fig. 7: Adsorption/desorption based thermal energy storage cycle for the potential MOF ‘str_m5_o18_o28_sra_sym.72’ with water, in the coverage-pressure plane.**

We observe a coverage span Δθ = 0.653, an isosteric heat ${q}_{{{{\rm{st}}}}}=48.95\,{{{\rm{kJ}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}$ with an objective function value of $1.32\times 1{0}^{5}\,{{{\rm{kJ}}}}\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}$. That quantity can be directly related to specific energy: upon multiplication by the constant $\eta /{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}$ ($\eta =3.875\times 1{0}^{-4}\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{-2}$, ${{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}=18.02\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}$), we obtain a value of $2.83\times 1{0}^{-3}\,{{{\rm{GJ}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}$. Furthermore, we can compute the theoretical density ρ_MOF of the crystal knowing the mass of the cell (1965.21 u = 3.263 × 10⁻²¹ g_MOF, as from the database) and its volume (5.335 × 10⁻²¹ cm³, as from the CIF file), leading to ρ_MOF = 0.612 g_MOFcm⁻³. As a result, the volume-based energy density turns out to be 1.73 GJm⁻³. For the sake of comparison, Fig. 6a shows a 2D map where the two axes represent the two most important descriptors according to the SHAP ranking for the Henry coefficient of H₂O: the four top-performing potential MOFs are highlighted and the corresponding crystallographic cells are depicted in Fig. 6b. Moreover, Table 3 shows the ten most performing potential MOFs ranked in terms of specific energy. Interestingly, those compounds are all Vanadium-based, mostly due to values of the Henry coefficient for H₂O in the optimal range, which produces a good coverage span Δθ over the thermodynamic cycle.

Table 3 Top ten potential MOFs in terms of specific energy from the database by Boyd et al.⁴⁶.

Full size table

As depicted in Fig. 8 the four top-performing hypothetical MOFs are predicted to have (material-based) specific energy values among the highest available in the literature for sorption thermal energy storage under similar operating conditions. A more extensive comparison is shown in Supplementary Fig. 11, where estimates of the specific energy for real MOFs by means of the FFG model are also reported.

**Fig. 8: Comparison between the expected specific energy for several materials.**

Discussion

Sequential learning (SL) algorithms can in principle dramatically reduce the number of evaluations needed for finding the optimum of an unknown function as compared with a naive random choice and, as such, they are emerging also as effective tools for material optimization and discovery. In this work, focusing on metal-organic frameworks (MOFs) and some of their crucial adsorption properties (both with H₂O and CO₂ as sorbate fluids), we have addressed a number of critical aspects related to the discovery of the minimal set of important crystallographic descriptors for SL-based optimization algorithms. We have shown that the general protocol for sorting out the minimal set of ruling descriptors for a given adsorption property is based on two steps: (i) construction and training of the machine learning (ML) model which identifies the number of ruling descriptors; (ii) evaluation of the relative importance of each explanatory variable on the chosen output by the SHAP analysis. We found that, as long as the set of such ruling descriptors (for a given property of interest) is included among the exploration space features, convergence performance is not affected, although the computational burden of an SL algorithm also depends on the dimension of the parameter space to be explored: taking into account only the most relevant features may be in fact beneficial in that respect. Furthermore, based on the several examples provided here (i.e., Henry coefficient for CO₂, working capacity for CO₂, and Henry coefficient for H₂O as well as the synthetic example discussed in the Supplementary Note 1), we have consistently noticed that the COMBO algorithm always performs better than random guessing.

Furthermore, we recognize that full access to the adsorption properties of hypothetical MOFs in the entire coverage regime (as requested in important applications of engineering relevance) is very challenging both experimentally and computationally. This holds particularly for water-MOFs working pairs, that are promising for a number of energy applications. Hence, we formulate a general and efficient computational screening procedure of hypothetical MOFs which, only relying upon the adsorption properties reported in Fig. 2, is capable to estimate important figures of merit for sorption-based seasonal thermal energy storage. Remarkably, our procedure suggests that some of the MOFs hypothesized in the database by ref. ⁴⁶ (developed for completely different purposes) can possibly outperform most of the state-of-the-art water-sorbent compounds. Interestingly, those compounds are all Vanadium-based, mostly due to values of the Henry coefficient for H₂O in the optimal range, causing a good coverage span over the thermodynamic cycle. It is worth noticing that the above MOFs screening for thermal energy storage applications critically relies upon the prediction of the Henry coefficient for water.

Therefore, for the latter property only, we have also considered a number of real MOFs and their reported experimental values have been compared with the corresponding numerical values. It is worth stressing that finding both the experimentally evaluated property values and CIF file from the same publication is a non-trivial task, sometimes possibly leading to not necessarily consistent data. Furthermore, even for the three MOFs (i.e., MOF-801-SC, MOF-808, and MOF-841) considered in this study with experimental and computational data extracted from the same reference source, there is no guarantee that the tested material corresponds perfectly to the related CIF file, and discrepancies can always occur as demonstrated by the same Authors of ref. ⁶⁸. Moreover, the CIF files needed to extract the descriptors are always very ideal if compared with the experimentally tested crystals. Possible defects in real compounds can be related to any chemical changes, leading to variation in the hydrophilic nature of the material^69,70,71. Nonetheless, we observed that, while discrepancies can be found, the ML-based predictor of the Henry coefficient consistently shows a good agreement with the experimental values. Importantly, the numerical predictions based on the identified descriptors can selectively distinguish among MOFs with higher or lower values of the Henry coefficient. We believe that the above results represent an important step toward efficient MOFs screening and optimization, not only with respect to intrinsic materials properties but also (and importantly) with respect to figures of merit of engineering relevance for applications such as thermally driven water harvesting from the air, water-sorption thermal energy storage, and solar cooling.

Clearly, we are also aware that our approach is based on a number of approximations and it still requires additional research activities. First, we notice that a large set of hypothetical MOFs may be characterized by properties (e.g., Henry coefficients) whose values span several orders of magnitude. Hence a unique ML model, as used in this pipeline, may achieve a high coefficient of determination if its logarithm is considered. Nonetheless, the computation of the coverage span Δθ depends directly on the Henry coefficient, which may thus be affected by a relatively high error. Furthermore, additional simplifying assumptions that have been used in our approach include fixed parameters such as the constant of proportionality between the specific surface area and the water working capacity, as well as the steepness coefficient β in the FFG model. Without losing generality, those assumptions could be relaxed in the near future relying on more sophisticated models. One possible way to cope with those challenges (not necessarily the only possible strategy) may be a preliminary classification of the hypothetical MOFs based on properly trained ML classifiers, with the purpose of assigning a given compound of interest to a specific category (e.g., the set of MOFs with Henry coefficient of similar magnitude, similar β, etc.). Afterward, property predictions on each MOF category can be possibly performed with higher accuracy.

Methods

Water-sorption thermal energy storage

In this work, we focus on the use of potential MOFs for water-sorption-based thermal storage applications.

Physical adsorption processes are based on weak and reversible interactions between the (solid) sorbent material and the corresponding adsorbate, i.e., the fluid⁷². Those phenomena are relevant to thermal energy engineering as sorption/desorption in solid sorbents can be accompanied by a significant amount of energy exchange. In the following, the solid sorbents are MOFs, while water is the adsorbate.

To allow desorption of an infinitesimal number (dn) of moles of adsorbate from the adsorbent surface, a given amount of heat dQ = q_st dn has to be provided to the system, where q_st (with units of kJ/mol) denotes the isosteric heat. Since the process is reversible, the same amount of heat dQ is released by the dry sorbent when dn moles of fluid at a pressure p, initially in the vapor phase, are adsorbed. Furthermore, we define load X as the ratio between the mass of adsorbate and the mass of dry sorbent. A process characterized by constant load X is referred to as an isosteric transformation and the popular Clausius–Clapeyron relationship yields:

$${\left(\frac{\partial \ln p}{\partial \left(-\frac{1}{T}\right)}\right)}_{X}=\frac{{q}_{{{{\rm{st}}}}}}{R},$$

(7)

where T is the absolute temperature and R = 8.314 J mol⁻¹K⁻¹ denotes the gas constant⁷³. Therefore, an isosteric transformation in the Clapeyron chart ($\ln p$ vs − 1/T) is a curve with local slope q_st/R. Similarly, the adsorbate isosteric curve has a slope ΔH_(vap)/R, where ΔH_(vap) is the molar enthalpy for liquid-vapor phase change of the adsorbate.

Closely related to load X, the coverage θ is defined as the ratio between the number of already occupied adsorption sites n_s and the total available number of sites n_TOT. At equilibrium and at a given temperature T, the coverage θ depends on the pressure p of the vapor phase according to an adsorption isotherm, whose shape depends on the sorbent/adsorbate pair. MOFs/water pairs are known to show typical S-shaped isotherms in the θ − p chart (i.e., type V of the IUPAC classification⁷⁴). Thus, in this work, we make the assumption that the Frumkin–Fawler–Guggenheim (FFG) model can be used conveniently for describing analytically the MOFs-water adsorption isotherms:

$$\theta =\frac{H(T)p\exp (\beta \theta )}{1+H(T)p\exp (\beta \theta )},$$

(8)

where $\beta =\frac{\overline{n}{E}_{{{{\rm{p}}}}}}{RT}$ rules the steepness of the S-shape, $\overline{n}$ denotes the neighboring binding sites and E_p represents the additional binding energy due to lateral interactions⁶⁷. We have used the FFG model to interpret eight experimental isotherms of real MOF-water pairs and achieve a proper choice of β. In particular, for each curve, we have identified the best value of β in terms of a least-squares approach; then, we have taken the mean over those eight values, ending up with β = 3.4. More details can be found in Supplementary Note 6. Finally, H(T) is the Henry coefficient for the specific sorbent/adsorbate pair (with units Pa⁻¹) at a given absolute temperature T.

A schematic of a closed water-sorption thermal energy storage system is shown in Fig. 9. These systems are based on a reactor, containing the solid sorbent, connected with a condenser/evaporator by means of a valve⁵⁷. Such chemical apparatus follows a seasonal closed-cycle completely defined by four temperatures: T_A (the minimum temperature on the user side), T_C (the maximum temperature on the source side), T_E (the average winter temperature), and T_F (the average summer temperature). The two pressures p_E (evaporator) and p_F (condenser) are related to the absolute temperatures T_E and T_F, respectively, through the Antoine equation for water saturation ${p}_{{{{\rm{E}}}},{{{\rm{F}}}}}=133.2\times 1{0}^{A-B/(C+{T}_{{{{\rm{E}}}},{{{\rm{F}}}}}-273)}$, where A = 8.07131, B = 1730.63, C = 233.426⁷⁵. Hence, the ideal thermodynamic cycle of a closed thermal energy storage process (see Fig. 10) is based on the following four steps:

1.
The sorbent/adsorbate is heated isosterically up to a temperature T_B, corresponding to a pressure p_F in the condenser (line AB).
2.
Heating of the pair continues at constant pressure p_F and desorbed vapor flows to the condenser through the opened valve. In the condenser, the adsorbate rejects the condensation heat into the environment while condensing until the maximum temperature of the heat source T_C is reached (line BC). The condition of the minimum load is reached and the valve gets closed.
3.
Keeping the valve closed, the system in contact with the environmental temperature cools isosterically during the storage period (line CD) down to temperature T_D, corresponding to the evaporator pressure p_E.
4.
During the discharge phase of the heat storage system, the valve is opened to let the adsorbate evaporate and reach the reactor. During this isobaric transformation (line DA), the heat of adsorption Q_DA, also known as cycled heat, is released.

**Fig. 9: Schematic of a closed water-sorption thermal energy storage system.**

**Fig. 10: Schematics of ideal adsorption/desorption thermal energy storage cycle.**

One of the most important figures of merit for energy storage systems is the specific stored energy, namely the maximum energy that can be stored per unit of mass of the plant or of the material⁷⁶. Clearly, at fixed material (or plant) mass, the higher the cycled heat the higher the specific energy of the storage system. In this view, the choice of the solid sorbent material for a given adsorbate is key for maximizing the cycled heat in the ideal thermodynamic cycle. We thus perform below a material screening aiming at the maximum value of the following cycled heat (i.e., the released heat during the DA process in Fig. 10):

$${Q}_{{{{\rm{DA}}}}}=\int\nolimits_{{{{\rm{D}}}}}^{{{{\rm{A}}}}}{q}_{{{{\rm{st}}}}}\,{{{\rm{d}}}}{n}_{{{{\rm{s}}}}}=\int\nolimits_{{{{\rm{D}}}}}^{{{{\rm{A}}}}}{n}_{{{{\rm{TOT}}}}}{q}_{{{{\rm{st}}}}}\,{{{\rm{d}}}}\theta \approx {n}_{{{{\rm{TOT}}}}}{q}_{{{{\rm{st}}}}}{{\Delta }}\theta ,$$

(9)

where we have used the definition of the heat of adsorption dQ = q_st dn_s, coverage θ = n_s/n_TOT and the approximation ${q}_{{{{\rm{st}}}}}\approx {{{\rm{const}}}}$.

Sequential learning

Typical steps in any SL algorithm consist in (i) constructing a regression model over known data, (ii) using a strategy to suggest the best-unmeasured point to test, (iii) enlarging the known dataset with this tested point, and (iv) iterating up to the tested candidate meets the needed specification. Let ${{{\mathcal{D}}}}=\{({{{{\bf{x}}}}}_{1},{y}_{1}),\ldots ,({{{{\bf{x}}}}}_{n},{y}_{n})\}$ denote a set of n training data, where ${{{{\bf{x}}}}}_{i}\in {{\mathbb{R}}}^{d}$ and ${y}_{i}\in {\mathbb{R}}$ represent the i-th vector of descriptors and its known response, respectively. Let {x_n+1, …, x_m} denote the (m − n) d-dimensional arrays of descriptors with unknown responses {y_n+1, …, y_m}. To find the location x^* of the maximum y^*, we would need the exact model y = f(x). However, given the restricted set ${{{\mathcal{D}}}}$ of training data, only a surrogate model $y=\hat{f}({{{\bf{x}}}}| {{{\mathcal{D}}}})$ can be constructed. Hence, for each unmeasured point i = n + 1, …, m, different regression methodologies (here FUELS-Random Forest, kriging and COMBO-Gaussian processes) can be used to estimate the response $\hat{f}({{{{\bf{x}}}}}_{i})$ in terms of a mean value μ(x_i) and the corresponding uncertainty σ(x_i), indicating the robustness of the prediction. To measure the performance of any combination regression model/query strategy, we put ourselves in the practitioner’s perspective, who is interested in a unique sequence of points to be tested, and not in an average over more paths (as, for instance, shown in ref. ⁴⁹). To achieve this, for those regression models not allowing a deterministic prediction (i.e., Random Forest and COMBO), at each step we have repeated 100 times the choice of the next point to query, picking the most preferred one. A comprehensive comparison of the above methodologies is reported in the Results (subsection Descriptors of sorption properties in MOFs and their use in SL algorithms), and the complete details on the adopted algorithms can be found in the Supplementary Notes 7 and 8.

Model training and choice of the descriptors

The first issue to be addressed when applying SL to material optimization is computation and selection of relevant features (or descriptors). The descriptor issue is critical in materials science^77,78 as well as in other computational fields⁷⁹. In this work, we first investigate to which extent the choice of a minimal set of relevant descriptors is critical for the fast convergence of SL algorithms.

To this end, before even implementing SL procedures, we decided to perform a preliminary feature pruning for discovering the most meaningful ones in terms of the target property. We use the entire dataset (both descriptors and target property) to train and validate a Random Forest-based pipeline—feature reduction and machine learning—for regression, with hyperparameter tuning in five fold cross-validation. More details can be found in Supplementary Note 10.

Upon model training and validation, we detect the most important features thanks to the Tree SHAP algorithm, which is optimized for tree-based models such as Random Forest^48,59, thus quantifying to which extent a given feature impacts the output. The latter methodology is based on the classical Shapley value, which has in game theory its original field of application. There, the problem of assigning, in a cooperative game, a proportional reward to each player is addressed based on the real contribution provided to the common objective of the coalition. In a model, given F the set of all features and its generic subset S ⊆ F, the importance of the i-th descriptor depends on the comparison between the model f_S∪{i} trained with that explanatory variable, and another model f_S trained without that feature; then, the difference between the predictions f_S∪{i}(x_S∪{i}) − f_S(x_S) is computed, where x_S∪{i} and x_S represent respectively the values of the input feature over the subsets S ∪ {i} and S. This difference is weighted over all possible subsets S and the importance value of the i-th feature turns out to be

$${\phi }_{i}=\mathop{\sum}\limits_{S\subseteq F\setminus \{i\}}\frac{| S| !(| F| -| S| -1)!}{| F| !}\left({f}_{S\cup \{i\}}({{{{\bf{x}}}}}_{S\cup \{i\}})-{f}_{S}({{{{\bf{x}}}}}_{S})\right),$$

(10)

where ∣⋅∣ denotes the number of elements. Because of the huge number of possible descriptor subsets S of a set F, the classical Shapley values of Equation (10) are, in general, computationally challenging. Nonetheless, the Tree SHAP realization we have employed in this work is able to explain efficiently tree-based models, such as Random Forest.

Data availability

Datasets and trained models of this study are available in Zenodo (DOI:10.5281/zenodo.6351366)⁸⁰.

Code availability

The codes used to obtain the results of this study are publicly available at https://github.com/giotre/MOFs.

References

Kitagawa, S. et al. Metal–organic frameworks (mofs). Chem. Soc. Rev. 43, 5415–5418 (2014).
Article Google Scholar
Adil, K. et al. Gas/vapour separation using ultra-microporous metal–organic frameworks: insights into the structure/separation relationship. Chem. Soc. Rev. 46, 3402–3430 (2017).
Article CAS Google Scholar
Rogge, S. M. et al. Metal–organic and covalent organic frameworks as single-site catalysts. Chem. Soc. Rev. 46, 3134–3184 (2017).
Article CAS Google Scholar
Wuttke, S., Lismont, M., Escudero, A., Rungtaweevoranit, B. & Parak, W. J. Positioning metal-organic framework nanoparticles within the context of drug delivery–a comparison with mesoporous silica nanoparticles and dendrimers. Biomaterials 123, 172–183 (2017).
Article CAS Google Scholar
Chaemchuen, S., Xiao, X., Klomkliang, N., Yusubov, M. S. & Verpoort, F. Tunable metal–organic frameworks for heat transformation applications. Nanomaterials 8, 661 (2018).
Article CAS Google Scholar
de Lange, M. F., Verouden, K. J., Vlugt, T. J., Gascon, J. & Kapteijn, F. Adsorption-driven heat pumps: the potential of metal–organic frameworks. Chem. Rev. 115, 12205–12250 (2015).
Article CAS Google Scholar
Chen, S., Lucier, B. E., Boyle, P. D. & Huang, Y. Understanding the fascinating origins of co2 adsorption and dynamics in mofs. Chem. Mater. 28, 5829–5846 (2016).
Article CAS Google Scholar
Kim, H. et al. Adsorption-based atmospheric water harvesting device for arid climates. Nat. Commun. 9, 1–8 (2018).
CAS Google Scholar
Kalmutzki, M. J., Diercks, C. S. & Yaghi, O. M. Metal–organic frameworks for water harvesting from air. Adv. Mater. 30, 1704304 (2018).
Article CAS Google Scholar
Ejeian, M. & Wang, R. Adsorption-based atmospheric water harvesting. Joule 5, 1678–1703 (2021).
Article Google Scholar
Lee, S. et al. Computational screening of trillions of metal–organic frameworks for high-performance methane storage. ACS Appl. Mater. Interfaces 13, 23647–23654 (2021).
Li, A., Bueno-Perez, R., Wiggin, S. & Fairen-Jimenez, D. Enabling efficient exploration of metal–organic frameworks in the cambridge structural database. CrystEngComm 22, 7152–7161 (2020).
Article CAS Google Scholar
Borboudakis, G. et al. Chemically intuited, large-scale screening of mofs by machine learning techniques. Npj Comput. Mater. 3, 1–7 (2017).
CAS Google Scholar
Anderson, R., Rodgers, J., Argueta, E., Biong, A. & Gómez-Gualdrón, D. A. Role of pore chemistry and topology in the co2 capture capabilities of mofs: from molecular simulation to machine learning. Chem. Mater. 30, 6325–6337 (2018).
Article CAS Google Scholar
Moghadam, P. Z. et al. Structure-mechanical stability relations of metal-organic frameworks via machine learning. Matter 1, 219–234 (2019).
Article Google Scholar
Zhou, M., Vassallo, A. & Wu, J. Toward the inverse design of mof membranes for efficient d2/h2 separation by combination of physics-based and data-driven modeling. J. Membr. Sci. 598, 117675 (2020).
Article CAS Google Scholar
Yan, Y. et al. Machine learning and in-silico screening of metal–organic frameworks for o2/n2 dynamic adsorption and separation. Chem. Eng. J. 427, 131604 (2022).
Article CAS Google Scholar
Rampal, N. et al. The development of a comprehensive toolbox based on multi-level, high-throughput screening of mofs for co/n 2 separations. Chem. Sci. 12, 12068–12081 (2021).
Article CAS Google Scholar
Avci, G., Erucar, I. & Keskin, S. Do new mofs perform better for co2 capture and h2 purification? computational screening of the updated mof database. ACS Appl. Mater. Interfaces 12, 41567–41579 (2020).
Article CAS Google Scholar
Halder, P. & Singh, J. K. High-throughput screening of metal–organic frameworks for ethane–ethylene separation using the machine learning technique. Energy Fuels 34, 14591–14597 (2020).
Article CAS Google Scholar
Yang, W. et al. Computational screening of metal–organic framework membranes for the separation of 15 gas mixtures. Nanomaterials 9, 467 (2019).
Article CAS Google Scholar
Qiao, Z. et al. Molecular fingerprint and machine learning to accelerate design of high-performance homochiral metal–organic frameworks. AIChE J. 67, e17352 (2021).
Article CAS Google Scholar
Li, S., Chung, Y. G. & Snurr, R. Q. High-throughput screening of metal–organic frameworks for co2 capture in the presence of water. Langmuir 32, 10368–10376 (2016).
Article CAS Google Scholar
Pardakhti, M., Moharreri, E., Wanik, D., Suib, S. L. & Srivastava, R. Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (mofs). ACS Comb. Sci. 19, 640–645 (2017).
Article CAS Google Scholar
Bobbitt, N. S. & Snurr, R. Q. Molecular modelling and machine learning for high-throughput screening of metal-organic frameworks for hydrogen storage. Mol. Simul. 45, 1069–1081 (2019).
Article CAS Google Scholar
Qiao, Z., Xu, Q., Cheetham, A. K. & Jiang, J. High-throughput computational screening of metal–organic frameworks for thiol capture. J. Phys. Chem. C. 121, 22208–22215 (2017).
Article CAS Google Scholar
Liang, H. et al. Combining large-scale screening and machine learning to predict the metal-organic frameworks for organosulfurs removal from high-sour natural gas. APL Mater. 7, 091101 (2019).
Article CAS Google Scholar
Yang, P. et al. Analyzing acetylene adsorption of metal–organic frameworks based on machine learning. Green Energy Environ. https://doi.org/10.1016/j.gee.2021.01.006 (2021).
Dureckova, H., Krykunov, M., Aghaji, M. Z. & Woo, T. K. Robust machine learning models for predicting high co2 working capacity and co2/h2 selectivity of gas adsorption in metal organic frameworks for precombustion carbon capture. J. Phys. Chem. C. 123, 4133–4139 (2019).
Article CAS Google Scholar
Liu, Z. et al. Predicting adsorption and separation performance indicators of xe/kr in metal-organic frameworks via a precursor-based neural network model. Chem. Eng. Sci. 243, 116772 (2021).
Article CAS Google Scholar
Ma, P. et al. Computer-assisted design for stable and porous metal-organic framework (mof) as a carrier for curcumin delivery. LWT 120, 108949 (2020).
Article CAS Google Scholar
Du, Z. et al. A high-throughput computational screening of potential adsorbents for a thermal compression co2 brayton cycle. Sustain. Energy Fuels 5, 1415–1428 (2021).
Article CAS Google Scholar
Long, R. et al. Screening metal-organic frameworks for adsorption-driven osmotic heat engines via grand canonical monte carlo simulations and machine learning. iScience 24, 101914 (2021).
Article CAS Google Scholar
Shi, Z. et al. Machine learning and in silico discovery of metal-organic frameworks: Methanol as a working fluid in adsorption-driven heat pumps and chillers. Chem. Eng. Sci. 214, 115430 (2020).
Article CAS Google Scholar
Shi, Z. et al. Techno-economic analysis of metal–organic frameworks for adsorption heat pumps/chillers: from directional computational screening, machine learning to experiment. J. Mater. Chem. A 9, 7656–7666 (2021).
Article CAS Google Scholar
García, E. J., Bahamon, D. & Vega, L. F. Systematic search of suitable metal–organic frameworks for thermal energy-storage applications with low global warming potential refrigerants. ACS Sustain. Chem. Eng. 9, 3157–3171 (2021).
Article CAS Google Scholar
Liang, Q. et al. Benchmarking the performance of bayesian optimization across multiple experimental materials science domains. Npj Comput. Mater. 7, 1–10 (2021).
Article CAS Google Scholar
Kim, Y., Kim, E., Antono, E., Meredig, B. & Ling, J. Machine-learned metrics for predicting the likelihood of success in materials discovery. Npj Comput. Mater. 6, 1–9 (2020).
Article CAS Google Scholar
Brochu, E., Cora, V. M. & De Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at https://arxiv.org/abs/1012.2599 (2010).
Ahmadi, M., Vogt, M., Iyer, P., Bajorath, J. & Fröhlich, H. Predicting potent compounds via model-based global optimization. J. Chem. Inf. Model. 53, 553–559 (2013).
Article CAS Google Scholar
Rohr, B. et al. Benchmarking the acceleration of materials discovery by sequential learning. Chem. Sci. 11, 2696–2706 (2020).
Article Google Scholar
Aggarwal, R., Demkowicz, M. & Marzouk, Y. in Information Science for Materials Discovery and Design (eds Alexander, F. J., Rajan, K. & Lookman, T.) Ch. 2 (Springer, 2016).
Seko, A., Maekawa, T., Tsuda, K. & Tanaka, I. Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single-and binary-component solids. Phys. Rev. B 89, 054303 (2014).
Article CAS Google Scholar
Kiyohara, S., Oda, H., Tsuda, K. & Mizoguchi, T. Acceleration of stable interface structure searching using a kriging approach. Jpn. J. Appl. Phys. 55, 045502 (2016).
Article CAS Google Scholar
Dehghannasiri, R. et al. Optimal experimental design for materials discovery. Comput. Mater. Sci. 129, 311–322 (2017).
Article CAS Google Scholar
Boyd, P. G. et al. Data-driven design of metal–organic frameworks for wet flue gas co 2 capture. Nature 576, 253–256 (2019).
Article CAS Google Scholar
Choudhary, K., DeCost, B. & Tavazza, F. Machine learning with force-field-inspired descriptors for materials: Fast screening and mapping energy landscape. Phys. Rev. Mater. 2, 083801 (2018).
Article CAS Google Scholar
Lundberg, S. M. et al. From local explanations to global understanding with explainable ai for trees. Nat. Mach. Intell. 2, 2522–5839 (2020).
Article Google Scholar
Ling, J., Hutchinson, M., Antono, E., Paradiso, S. & Meredig, B. High-dimensional materials and process optimization using data-driven experimental design with well-calibrated uncertainty estimates. Integr. Mater. Manuf. Innov. 6, 207–217 (2017).
Article Google Scholar
Lophaven, S. N., Nielsen, H. B., Sondergaard, J. & Dace, A. A Matlab Kriging Toolbox. Report No. IMMTR-2002 12 (Technical University of Denmark, 2002).
Ueno, T., Rhone, T. D., Hou, Z., Mizoguchi, T. & Tsuda, K. Combo: an efficient bayesian optimization library for materials science. Mater. Discov. 4, 18–21 (2016).
Article Google Scholar
Fasano, M. et al. Water/ethanol and 13x zeolite pairs for long-term thermal energy storage at ambient pressure. Front. Energy Res. 7, 148 (2019).
Article Google Scholar
Fasano, M., Bevilacqua, A., Chiavazzo, E., Humplik, T. & Asinari, P. Mechanistic correlation between water infiltration and framework hydrophilicity in mfi zeolites. Sci. Rep. 9, 1–12 (2019).
Article CAS Google Scholar
Anstine, D. M., Tang, D., Sholl, D. S. & Colina, C. M. Adsorption space for microporous polymers with diverse adsorbate species. Npj Comput. Mater. 7, 1–9 (2021).
CAS Google Scholar
Neri, M., Chiavazzo, E. & Mongibello, L. Numerical simulation and validation of commercial hot water tanks integrated with phase change material-based storage units. J. Energy Storage 32, 101938 (2020).
Article Google Scholar
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
Article Google Scholar
Fasano, M., Borri, D., Chiavazzo, E. & Asinari, P. Protocols for atomistic modeling of water uptake into zeolite crystals for thermal storage and other applications. Appl. Therm. Eng. 101, 762–769 (2016).
Article CAS Google Scholar
Fasano, M. et al. Atomistic modelling of water transport and adsorption mechanisms in silicoaluminophosphate for thermal energy storage. Appl. Therm. Eng. 160, 114075 (2019).
Article CAS Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4768–4777 (2017).
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. Npj Comput. Mater. 6, 1–10 (2020).
Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne.J. Mach. Learn. Res. 9, 2579−2605 (2008).
Wu, X., Xiang, S., Su, J. & Cai, W. Understanding quantitative relationship between methane storage capacities and characteristic properties of metal–organic frameworks based on machine learning. J. Phys. Chem. C. 123, 8550–8559 (2019).
Article CAS Google Scholar
Yu, X., Choi, S., Tang, D., Medford, A. J. & Sholl, D. S. Efficient models for predicting temperature-dependent henry’s constants and adsorption selectivities for diverse collections of molecules in metal–organic frameworks. J. Phys. Chem. C. 125, 18046–18057 (2021).
Article CAS Google Scholar
Lavagna, L. et al. Cementitious composite materials for thermal energy storage applications: a preliminary characterization and theoretical analysis. Sci. Rep. 10, 1–13 (2020).
Article CAS Google Scholar
Talirz, L. et al. Materials cloud, a platform for open computational science. Sci. Data 7, 1–12 (2020).
Article Google Scholar
Xu, M., Liu, Z., Huai, X., Lou, L. & Guo, J. Screening of metal–organic frameworks for water adsorption heat transformation using structure–property relationships. RSC Adv. 10, 34621–34631 (2020).
Article CAS Google Scholar
Butt, H.-J., Graf, K. & Kappl, M. Physics and Chemistry of Interfaces (John Wiley & Sons, 2013).
Furukawa, H. et al. Water adsorption in porous metal–organic frameworks and related materials. J. Am. Chem. Soc. 136, 4369–4381 (2014).
Article CAS Google Scholar
Ni, L. et al. Defect-engineered uio-66-nh 2 modified thin film nanocomposite membrane with enhanced nanofiltration performance. Chem. Commun. 56, 8372–8375 (2020).
Article CAS Google Scholar
Huang, Y. et al. Tuning the wettability of metal–organic frameworks via defect engineering for efficient oil/water separation. ACS Appl. Mater. Interfaces 12, 34413–34422 (2020).
Article CAS Google Scholar
Xiang, W., Zhang, Y., Chen, Y., Liu, C.-j. & Tu, X. Synthesis, characterization and application of defective metal–organic frameworks: current status and perspectives. J. Mater. Chem. A 8, 21526–21546 (2020).
Article CAS Google Scholar
Ruthven, D. M. Principles of Adsorption and Adsorption Processes (John Wiley & Sons, 1984).
Schmidt, F. P. Optimizing Adsorbents for Heat Storage Applications: Estimation of Thermodynamic Limits and Monte Carlo Simulations of Water Adsorption in Nanopores. Ph.D. thesis, Fakultät für Mathematik und Physik, Universität Freiburg (2004).
Sing, K. S. Reporting physisorption data for gas/solid systems with special reference to the determination of surface area and porosity (recommendations 1984). Pure Appl. Chem. 57, 603–619 (1985).
Article CAS Google Scholar
Speight, J. G. et al. Lange’s Handbook of Chemistry Vol. 1 (McGraw-Hill, 2005).
Dincer, I. & Rosen, M. A.Thermal Energy Storage Systems and Applications (John Wiley & Sons, 2021).
Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C. & Scheffler, M. Big data of materials science: critical role of the descriptor. Phys. Rev. Lett. 114, 105503 (2015).
Article CAS Google Scholar
Gomes, S. I. et al. Machine learning and materials modelling interpretation of in vivo toxicological response to tio 2 nanoparticles library (uv and non-uv exposure). Nanoscale 13, 14666–14678 (2021).
Article CAS Google Scholar
Chiavazzo, E. et al. Intrinsic map dynamics exploration for uncharted effective free-energy landscapes. Proc. Natl Acad. Sci. USA 114, E5494–E5503 (2017).
Article CAS Google Scholar
Trezza, G., Bergamasco, L., Fasano, M. & Chiavazzo, E. Models and datasets for “Minimal set of crystallographic descriptors for sorption properties in hypothetical Metal Organic Frameworks and their role in sequential learning optimization". Zenodo https://doi.org/10.5281/zenodo.6351366 (2022).
Küsgens, P. et al. Characterization of metal-organic frameworks by water adsorption. Microporous Mesoporous Mater. 120, 325–330 (2009).
Article CAS Google Scholar
Horcajada, P. et al. Synthesis and catalytic properties of mil-100 (fe), an iron (iii) carboxylate with large pores. Chem. Commun. 27, 2820–2822 (2007).
Lebedev, O., Millange, F., Serre, C., Van Tendeloo, G. & Férey, G. First direct imaging of giant pores of the metal- organic framework mil-101. Chem. Mater. 17, 6525–6527 (2005).
Article CAS Google Scholar
Yang, D.-A., Cho, H.-Y., Kim, J., Yang, S.-T. & Ahn, W.-S. Co2 capture and conversion using mg-mof-74 prepared by a sonochemical method. Energy Environ. Sci. 5, 6465–6473 (2012).
Article CAS Google Scholar
Henkelis, S. E. et al. A single crystal study of cpo-27 and utsa-74 for nitric oxide storage and release. CrystEngComm 21, 1857–1861 (2019).
Article CAS Google Scholar
Permyakova, A. et al. Synthesis optimization, shaping, and heat reallocation evaluation of the hydrophilic metal–organic framework mil-160 (al). ChemSusChem 10, 1419–1426 (2017).
Article CAS Google Scholar
Permyakova, A. et al. Design of salt–metal organic framework composites for seasonal heat storage applications. J. Mater. Chem. A 5, 12889–12898 (2017).
Article CAS Google Scholar

Download references

Acknowledgements

E.C. acknowledges financial support of the Italian National Project PRIN Heat transfer and Thermal Energy Storage Enhancement by Foams and Nanoparticles (2017F7KZWS) and of the research contract PTR 2019/21 ENEA (Sviluppo di modelli per la caratterizzazione delle proprietà di scambio termico di PCM in presenza di additivi per il miglioramento dello scambio termico) funded by the Italian Ministry of Economic Development (MiSE).

Author information

Authors and Affiliations

Department of Energy, Politecnico di Torino, C.so Duca degli Abruzzi 24, Torino, 10129, Italy
Giovanni Trezza, Luca Bergamasco, Matteo Fasano & Eliodoro Chiavazzo

Authors

Giovanni Trezza
View author publications
You can also search for this author in PubMed Google Scholar
Luca Bergamasco
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Fasano
View author publications
You can also search for this author in PubMed Google Scholar
Eliodoro Chiavazzo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.C. conceived the idea and found financial support. G.T. performed all computations and wrote the first paper draft. E.C., M.F., and L.B. supervised the research activities. L.B. and M.F. helped with the result presentation and interpretation. M.F. suggested the synthetic dataset. All authors contributed to the final paper writing.

Corresponding author

Correspondence to Eliodoro Chiavazzo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Trezza, G., Bergamasco, L., Fasano, M. et al. Minimal crystallographic descriptors of sorption properties in hypothetical MOFs and role in sequential learning optimization. npj Comput Mater 8, 123 (2022). https://doi.org/10.1038/s41524-022-00806-7

Download citation

Received: 22 November 2021
Accepted: 08 May 2022
Published: 03 June 2022
DOI: https://doi.org/10.1038/s41524-022-00806-7

This article is cited by

Enhancing ReaxFF for molecular dynamics simulations of lithium-ion batteries: an interactive reparameterization protocol
- Paolo De Angelis
- Roberta Cappabianca
- Eliodoro Chiavazzo
Scientific Reports (2024)
The impact of physicochemical features of carbon electrodes on the capacitive performance of supercapacitors: a machine learning approach
- Sachit Mishra
- Rajat Srivastava
- Pietro Asinari
Scientific Reports (2023)