Introduction

Metal-organic frameworks (MOFs) are crystalline compounds consisting of metal ions and organic linkers, characterized by tunable porosity and incredibly high surface area1. Due to their properties, MOFs have recently attracted remarkable attention in a wide range of different fields, including gas/vapor separation2, reaction catalysis3, drug delivery4, energy storage, and heat transformations5,6. Given their nature as porous adsorbent materials, intensively active research is focused on the use of MOFs for CO2 capture, towards the development of effective technologies for mitigating greenhouse gas emissions7. Recently, MOFs have been also employed in adsorption-based atmospheric water harvesting driven by solar thermal energy8,9. In general, when dealing with applications of engineering relevance, different inlet gas streams, variable operating conditions, and target properties tailored per each specific case make it challenging to identify an ideal MOF crystal for all applications10, thus leading to a fragmented case-by-case optimization problem.

Hence, MOFs require proper and efficient methods for tailoring their features according to target properties of interest in each specific application. The latter is everything but an easy task. In fact, due to the myriad of degrees of freedom for MOFs structure and composition, more than 100 trillion compounds have been hypothesized11, while almost 100,000 have been synthesized so far12. High-throughput computational screening and machine learning have been recently adopted to analyse large MOF datasets13. Such computational tools allow us to identify significant correlations between nanoscale features and observable macroscale properties14,15, and to select the most suitable crystal for a given application case. A few representative examples are provided by gas-gas separation (D2/H216, O2/N217, CO/N218, CO2/H219, ethane/ethylene20, and other gas mixtures21), the enantioselectivity of chemical compounds22, gas adsorption (CO223, CH424, H225, thiol26, organosulfurs27, and acetylene28), and combinations thereof29,30. Several computational explorations of MOFs datasets have been carried out also for biomedical (drug delivery31), mechanical (CO2 Brayton cycle32 and osmotic heat engine33), and energy applications (heat pumps/chillers34,35 and thermal energy storage36).

In this context, modern sequential learning (SL) algorithms are emerging as particularly efficient tools for exploring the material high-dimensional (crystallographic) feature space. In particular, while evaluating an objective black-box function through demanding physical or numerical experiments, SL tools can provide a well-orchestrated procedure to rationally navigate the high-dimensional parameter (feature) space37,38. Thus, given an initial pool of evaluation points, one can sequentially choose the next experiment to carry out39,40, without relying on naive random guessing. Rather general techniques have been proposed in the area of material science holding the promise to accelerate materials discovery and research41, with a number of Authors reporting successful use of SL approaches in this field. Aggarwal et al.42, by means of optimal experimental design, successfully characterized a substrate under a thin film. Seko et al.43 found the compound with the highest melting temperature in a given ensemble of candidate materials with less attempts than a naive random choice. Kiyohara et al.44 accelerated the search of a stable interface structure with respect to a traditional brute force approach. Dehghannasiri et al.45 efficiently guided experiments to design the shape memory alloy with the lowest energy dissipation at a given temperature. Needless to say that the identification of the parameter space is a very important preliminary step when implementing SL algorithms. Here, we choose to specifically focus on MOFs properties as gas/vapor sorbent materials, since those are particularly relevant for energy applications.

The first important objective of this work is the identification of the minimal set of MOFs features (or descriptors) ruling critical adsorption properties in the low-coverage regime, i.e., the Henry solubility coefficients for both CO2-MOFs and, importantly, H2O-MOFs working pairs. The above minimal set represents the important crystallographic features underpinning a given adsorption property of interest. In this sense, each minimal set of descriptors can be regarded as the genetic code for a given property, and it is identified as described below.

First, we curate and enhance MOF data from a recently developed library made of 8206 compounds generated computationally46. Each Crystallographic Information File (CIF) representing a given material is first featurized by means of 1557 Classical Force-field Inspired Descriptors (CFID)47 taking into account both chemical and structural parameters. Subsequently, we train and validate regression models of target properties involved in heat storage applications, such as Henry coefficients and working capacity. These models are obtained by means of a Random Forest-based pipeline, with hyperparameter tuning in five fold cross-validation. The final ranking and selection of the minimal set of descriptors is performed by evaluating the importance of each feature on the models outputs by means of the Tree SHAP interpretation algorithm48.

Upon identification of the above crystallographic genetic code of sorption properties in MOFs, we investigate its role when using SL algorithms. Therefore, we compare the performance of three different SL methodologies aiming at maximizing H2O and CO2 Henry coefficients, and CO2 working capacity: (a) random Forests with Uncertainty Estimates for Learning Sequentially (FUELS49); (b) kriging algorithm50; (c) COMmon Bayesian Optimization Library (COMBO)51. For each SL methodology, we compare several strategies for choosing the next material to test, combining the exploration of high-uncertainty regions with the exploitation of high-performing candidates. Importantly, we analyse the SL performance using both the minimal subset of features (from the pipeline and SHAP analysis) and a larger set of variables, to highlight how the identification of descriptors affects the minimum number of experiments needed to pick out a MOF with the highest value of the desired property. In Fig. 1 the above procedure is schematically represented.

Fig. 1: Overview of the protocol to identify and test the minimal set of ruling crystallographic descriptors of sorption properties in several sequential learning algorithms.
figure 1

Over 5000 hypothetical MOFs from ref. 46 are first featurized by CFID, with the corresponding full set of descriptors provided to a ML regression pipeline for a preliminary descriptor reduction and ML model training of sorption properties of interest. The Tree SHAP interpretation algorithm is thus used to finalize the identification and ranking of a reduced subset of ruling descriptors of the chosen property). Several sequential learning schemes are tested using both the full set of descriptors and the reduced one for a comprehensive comparison.

We highlight that sorption-based engineering applications rely upon sorbent material characterization in a wide coverage range. However, when a large number of hypothetical sorbents (here MOFs, but in principle also zeolites52,53 or other materials54) have to be evaluated as potential candidates, only low-coverage characterization (i.e. Henry coefficient) is often accessible thus making challenging any optimization of crucial figures of merits of engineering relevance. We thus formulate a procedure aiming at a fast evaluation of one of the most important figures of merit in closed water-sorption seasonal thermal energy storage applications, namely the material-based specific (stored) energy. Unlike traditional sensible or latent systems55, the above sorption-based energy storage technologies have the advantage to be loss-free. Our procedure can thus be used in SL-based (or other) optimization/screening processes of MOFs even under incomplete knowledge of the entire isosteric field of the candidate working pairs. Applied to the database of over 5000 computationally generated (hypothetical) compounds by ref. 46 (developed for different purposes), our procedure identifies MOFs that can possibly outperform state-of-the-art sorbent materials for thermal energy storage.

Results

In this work, a crucial source of data on MOFs stems from the dataset of Boyd et al.46, where important sorption properties (e.g., the Henry coefficients for CO2 and H2O, the working capacity for CO2, and the specific surface area) have been computed by DFT-based simulations for over 8000 potential MOFs. Capitalizing on the above comprehensive study, we construct machine learning (ML) models capable of accurately predicting MOFs solubility of both CO2 and H2O as well as CO2 working capacity and surface area. The above models enable us to achieve the first key result of this work, namely the identification of the minimal set of crystallographic-based descriptors56 ruling these sorption and geometric properties in MOFs.

Moreover, a systematic comparison of SL approaches on the above MOFs database reveals important conclusions on the performance of the different regression schemes adopted and, most importantly, the role played by the selection of the feature space to be explored. Those conclusions are also supported by results obtained on a highly-controllable synthetic dataset, as discussed in detail in Supplementary Note 1.

It is worth stressing that properties reported in ref. 46 only characterize MOFs in the Henry regime and are not sufficient to describe the equilibrium sorption properties in the high coverage regime. However, when targeting important engineering applications such as seasonal thermal energy storage, key figures of merits of the storage plant (e.g., the material-based specific energy) critically rely upon the access to the entire isosteric field of the chosen sorbent-sorbate pair or, equivalently, to the knowledge of equilibrium adsorption isotherms at several temperature values57,58. The latter isotherms describe the adsorption properties (at equilibrium) of sorbents in a wide range of coverage values, from the Henry regime up to the saturation pressure of the sorbate fluid.

Therefore, in this work, we also propose an approach enabling us to optimize one of the most important engineering figures of merit of MOFs for seasonal thermal energy storage applications (i.e., material-based specific energy in an ideal closed sorption cycle, see Methods, subsection Water-sorption thermal energy storage, and Results, subsection Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs), even if only an incomplete set of sorption properties are (experimentally or numerically) accessible. Based on the latter optimization procedure, we are finally able to identify potential MOFs candidates for seasonal thermal energy storage that can possibly outperform most of the current state-of-the-art sorbent materials.

Descriptors of sorption properties in MOFs and their use in SL algorithms

We constructed four MOFs datasets, each one with the same 1557 features and a different target property among Henry coefficient for CO2 (8194 data entries), working capacity for CO2 (8202 data entries), Henry coefficient for H2O (8202 data entries), and surface area (5028 data entries). The different number of data entries are due to missing values for some of the chosen properties in the available database by ref. 46.

The above database also reported, for all the compounds, both the crystallographic file (used to extract the 1557 CFID47 by means of Matminer56) and a list of DFT-computed properties, among which we have only considered the abovementioned adsorption properties of interest. More specifically, the computed 1557 explanatory variables (also referred to as descriptors) proposed by Choudhary et al.47 consist of a set of both chemical (e.g., average chemical properties over the elements in the cell, average atomic radial charge) and structural (e.g., distribution functions) quantities. More details on descriptor sub-categories are reported in Table 1.

Table 1 Components of classical force-field inspired descriptors (CFID)47.

We have trained four different ML models by means of a Random Forest-based pipeline with hyperparameter tuning in five fold cross-validation to predict the Henry coefficient for CO2, the working capacity for CO2, the Henry coefficient for H2O, and the surface area, achieving coefficients of determination of R2 = 0.785, R2 = 0.590, R2 = 0.874, and R2 = 0.924, respectively. We have used 80% of each dataset to train the models, and the remaining 20% to validate them. Since the Henry coefficient values span a few orders of magnitude, the corresponding ML models have been developed in terms of the natural logarithm of those properties. During the data preprocessing routines, each of the four trained pipelines (i.e., feature reduction and ML with hyperparameter tuning, see Supplementary Note 10 for details) already drops a significant number of the 1557 features, thus confirming that many of the initially selected descriptors do not significantly affect the chosen adsorption properties. More specifically, the final models include 237 descriptors for the CO2 Henry coefficient, 236 descriptors for CO2 working capacity, 177 descriptors for the H2O Henry coefficient, and 234 descriptors for surface area.

Then, the Tree SHAP routine48,59 allows to identify the most meaningful descriptors as those accounting for the 75% of the cumulative curve over the coefficients of importance. The SHAP routine identifies the most meaningful features for the trained models among the reduced set of descriptors retained by the trained pipelines after the preprocessing. In particular, the impact of a descriptor depends on the comparison between the output of a model trained with that feature and another model output, trained without that feature (see Methods, subsection Model training and choice of descriptors). The coefficients of importance are thus computed over the testing set, i.e., over samples the model has never encountered during the training.

Overall, starting from the original 1557 Classical Force-field Inspired Descriptors, 33 items for the CO2 Henry coefficient, 62 for the CO2 working capacity, 15 for the H2O Henry coefficient, and 14 for the surface area are found to explain 75% of the corresponding regression models. Furthermore, we have repeated an analogous procedure with AutoMatminer60, which allows us to automatically train and validate a complete pipeline—feature reduction, data cleaning, and machine learning—with automatic hyperparameter tuning, without cross-validation. Results are reported in Supplementary Note 2. Models performances with the corresponding cumulative importance curves of the ruling descriptors are reported in Fig. 2. The H2O Henry coefficient is key for the computation of the specific material energy in a MOF-based thermal energy storage plant (as discussed below in Results, Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs). Hence, for the H2O Henry coefficient only, we also report a few experimental values corresponding to real MOFs to compare with numerical predictions of hypothetical MOFs. In particular, numerical predictions happen to follow a good agreement with the experimental values. This aspect is key for material screening purposes and is further highlighted in the Discussion section below.

Fig. 2: Predictions and corresponding normalized cumulative curves for the coefficients of importance of the four Random-forest regression models.
figure 2

Results are reported for a Henry coefficient for CO2, b working capacity for CO2, c Henry coefficient for H2O (with experimental values for MIL-100(Fe)81,82, MIL-10181,83, MOF-801-SC68, MOF-80868, MOF-84168, Mg-MOF-7484,85), d surface area; model performances are shown in terms of coefficient of determination R2, mean absolute error (MAE), and root mean squared error (RMSE), with the size of training and testing sets Ntrain and Ntest, respectively.

Importantly, Fig. 3 shows the SHAP rankings of the five most meaningful descriptors for each of the properties of interest. Table 2 summarizes the physicochemical meaning of the identified descriptors, based on the complete list by Choudhary et al.47. The entire list of variables, together with their cumulative importance, the trained models, and the datasets on which they have been trained are publicly available online (see Data availability and Code availability).

Fig. 3: The five most important features according to SHAP ranking for each of the properties of interest.
figure 3

Results are reported for a Henry coefficient for CO2, b working capacity for CO2, c Henry coefficient for H2O, d surface area. In each panel, for each feature (i.e., each line), 1639, 1641, 1641, and 1006 dots are shown respectively, representing the entire testing sets used for computing the related SHAP values (impacts on the model output, horizontal axes); the color represents the corresponding feature value, the features are sorted according to the mean over the absolute SHAP values.

Table 2 Physicochemical meaning of the most relevant CFID descriptors47.

As far as the four properties of MOF are concerned, the identification of those important descriptors represents per se an advancement of knowledge on sorption mechanism in MOFs and an important contribution to this work.

Focusing our attention on the top three descriptors in terms of importance, we notice that:

  • As far as the Henry coefficient for CO2 (see Fig. 3a) is concerned, such a property happens to be mainly ruled by the volume per atom, the nearest neighbor distribution, and an elastic constant. The former descriptor is the volume of a cell divided by the number of atoms in the cell, thus mainly representing a geometrical feature of the MOF crystal. The nearest neighbor distribution can also be regarded mainly as a geometry feature strictly related to the crystallographic structure. The elastic constant is one of the elements of an averaged 6 × 6 elastic tensor of the cell. In particular, for each of the chemical elements of a cell, the elastic tensor of the solid ground state structure at temperature 0 K is known. The weighted average of those tensor entries over the chemical elements of the cell are the elastic constants found by the featurizer. To our best interpretation, the latter two quantities effectively describe the chemical environment of the crystal thus underpinning the sorbent-sorbate interaction potential.

  • As far as the working capacity for CO2 (see Fig. 3b) is concerned, this quantity is mainly ruled by nearest neighbor distribution and volume per atom. Therefore, as expected, geometric features are playing a key role in determining the maximum CO2 uptake into the crystal.

  • As far as the Henry coefficient for H2O (see Fig. 3c) is concerned, we found that that quantity is mainly ruled by an elastic constant, radial distribution function, and nearest-neighbor distributions. As such, as opposed to the CO2 case above, we can observe that descriptors more related to the chemical environment are needed this time since a polar molecule is interacting with the MOFs structure.

  • Finally, for the surface area (see Fig. 3d), we found that it is mainly ruled by volume per atom, packing fraction, and nearest neighbor distribution, thus showing the prominence of geometrical features similar to the case of the above working capacity.

Upon the identification of the above lists of descriptors, we have compared the performance of SL algorithms for the sorption properties of interest using both the reduced set of important descriptors (i.e., those explaining the 75% of the models predicting CO2 Henry coefficient, CO2 working capacity, and H2O Henry coefficient, respectively) and a larger set of 100 descriptors composed by the previous and some additional (non-meaningful) ones.

In particular, for the aforementioned sorption properties, we have chosen the non-relevant features as the ones obtaining the least scores in the SHAP ranking. SL was adopted to find the maximum property value among a random subset of 500 samples from the original datasets (over 8000 MOFs), starting from a pool of 100 points with the lowest target property. Unexpectedly, SL optimization in the space of relevant descriptors does not ensure, in general, faster convergence of the procedure to the optimum property value (this is also confirmed by results in the synthetic case reported in Supplementary Note 1). Those six spaces (two domains for each of the three properties) are reported in Supplementary Note 9, as t-SNE projections over two components61. Furthermore, among the three methodologies examined, both Random Forest- and COMBO-based methods were able to provide always a faster convergence to the optimum value as compared to the random choice strategy. Results are shown in Fig. 4. As Ling et al. point out49, a pure exploitative strategy—RF-MEI, K-MEI, COMBO-PI—performs better when, already in the very first steps, the model is able to predict with high accuracy the value of the property of interest; conversely, a pure explorative strategy—RF-MU, K-MU—is more proper if the optimum is somehow very different with respect to all the other candidates, while the remaining strategies are a trade-off between those two extremes—RF-MLI, K-MLI, and COMBO-EI.

Fig. 4: Number of evaluations before converging to the maximum for the SL algorithms, normalized with respect to the random choice (corresponding to 200 experiments), for three sorption properties of MOFs.
figure 4

Results are reported for a Henry coefficient for CO2, b working capacity for CO2, c Henry coefficient for H2O; the initial set consists of the same worst 100 candidates (in terms of the target property) from a random subset of 500 samples of the original dataset.

We notice that, over different regression methodologies—even with the same acquisition function—the performance ranking, in general, changes. For instance, in the case of the Henry coefficient for H2O in Fig. 4c, among the Random Forest-based strategies, RF-MLI is the top performer, followed by RF-MEI and RF-MU, for both 15- and 100-dimensional input spaces; the very same ranking applies for Kriging based acquisition functions, K-MLI, K-MEI, and K-MU. In the case of working capacity for CO2 in Fig. 4b, instead, among the Random Forest-based strategies, RF-MEI is the top performer, followed by RF-MU and RF-MLI; on the contrary, the same ranking does not apply for Kriging based strategies, since K-MEI is the top performer, followed by K-MLI and K-MU, for both 62- and 100-dimensional input spaces. An intermediate case is represented by the Henry coefficient for CO2 in Fig. 4a, where the ranking of Random Forest-based strategies (RF-MLI, RF-MEI, RF-MU, both 33- and 100-dimensional input spaces), is preserved over the Kriging methodology only for the 100-dimensional space (K-MLI, K-MEI, and K-MU), but not for the 33-dimensional (K-MEI, K-MLI, and K-MU). Among COMBO-based acquisition functions, COMBO-PI and COMBO-EI always require a similar number of evaluations, except in the 100-dimensional input space of Henry coefficient for CO2, and the 100-dimensional input space is consistently better performing than the corresponding low-dimensional one.

This brief comparison shows that no general rule is valid a priori for SL, neither in terms of regression methodology, nor in terms of acquisition function, nor in terms of dimensionality of the problem, albeit selecting relevant descriptors.

Optimization under incomplete access to the isosteric field of candidate MOFs-water working pairs

Without losing generality, here we aim at investigating the expected performance of hypothetical (i.e., computationally generated) MOFs for an important energy engineering application, namely water-sorption seasonal thermal energy storage (see also the Methods section, subsection Water-sorption thermal energy storage). The most challenging aspect of this task consists in the access to the entire isosteric field of each candidate MOF-water pair for estimating the engineering figure of merit of interest. Clearly, for a large number of MOFs candidates, this is challenging and time-consuming both computationally (typically, only the Henry low-coverage regime is reported in literature works62,63) and experimentally64. In this section, we specifically focus on such a challenging aspect. We envision an efficient optimization procedure that is capable of searching MOFs with the largest expected figure of merit of engineering relevance (either by SL, if materials are sequentially synthesized/computed, or by accessing readily available databases65). As far as seasonal thermal energy storage applications are concerned, here we focus on the highest specific energy of MOF-water working pairs among the compounds reported in ref. 46. An overview of the proposed methodology is schematically reported in Fig. 5.

Fig. 5: Suggested procedure for estimating the specific energy of hypothetical MOF-water working pairs when only an incomplete knowledge of the isosteric field is experimentally or numerically accessible.
figure 5

By the knowledge of Henry coefficient for H2O at a certain temperature from literature data, an isotherm is obtained by Frumkin–Fawler–Guggenheim; the Polanyi potential is used for scaling to different temperatures. When two isotherms of interest are identified, upon the definition of the necessary environmental conditions and the use of an incomplete set of properties (e.g., solubility, uptake), the corresponding specific stored energy can be computed. SL algorithms can be employed for optimizing the desired figure of merit/properties in a discretized descriptor space.

As detailed in the Methods section (subsection Water-sorption thermal energy storage), the ideal thermodynamic cycle of a closed sorption thermal energy storage system is completely defined by four operating temperatures. In this study, we assume TA = 308 K (the minimum temperature on the user side), TC = 353 K (the maximum temperature on the source side), TE = 278 K (the average winter temperature), TF = 303 K (the average summer temperature). Those temperature values are reasonable for space heating applications in temperate climates64. Given the Antoine equation, the equilibrium water vapor pressures pE = 866.2 Pa at the evaporator and pF = 4231.6 Pa at the condenser are also uniquely defined considering the average winter and summer temperatures, respectively. We decided to evaluate and maximize over the database one of the most important engineering quantities in a thermal energy storage plant, namely the cycled heat per unit of material weight. While full details on the adopted models are given in the Methods section (subsection Water-sorption thermal energy storage), in the following we report and discuss the main simplifying assumptions in our approach:

  • A key quantity to be estimated is the H2O working capacity. That quantity is related to the available adsorption sites nTOT per unit of dry sorbent mass. Boyd et al.46 reported only the CO2 working capacity, while no data are available on the maximum H2O uptake. Nonetheless, we can rely on other related properties, such as the specific surface area of MOFs. To this end, we notice that Chaemchuen et al. have reported H2O working capacity for a pool of 66 MOFs5. A good correlation between the water uptake and the surface area (i.e., the available internal surface per gram of dry adsorbent) can be observed for typical MOFs used in the energy engineering field, and this is also in line with results found by Xu et al.66. In this work, on the basis of that correlation, we impose a linear regression for finding the constant of proportionality between water uptake and surface area (water uptake = η × surface area). This yields \(\eta =3.875\times 10^{-4}\,\rm{g}_{\rm{H}_{2}\rm{O}}\,\rm{m}^{-2}\). More details can be found in Supplementary Note 4. We also notice that, without a loss of generality, if more accurate values of the water uptake are available (e.g., from numerical simulations of adsorption experiments) the above assumption on the uptake-internal surface correlation can be fully relaxed.

  • Henry coefficients \(\widetilde{H}({T}_{0})\) for H2O are listed at the reference temperature T0 = 298 K with units of \({\rm{mol}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}\), thus representing the moles of adsorbed H2O per kilogram of dry MOF per bar of H2O vapor. In our approach, we adopt the Frumkin–Fawler–Guggenheim (FFG) model to estimate the adsorption isotherm over the entire range of coverages only relying upon such Henry coefficient. However, as discussed in the Methods sections (subsection Water-sorption thermal energy storage), the FFG equation requires H(T0) in units of Pa−1: we have thus converted \(\widetilde{H}({T}_{0})\) (readily available from ref. 46) to H(T0). Let ns, mMOF and \({p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) be the number of adsorbed water moles, the mass of the hypothetical MOF, and the pressure of water in the vapor phase, respectively, it holds:

    $$\widetilde{H}({T}_{0})=\frac{{n}_{{{{\rm{s}}}}}}{{m}_{{{{\rm{MOF}}}}}{p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}.$$
    (1)

    The approximation of the low-coverage regime yields the linear relationship between the coverage and pressure, namely \(\theta =H({T}_{0}){p}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\). Since the number of adsorbed water moles is related to the molar based total number of adsorption sites as ns = θnTOT, it follows:

    $$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{n}_{{{{\rm{TOT}}}}}/{n}_{{{{\rm{MOF}}}}}},$$
    (2)

    with \({m}_{{{{\rm{MOF}}}}}={{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}{n}_{{{{\rm{MOF}}}}}\), \({{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}\) the molecular weight of the MOF, and nMOF the total number of moles. Furthermore, the molar based total number of adsorption sites nTOT corresponds to the maximum number of water moles \({n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}\) that can be adsorbed, and the following relationship holds:

    $$\frac{{n}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{n}_{{{{\rm{MOF}}}}}}=\frac{{m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}}{{m}_{{{{\rm{MOF}}}}}}\frac{{{{{\mathcal{M}}}}}_{{{{\rm{MOF}}}}}}{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}},$$
    (3)

    where \({{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) is the molecular weight of water and \({m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}\) denotes the maximum mass of water that can be adsorbed. The ratio \({m}_{{{{\rm{MAX}}}},{{{{\rm{H}}}}}_{{{{\rm{2}}}}}{{{\rm{O}}}}}/{m}_{{{{\rm{MOF}}}}}\) is related to the H2O working capacity of the MOF and it is equal to ηS, where η is the constant of proportionality between the uptake and the surface area S. A comparison of Equations (3) and (2) yields:

    $$H({T}_{0})=\widetilde{H}({T}_{0})\frac{{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}}{\eta S}\times 1{0}^{-8},$$
    (4)

    which is the final conversion formula of the Henry coefficient for H2O from measure units of \({{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}\) into Pa−1. Here, the factor 10−8 appears because \([{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{\rm{H2O}}}}}^{-1}\), \([\eta ]=\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{-2}\), \([S]=\,{{{{\rm{m}}}}}^{2}{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}\), and so \([\widetilde{H}({T}_{0}){{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}/(\eta S)]=\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}\).

  • A crucial quantity for heat transformation is the isosteric heat of adsorption qst. Due to the Clausius–Clapeyron relationship (see Equation (7) in the Methods section, subsection Water-sorption thermal energy storage), at least two adsorption isotherm curves (at TA and at TC) are needed to estimate the corresponding qst. In our database, though, the Henry coefficients are only available at T0. In order to reconstruct a second adsorption isotherm for the same MOF-water working pair, we decided to resort to the potential theory of Polanyi, thus exploiting the basic notion that all adsorption isotherms are self-similar when rescaled with respect to the Polanyi potential function. Details are provided in the Supplementary Note 5. More specifically, the Polanyi potential is defined as:

    $${{{\mathcal{A}}}}=-RT\ln \left(\frac{{p}_{{{{\rm{s}}}}}(T)}{p}\right),$$
    (5)

    where ps(T) is the saturation pressure of water at temperature T, while p is the pressure of the vapor phase on the adsorbent surface67. Since, at a given pressure p, the Polanyi potential is a constant of the sorption pair, we have computed \({{{\mathcal{A}}}}\) at T0 in a range from 10−4 Pa up to ps(T0) = 3157 Pa (Antoine equation for water); then, in the θ − p chart, we have rescaled the abscissa p of the isotherm obtained at the temperature T0 according to \(p={p}_{{{{\rm{s}}}}}(T)\exp \left({{{\mathcal{A}}}}/(RT)\right)\), for getting the new curves at TA and TC. We have finally computed the isosteric heat by means of the Clausius–Clapeyron relationship:

    $${q}_{{{{\rm{st}}}}}=\frac{R}{3}\frac{{T}_{{{{\rm{C}}}}}{T}_{{{{\rm{A}}}}}}{{T}_{{{{\rm{C}}}}}-{T}_{{{{\rm{A}}}}}}\mathop{\sum }\limits_{i=1}^{3}\ln \frac{{p}_{2}({\theta }_{i})}{{p}_{1}({\theta }_{i})},$$
    (6)

    where the points 1 and 2 represent the intersections of an isosteric transformation with the two isotherms respectively at TA and TC. We have repeated the procedure for three coverage values (i.e., θ1 = 0.4, θ2 = 0.5, θ3 = 0.6) and averaged them.

  • Finally, upon determination of the low- and high-temperature adsorption isotherms curves at TA and TC, the coverage span Δθ during the discharge phase can be determined as detailed in the Methods section, subsection Water-sorption thermal energy storage. As for the estimate of the isosteric heat, this requires for all the compounds the rescaling of the horizontal axis in the θ − p chart of the isotherm obtained at temperature T0 according to the Polanyi potential theory.

We have thus computed the following objective function SqstΔθ, which recovers the cycled heat up to a constant, over the entire list of 5028 potential MOFs with positive surface area in ref. 46. The MOF with the highest predicted specific energy turned out to be the compound with the chemical formula C96H48O28N8V4 and referred to as “str_m5_o18_o28_sra_sym.72” in the database (see also the molecular rendering in Fig. 6b), with Henry coefficient at 298 K of \(6110.54\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\,{{{{\rm{bar}}}}}^{-1}\) (or equivalently, 6.89 × 10−4 Pa−1) and surface area \(S=4118.79\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}\). Figure 7 shows also the ideal expected thermodynamic cycle related to this optimal potential MOF.

Fig. 6: Hypothetical MOFs ranked with respect to the specific energy.
figure 6

a 2D chart with the entire set of 5028 hypothetical MOFs in the database by Boyd et al.46, where the first two SHAP ranked descriptors for the H2O Henry coefficient are represented and the best four MOFs sorbents for the adsorption/desorption based thermal storage application are highlighted. b 3 × 3 × 13 replications of the respective crystallographic cells (C atoms: gray; H atoms: white; O atoms: red; N atoms: blue; V atoms: green).

Fig. 7: Adsorption/desorption based thermal energy storage cycle for the potential MOF ‘str_m5_o18_o28_sra_sym.72’ with water, in the coverage-pressure plane.
figure 7

The isotherms TA = 308 K and TC = 353 K are shown. Surface S, isosteric heat qst, and coverage span Δθ over the cycle are reported, giving an objective function \(S{q}_{{{{\rm{st}}}}}{{\Delta }}\theta =1.32\times 1{0}^{5}\,{{{\rm{kJ}}}}\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}\) or equivalently \(2.83\times 1{0}^{-3}\,{{{\rm{GJ}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\), which corresponds to 1.73 GJm−3.

We observe a coverage span Δθ = 0.653, an isosteric heat \({q}_{{{{\rm{st}}}}}=48.95\,{{{\rm{kJ}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}\) with an objective function value of \(1.32\times 1{0}^{5}\,{{{\rm{kJ}}}}\,{{{{\rm{m}}}}}^{2}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}\,{{{{\rm{g}}}}}_{{{{\rm{MOF}}}}}^{-1}\). That quantity can be directly related to specific energy: upon multiplication by the constant \(\eta /{{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\) (\(\eta =3.875\times 1{0}^{-4}\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{m}}}}}^{-2}\), \({{{{\mathcal{M}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}=18.02\,{{{{\rm{g}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}\,{{{{\rm{mol}}}}}_{{{{{\rm{H}}}}}_{2}{{{\rm{O}}}}}^{-1}\)), we obtain a value of \(2.83\times 1{0}^{-3}\,{{{\rm{GJ}}}}\,{{{{\rm{kg}}}}}_{{{{\rm{MOF}}}}}^{-1}\). Furthermore, we can compute the theoretical density ρMOF of the crystal knowing the mass of the cell (1965.21 u = 3.263 × 10−21 gMOF, as from the database) and its volume (5.335 × 10−21 cm3, as from the CIF file), leading to ρMOF = 0.612 gMOFcm−3. As a result, the volume-based energy density turns out to be 1.73 GJm3. For the sake of comparison, Fig. 6a shows a 2D map where the two axes represent the two most important descriptors according to the SHAP ranking for the Henry coefficient of H2O: the four top-performing potential MOFs are highlighted and the corresponding crystallographic cells are depicted in Fig. 6b. Moreover, Table 3 shows the ten most performing potential MOFs ranked in terms of specific energy. Interestingly, those compounds are all Vanadium-based, mostly due to values of the Henry coefficient for H2O in the optimal range, which produces a good coverage span Δθ over the thermodynamic cycle.

Table 3 Top ten potential MOFs in terms of specific energy from the database by Boyd et al.46.

As depicted in Fig. 8 the four top-performing hypothetical MOFs are predicted to have (material-based) specific energy values among the highest available in the literature for sorption thermal energy storage under similar operating conditions. A more extensive comparison is shown in Supplementary Fig. 11, where estimates of the specific energy for real MOFs by means of the FFG model are also reported.

Fig. 8: Comparison between the expected specific energy for several materials.
figure 8

Specific energies are shown for different desorption temperatures TC both for the optimum MOFs identified in this work (either standard environmental conditions, i.e., evaporation temperature TE = 278 K and condensation temperature TF = 303 K, or with conditions of TE = 283 K and TF = 293 K, adsorption temperature TA = 308 K always) and for several water-sorbent materials in the literature6,64,86,87.

Discussion

Sequential learning (SL) algorithms can in principle dramatically reduce the number of evaluations needed for finding the optimum of an unknown function as compared with a naive random choice and, as such, they are emerging also as effective tools for material optimization and discovery. In this work, focusing on metal-organic frameworks (MOFs) and some of their crucial adsorption properties (both with H2O and CO2 as sorbate fluids), we have addressed a number of critical aspects related to the discovery of the minimal set of important crystallographic descriptors for SL-based optimization algorithms. We have shown that the general protocol for sorting out the minimal set of ruling descriptors for a given adsorption property is based on two steps: (i) construction and training of the machine learning (ML) model which identifies the number of ruling descriptors; (ii) evaluation of the relative importance of each explanatory variable on the chosen output by the SHAP analysis. We found that, as long as the set of such ruling descriptors (for a given property of interest) is included among the exploration space features, convergence performance is not affected, although the computational burden of an SL algorithm also depends on the dimension of the parameter space to be explored: taking into account only the most relevant features may be in fact beneficial in that respect. Furthermore, based on the several examples provided here (i.e., Henry coefficient for CO2, working capacity for CO2, and Henry coefficient for H2O as well as the synthetic example discussed in the Supplementary Note 1), we have consistently noticed that the COMBO algorithm always performs better than random guessing.

Furthermore, we recognize that full access to the adsorption properties of hypothetical MOFs in the entire coverage regime (as requested in important applications of engineering relevance) is very challenging both experimentally and computationally. This holds particularly for water-MOFs working pairs, that are promising for a number of energy applications. Hence, we formulate a general and efficient computational screening procedure of hypothetical MOFs which, only relying upon the adsorption properties reported in Fig. 2, is capable to estimate important figures of merit for sorption-based seasonal thermal energy storage. Remarkably, our procedure suggests that some of the MOFs hypothesized in the database by ref. 46 (developed for completely different purposes) can possibly outperform most of the state-of-the-art water-sorbent compounds. Interestingly, those compounds are all Vanadium-based, mostly due to values of the Henry coefficient for H2O in the optimal range, causing a good coverage span over the thermodynamic cycle. It is worth noticing that the above MOFs screening for thermal energy storage applications critically relies upon the prediction of the Henry coefficient for water.

Therefore, for the latter property only, we have also considered a number of real MOFs and their reported experimental values have been compared with the corresponding numerical values. It is worth stressing that finding both the experimentally evaluated property values and CIF file from the same publication is a non-trivial task, sometimes possibly leading to not necessarily consistent data. Furthermore, even for the three MOFs (i.e., MOF-801-SC, MOF-808, and MOF-841) considered in this study with experimental and computational data extracted from the same reference source, there is no guarantee that the tested material corresponds perfectly to the related CIF file, and discrepancies can always occur as demonstrated by the same Authors of ref. 68. Moreover, the CIF files needed to extract the descriptors are always very ideal if compared with the experimentally tested crystals. Possible defects in real compounds can be related to any chemical changes, leading to variation in the hydrophilic nature of the material69,70,71. Nonetheless, we observed that, while discrepancies can be found, the ML-based predictor of the Henry coefficient consistently shows a good agreement with the experimental values. Importantly, the numerical predictions based on the identified descriptors can selectively distinguish among MOFs with higher or lower values of the Henry coefficient. We believe that the above results represent an important step toward efficient MOFs screening and optimization, not only with respect to intrinsic materials properties but also (and importantly) with respect to figures of merit of engineering relevance for applications such as thermally driven water harvesting from the air, water-sorption thermal energy storage, and solar cooling.

Clearly, we are also aware that our approach is based on a number of approximations and it still requires additional research activities. First, we notice that a large set of hypothetical MOFs may be characterized by properties (e.g., Henry coefficients) whose values span several orders of magnitude. Hence a unique ML model, as used in this pipeline, may achieve a high coefficient of determination if its logarithm is considered. Nonetheless, the computation of the coverage span Δθ depends directly on the Henry coefficient, which may thus be affected by a relatively high error. Furthermore, additional simplifying assumptions that have been used in our approach include fixed parameters such as the constant of proportionality between the specific surface area and the water working capacity, as well as the steepness coefficient β in the FFG model. Without losing generality, those assumptions could be relaxed in the near future relying on more sophisticated models. One possible way to cope with those challenges (not necessarily the only possible strategy) may be a preliminary classification of the hypothetical MOFs based on properly trained ML classifiers, with the purpose of assigning a given compound of interest to a specific category (e.g., the set of MOFs with Henry coefficient of similar magnitude, similar β, etc.). Afterward, property predictions on each MOF category can be possibly performed with higher accuracy.

Methods

Water-sorption thermal energy storage

In this work, we focus on the use of potential MOFs for water-sorption-based thermal storage applications.

Physical adsorption processes are based on weak and reversible interactions between the (solid) sorbent material and the corresponding adsorbate, i.e., the fluid72. Those phenomena are relevant to thermal energy engineering as sorption/desorption in solid sorbents can be accompanied by a significant amount of energy exchange. In the following, the solid sorbents are MOFs, while water is the adsorbate.

To allow desorption of an infinitesimal number (dn) of moles of adsorbate from the adsorbent surface, a given amount of heat dQ = qst dn has to be provided to the system, where qst (with units of kJ/mol) denotes the isosteric heat. Since the process is reversible, the same amount of heat dQ is released by the dry sorbent when dn moles of fluid at a pressure p, initially in the vapor phase, are adsorbed. Furthermore, we define load X as the ratio between the mass of adsorbate and the mass of dry sorbent. A process characterized by constant load X is referred to as an isosteric transformation and the popular Clausius–Clapeyron relationship yields:

$${\left(\frac{\partial \ln p}{\partial \left(-\frac{1}{T}\right)}\right)}_{X}=\frac{{q}_{{{{\rm{st}}}}}}{R},$$
(7)

where T is the absolute temperature and R = 8.314 J mol−1K−1 denotes the gas constant73. Therefore, an isosteric transformation in the Clapeyron chart (\(\ln p\) vs − 1/T) is a curve with local slope qst/R. Similarly, the adsorbate isosteric curve has a slope ΔH(vap)/R, where ΔH(vap) is the molar enthalpy for liquid-vapor phase change of the adsorbate.

Closely related to load X, the coverage θ is defined as the ratio between the number of already occupied adsorption sites ns and the total available number of sites nTOT. At equilibrium and at a given temperature T, the coverage θ depends on the pressure p of the vapor phase according to an adsorption isotherm, whose shape depends on the sorbent/adsorbate pair. MOFs/water pairs are known to show typical S-shaped isotherms in the θ − p chart (i.e., type V of the IUPAC classification74). Thus, in this work, we make the assumption that the Frumkin–Fawler–Guggenheim (FFG) model can be used conveniently for describing analytically the MOFs-water adsorption isotherms:

$$\theta =\frac{H(T)p\exp (\beta \theta )}{1+H(T)p\exp (\beta \theta )},$$
(8)

where \(\beta =\frac{\overline{n}{E}_{{{{\rm{p}}}}}}{RT}\) rules the steepness of the S-shape, \(\overline{n}\) denotes the neighboring binding sites and Ep represents the additional binding energy due to lateral interactions67. We have used the FFG model to interpret eight experimental isotherms of real MOF-water pairs and achieve a proper choice of β. In particular, for each curve, we have identified the best value of β in terms of a least-squares approach; then, we have taken the mean over those eight values, ending up with β = 3.4. More details can be found in Supplementary Note 6. Finally, H(T) is the Henry coefficient for the specific sorbent/adsorbate pair (with units Pa−1) at a given absolute temperature T.

A schematic of a closed water-sorption thermal energy storage system is shown in Fig. 9. These systems are based on a reactor, containing the solid sorbent, connected with a condenser/evaporator by means of a valve57. Such chemical apparatus follows a seasonal closed-cycle completely defined by four temperatures: TA (the minimum temperature on the user side), TC (the maximum temperature on the source side), TE (the average winter temperature), and TF (the average summer temperature). The two pressures pE (evaporator) and pF (condenser) are related to the absolute temperatures TE and TF, respectively, through the Antoine equation for water saturation \({p}_{{{{\rm{E}}}},{{{\rm{F}}}}}=133.2\times 1{0}^{A-B/(C+{T}_{{{{\rm{E}}}},{{{\rm{F}}}}}-273)}\), where A = 8.07131, B = 1730.63, C = 233.42675. Hence, the ideal thermodynamic cycle of a closed thermal energy storage process (see Fig. 10) is based on the following four steps:

  1. 1.

    The sorbent/adsorbate is heated isosterically up to a temperature TB, corresponding to a pressure pF in the condenser (line AB).

  2. 2.

    Heating of the pair continues at constant pressure pF and desorbed vapor flows to the condenser through the opened valve. In the condenser, the adsorbate rejects the condensation heat into the environment while condensing until the maximum temperature of the heat source TC is reached (line BC). The condition of the minimum load is reached and the valve gets closed.

  3. 3.

    Keeping the valve closed, the system in contact with the environmental temperature cools isosterically during the storage period (line CD) down to temperature TD, corresponding to the evaporator pressure pE.

  4. 4.

    During the discharge phase of the heat storage system, the valve is opened to let the adsorbate evaporate and reach the reactor. During this isobaric transformation (line DA), the heat of adsorption QDA, also known as cycled heat, is released.

Fig. 9: Schematic of a closed water-sorption thermal energy storage system.
figure 9

The cycle underlying the heat accumulation and successive release is represented.

Fig. 10: Schematics of ideal adsorption/desorption thermal energy storage cycle.
figure 10

a Ideal cycle in the Clapeyron chart. b Same ideal cycle in the coverage-pressure chart, between the two limiting isotherms passing by θ(p, TA) and θ(p, TC).

One of the most important figures of merit for energy storage systems is the specific stored energy, namely the maximum energy that can be stored per unit of mass of the plant or of the material76. Clearly, at fixed material (or plant) mass, the higher the cycled heat the higher the specific energy of the storage system. In this view, the choice of the solid sorbent material for a given adsorbate is key for maximizing the cycled heat in the ideal thermodynamic cycle. We thus perform below a material screening aiming at the maximum value of the following cycled heat (i.e., the released heat during the DA process in Fig. 10):

$${Q}_{{{{\rm{DA}}}}}=\int\nolimits_{{{{\rm{D}}}}}^{{{{\rm{A}}}}}{q}_{{{{\rm{st}}}}}\,{{{\rm{d}}}}{n}_{{{{\rm{s}}}}}=\int\nolimits_{{{{\rm{D}}}}}^{{{{\rm{A}}}}}{n}_{{{{\rm{TOT}}}}}{q}_{{{{\rm{st}}}}}\,{{{\rm{d}}}}\theta \approx {n}_{{{{\rm{TOT}}}}}{q}_{{{{\rm{st}}}}}{{\Delta }}\theta ,$$
(9)

where we have used the definition of the heat of adsorption dQ = qst dns, coverage θ = ns/nTOT and the approximation \({q}_{{{{\rm{st}}}}}\approx {{{\rm{const}}}}\).

Sequential learning

Typical steps in any SL algorithm consist in (i) constructing a regression model over known data, (ii) using a strategy to suggest the best-unmeasured point to test, (iii) enlarging the known dataset with this tested point, and (iv) iterating up to the tested candidate meets the needed specification. Let \({{{\mathcal{D}}}}=\{({{{{\bf{x}}}}}_{1},{y}_{1}),\ldots ,({{{{\bf{x}}}}}_{n},{y}_{n})\}\) denote a set of n training data, where \({{{{\bf{x}}}}}_{i}\in {{\mathbb{R}}}^{d}\) and \({y}_{i}\in {\mathbb{R}}\) represent the i-th vector of descriptors and its known response, respectively. Let {xn+1, …, xm} denote the (m − n) d-dimensional arrays of descriptors with unknown responses {yn+1, …, ym}. To find the location x* of the maximum y*, we would need the exact model y = f(x). However, given the restricted set \({{{\mathcal{D}}}}\) of training data, only a surrogate model \(y=\hat{f}({{{\bf{x}}}}| {{{\mathcal{D}}}})\) can be constructed. Hence, for each unmeasured point i = n + 1, …, m, different regression methodologies (here FUELS-Random Forest, kriging and COMBO-Gaussian processes) can be used to estimate the response \(\hat{f}({{{{\bf{x}}}}}_{i})\) in terms of a mean value μ(xi) and the corresponding uncertainty σ(xi), indicating the robustness of the prediction. To measure the performance of any combination regression model/query strategy, we put ourselves in the practitioner’s perspective, who is interested in a unique sequence of points to be tested, and not in an average over more paths (as, for instance, shown in ref. 49). To achieve this, for those regression models not allowing a deterministic prediction (i.e., Random Forest and COMBO), at each step we have repeated 100 times the choice of the next point to query, picking the most preferred one. A comprehensive comparison of the above methodologies is reported in the Results (subsection Descriptors of sorption properties in MOFs and their use in SL algorithms), and the complete details on the adopted algorithms can be found in the Supplementary Notes 7 and 8.

Model training and choice of the descriptors

The first issue to be addressed when applying SL to material optimization is computation and selection of relevant features (or descriptors). The descriptor issue is critical in materials science77,78 as well as in other computational fields79. In this work, we first investigate to which extent the choice of a minimal set of relevant descriptors is critical for the fast convergence of SL algorithms.

To this end, before even implementing SL procedures, we decided to perform a preliminary feature pruning for discovering the most meaningful ones in terms of the target property. We use the entire dataset (both descriptors and target property) to train and validate a Random Forest-based pipeline—feature reduction and machine learning—for regression, with hyperparameter tuning in five fold cross-validation. More details can be found in Supplementary Note 10.

Upon model training and validation, we detect the most important features thanks to the Tree SHAP algorithm, which is optimized for tree-based models such as Random Forest48,59, thus quantifying to which extent a given feature impacts the output. The latter methodology is based on the classical Shapley value, which has in game theory its original field of application. There, the problem of assigning, in a cooperative game, a proportional reward to each player is addressed based on the real contribution provided to the common objective of the coalition. In a model, given F the set of all features and its generic subset SF, the importance of the i-th descriptor depends on the comparison between the model fS{i} trained with that explanatory variable, and another model fS trained without that feature; then, the difference between the predictions fS{i}(xS{i}) − fS(xS) is computed, where xS{i} and xS represent respectively the values of the input feature over the subsets S {i} and S. This difference is weighted over all possible subsets S and the importance value of the i-th feature turns out to be

$${\phi }_{i}=\mathop{\sum}\limits_{S\subseteq F\setminus \{i\}}\frac{| S| !(| F| -| S| -1)!}{| F| !}\left({f}_{S\cup \{i\}}({{{{\bf{x}}}}}_{S\cup \{i\}})-{f}_{S}({{{{\bf{x}}}}}_{S})\right),$$
(10)

where denotes the number of elements. Because of the huge number of possible descriptor subsets S of a set F, the classical Shapley values of Equation (10) are, in general, computationally challenging. Nonetheless, the Tree SHAP realization we have employed in this work is able to explain efficiently tree-based models, such as Random Forest.