CRAFTED: An exploratory database of simulated adsorption isotherms of metal-organic frameworks

Grand Canonical Monte Carlo is an important method for performing molecular-level simulations and assisting the study and development of nanoporous materials for gas capture applications. These simulations are based on the use of force fields and partial charges to model the interaction between the adsorbent molecules and the solid framework. The choice of the force field parameters and partial charges can significantly impact the results obtained, however, there are very few databases available to support a comprehensive impact evaluation. Here, we present a database of simulations of CO2 and N2 adsorption isotherms on 690 metal-organic frameworks taken from the CoRE MOF 2014 database. We performed simulations with two force fields (UFF and DREIDING), six partial charge schemes (no charges, Qeq, EQeq, MPNN, PACMOF, and DDEC), and three temperatures (273, 298, 323 K). The resulting isotherms compose the Charge-dependent, Reproducible, Accessible, Forcefield-dependent, and Temperature-dependent Exploratory Database (CRAFTED) of adsorption isotherms.


Background & Summary
Carbon capture, storage, and utilization is considered as one of the key strategies required to reduce anthropogenic carbon dioxide emissions and their impacts on climate change 1 .The most viable option for this approach is to focus on CO 2 capturing from the point sources, such as fossil fuel power plants, fuel processing plants and other industrial plants where carbon capture technology can be applied to streams with the industrial scale flow rates 2 .However, despite several decades of intensive research, carbon capture in an economically viable way remains an enormous challenge 3 .
Adsorption processes are considered to be a promising alternative to the conventional absorption processes due the their low regeneration energy, high selectivity, and high capture capacity 4 .Combined, these characteristics may lead to energy-efficient processes for industrial scale capture and utilization of greenhouse gases (GHG).At the heart of a typical adsorption process for gas separation, such as the Pressure Swing Adsorption process, is the active adsorbent material; and the efficiency of the process crucially depends on the properties of this material.Within the different adsorbent materials that are potentially available for this process, crystalline nanoporous materials such as metal-organic frameworks (MOF) [5][6][7][8] , covalent organic frameworks (COF) [9][10][11] , zeolitic imidazolate frameworks (ZIF) [12][13][14] , and zeolites 15 feature many of the necessary characteristics for a solid sorbent for efficient gas separation arXiv:2210.09456v1[cond-mat.mtrl-sci]17 Oct 2022 under the conditions of interest.
These families of materials contain hundreds of thousands synthesized structures and countless more hypothetical ones, featuring pores of different size, shape, and chemical characteristics.This creates large exploration space for studies that seek to identify the best candidates for a given gas capture application.This endeavour, however, is not possible via a brute-force experimental campaign.The number of large databases built upon experimental [16][17][18][19][20] and hypothetical 21,22 structures, combined with the continuous growth of diversity and scope of new materials due to the advancements in digital reticular chemistry, 23,24 make high-throughput computational screening (HTCS) approaches an an imperative strategy for efficient exploration of the vast chemical landscape of crystalline nanoporous adsorbents 25,26 .
Most of the HTCS studies for carbon capture and related problems are based on using Grand Canonical Monte Carlo (GCMC) simulations to generate adsorption data.This data is then used to form some simple material performance metrics or is passed on to the process level modelling to explore performance of candidate materials under the realistic process conditions.
To perform molecular simulations, such as GCMC, one needs a set of parameters that describe the interactions among the adsorbate molecules, and between the adsorbate molecules and the atoms of the adsorbent material; this set of parameters is called a force field.
Over the years of the development, many force fields have been developed for various purposes and several options are available to describe adsorption of gases such as carbon dioxide in materials such as MOFs.Invariably, the predicted equilibrium adsorption data and, consequently ranking of the materials and the recommendations of the screening study will depend on the choice of the force field.
This poses several fundamental questions.How sensitive is the adsorption data to the choice of the force field parameters?How does this sensitivity vary across different categories or classes of materials?And ultimately, is a ranking of porous materials for a particular application a robust result or it is contingent on using a particular force field?
To start to explore these questions one needs a representative mass of adsorption data covering typical choices of the force field parameters, materials, gases and conditions.This defines the remit of the current article where we tasked ourselves with building such a database (or at least, the first block of conditions).
To explain the contents of the database and our approach, let us delve first into components of the classical force fields and the typical options available for the studies of adsorption of gases in MOFs and related materials.In the classical force fields, the non-bonded interactions are modeled as a sum of van der Waals and Coulomb potentials 27 .The van der Waals interactions between the adsorbed molecules and the framework are usually modeled by the Lennard-Jones (LJ) potential, which is an effective potential with two fitted parameters that can capture most of the intermolecular effects relevant to physisorption.The parameters for the atoms can be taken from the generic force fields such as the Universal Force Field (UFF) 28 , DREIDING 29 and TraPPE 30 , with the interactions between different atom species computed using mixing rules such as Lorentz-Berthelot 31 or Jorgensen 32 .
The Coulombic interactions are modeled by partial atomic charges assigned to the atoms which need to be calculated for each material.There are several charge assignment methods available, and they can be divided into two main groups: i. methods derived from quantum chemistry calculations (e.g.RESP 33 , CHELP 34 , REPEAT 35 , and DDEC [36][37][38] ) and ii.methods based on charge equilibration (e.g.Qeq 39 , PQeq 40 , EQeq 41 , and FC-Qeq 42 ).Although there is some consensus that the approaches such as DDEC (based on electronic structure calculations) are more accurate, methods such as EQeq can present sufficiently good results that, combined with their low computational cost, makes them attractive choices for HTCS studies 43,44 .
Lately, there have been several studies evaluating the accuracy of different methods for calculating partial atomic charges [44][45][46][47][48] , however, little is known about the combined impact of force field and partial charge selection on material-level analysis and its implication on process-level performance metrics.Furthermore, the parameters of force fields such as UFF and DREIDING were fitted employing specific partial charge schemes (Gasteiger 49 for DREIDING and Qeq 28 for UFF), thus the combination of these parameters with different charge assignment methodologies, even if more accurate, may not necessarily generate better results.
These considerations guide us on the choices of the parameters of the force fields to consider in the database.
The database contains simulated adsorption isotherms for 726 MOFs selected from the CoRE MOF 2014 16 database.The simulations were performed for the adsorption of CO 2 and N 2 with two force fields (UFF and DREIDING), four partial charge schemes (no charge, Qeq, EQeq, and DDEC), at three temperatures (273, 298, 323 K).The resulting isotherms compose the Charge-dependent, Reproducible, Accessible, Forcefield-dependent, and Temperature-dependent Exploratory Database (CRAFTED) of adsorption isotherms.CRAFTED provides a convenient platform to explore the sensitivity of simulation outcomes to molecular modeling choices at the material (structure-property relationship) and process levels (structure-property-performance relationship).

Structure selection
The 2932 structures present in the CoRE MOF 2014 16 database were analyzed and 726 structures were retained in our analysis.This subset of 726 comprises all materials from the CoRE MOF 2014 database for which all atom types are present in both DREIDING and UFF force fields.Throughout this work, this subset of structures will be referred to as "CRAFTED structures".

Partial charges calculation
The DDEC partial charges [36][37][38] were taken without modification from the CoRE MOF 2014 16 database.The EQeq partial charges 41 were calculated using the the extended charge equilibration method as implemented in the EQeq software 50 v1.1.0.The Qeq partial charges 39 were calculated using the default implementation available in RASPA 51 .

Grand Canonical Monte Carlo simulations
Atomistic Grand Canonical Monte Carlo (GCMC) simulations were performed using a force field-based algorithm as implemented in RASPA 51,52 v2.0.45.Interaction energies between non-bonded atoms were computed through a combination of Lennard-Jones (LJ) and Coulomb potentials where i and j are interacting atom indexes and r i j is their interatomic distance.ε i j and σ i j are the well depth and diameter, respectively.The LJ parameters between atoms of different types were calculated using the Lorentz-Berthelot mixing rules LJ parameters for framework atoms were taken from Universal Force Field (UFF) 28 or DREIDING 29 (see Table 1).The parameters for the adsorbed molecules were taken from the TraPPE 30 force field (see Table 2).All simulations were performed with 10,000 Monte Carlo cycles.Swap (insertion or deletion with with a probability of 50% for each), translations, rotations, and re-insertions moves were tried with probabilities 0.5, 0.3, 0.1, and 0.1, respectively.To avoid the use of long initialization cycles, each isotherm was calculated in a single simulation, with each pressure point of the simulation starting from the result of the previous one.The uptake values for each pressure were obtained by averaging over the GCMC equilibrium phase, determined using the Marginal Standard Error Rule.For more information, please refer to section Automatic transient regime detection and truncation.
All atoms in the MOF were held fixed at their crystallographic positions.The number of unit cells used was different for each MOF to ensure that the perpendicular lengths of the supercell were greater than twice the cutoff used.The cutoff for Lennard-Jones and charge-charge short-range interactions was 12.8 Å and the Ewald sum technique was applied to compute the long-range electrostatic interactions with a relative precision of 10 -6 .The Lennard-Jones potential was shifted to zero at the cutoff.Fugacities needed to impose equilibrium between the system and the external ideal gas reservoir at each pressure were calculated using the Peng-Robinson equation of state 53 with the critical parameters for each gas taken from Table 3.All GCMC uptake data report the absolute adsorption value in mol/kg units.
The enthalpy of adsorption was computed as where N is the number of adsorbates on the simulation box and U is the potential energy 54 .All the adsorption enthalpy values are reported in kJ/mol and are the values as calculated by RASPA without further modification.

Lennard-Jones parameters
The Lennard-Jones parameters for DREIDING and UFF force fields used in the calculations for the framework atoms are shown in Table 1.For simplicity, only the atoms that are present in both UFF and DREIDING are shown.The TraPPE parameters used for the gas molecules are present in Table 2.The critical parameters used in the Peng-Robinson equation to calculate the fugacity are present in Table 3.

Automatic equilibration detection and truncation
To eliminate the use of long initialization cycles, the Marginal Standard Error Rule (MSER) 55 was applied to automatically detect the ideal truncation point using the pyMSER package v1.0.12 56 , so that the averages were taken only over the equilibrated phase of the simulation.The output of this method is the equilibrated average of the observable, alongside an uncertainty metric.Here we used the uncorrelated standard deviation, as explained in the next sub-section.
The MSER defines the start of the equilibrated region d(n) by solving the minimization problem: The Left-most Local Mininum (LLM) version of MSER was used in a batched data with batch size of 5.

Uncorrelated Standard Deviation (uSD)
To use an uncertainty metric that reflects, at the same time, the real dispersion of the simulated values and the number of cycles used for this simulation, the uncorrelated standard deviation (uSD) was used as an uncertainty metric.To calculate this quantity, first the number of uncorrelated states in the simulation is estimated by calculating the autocorrelation time.The equilibrated data is divided into chunks so that each chunk has a number of data points equivalent to the autocorrelation time.Then, the average value of each chunk is calculated and the standard deviation over this list of uncorrelated average values is calculated as the uSD.
The autocorrelation time is estimated by calculating the autocorrelation function of the equilibrated data using the method implemented in the numpy library. 57An exponential decay function is fitted over the values of the autocorrelation function and the autocorrelation time is calculated as the half-life of this exponential decay.

Automatic simulations workflow
A set of scripts composed of three stages were created to automate the isotherm generation process.First, a pre-processing step is performed where partial charges are calculated for all structures.Next, a set of calculation scheduler scripts are executed, where steps such as copying the force field and CIF files, creating a supercell with P1 symmetry, writing the RASPA input file, running the RASPA simulation, parsing the RASPA output, and performing the MSER analysis of the results for averaging over the equilibrated phase of the simulation, are run in sequence.Finally, a post-processing script is executed to analyze the results, resubmit incomplete calculations and create the isotherm database.A simplified scheme containing the main steps of this workflow is present on Figure 1.

Revised Autocorrelations (RACs) descriptors
To understand the diversity of our subset of CRAFTED MOFs, and understand how representative they are with respect to the MOF material class, revised autocorrelations (RACs) descriptors were calculated using the molSimplify software v1.7.1 58 .RACs are built by generating a crystal graph derived from the adjacency matrix computed for the primitive cell of the crystal structure and calculating the discrete correlations between atomic properties (Pauling electronegativity, nuclear charge, etc.) over the atoms.The correlations are composed of the products (Equation 5) and the differences (Equation 6) of an atomic property P for atom i, which is selected from the start atom list and is correlated to atom j selected from the scope atom list when they are separated by d number of bonds.
Six atomic properties were used to compute RACs: atom identity (I), connectivity (T), Pauling electronegativity (χ), covalent radii (S), nuclear charge (Z) and polarisability (α).These properties are used to generate metal-centred, linker and functional-group descriptors.To generate a fixed length descriptor, the averages of these descriptors were used, thus generating 156 features (40 for metal chemistry, 68 for linker chemistry and 48 for functional group chemistry) for each MOF structure.
For the t-SNE projections of the geometric properties, 14 features were used: largest included sphere (D is ), the largest free sphere (D fs ), largest included sphere along a free path (D isfs ), volumetric accessible area (ASA m2/cm3 ), gravimetric accessible area (ASA m2/g ), volumetric non-accessible area (NASA m2/cm3 ), gravimetric non-accessible area (NASA m2/g ), unit cell volume, crystal density, accessible volume fraction (AVF), non-accessible volume fraction (NAVF), accessible volume (AV cm3/g ), non-accessible volume (NAV cm3/g ), and the number of pockets (n pockets ).All these properties were calculated using Zeo++ v0.3. 60he chemical descriptors used for the t-SNE projections of the MOF structures were described in the previous section.All structures that could not have their descriptors calculated were removed from the list.
For the unsupervised cluster analysis, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 61 method was employed.The DBSCAN analysis was performed using the standardized and scaled set of descriptors with eps = 0.09 and minimum number of samples per cluster of 25.

Data Records
CRAFTED provides 34,848 isotherm files and 34,848 adsorption enthalpy files resulting from the GCMC simulation of two gases (CO 2 and N 2 ) on 726 MOF structures at three temperatures (273, 298, and 323 K) using two force fields (UFF and DREIDING) and 4 partial charge methods (no charges, Qeq, EQeq, and DDEC).Alongside the isotherm data, the charge-assigned CIF files, force field and molecule definition files are provided, to ensure reproducibility and facilitate a future database expansion.
Each isotherm file corresponds to a comma-separated value (CSV) file containing three labeled columns corresponding to pressure (in Pa), uptake volume and its uncertainty (in mol/kg).The file names follow the pattern Q_MOF_FF_GAS_T.csv,therefore the isotherm file named DDEC_ABUWOJ_UFF_CO2_273.csv contains the data corresponding to the CO 2 adsorption isotherm at 273K on the ABUWOJ MOF with DDEC partial charges using the UFF forcefield.The adsorption enthalpy file names follows the same pattern.
The adsorption enthalpy files correspond to a CSV file ontaining three labeled columns corresponding to pressure (in Pa), adsorption enthalpy and its uncertainty (in kJ/mol), following the same naming pattern as the isotherm files.
The RASPA input file names follows the same pattern as the isotherm files (Q_MOF_FF_GAS_T.input), the cif files are separated into folders according to their partial charge (Qeq, EQeq, DDEC, and NEUTRAL) type and the force field files are separated into folder according to their type (UFF and DREIDING).

Chemical and geometrical diversity of CRAFTED subset of structures
The selection of a subset of structures that can be modeled simultaneously by both UFF and DREIDING force fields may impose a limitation on the structural and chemical representativeness of the CRAFTED database and, consequently, on the results obtained with this data.To ensure that CRAFTED MOFs form a group that represents the great diversity of experimentally realised MOFs, both the geometrical and chemical diversities must be represented.
To evaluate the geometrical diversity, the pore size (such as the largest included sphere, largest free sphere and largest included sphere along a free path), void fraction, density, unit cell volume, specific area (both gravimetric and volumetric), pore volume (both gravimetric and volumetric) and the number of pockets (non accessible pores) were used as the descriptors for t-SNE projection.Both the accessible and non-accessible specific area and pore volume was used.
For the chemical diversity, the revised autocorrelations (RACs) descriptors 62 were used.This approach has been successfully applied to study transition metal chemistry 63 and the chemical diversity of MOF datasets 64 .The chemical characteristics of the MOFs were divided into three categories: metal node chemistry, organic linker chemistry and functional groups chemistry.
Figure 2 shows the t-SNE projection onto 2D maps of the four selected groups of descriptors for the structures in the CoRE MOF 2019 database (colored circles) and the selected structures for CRAFTED database (red hexagons).Although CRAFTED structures were taken from the first version of CoRE MOF from 2014, here the comparison is made with the second version of this database, from 2019, as it has a greater number and diversity of structures.
To numerically evaluate the overlap of the databases in t-SNE projected space, the DBSCAN method was used.This method is an unsupervised machine learning technique used to identify clusters of varying shape and size, grouping points that are close to each other.Since the t-SNE method reduces the feature space by modelling structures with similar features as nearby points and distinct features as distant points, the groups found by the DBSCAN method will share similarities within the original feature space.
The limitation imposed by the DREIDING force field reduces the diversity of metal chemistry observed in the CRAFTED database compared to the structures presented in CoRE MOF 2019, as shown in Figure 2(a).However, 59 of the 95 clusters found ( 62%) have some structure present in CRAFTED, indicating that even with a limited metal cluster composition, the CRAFTED structures show a good representation of the chemical diversity present in CoRE MOF 2019.
The chemical diversity of both organic linker and functional groups is much better represented within the CRAFTED structures, as can be seen in Figure 2(b) and Figure 2(c).In both cases, one can see that the points from both CRAFTED and CoRE MOF 2019 structures are equally dispersed across 2D space.Additionally, 36 of the 40 clusters identified for linker chemistry (90%) and 31 of the 35 clusters identified for functional group chemistry ( 89%) contain structures present in CRAFTED.
The geometric properties are also well represented by the CRAFTED database, as shown in Figure 2(d).From the 28 clusters identified, 21 (75%) present structures from CRAFTED.Additionally, Figure 3 shows the distribution density of the values for the main geometric properties presented by the structures on CRAFTED and CoRE MOF 2019.One sees that both databases contain similar distributions, showing that even with a smaller number of structures, the CRAFTED database is exemplary of synthesized MOF structural properties.

General impact of force field and partial charge selection
To illustrate the impact of molecular-level simulation parameters (force field and partial charge) on the outcome of the GCMC simulations, we show in Figure 4 the absolute uptake and enthalpy of adsorption of CO 2 on the BONWID MOF at 237K.At 0.1 bar, a typical pressure used in adsorption-based pressure swing adsorption (PSA) processes for CO 2 capture 65 , the uptake values range from 0.12 to 0.83 mol/kg and every combination of partial charge and force field yields a different value, while at 10 bar almost all conditions resulted in similar uptake.This dispersion of results is also reflected on the enthalpy of adsorption that ranges from -31 to -43 kJ/mol and are fairly different for every combination of parameters.
Among the CRAFTED materials, one finds a diversity of responses to the choices of force field and partial charge method.Four representative cases are highlighted in Figure 5.For some materials, the uptake is highly dependent on both the force field and partial charge, others are only sensitive to one of these parameters, and some are not sensitive to any.Therefore it is possible to anticipate that most studies that depend directly on the results of these molecular simulations, such as high-throughput computational screening or multiscale processes modelling, may also present different degrees of dependence on the choice of parameters.

Usage Notes
The CRAFTED database can be found on Zenodo 66 repository.The database contains 34,848 CSV files with the isotherm adsorption data (pressure, uptake, and uncertainty), definition files for UFF and DREIDING force fields, 2,904 CIF files for all 726 structures with all 4 partial charges considered (NEUTRAL, DDEC, EQeq, and Qeq).The database also contains 34,848 files containing the adsorption enthalpy.All 34,848 RASPA input files to facilitate the reproduction of the isotherm simulations and 2 CSV files containing the set of RAC and geometrical descriptors calculated with molSimplify and Zeo++, respectively.
To facilitate the exploration and visualization of the isotherms present on CRAFTED, we also developed an interactive visualization interface based on a Jupyter notebook and panel.This interface allows the user to select a set of specific conditions -e.g.partial charge, temperature, force field, adsorbed gas, and material name -for each isotherm, thus facilitating a quick and easy visual comparison of the data.In addition, it is possible to download the CIF files, the inputs for the GCMC simulation with RASPA, and the data of the selected isotherms, thus facilitating the reproduction of the data present in CRAFTED.An example of the interface is shown in Figure 6.We recommend the user to set up a Python environment using the environment.ymlor requirements.txtprovided therein.
A Jupyter notebook with the code to perform the t-NSE dimensionality reduction and the DBSCAN unsupervised clusterization analysis with the necessary files containing the RAC and geometric descriptors for the CoRE MOF 2019 is also provided alongside the CRAFTED data, providing an easy reproduction of the results presented in Figure 2.
Finally, we would like to highlight some points to show why this database benefits the scientific community.Machine learning (ML) and data-driven methods have become useful tools to aid in the discovery of new materials for CO 2 capture.For example, surrogate models can be constructed from GCMC-simulated adsorption data to map the structure-property relationship of nanoporous materials, which may then be used to accelerate the HTCS of previously unexplored databases of adsorbents 67,68 .Deep generative models can be trained with simulated adsorption data to develop property-orientated generative algorithms to discover new materials on the latent chemical space optimized for gas capture applications by an inverse design approach [69][70][71] .
As ML model prediction accuracy and data quality are intrinsically related, there is an apparent requirement to assess the impact of uncertainty from molecular-level simulations on the confidence of machine learning model predictions 72 .The efficacy of the surrogate models, for example, depends on the quality of information provided by the material feature vector, and so concerted efforts have been made to develop useful representations of MOFs. 73However, little is known about the impact of force field selection on the interpretability of surrogate model predictions.Particularly in the case of MOFs with coordinatively unsaturated metals, different generic force fields and charge assignment schemes can deliver dramatically different results.
Therefore, the importance of material features -learned either explicitly by the surrogate model as in decision trees, or extracted through feature permutations / SHAP values 74 -may be subject to considerable discrepancies.Feature importance analyses are useful to guide the design of new functional MOFs, and so it is desirable to understand the differences (if any) that arise from different levels of molecular theory.13/18 Atom type UFF DREIDING

Figure 1 .Figure 2 .
Figure 1.Schematic representation of the main steps in the automated simulation workflow.

Figure 3 .
Figure 3.Comparison between the main geometric properties of the structures present in CRAFTED and CoRE MOF 2019.

Figure 4 .Figure 5 .
Figure 4. Example of the impact of force field and partial charge choice on the simulated adsorption isotherms of CO 2 on BONWIL MOF at 273K.

Figure 6 .
Figure 6.Screenshot of the interactive interface for isotherm visualization.

Table 1 .
Lennard-Jones parameters for UFF and DREIDING force fields.

Table 2 .
Lennard-Jones parameters for TraPPE force field.

Table 3 .
Critical parameters for CO 2 and N 2 .