A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

Park, Hyun; Yan, Xiaoli; Zhu, Ruijie; Huerta, Eliu A.; Chaudhuri, Santanu; Cooper, Donny; Foster, Ian; Tajkhorshid, Emad

doi:10.1038/s42004-023-01090-2

Download PDF

Article
Open access
Published: 14 February 2024

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

Communications Chemistry volume 7, Article number: 21 (2024) Cite this article

4982 Accesses
193 Altmetric
Metrics details

Subjects

Abstract

Metal-organic frameworks (MOFs) exhibit great promise for CO₂ capture. However, finding the best performing materials poses computational and experimental grand challenges in view of the vast chemical space of potential building blocks. Here, we introduce GHP-MOFassemble, a generative artificial intelligence (AI), high performance framework for the rational and accelerated design of MOFs with high CO₂ adsorption capacity and synthesizable linkers. GHP-MOFassemble generates novel linkers, assembled with one of three pre-selected metal nodes (Cu paddlewheel, Zn paddlewheel, Zn tetramer) into MOFs in a primitive cubic topology. GHP-MOFassemble screens and validates AI-generated MOFs for uniqueness, synthesizability, structural validity, uses molecular dynamics simulations to study their stability and chemical consistency, and crystal graph neural networks and Grand Canonical Monte Carlo simulations to quantify their CO₂ adsorption capacities. We present the top six AI-generated MOFs with CO₂ capacities greater than 2m mol g⁻¹, i.e., higher than 96.9% of structures in the hypothetical MOF dataset.

Understanding the diversity of the metal-organic framework ecosystem

Article Open access 13 August 2020

Integrating stability metrics with high-throughput computational screening of metal–organic frameworks for CO2 capture

Article Open access 03 October 2023

Progress toward the computational discovery of new metal–organic framework adsorbents for energy applications

Article 09 January 2024

Introduction

Metal-organic frameworks (MOFs) have garnered much research interest in recent years due to their diverse industrial applications, including gas adsorption and storage¹, catalysis², and drug delivery³. As nanocrystalline porous materials, MOFs are modular in nature⁴, with a specific MOF defined by its three types of constituent building blocks (inorganic nodes, organic nodes, and organic linkers) and topology (the relative positions and orientations of building blocks). MOFs with different properties can be produced by varying these building blocks and their spatial arrangements. Nodes and linkers differ in terms of their numbers of connection points: inorganic nodes are typically metal oxide complexes whereas organic nodes are molecules, both with three or more connection points; organic linkers are molecules with only two connection points.

MOFs have been shown to exhibit superior chemical and physical CO₂ adsorption properties. In appropriate operating environments, they can be recycled, for varying numbers of times, before undergoing significant structural degradation. However, their industrial applications have not yet reached full potential due to stability issues, such as poor long-term recyclability⁵, and high moisture sensitivity⁶. For instance, the presence of moisture in adsorption gas impairs a MOF’s CO₂ capture performance, which may be attributed to MOFs having stronger affinity toward water molecules than to CO₂ molecules⁷. Other investigations of MOF stability issues^8,9,10,11,12 have shown that the gas adsorption properties of MOFs can be enhanced by tuning the building blocks used. Yet progress has been difficult due to the enormous chemical space of building blocks, which makes exhaustive search with traditional experimental methods impractical¹³.

Related work

Previous efforts in the search for new MOF structures with exceptional gas adsorption properties include:

Database search methods. These methods apply filters to identify optimal candidates in large databases of MOF structures plus calculated properties obtained with molecular simulations. For example, Ref. ¹⁴ shows how to search for MOFs with high CO₂ uptake when moisture is present using the experimental MOF database (CoRE DB), while Ref. ¹⁵ shows how to search for MOFs with high CH₄/H₂ selectivity in both CoRE DB and the Cambridge Structural Database non-disordered MOF subset.

Machine Learning (ML)-assisted screening. Building large MOF databases require expensive molecular simulation calculations for every target structure. ML-assisted screening avoids this difficulty by using a regression model trained on a smaller training set to predict target properties of a large number of new test structures. This approach has been widely applied to gas adsorption and separation property predictions of MOFs, including CO₂/H₂ separation¹⁶ and CH₄ adsorption¹⁷. ML-assisted methods use of geometrical features (e.g., dominant pore size, void fraction¹⁸) and chemical features (e.g., atom type, electronegativity¹⁹). A compound feature, the atomic property weighted radial distribution function²⁰ has been shown to improve regression model performance in finding MOF structures with high CO₂ capacity²¹. This feature is constructed by using both local geometry and chemistry of atomic sites in MOF structures. As an alternative to the use of hand-engineered ML features, neural networks have been applied to MOF research. In Ref. ²², the Atomistic Line Graph Neural Network, trained on the hMOF dataset, was applied to search CoRE DB for high-performing MOFs for carbon capture²³.

Generative modeling. Candidates are not drawn from a database but are generated de novo via methods that produce novel compounds that have a desired set of chemical features. For instance, the Supramolecular Variational Autoencoder²⁴ uses a semantically constrained graph-based canonical code to encode MOF building blocks. The variational autoencoder framework structure allows this model to interpolate between existing MOF structures and enables isoreticular optimization of MOF structures toward higher CO₂ capacity and CO₂/N₂ selectivity.

This work. Proposed generative models include variational autoencoders, generative adversarial networks, normalizing flows, autoregressive models, and diffusion models²⁵. Here, we adopt a diffusion model named DiffLinker to generate novel MOF linkers. Diffusion models use a probability distribution and Markovian properties to generate new data via forward diffusion and backward denoising steps. First, Gaussian noise is added to the input samples to yield noisy data. A neural network is then trained to learn what noise was added. The trained network is then used to reversely transform (i.e., denoise) the noisy data back into target samples that resemble molecules from the training data distribution. This method has been widely applied to drug discovery to speed up the design of new ligands and ligand-protein complexes^26,27,28,29. Two broad categories of molecular representation schemes have been used in diffusion model-based ligand generators: molecular graphs and 3D coordinates. In the former case (Digress/Congress³⁰ is an example), atoms are represented as nodes and bonds as edges. In the latter case, the 3D atomic coordinates of molecules are generated directly, as in DiffLinker³¹ and E(3) equivariant diffusion model (EDM)³². Given the success of diffusion models in drug discovery^{26,28,29,33,34}, we demonstrate how to transfer the idea to the iso-reticular design of MOF structures by varying MOF linkers while fixing node and topology. The diffusion model is used specifically for MOF linker design.

Our proposed approach, GHP-MOFassemble, is a novel high-throughput computational framework to accelerate the discovery of MOF structures with high CO₂ capacities and synthesizable linkers. While previous attempts to discover useful MOFs via computational methods have proceeded via high-throughput screening of existing datasets^14,35,36, an approach that necessarily limits the search to known MOFs. In contrast, GHP-MOFassemble probes the MOF design space by employing a molecular generative diffusion model, DiffLinker³¹, to generate chemically diverse and unique MOF linkers from a set of molecular fragments, which it assembles with pre-selected metal nodes to form novel MOF structures. It then screens those structures with a pre-trained regression model, a modified version of Crystal Graph Convolutional Neural Network (CGCNN)³⁷, to identify high-performing MOF candidates for carbon capture. In addition, we apply molecular dynamics (MD) and grand canonical Monte Carlo (GCMC) to further down-select stable AI-generated MOFs and compute more credible CO₂ adsorption capabilities. We demonstrate the utility of our framework by applying it to the rational design of MOFs with pcu topology and three types of inorganic nodes: Cu paddlewheel, Zn paddlewheel, and Zn tetramer.

Results

Here we describe the key components of GHP-MOFassemble, and present a new set of AI-generated MOF structures with high CO₂ capacities. A detailed description of the approaches used to obtain these findings is provided in Methods.

Analysis of the hMOF dataset

The three most frequent node-topology pairs in the hMOF dataset, accounting for around 74% of its MOFs, are the Cu paddlewheel-pcu, Zn paddlewheel-pcu, and Zn tetramer-pcu, which amount to 102,117 hMOF structures (29,714 + 28,529 + 43,874 from first column of Table 1). Out of these 102,117 structures, we only used those with correctly parsed MOFids and valid Simplified Molecular-Input Line-Entry System (SMILES), i.e., 78,238 MOFs.

Table 1 Most frequent node-topology pairs in the hMOF dataset.

Full size table

Linker generation and evaluation

For all three MOF types, we start with 540 (i.e., = 180 + 162 + 198 from Table 1 last column) unique molecular fragments extracted from high-performing hMOF structures, and use DiffLinker to generate new MOF linkers, with the number of sampled connection atoms ranging from five to 10 (total six connection atom types). For each linker, sampling is performed 20 times. Therefore, the resulting candidate pool contains a total of 64,800 linkers, i.e., 540 fragments × 20 independent sampling steps × six unique number of connection atoms. Since these linkers only contain heavy atoms, to generate all-atom linker molecules, we apply openbabel to add hydrogen atoms, which results in 56,257 linkers after removing linkers with erroneous hydrogen assignments. Next, dummy atom identification is performed to generate information that enables assembly with metal nodes. A total of 16,162 linkers with dummy atoms are generated. These linkers are then passed through the element filter, which removes linkers that contain S, Br, and I, further reducing the number to 12,305 linkers. The number of linkers in each step are summarized in the Total column of Table 2.

Table 2 Number of linkers at each step of the MOF assembly process.

Full size table

MOF assembly

We generate new MOF structures by assembling three randomly selected DiffLinker-generated linkers with one of the three most frequent nodes in the hMOF dataset. Random sampling of 12,305 linkers ensures that the selected linkers cover a large design space. For each node-linker-topology combination, we considered four levels of catenation (cat0, cat1, cat2, cat3). We generated a total of 120,000 MOFs as follows: we have four catenation levels and three node candidates. The random sampling of 10,000 linkers for each catenation level-node candidate pair generates 120,000 total MOFs of different catenation-node-linker combinations.

MOF geometry examination

Inter-atomic distance check

We used Pymatgen to read each MOF structure’s CIF file and to extract the pairwise distance matrix. Diagonal entries of the distance matrix are discarded as they signify the distance between an atom and itself. The off-diagonal values are tested for minimum inter-atomic distance threshold. The inter-atomic distance threshold is predetermined by an experimental database, OChemDb³⁸. If a MOF has at least one entry of lower than threshold inter-atomic distance, the MOF is discarded. 78,796 of the 120,000 assembled MOFs passed this test.

Pre-simulation check

We used the open source library, cif2lammps³⁹, to automatically assign the Universal Force Field for Metal-Organic Frameworks (UFF4MOF)^40,41 to MOFs and generate LAMMPS⁴² input files. This step ensures that all atomic structures and bonding appearing in each MOF structure are chemically valid within the scope of UFF4MOF. For example, if a Zn atom is bonded to a C atom, or if an O atom is bonded to four other atoms, etc., these structures will be identified as invalid and will be discarded. Of the 78,796 MOFs that passed geometry and inter-atomic distance checks, 18,770 passed this pre-simulation check, and thus had LAMMPS input files generated.

Regression model for MOF CO₂ capacity prediction

To reduce the number of LAMMPS simulations, a screening step of the MOFs’ adsorption performance is conducted on the 18,770 MOFs described above using a modified version of the CGCNN model that we introduced in Ref. ³⁷. Here, we used MOF structures from the hMOF dataset as well as their CO₂ capacities at 0.1 bar as input data. We split the hMOF dataset into three independent sets: 80% for training, 10% for validation, and 10% for testing. Using this data split, we trained three CGCNN models using random initialization of weights for each of them. When we use this model ensemble to infer the CO₂ capacities of newly AI-generated MOF structures, we take the average of the predictions made by the three independent models as the predicted CO₂ adsorption capacity. Out of the 18,770 AI-generated MOF structures, a total of 364 were predicted to be high-performing MOFs with CO₂ capacity higher than 2 m mol g⁻¹ at 0.1 bar. Specifically, we used ensemble model prediction mean plus standard deviation as CO₂ adsorption capacity value to be higher than 2 m mol g⁻¹ as the threshold. The standard deviation consideration is to assure we account for statistical errors of adsorption predictions by our ensemble model. Categorization of the predicted high-performing MOFs by node-topology pair and catenation level indicates that for all three node types most of the high-performing MOFs are cat2 and cat3.

Structural validation of AI-generated MOFs with molecular dynamics simulations

We now examine the stability and porous properties of the 364 AI-generated MOFs described in the previous section using MD simulations with LAMMPS⁴². For each MOF, a 2 × 2 × 2 supercell structure is equilibrated under a triclinic isothermal-isobaric ensemble (i.e., NPT) at 〈p〉 = 1 atm and 〈T〉 = 300K, such that the cell lengths and angles of the MOF structure can be equilibrated. The NPT simulations are run for 400,000 steps with step size of 0.5 femtoseconds. The relative changes of the lattice parameters before and after the simulation are inspected to evaluate how well the MOF structure is maintained throughout the equilibrium MD simulation. The MOFs with higher than 5% changes in any of the lattice parameters (a, b, c, α, β, γ) are discarded (first three are lattice vector lengths and the latter three are lattice angles) between the initial state and the equilibrated state. The MOFs with higher than 5% changes in any of the lattice parameters (a, b, c, α, β, γ) are discarded. This process produces 102 MOFs that satisfied the previously defined criteria, i.e., < 5% in lattice parameter change⁴⁰.

Property validation of AI-generated MOFs with grand canonical Monte Carlo (GCMC) simulations

These 102 MOFs are further examined with GCMC simulations to calculate their CO₂ adsorption capacity at a pressure and temperature of 0.1 bar and 300 K, respectively. RASPA⁴³ is used to conduct these GCMC CO₂ adsorption simulations. As demonstrated in Fig. 1, 6 MOFs candidates were found to have CO₂ adsorption capacity higher than 2 m mol g⁻¹ by the GCMC simulations. The 3D structure of these six high performing, AI-generated MOFs is presented in Fig. 2.

**Fig. 1: CO₂ adsorption values at 0.1 bar and 300K.**

**Fig. 2: Visualization of the crystal structure of AI-generated MOFs.**

Discussion

GHP-MOFassemble combines novel generative and graph AI applications, as well as a comprehensive screening workflow that combines various modeling methods with increasing levels of chemistry awareness and increasing levels of computational cost. When we deployed and optimized GHP-MOFassemble on the Delta supercomputer, unless indicated otherwise, we found that:

We AI-assembled 120,000 MOFs within 33 minutes using multiprocessing on 28 cores in the ThetaKNL supercomputer at the Argonne Leadership Computing Facility (ALCF).
We screened these 120,000 MOFs in 40 minutes using multiprocessing on 128 cores, and identified 78,796 MOFs with valid bond lengths.
We further screened these 78,796 MOFs and identified 18,770 MOFs with valid chemistry, using UFF4MOF as a reference, within 205 minutes using multiprocessing on 128 cores.
We then used our AI ensemble of CGCNN models to estimate the CO₂ capacity of these 18,770 MOFs, and identified 364 high-performing, AI-generated MOFs with CO₂ capacity higher than 2 m mol g⁻¹ at 0.1 bar. AI inference was completed in 50 minutes using one NVIDIA A40 GPU.

In brief, from assembly to selection of high-performing MOFs, GHP-MOFassemble completes the analysis within 5 hours and 7 minutes. Once we have selected 364 AI-generated, high-performing MOFs, we carry out the most compute-intensive part of the analysis, namely:

We used the LAMMPS code to equilibrate 364 MOFs with 200-picosecond NPT MD simulations with the UFF4MOF force field. Each of these simulations is completed within 11 minutes, on average, using between 6 to 14 MPI processes, depending on the number of atoms in the MOF. We identified 102 stable MOFs whose lattice parameters changed less than 5% throughout the MD simulations.
Finally, each of these 102 MOFs were further examined with GCMC simulations, and a total of six MOF candidates were found to have CO₂ capacity higher than 2 m mol g⁻¹, which corresponds to the top 5% of the hMOF dataset. Each of these 102 GCMC simulations is completed within six hours, on average, using one CPU core.

Recent studies^44,45,46,47 indicate that several functional groups play an important role in determining MOF CO₂ capacity, including carboxylic acid (-COOH), primary amine (-NH₂), hydroxyl (-OH), and nitrile (-CN). These functional groups affect MOF CO₂ capacity due to their interaction with CO₂ molecules through charge redistribution⁴⁶, hydrogen bond interaction⁴⁵, and electrostatic interaction⁴⁵. Thus, we analyzed the functional groups of linkers in the 364 AI-generated high-performing MOF structures and compared them with high-performing hMOF structures. The results of this analysis are presented in Fig. 3. We observe that linkers in high-performing hMOF structures have a higher proportion of carboxylic groups, whereas hydroxyl groups appear more often in the predicted high-performing MOFs generated by our framework. Moreover, the percentages of primary amine and nitrile in linkers are similar in both cases. Note that some substructures appear more often than others. For example, ring structures appear in most of the linkers in the top candidates, and many of the DiffLinker-generated molecular substructures (in between the terminal fragments) also contain ring structures. The frequent occurrence of rings in linkers of high-performing MOFs may be due to the presence of CO₂-ring interaction, as reported in Ref. ⁴⁸. In addition, rings in the linkers may increase MOFs’ structural rigidity due to less rotatable bonds.

**Fig. 3: Functional groups in high-performing MOFs.**

In summary, the entire discovery process from generative AI MOF assembly to detailed GCMC simulations of top AI-predicted MOFs can be completed within 12 hours using distributed computing in modern supercomputing platforms. The entire workflow may be further accelerated by scaling up any of the subprocesses using more cores or by distributing AI inference over more GPUs. Furthermore, the various pieces of GHP-MOFassemble may be assembled to create a standalone workflow that combines MD, density functional theory, and GCMC simulations to expose our generative AI model to a larger set of chemically diverse, high performing MOFs. Through online learning methods one may continually guide generative AI until it consistently assembles high performing MOFs. A study of this nature will be pursued in future work.

Methods

Our GHP-MOFassemble framework has three components:

Decompose. We use a molecular fragmentation algorithm to decompose the MOF linkers found in high-performing MOF structures within the hMOF dataset²²—an open source dataset that provides, for each of 137,652 hypothetical MOF structures, and corresponding MOFid, MOFkey, geometric features, and isotherm data for six adsorption gases (CO₂, N₂, CH₄, H₂, Kr, Xe) at 0.01, 0.05, 0.1, 0.5, and 2.5 bar—covering the pressure ranges of cyclic adsorption gas separation processes in industrial applications. The adsorption properties of gases provided in this dataset were calculated using GCMC calculations²², as described in Ref. ⁴⁹.
Generate. We use a diffusion model to generate new MOF linkers. We then screen the AI-generated linkers by removing linkers with S, Br and I elements (we call this step “element filter”), and evaluate their quality using five different scores to quantify their synthesizability (SAScore and SCScore), validity, uniqueness, and internal diversity. Linkers that passed the element filter are then assembled with one of three pre-selected nodes into new MOF structures in the pcu topology.
Screen and Predict. We check for inter-atomic distances with pre-simulation checks to ensure we filter poor AI-generated MOF structures from previous steps. We then use a pre-trained regression model to predict the CO₂ capacities of the newly generated MOF structures. Lastly, we perform MD and GCMC simulations to obtain stable and high-performing MOF structures.

In the following sections we describe the datasets we have used, and each of the GHP-MOFassemble components in detail.

The hMOF dataset

We selected the most common topologies of the hMOF dataset for this work, i.e., Cu paddlewheel-pcu, Zn paddlewheel-pcu, and Zn tetramer-pcu, i.e., 78,238 MOFs with correctly parsed MOFids and valid SMILES. We gather the numbers of molecular fragments in Table 1, produced through the Fragment step of our GHP-MOFassemble framework, described below. In Table 1 high-performing MOFs (second column) with three parsed linkers are selected (third column), and their unique linkers are parsed by using the MMPA (Matched Molecular Pairs Algorithm) algorithm. The cumulative distribution functions of the CO₂ capacities of these three types of MOFs are shown in the right panel of Fig. 4.

In Table 1, the number of output molecular fragment conformers (last column) is much less than the number of unique linker SMILES (second to last column) for two reasons. First, around 56% of linkers did not pass the valency check, which may be due to the intrinsic limitations in the parsing of SMILES of MOF linkers. Second, around 90% of linkers that passed the valency check share similar molecular fragments, and thus many duplicated molecular fragments exist. The successfully parsed unique molecular fragment conformers are subsequently used for linker generation.

Figure 5 shows the empirical cumulative distribution functions of the CO₂ capacities of hMOF structures at different catenation levels (i.e., MOFs with interpenetrated lattices). Therein, we observe that a higher percentage of catenated MOFs (cat1, cat2 and cat3) are high-performing, as compared to the uncatenated MOFs (cat0). This result confirms that catenation is an important factor when designing new MOF structures with high CO₂ capacity. This observation is consistent with other studies in the literature^50,51, which indicate that even though catenation reduces pore size and surface areas, catenated MOFs generally have higher CO₂/H₂ selectivities because MOF-CO₂ interactions are enhanced as a result of the strong confinement of CO₂ with a much lower adsorption surface. Thus, the results presented in Fig. 5 using the hMOF dataset indicate that CO₂ working capacities of catenated MOFs are higher than their noncatenated counterparts. As we discuss below, we found a similar pattern in AI-generated MOFs. Catenated MOFs were generated by using site translation method, which is achieved by displacing the reference lattice along the diagonal line of the unit cell. For all four catenation levels, the amount of relative lattice displacement is given in Table 3. The numbers are fractional displacements relative to the unit cell diagonal.

**Fig. 5: CO₂ capacities of hMOF structures at different catenation levels.**

Table 3 Relative lattice displacement at each catenation level.

Full size table

Figure 6 shows the pairwise relationships of the CO₂ capacities of hMOF structures at five pressures. We observe a strong correlation of CO₂ capacities between 0.05 bar and 0.1 bar, with a Pearson’s correlation coefficient of 0.86. For other pairs of pressures, the CO₂ capacities are only weakly correlated, with a decreasing correlation for larger pressure differences. Moreover, the distributions of CO₂ capacities at all pressures exhibit long tails at high value ranges, which indicate that the majority of MOFs are low-performing and high-performing MOFs are uncommon.

**Fig. 6: CO₂ capacities of hMOF structures at different adsorption pressures.**

Decomposing MOF linkers into molecular fragments

The first component of the GHP-MOFassemble framework, Decompose, decomposes linkers from high-performing MOF structures in the hMOF dataset into their molecular fragments. As illustrated in Fig. 7, this process applies the following three steps to a specified node-topology pair, which in the figure is the most frequent node-topology pair in the hMOF dataset, Zn tetramer-pcu. Note that for the pcu topology, one linker is orientated along each of the x, y, and z directions, and thus at most three unique linker types are possible.

1.
Select: We select high-performing MOFs with the given node-topology pair from hMOF. We define here a high-performing MOF structure as one with CO₂ capacity higher than 2 m mol g⁻¹ at 0.1 bar, which corresponds to the top 5% of CO₂ capacity of the hMOF dataset.
2.
Extract: We extract the linker Simplified Molecular-Input Line-Entry System (SMILES) strings for each high-performing MOF identified in Select step, eliminate those with more than three unique linker types, and assemble the remaining into a linker dataset.
3.
Fragment: We fragment the linkers produced in Extract step to obtain their chemically relevant fragment-connection atom pairs, which we assemble into a molecular fragment dataset.

**Fig. 7: First component of the GHP-MOFassemble framework.**

The extraction of linkers in Extract step is straightforward because the MOFid of each hMOF structure specifies the SMILES strings of its constituent metal nodes, linkers, as well as a format signature and a topology code⁵². Together, these elements uniquely define the topology and building blocks of a given MOF structure.

Fragment step use MMPA⁵³ as implemented in DeLinker⁵⁴ to generate molecular fragments of a given molecule by breaking chemical bonds between atom pairs. We set the minimum number of connection atoms, the minimum fragment size, and the minimum path length to 3, 5, and 2, respectively. Moreover, we only consider the case where the fragments are at least two atoms away from each other. The chemically relevant fragment-connection atoms pairs are then used to form the molecular fragment dataset.

Generating new MOF structures

The Generate component employs the pre-trained diffusion model DiffLinker³¹ to generate new MOF linkers, and then assembles those new linkers with one of three pre-selected nodes into new MOF structures in the pcu topology. It comprises three steps: Diffuse and Denoise; Screen and Evaluate; and Assemble. An example of these three steps for Zn tetramer-pcu MOFs is shown in Fig. 8.

**Fig. 8: Second component of the GHP-MOFassemble framework.**

Diffuse and Denoise

We apply the pre-trained diffusion model DiffLinker³¹ to generate new linkers based on the molecular fragments outputted by the Decompose component. This model connects the molecular fragments supplied as input (also known as context) with a specified number of sampled atoms. With less sampled atoms, straight chains or branched chains may be obtained, while with more sampled atoms, ring structures may be present. Moreover, during the denoising process, the species of the sampled atoms may change. In this work, we vary the number of sampled atoms from 5 to 10 to ensure that the generated linkers have a diverse set of chemistry and substructures.

DiffLinker applies a denoising process to determine the atomic species and Cartesian coordinates of the sampled atoms by using a decoder neural network architecture named E(3)-Equivariant Graph Neural Network (EGNN)⁵⁵. This graph neural network predicts zero-mean centered Cartesian coordinates noise and one-hot encoding noise of atomic species, and accounts for equivariance due to translation, rotation, and reflection of molecules, i.e., Euclidean Group 3 (E(3)). In this work, we use the pre-trained DiffLinker model that was trained on the GEOM dataset⁵⁶, which contains 37 million molecular conformations for over 450,000 molecules. Such diverse training data enables DiffLinker to sample linker molecules with high chemical and structural diversity.

The outputs of DiffLinker are the 3D coordinates and the atomic species of heavy atoms in the linker molecules, rather than SMILES strings or 2D graphs. Since hydrogen atoms are implicit in the DiffLinker model, their 3D coordinates are not outputted by the model. To generate the spatial configurations of all-atom molecules, we employ openbabel to convert the DiffLinker generated molecules to SMILES strings, which in turn are used to generate 3D configurations of linkers with explicit hydrogen atoms. Openbabel uses distance-based heuristics to determine bond connectivity in a given molecule⁵⁷, which is critical to the assignment of the number of hydrogen atoms. Finally, after the hydrogen addition step, we identify the dummy atoms which contains information about how the linkers connect with metal nodes.

The diffusion process involves two consecutive steps. The first step adds Gaussian noise to the original data (i.e., x), yielding noisy input (i.e., z_t at time t). Next, the denoising step applies neural network-based noise removal operation. Mathematically, the following Markovian properties and equations are satisfied:

$$q({z}_{t}| {z}_{t-1})={{{{{{{\mathcal{N}}}}}}}}({z}_{t};{\bar{\alpha }}_{t}{z}_{t-1},{\bar{\sigma }}_{t}^{2}{{{{{{{\boldsymbol{I}}}}}}}}),$$

(1)

$$q({z}_{t}| {{{{{{{\bf{x}}}}}}}})={{{{{{{\mathcal{N}}}}}}}}({z}_{t};{\alpha }_{t}{{{{{{{\bf{x}}}}}}}},{\sigma }_{t}^{2}{{{{{{{\boldsymbol{I}}}}}}}}),$$

(2)

$$p({z}_{t-1}| {z}_{t})=q({z}_{t-1}| {{{{{{{\bf{x}}}}}}}},{z}_{t}),$$

(3)

$$q({z}_{t-1}| {{{{{{{\bf{x}}}}}}}},{z}_{t})={{{{{{{\mathcal{N}}}}}}}}({z}_{t-1};{\mu }_{t}^{\theta }({{{{{{{\bf{x}}}}}}}},{z}_{t};{\alpha }_{t},{\sigma }_{t}),{\xi }_{t}{{{{{{{\boldsymbol{I}}}}}}}}),$$

(4)

$${\mu }_{t}^{\theta }({{{{{{{\bf{x}}}}}}}},{z}_{t};{\alpha }_{t},{\sigma }_{t})={A}_{t}{z}_{t}+{B}_{t}{{{{{{{\bf{x}}}}}}}},$$

(5)

$$\hat{{{{{{{{\bf{x}}}}}}}}}={{{{{{{\bf{x}}}}}}}}={C}_{t}{z}_{t}-{D}_{t}{\epsilon }_{t}^{\theta }({z}_{t},t),$$

(6)

where q and p are the probability density functions during forward (diffusion) and reverse (denoising) process, respectively. ${{{{{{{\mathcal{N}}}}}}}}$ stands for normal distribution. ${\bar{\alpha }}_{t}={\alpha }_{t}/{\alpha }_{t-1}$, and ${\bar{\sigma }}_{t}^{2}={\sigma }_{t}^{2}-{\bar{\alpha }}_{t}^{2}{\sigma }_{t-1}^{2}$. ${{{{{{{{\boldsymbol{I}}}}}}}}}_{}^{}$ is an identity matrix that computes an isotropic Gaussian upon multiplying with a constant (e.g., ${\sigma }_{t}^{2}$); α_t and ${\bar{\alpha }}_{t}$ are signal controls, i.e., meaningful information during training and generative steps; and σ_t and ${\bar{\sigma }}_{t}$ are noise controls, i.e., noisy information used for diversity during training and generative steps. The subscript t = 0, 1,...,T is the time step at which a molecule is generated (a.k.a. denoised or diffused). ξ_t, A_t, B_t, C_t, and D_t are constants consisting of ${\alpha }_{t},{\bar{\alpha }}_{t},{\sigma }_{t}$, and ${\bar{\sigma }}_{t}$³¹.

Equations (1) and (2) represent a diffusion process that is conditioned on the previous value, z_t−1, and initial value, x, to predict a current value, z_t. The denoising processes are described by Equations (3) and (4), which predict a previous value conditioned on (either real or predicted) initial value and current value. Equations (5) and (6) provide a detailed evolution for ${\mu }_{t}^{\theta }$ which represents the learned denoising mean parameterized by θ, as well as x, a real initial value, approximated by $\hat{{{{{{{{\bf{x}}}}}}}}}$. The training objective is to learn a neural network, i.e., EGNN, to predict a denoising value so that random noise can be denoised to a physically and chemically valid molecule during the generative process. In practice, we predict ${\epsilon }_{t}^{\theta }$ (learned noise value) rather than $\hat{{{{{{{{\bf{x}}}}}}}}}$ directly (i.e., Equations (5) and (6)) for better prediction⁵⁸. Since DiffLinker generates chemically valid molecules, EGNN needs to take more regularization into account, such as E(3) and O(3) (i.e., orthogonal group in dimension 3: rotations and reflections). Thus, invariant features such as one-hot and equivariant features such as atomic coordinates updates need to be considered³¹.

Screen and Evaluate

To ensure consistency of elements in hMOF linkers and AI-generated MOF linkers, we manually filter out generated linkers that contain elements not present in the hMOF dataset. Since the pre-trained DiffLinker model we used in this work was trained on the GEOM dataset⁵⁶, a total of nine heavy elements may be sampled, including C, N, O, F, P, S, Cl, Br, and I. Among these elements, S, Br, and I do not appear in the hMOF dataset, therefore we remove generated linkers that contain these three elements. For each molecular fragment, we then perform sampling 20 times. Each time the model samples a different molecule may be obtained because of the random and probabilistic nature of denoising process. The probabilistic nature of diffusion model enables it to generate linkers from an extensive linker design space beyond that of the hMOF dataset.

We then use five metrics to evaluate the quality of the remaining linkers. The first two metrics are commonly used heuristic measures of synthesizability: the synthetic accessibility score (SAscore), and the synthetic complexity score (SCscore)⁵⁹.

The SAscore, as defined by Ertl and Schuffenhauer⁶⁰, is based on analysis of one million PubChem molecules, and combines fragment contributions from molecule substructures with a complexity penalty that accounts for molecular size and for structural features of molecules such as the presence of rings. The SCscore is computed by using a neural network trained on 12 million chemical reactions from the Reaxys database to estimate the number of reaction steps required to produce a molecule⁵⁹. For both SAscore and SCscore, the higher the values are, the more difficult it is to synthesize the linker, hence less desirable. Specifically, the SAscore⁶⁰ is defined as the difference of fragment score and complexity penalty:

$${{{{{{{\rm{SAscore}}}}}}}}={{{{{{{\rm{fragmentScore}}}}}}}}-{{{{{{{\rm{complexityPenalty}}}}}}}},$$

where fragment score is calculated by averaging over the contributions of non-zero elements of the Morgan fingerprint for all molecular fragments. The complexity score is calculated based on five components:

$$ {{{{{{{\rm{complexityPenalty}}}}}}}}=-{{{{{{{\rm{sizePenalty}}}}}}}}-{{{{{{{\rm{stereoPenalty}}}}}}}}\\ -{{{{{{{\rm{spiroPenalty}}}}}}}}-{{{{{{{\rm{bridgePenalty}}}}}}}}-{{{{{{{\rm{macrocyclePenalty}}}}}}}}.$$

The size penalty increases with the number of atoms:

$${{{{{{{\rm{sizePenalty}}}}}}}}={{{{{{{{\rm{nAtoms}}}}}}}}}^{1.005}-{{{{{{{\rm{nAtoms}}}}}}}},$$

The stereo penalty is calculated based on the number of chiral centers:

$${{{{{{{\rm{stereoPenalty}}}}}}}}=\log ({{{{{{{\rm{nChiralCenters}}}}}}}}+1),$$

The spiro penalty is calculated based on the number of spiro centers, or atoms that connect two rings together:

$${{{{{{{\rm{spiroPenalty}}}}}}}}=\log ({{{{{{{\rm{nSpiro}}}}}}}}+1),$$

The bridge penalty term is calculated based on the number of bridgehead atoms, or atoms that belong to two or more rings:

$${{{{{{{\rm{bridgePenalty}}}}}}}}=\log ({{{{{{{\rm{nBridgeheads}}}}}}}}+1),$$

Lastly, if macrocycles are present, the macroPenalty is:

$${{{{{{{\rm{macrocyclePenalty}}}}}}}}={{{{{{{\rm{math}}}}}}}}.\log 10(2),$$

otherwise, the macroPenalty is zero. The SCscore⁶⁰ is calculated by using a neural network trained on 12 million reactions from the Reaxys database. The synthetic complexity of each molecule is rated from 1 to 5, with 1 being the easiest and 5 the most difficult to synthesize. To evaluate the capability of GHP-MOFassemble to generate a valid, novel, and diverse set of linkers, we leverage the MOSES framework⁶¹ to compute three additional metrics: the fraction of valid linker SMILES strings (validity), the fraction of unique linker SMILES strings (uniqueness), and the dissimilarity of linkers (internal diversity). Validity measures whether atoms in generated molecules have the correct valency, whereas uniqueness quantifies the percentage of any molecule that is different from the rest of molecules.

We show in Fig. 9 the synthetic accessibility score (SAscore) and synthetic complexity score (SCscore) values for the remaining linkers. We observe that as the number of sampled atoms increases, both distributions generally shift to the right, with the exception of the SAscore distribution with six sampled atoms, which is to the left of that with five atoms. This general trend indicates that linkers become harder to synthesize as the molecules become bigger. This result is expected because as more atoms are sampled, more complex substructures may be present. We note that no linker has zero or very large SAscore or SCscore, values that would indicate unsynthesizability.

**Fig. 9: Synthesizability of DiffLinker-generated MOF linkers with dummy atoms.**

We show in Table 4 the validity, uniqueness, and internal diversity metrics, grouped by the number of sampled atoms and node-topology pairs of the corresponding MOFs. The validity column confirms that all generated linkers are valid. The last three columns, when reviewed from top to bottom, reveal a considerable increase in linker uniqueness and a more modest increase in internal diversity as the number of sampled atoms increases. High uniqueness values indicate that our model is generating non-duplicate molecules, whereas high internal diversity values indicate that the generated molecules are chemically diverse. Internal diversity (which indicates how dissimilar a specific linker is to the rest of the population) is computed using two internal diversity scores: IntDiv₁ and IntDiv₂⁶¹, also shown in Table 4. The increase in these metrics as more atoms are sampled is expected because the degrees of freedom of atomic species and their spatial coordinates increase with molecular size.

Table 4 Statistical properties of AI-generated linkers.

Full size table

The internal diversity scores were calculated based on the Tanimoto distances⁶² among all pairs of molecules, which are obtained by calculating the normalized Jaccard score of Morgan fingerprint bit vector between all pairs of AI-generated linkers. This yields similarities between a specific linker and the rest of the linkers in the linker pool, which are averaged out. Therefore, we end up with a metric showing each linker’s similarity compared with the population of generated linkers. We used the MOSES⁶¹ framework to compute the internal diversity scores IntDiv₁ and IntDiv₂⁶¹ by using the relations:

$${{{{{{{{\rm{IntDiv}}}}}}}}}_{1}(G)=1-\frac{1}{| G{| }^{2}}\mathop{\sum}\limits_{{m}_{1},{m}_{2}\in G}{T}_{d}({m}_{1},{m}_{2}),$$

(7)

$${{{{{{{{\rm{IntDiv}}}}}}}}}_{2}(G)=1-\sqrt{\frac{1}{| G{| }^{2}}\mathop{\sum}\limits_{{m}_{1},{m}_{2}\in G}{T}_{d}{({m}_{1},{m}_{2})}^{2}}\,,$$

(8)

where T_d is the Tanimoto distance, which relates to the Tanimoto similarity (T_s) by:

$$\begin{array}{r}{T}_{d}({m}_{1},{m}_{2})=1-{T}_{s}({m}_{1},{m}_{2}),\end{array}$$

(9)

where G is the generated set of molecules, ∣G∣ is the size of that set, and (m₁, m₂) is a pair of molecules in the set.

To measure the similarity between the AI-generated linkers and hMOF linkers in high-performing MOF structures, we show in Fig. 10 the distribution of maximum Tanimoto similarity between the two linker sets. For each unique linker in the 364 AI-generated high-performing MOF structures we calculated its Tanimono similarity with all of the unique linkers in the hMOF structures. The maximum value of the Tanimoto similarities gives a quantitative measure of how different the generated linkers are from the hMOF linkers. Results in Fig. 10 indicate that most linkers in the predicted high-performing MOF structures are vastly different from those in hMOF, i.e., our GHP-MOFassemble framework generates novel MOF structures with chemically unique linkers.

**Fig. 10: Similarity between AI-generated and hMOF linkers in high-performing MOF structures.**

MOF Assembly

Once we have generated linkers, we can then assemble them with metal nodes. To construct MOFs, we need to guide the assembly process. We do this by using dummy atoms, which indicate the points at which the building blocks are to be connected. In practice, this is done as follows. Our parsing of MOFids generates, for MOFs with Zn tetramer nodes, linkers that carry two carboxyl groups, which by definition are part of metal nodes instead of linkers. Such wrong assignment of carboxyl groups is due to how the MOFid algorithm parses MOF structures. To reflect the correct molecular structure of linkers, the two carbon atoms in the carboxyl groups are identified as dummy atoms, and the redundant four oxygen atoms and two hydrogen atoms are removed. An illustration of dummy atom identification for linkers containing carboxyl groups is shown in the left panel of Fig. 11.

**Fig. 11: Assembly of AI-generated linkers with metal nodes.**

For MOFs with Cu paddlewheel and Zn paddlewheel node, however, another type of linker containing heterocyclic rings exists. For this type of linker, the two atoms that are nitrogen-metal bond distance away from the terminal nitrogen atoms on the heterocyclic rings are identified as dummy atoms. After the dummy atoms are correctly identified, three randomly selected linkers (duplicates allowed) are assembled with one of the three pre-selected nodes into MOFs in the pcu topology. An illustration of dummy atom identification of linkers containing heterocyclic groups is shown in the right panel of Fig. 11.

More than half of the hMOF structures are catenated MOFs, i.e., MOFs with interpenetrated lattices. By varying the level of interpenetration, it is possible to generate MOFs with different pore sizes, with a higher catenation level generally corresponding to smaller pores. We denote the four catenation levels in hMOF, with increasing number of interpenetrated lattices, as cat0, cat1, cat2 and cat3. To generate MOFs with high structural diversity, we applied site translation method as implemented in Pymatgen⁶³ to generate MOFs with different catenation levels. Figure 12 presents the building block combinations of the final six MOF candidates that passed all screening processes. For linkers, the corresponding molecules (without dummy atoms) are shown for ease of visualization. For catenated MOF structures, we keep the spacing between interpenetrated lattices the same, so as to ensure equal pore sizes.

**Fig. 12: Optimal linkers and catenation levels for MOF assembly.**

Screening and predicting properties of AI-generated assembled MOFs

Here we describe the final Screen and Predict component of GHP-MOFassemble. After the MOF structures are assembled, a comprehensive screening workflow is developed to sift through the candidates with increasing levels of computational cost and confidence level:

Inter-atomic Distance Check (screen)
Pre-simulation Check (screen)
Predicting CO₂ Capacity of MOF via Pre-trained Regression Model (screen and predict)
Structure Validation of the Assembled MOFs (screen)
Grand Canonical Monte Carlo Simulation (screen and predict)

Inter-atomic Distance Check

A pairwise distance matrix M_n×n is extracted from a MOF’s CIF file. The matrix element M_i,j denoting the Euclidean distance between atom A_i and atom A_j is:

$${M}_{i,j}=\sqrt{{({x}_{i}-{x}_{j})}^{2}+{({y}_{i}-{y}_{j})}^{2}+{({z}_{i}-{z}_{j})}^{2}}\,,$$

(10)

where 1 ≤ i ≤ n and 1 ≤ j ≤ n, and (x_i, y_i, z_i) and (x_j, y_j, z_j) are the Cartesian coordinates of atoms A_i and A_j. For an arbitrary pair of atoms A_i and A_j, their chemical symbols can be denoted as E_i and E_j, respectively. The bond length between atoms A_i and A_j is δ(E_i, E_j). The minimum allowed inter-atomic distance between the these two atoms can be estimated by calculating the infimum of all experimental bond lengths between these two element types, i.e.: $\inf (\delta ({E}_{i},{E}_{j}))$, which can be queried from the OChemDb³⁸ database. All values of M_i,j (i ≠ j) are compared against $\inf (\delta ({E}_{i},{E}_{j}))$, and any occurrence of ${M}_{i,j} < \inf (\delta ({E}_{i},{E}_{j}))\;(i\ne j)$ will result in the MOF candidate being discarded. Entries with i = j (i.e., the diagonal entries of the distance matrix M_i,i) are not examined as they should always be zero, because the distance between an atom and itself is zero. We used Python 3.10’s built-in multiprocessing library to help launch the inter-atomic distance check on different MOF structures in parallel.

Pre-simulation Check

For each atom in the MOF structure, the pre-simulation check workflow can be described as the following:

1.
The atom’s element should be recognized as one of the supported elements in the UFF4MOF parameter set,
2.
The atom’s neighboring atoms that are within the allowed bonding distance are examined for chemistry validity.

We conducted this process with the help of the cif2lammps library. Python’s multiprocessing library is also used to speed up this process over many structures in parallel.

Predicting CO₂ capacity of MOF via pre-trained regression model

The GHP-MOFassemble framework uses an ensemble of regression models to estimate the CO₂ capacity of newly generated pcu MOF structures. We use for this purpose a modified version of the CGCNN model, developed in our previous work³⁷, which adopts an adjacency list to format node and edge embeddings, rather than the adjacency matrix format of the original CGCNN, a change that enhances model training speed, training stability, and prediction accuracy. To fine-tune this AI model, we used MOF structures in the hMOF dataset, along with their CO₂ capacities at 0.1 bar as input. We split the hMOF dataset into 80% training set, 10% validation set, and 10% test test. We independently trained three modified versions of the CGCNN model for 5000 epochs with a batch size of 160, using Adam as the optimizer, at a learning rate of 10⁻⁴, and a weight decay rate of 2 × 10⁻⁵.

Table 5 shows the R² score, mean absolute error (MAE), and root mean squared error (RMSE) of the three AI models ensemble on the 10% test set. Figure 13 shows the distribution of the standard deviations of ensemble model predictions. We treat the standard deviation of predictions from the three models as a measure of model uncertainty, which we view as arising from the model’s difficulty in learning certain data points. This uncertainty is also known as epistemic uncertainty. By employing an ensemble of models and averaging their prediction results, we can increase prediction accuracy, which is thoroughly and extensively explored in the machine learning community^64,65. We chose the threshold for the standard deviation filter to be 0.2 m mol g⁻¹ because it is sufficiently small (with 96% of data points below the threshold) that for low CO₂ capacity predictions, the predicted values across the ensemble are similar. On the other hand, for high CO₂ capacity predictions, it also ensures that the outliers (i.e., predicted values across the ensemble are vastly different) are filtered out, therefore minimizing the overall error. In Table 6 we also notice that most of the high-performing MOFs are cat2 and cat3. This is consistent with our preliminary analysis on catenation levels and distributions of CO₂ adsorption capacity in Fig. 5.

Table 5 Statistical properties of AI ensemble to characterize MOF’s performance.

Full size table

**Fig. 13: Standard deviation of AI model ensemble to estimate MOFs’ CO₂ capacity.**

Table 6 Catenation level of high-performing MOFs.

Full size table

We show in the left panel of Fig. 14 the scatter plot of the ensemble model predictions for the 10% test set (from hMOF), which has a MAE of 0.093 m mol g⁻¹. Using CO₂ capacity of 2 m mol g⁻¹ as a threshold, we repurposed our predictive model as a classifier to categorize MOFs into low and high performers, with predicted CO₂ capacities below and above the threshold, respectively. Using this scheme, the confusion matrix for identifying low and high performers is shown in the right panel of Fig. 14. We conclude from the confusion matrix that the pre-trained model classifies both low and high performers with high accuracy, with 98.4% (13,551 out of 13,765) of test samples (of hMOF) correctly classified. As the test set is heavily imbalanced, with many more low performers than high performers, we also calculated the balanced accuracy of classification⁶⁶ (the sum of true positive rate and true negative rate divided by two for binary classification or average of recalls for multi-class classification), obtaining a value of 90.7%. Since the majority of the 214 misclassified samples lie close to the decision boundaries, as shown in the red and purple regions of the left panel of Fig. 14, we conclude that the ensemble model is capable of differentiating low and high performers.

**Fig. 14: Performance of AI ensemble to measure CO₂ capacity.**

We show in Fig. 15 the empirical cumulative distribution functions of the CO₂ capacities of hMOF (dashed lines) structures and AI-generated MOF structures (solid lines) at different catenation levels. We observe that there are more high-performing cat2 and cat3 MOFs among the generated structures compared to the hMOF dataset, as indicated by lower cumulative density values at 2 m mol g⁻¹ (black dashed line). On the other hand, there are fewer high-performing cat0 and cat1 MOFs, as shown by higher cumulative density values at 2 m mol g⁻¹. This result demonstrates that within the AI-generated MOF design space, there are more high-performing candidates with higher catenation levels compared to lower catenation levels.

**Fig. 15: MOFs’ CO₂ capacities in terms of catenation levels.**

In summary, we use an ensemble of 3 AI models to identify pcu AI-generated MOFs produced by the Generate step (in addition to inter-atomic distance and pre-simulation checks) that have predicted CO₂ capacities higher than 2 m mol g⁻¹. We use the ensemble mean plus standard deviation as the final prediction. These MOFs are then selected for further investigation with MD simulations.

Structure Validation of the Assembled MOFs

Pre-screened MOFs are checked for stability using MD simulations with LAMMPS⁴², version stable release 2 August 2023 compiled with gcc 11.2.0 and OpenMPI 4.1.2. For each MOF, a 2x2x2 supercell of triclinic periodic boundary condition is created. A triclinic isothermal-isobaric ensemble (i.e., NPT) at 300 K and 1 atm is applied to the supercell structure. All lattice parameters: a, b, c, α, β, γ are allowed to equilibrate freely. A relatively short simulation with 400,000 steps is conducted for each MOF to allow for structural equilibrium with a step size of 0.5 femtoseconds. The potential energy E_pot of the MOF structure can be expressed as:

$${E}_{{{{{{{{\rm{pot}}}}}}}}}={E}_{{{{{{{{\rm{bond}}}}}}}}}+{E}_{{{{{{{{\rm{angle}}}}}}}}}+{E}_{{{{{{{{\rm{dihedral}}}}}}}}}+{E}_{{{{{{{{\rm{improper}}}}}}}}}+{E}_{{{{{{{{\rm{Vdw}}}}}}}}}+{E}_{{{{{{{{\rm{Coul}}}}}}}}},$$

(11)

where E_bond is the bond stretch energy, E_angle is the angle between bonds energy, E_dihedral is the dihedral angle energy, E_improper is the improper dihedral angle energy, E_vdw is the nonbonded interaction energy, and E_coul is the Coulombic interaction energy.

Coupry et al. (2016)⁴⁰ stated that 76.5% of the CoRE structures¹⁴ lattice parameters are within 5% change when they are simulated with the UFF4MOF parameters. A similar criterion is used here such that if any of MOF’s lattice parameters of triclinic cell changes are more than 5%, the structure is discarded in the screening process. This screening process provides a new level of confidence in terms of structural stability of the AI-generated MOFs. Figure 16 demonstrates the relative error of all lattice parameters of the top six MOF candidates during the MD simulation.

**Fig. 16: Lattice parameter relative errors of the top 6 AI-generated MOFs.**

Grand Canonical Monte Carlo (GCMC) Simulations

The void fraction of each candidate MOF is calculated using the RASPA⁴³’s helium void fraction function. It helps us understand the porosity contribution toward MOF CO₂ adsorption capacity. Next, PACMOF⁶⁷ is used to assign partial charges to MOFs. Using the pre-trained partial atomic charge model on the density-derived electrostatic and chemical (DDEC) charge⁶⁸ data, the partial charges are assigned to atoms in MOF structures. After determining the partial charges, the MOF structures are assigned with UFF4MOF force field parameters. The GCMC simulations are conducted at 0.1 bar and 300 K with a step size of 0.5 fs/step. The MOF structures are under the rigid assumption, therefore only Van der Waals interactions and Coulombic interactions (i.e., non-bonded) are considered:

$${E}_{{{{{{{{\rm{pot}}}}}}}}}={E}_{{{{{{{{\rm{Vdw}}}}}}}}}+{E}_{{{{{{{{\rm{Coul}}}}}}}}},$$

(12)

which can be expressed as⁶⁹

$${E}_{{{{{{{{\rm{pot}}}}}}}}}=\mathop{\sum}\limits_{i,j}4{\epsilon }_{ij}\left[{\left(\frac{{\sigma }_{ij}}{{r}_{ij}}\right)}^{12}+{\left(\frac{{\sigma }_{ij}}{{r}_{ij}}\right)}^{6}\right]+\frac{{q}_{i}{q}_{j}}{4\pi {\epsilon }_{0}{r}_{ij}},$$

(13)

where ϵ_ij, σ_ij, and r_ij are the well depth, zero position, and inter-atomic distance of a Lennard-Jones 12-6 interaction between atom i and atom j; q_i and q_j are the electrostatic charges of atom i and atom j; ϵ₀ is the vacuum permittivity.

Computational resources and performance

We have deployed and extensively tested GHP-MOFassemble on computers at the ALCF and at the National Center for Supercomputing Applications (NCSA), with the intent of providing scalable and computationally efficient AI tools to accelerate the modeling and discovery of novel MOF structures. The tools introduced in this work may be readily fine-tuned and adapted to other available datasets beyond hMOF to enable accelerated design and discovery of novel MOF structures for carbon capture at industrial scale. The training and inference of the generative model and the training of the regression model was conducted on 8-way NVIDIA A100 GPUs with FP32 mode. The inference of the regression model is conducted on a single NVIDIA A40 GPU with FP32 mode. CPU-based screening processes, including distance check, pre-simulation check, MD simulations, and GCMC simulations, are run on two-way AMD Epyc 7763 CPUs. We now present a breakdown of the average computational cost of each step of the GHP-MOFassemble workflow per MOF:

MOF assembly: 0.46 seconds per core. We scaled up this analysis by multiprocessing on 28 cores.
Distance check to ensure structural validity: 2.56 seconds per core. We scaled up this analysis by multiprocessing on 128 cores.
Pre-simulation check to ensure chemical consistency: 19.98 seconds per core. We scaled up this analysis by multiprocessing on 128 cores.
AI ensemble inference of CO₂ adsorption capacity: 20.46 seconds. We used one NVIDIA A40 GPU.
MD simulations to validate stability and chemical consistency: 658.40 seconds, or about ~ 10 minutes. We scaled up this analysis using between six to 14 MPI processes based on the number of atoms in the MOF.
Detailed GCMC simulations: 21214.04 seconds, or about ~6 hours. This averaged number is based on simulations on one CPU core.

A schematic representation of this benchmark analysis is presented in Fig. 17.

**Fig. 17: Process time cost breakdown of GHP-MOFassemble.**

Data availability

The datasets generated during and/or analysed during the current study are available in the GitHub repository, https://github.com/hyunp2/ghp_mof/tree/main/utils. We also used the open source hMOF dataset⁷⁰, and the GEOM dataset⁵⁶.

Code availability

The scientific software and data used in this article are readily available in GitHub at https://github.com/hyunp2/ghp_mof/tree/main.

References

Li, H. et al. Recent advances in gas storage and separation using metal–organic frameworks. Mater. Today 21, 108–121 (2018).
Article ADS CAS Google Scholar
Hao, M., Qiu, M., Yang, H., Hu, B. & Wang, X. Recent advances on preparation and environmental applications of MOF-derived carbons in catalysis. Sci. Total Environ. 760, 143333 (2021).
Article ADS CAS PubMed Google Scholar
Lawson, H. D., Walton, S. P. & Chan, C. Metal–organic frameworks for drug delivery: A design perspective. ACS Appl. Mater. Interfaces 13, 7004–7020 (2021).
Article CAS PubMed Google Scholar
Kalmutzki, M. J., Hanikel, N. & Yaghi, O. M. Secondary building units as the turning point in the development of the reticular chemistry of MOFs. Sci. Adv. 4, eaat9180 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Chatterjee, A., Hu, X. & Lam, F. L.-Y. Towards a recyclable MOF catalyst for efficient production of furfural. Catal. Today 314, 129–136 (2018).
Article CAS Google Scholar
Tan, K. et al. Water interactions in metal organic frameworks. CrystEngComm 17, 247–260 (2015).
Article CAS Google Scholar
Erucar, I. & Keskin, S. Unlocking the effect of H₂O on CO₂ separation performance of promising MOFs using atomically detailed simulations. Ind. Eng. Chem. Res. 59, 3141–3152 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Zhang, Y., Wang, X., Yu, J. & Ding, B. Ultrahigh metal–organic framework loading and flexible nanofibrous membranes for efficient CO₂ capture with long-term, ultrastable recyclability. ACS Appl. Mater. Interfaces 10, 34802–34810 (2018).
Article CAS PubMed Google Scholar
Zuluaga, S. et al. Understanding and controlling water stability of MOF-74. J. Mater. Chem. A 4, 5176–5183 (2016).
Article CAS Google Scholar
Jiao, Y. et al. Tuning the kinetic water stability and adsorption interactions of Mg-MOF-74 by partial substitution with Co or Ni. Ind. Eng. Chem. Res. 54, 12408–12414 (2015).
Article CAS Google Scholar
Cmarik, G. E., Kim, M., Cohen, S. M. & Walton, K. S. Tuning the adsorption properties of UiO-66 via ligand functionalization. Langmuir 28, 15606–15613 (2012).
Article CAS PubMed Google Scholar
Huang, H. et al. Enhancing CO₂ adsorption and separation ability of Zr (IV)-based metal–organic frameworks through ligand functionalization under the guidance of the quantitative structure–property relationship model. Chem. Eng. J. 289, 247–253 (2016).
Article CAS Google Scholar
Moosavi, S. M. et al. Understanding the diversity of the metal-organic framework ecosystem. Nat. Commun. 11, 1–10 (2020).
Article Google Scholar
Li, S., Chung, Y. G. & Snurr, R. Q. High-throughput screening of metal–organic frameworks for CO₂ capture in the presence of water. Langmuir 32, 10368–10376 (2016).
Article CAS PubMed Google Scholar
Altintas, C. et al. An extensive comparative analysis of two MOF databases: High-throughput screening of computation-ready MOFs for CH₄ and H₂ adsorption. J. Mater. Chem. A 7, 9593–9608 (2019).
Article CAS Google Scholar
Dureckova, H., Krykunov, M., Aghaji, M. Z. & Woo, T. K. Robust machine learning models for predicting high CO₂ working capacity and CO₂/H₂ selectivity of gas adsorption in metal organic frameworks for precombustion carbon capture. J. Phys. Chem. C. 123, 4133–4139 (2019).
Article CAS Google Scholar
Pardakhti, M., Moharreri, E., Wanik, D., Suib, S. L. & Srivastava, R. Machine learning using combined structural and chemical descriptors for prediction of methane adsorption performance of metal organic frameworks (MOFs). ACS Comb. Sci. 19, 640–645 (2017).
Article CAS PubMed Google Scholar
Fanourgakis, G. S., Gkagkas, K., Tylianakis, E. & Froudakis, G. E. A universal machine learning algorithm for large-scale screening of materials. J. Am. Chem. Soc. 142, 3814–3822 (2020).
Article CAS PubMed Google Scholar
Altintas, C., Altundal, O. F., Keskin, S. & Yildirim, R. Machine learning meets with metal organic frameworks for gas storage and separation. J. Chem. Inf. Model. 61, 2131–2146 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fernandez, M., Trefiak, N. R. & Woo, T. K. Atomic property weighted radial distribution functions descriptors of metal–organic frameworks for the prediction of gas uptake capacity. J. Phys. Chem. C. 117, 14095–14105 (2013).
Article CAS Google Scholar
Fernandez, M., Boyd, P. G., Daff, T. D., Aghaji, M. Z. & Woo, T. K. Rapid and accurate machine learning recognition of high performing metal organic frameworks for CO2 capture. J. Phys. Chem. Lett. 5, 3056–3060 (2014).
Article CAS PubMed Google Scholar
Bobbitt, N. S. et al. MOFX-DB: An online database of computational adsorption data for nanoporous materials. Journal of Chemical and Engineering Datahttps://doi.org/10.1021/acs.jced.2c00583 (2023).
Choudhary, K. & DeCost, B. Atomistic Line Graph Neural Network for improved materials property predictions. npj Comput. Mater. 7, 1–8 (2021).
Article Google Scholar
Yao, Z. et al. Inverse design of nanoporous crystalline reticular materials with deep generative models. Nat. Mach. Intell. 3, 76–86 (2021).
Article Google Scholar
Bond-Taylor, S., Leach, A., Long, Y. & Willcocks, C. G. Deep generative modelling: A comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7327–7347 (2021).
Article Google Scholar
Schneuing, A. et al. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.13695 (2022).
Huang, L. A dual diffusion model enables 3D binding bioactive molecule generation and lead optimization given target pockets. bioRxiv 2023–01 (2023).
Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. DiffDock: Diffusion steps, twists, and turns for molecular docking. Eleventh International Conference on Learning Representations (Kigali Rwanda, ICLR, 2023).
Xu, M. et al. GeoDiff: A geometric diffusion model for molecular conformation generation. Tenth International Conference on Learning Representationsar (Virtual, ICLR, 2022).
Vignac, C. et al. DiGress: Discrete denoising diffusion for graph generation. Eleventh International Conference on Learning Representations (Kigali Rwanda, ICLR, 2023).
Igashov, I. et al. Equivariant 3D-conditional diffusion models for molecular linker design. arXiv preprint arXiv:2210.05274 (2022).
Hoogeboom, E., Satorras, V. G., Vignac, C. & Welling, M. Equivariant diffusion for molecule generation in 3D. In International Conference on Machine Learning, Journal of Machine Learning Research, 162, 8867–8887 (2022).
Qiao, Z., Nie, W., Vahdat, A., Miller III, T. F. & Anandkumar, A. Dynamic-backbone protein-ligand structure prediction with multiscale generative diffusion models. Machine Learning in Structural Biology Workshop at the 37th Conference on Neural Information Processing Systems (2023).
Thomas, M., Bender, A. & de Graaf, C. Integrating structure-based approaches in generative molecular design. Curr. Opin. Struct. Biol. 79, 102559 (2023).
Article CAS PubMed Google Scholar
Han, S. et al. High-throughput screening of metal–organic frameworks for CO2 separation. ACS Comb. Sci. 14, 263–267 (2012).
Article CAS PubMed Google Scholar
Rogacka, J. et al. High-throughput screening of metal–organic frameworks for CO2 and CH4 separation in the presence of water. Chem. Eng. J. 403, 126392 (2021).
Article CAS Google Scholar
Park, H. et al. End-to-end AI framework for interpretable prediction of molecular and crystal properties. Mach. Learn.: Sci. Technol. 4, 025036 (2023).
Altomare, A. et al. OChemDb: The free on-line Open Chemistry Database portal for searching and analysing crystal structure information. J. Appl. Crystallogr. 51, 1229–1236 (2018).
Article ADS CAS Google Scholar
Anderson, R. cif2lammps https://github.com/rytheranderson/cif2lammps.
Addicoat, M. A., Vankova, N., Akter, I. F. & Heine, T. Extension of the universal force field to metal-organic frameworks. J. Chem. Theory Comput. 10, 880–891 (2014).
Article CAS PubMed Google Scholar
Coupry, D. E., Addicoat, M. A. & Heine, T. Extension of the universal force field for metal-organic frameworks. J. Chem. Theory Comput. 12, 5215–5225 (2016).
Article CAS PubMed Google Scholar
Thompson, A. P. et al. LAMMPS - A flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comput. Phys. Commun. 271, 108171 (2022).
Article CAS Google Scholar
Dubbeldam, D., Calero, S., Ellis, D. E. & Snurr, R. Q. RASPA: Molecular simulation software for adsorption and diffusion in flexible nanoporous materials. Mol. Simul. 42, 81–101 (2016).
Article CAS Google Scholar
Li, S., Chung, Y. G., Simon, C. M. & Snurr, R. Q. High-throughput computational screening of multivariate metal–organic frameworks (MTV-MOFs) for CO2 capture. J. Phys. Chem. Lett. 8, 6135–6141 (2017).
Article CAS PubMed Google Scholar
Gu, C., Liu, Y., Wang, W., Liu, J. & Hu, J. Effects of functional groups for CO₂ capture using metal organic frameworks. Front. Chem. Sci. Eng. 15, 437–449 (2021).
Article CAS Google Scholar
Liu, Y., Liu, J., Chang, M. & Zheng, C. Theoretical studies of Co₂ adsorption mechanism on linkers of metal–organic frameworks. Fuel 95, 521–527 (2012).
Article CAS Google Scholar
An, J., Geib, S. J. & Rosi, N. L. High and selective CO₂ uptake in a cobalt adeninate metal- organic framework exhibiting pyrimidine-and amino-decorated pores. J. Am. Chem. Soc. 132, 38–39 (2010).
Article CAS PubMed Google Scholar
Plonka, A. M. et al. Mechanism of carbon dioxide adsorption in a highly selective coordination network supported by direct structural evidence. Angew. Chem. Int. Ed. 52, 1692–1695 (2013).
Article CAS Google Scholar
Wilmer, C. E. & Snurr, R. Q. Towards rapid computational screening of metal-organic frameworks for carbon dioxide capture: Calculation of framework charges via charge equilibration. Chem. Eng. J. 171, 775–781 (2011).
Article CAS Google Scholar
Avci, G., Erucar, I. & Keskin, S. Do new MOFs perform better for CO₂ capture and H₂ purification? Computational screening of the updated MOF database. ACS Appl. Mater. Interfaces 12, 41567–41579 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yang, Q., XU, Q., Liu, B., ZHONG, C. & Smit, B. Molecular simulation of CO₂/H₂ mixture separation in metalorganic frameworks: Effect of catenation and electrostatic interactions. Chin. J. Chem. Eng. - Chin. J. Chem Eng 17, 781–790 (2009).
Article CAS Google Scholar
Bucior, B. J. et al. Identification schemes for metal–organic frameworks to enable rapid search and cheminformatics analysis. Cryst. Growth Des. 19, 6682–6697 (2019).
Article CAS Google Scholar
Hussain, J. & Rea, C. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J. Chem. Inf. Model. 50, 339–348 (2010).
Article CAS PubMed Google Scholar
Imrie, F., Bradley, A. R., van der Schaar, M. & Deane, C. M. Deep generative models for 3D linker design. J. Chem. Inf. Model. 60, 1983–1995 (2020).
Article CAS PubMed PubMed Central Google Scholar
Satorras, V. G., Hoogeboom, E. & Welling, M. E(n) equivariant graph neural networks. In International Conference on Machine Learning, 9323–9332 (2021).
Axelrod, S. & Gómez-Bombarelli, R. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Sci. Data 9, 185 (2022).
Article CAS PubMed PubMed Central Google Scholar
O’Boyle, N. M. et al. Open Babel: An open chemical toolbox. J. Cheminform. 3, 1–14 (2011).
Google Scholar
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. SCScore: Synthetic complexity learned from a reaction corpus. J. Chem. Inf. Model. 58, 252–261 (2018).
Article CAS PubMed Google Scholar
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 1–11 (2009).
Article Google Scholar
Polykovskiy, D. et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. Front. Pharmacol. 11, 565644 (2020).
Article CAS PubMed PubMed Central Google Scholar
Landrum, G. et al. rdkit/rdkit: 2020_03_1 (q1 2020) release, https://doi.org/10.5281/zenodo.3732262 (2020).
Ong, S. P. et al. Python Materials Genomics (pymatgen): A robust, open-source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Article CAS Google Scholar
Sagi, O. & Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 8, e1249 (2018).
Google Scholar
Ganaie, M. A., Hu, M., Malik, A., Tanveer, M. & Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 115, 105151 (2022).
Article Google Scholar
Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The balanced accuracy and its posterior distribution. In 20th International Conference on Pattern Recognition, 3121–3124 (IEEE, 2010).
Kancharlapalli, S., Gopalan, A., Haranczyk, M. & Snurr, R. Q. Fast and accurate machine learning strategy for calculating partial atomic charges in metal-organic frameworks. J. Chem. Theory Comput. 17, 3052–3064 (2021).
Article CAS PubMed Google Scholar
Manz, T. & Sholl, D. Chemically meaningful atomic charges that reproduce the electrostatic potential in periodic and nonperiodic materials. J. Chem. Theory Comput. 6, 2455—2468 (2010).
Article PubMed Google Scholar
Du, Z. et al. Comparative analysis of calculation method of adsorption isosteric heat: Case study of CO2 capture using MOFs. Microporous Mesoporous Mater. 298, 110053 (2020).
Article CAS Google Scholar
Wilmer, C. E., Farha, O. K., Bae, Y.-S., Hupp, J. T. & Snurr, R. Q. Structure-property relationships of porous materials for carbon dioxide separation and capture. Energy Environ. Sci. 5, 9849–9856 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357, and by the Braid project of the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under contract number DE-AC02-06CH11357. The work used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. EAH and IF acknowledge support from National Science Foundation (NSF) award OAC-2209892. SC and XY acknowledge partial support from NSF Future of Manufacturing Research Grant 2037026. This research also used the Delta advanced computing and data resources, which is supported by the National Science Foundation (award OAC 2005572) and the State of Illinois. Delta is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

Author information

These authors contributed equally: Hyun Park and Xiaoli Yan

Authors and Affiliations

Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, 60439, USA
Hyun Park, Xiaoli Yan, Ruijie Zhu, Eliu A. Huerta, Santanu Chaudhuri & Ian Foster
Theoretical and Computational Biophysics Group, NIH Resource Center for Macromolecular Modeling and Visualization, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Hyun Park & Emad Tajkhorshid
Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Hyun Park & Emad Tajkhorshid
Multiscale Materials and Manufacturing Lab, University of Illinois Chicago, Chicago, IL, 60607, USA
Xiaoli Yan & Santanu Chaudhuri
Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA
Ruijie Zhu
Department of Computer Science, University of Chicago, Chicago, IL, 60637, USA
Eliu A. Huerta & Ian Foster
Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Eliu A. Huerta
Computational Science and Engineering, Data Science and AI Department, TotalEnergies EP Research & Technology USA, LLC, Houston, TX, 77002, USA
Donny Cooper
Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Emad Tajkhorshid

Authors

Hyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ruijie Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Eliu A. Huerta
View author publications
You can also search for this author in PubMed Google Scholar
Santanu Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar
Donny Cooper
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar
Emad Tajkhorshid
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.A.H. envisioned and led this work, and guided the connection between generative AI with MD and GCMC simulations to create novel, chemically consistent MOFs at scale in high performance computing platforms. H.P. adapted techniques from drug design into MOF discovery with generative AI, producing a pool of candidates of MOF components, and also trained and inferred assembled MOFs’ CO₂ adsorption capabilities using a graph neural network model. X.Y. developed codes to decompose MOF into linkers and nodes, as well as to assemble predicted linkers and nodes to make hypothetical MOFs, and also conducted large-scale MD simulations, with LAMMPS and GCMC to identify stable and synthesizable AI-predicted MOFs. R.Z. analyzed hMOF database to understand MOF composition statistics, tested combined generative AI and graph models to predict novel AI-generated MOFs and quantify their chemical properties. S.C. provided expertise on metrics to be used to study the chemical properties and stability of MOFs with AI tools developed in this manuscript. D.C. provided expertise on expected features of MOFs to be used for carbon capture at industrial scale. I.F. guided the development of our ready-to-use AI framework in modern computing environments, and provided guidance on how to use and interpret metrics for MOF synthesizability. E.T. provided expertise on MD, chemistry and synthesis pathways of MOFs as well as how to analyze MD results. All authors reviewed and contributed to the manuscript.

Corresponding author

Correspondence to Eliu A. Huerta.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Chemistry thanks Emmanuel Ren and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Park, H., Yan, X., Zhu, R. et al. A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture. Commun Chem 7, 21 (2024). https://doi.org/10.1038/s42004-023-01090-2

Download citation

Received: 19 June 2023
Accepted: 18 December 2023
Published: 14 February 2024
DOI: https://doi.org/10.1038/s42004-023-01090-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Understanding the diversity of the metal-organic framework ecosystem

Integrating stability metrics with high-throughput computational screening of metal–organic frameworks for CO2 capture

Progress toward the computational discovery of new metal–organic framework adsorbents for energy applications

Introduction

Related work

Results

Analysis of the hMOF dataset

Linker generation and evaluation

MOF assembly

MOF geometry examination

Inter-atomic distance check

Pre-simulation check

Regression model for MOF CO2 capacity prediction

Structural validation of AI-generated MOFs with molecular dynamics simulations

Property validation of AI-generated MOFs with grand canonical Monte Carlo (GCMC) simulations

Discussion

Methods

The hMOF dataset

Decomposing MOF linkers into molecular fragments

Generating new MOF structures

Diffuse and Denoise

Screen and Evaluate

MOF Assembly

Screening and predicting properties of AI-generated assembled MOFs

Inter-atomic Distance Check

Pre-simulation Check

Predicting CO2 capacity of MOF via pre-trained regression model

Structure Validation of the Assembled MOFs

Grand Canonical Monte Carlo (GCMC) Simulations

Computational resources and performance

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links

Regression model for MOF CO₂ capacity prediction

Predicting CO₂ capacity of MOF via pre-trained regression model