A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

Metal-organic frameworks (MOFs) exhibit great promise for CO2 capture. However, finding the best performing materials poses computational and experimental grand challenges in view of the vast chemical space of potential building blocks. Here, we introduce GHP-MOFassemble, a generative artificial intelligence (AI), high performance framework for the rational and accelerated design of MOFs with high CO2 adsorption capacity and synthesizable linkers. GHP-MOFassemble generates novel linkers, assembled with one of three pre-selected metal nodes (Cu paddlewheel, Zn paddlewheel, Zn tetramer) into MOFs in a primitive cubic topology. GHP-MOFassemble screens and validates AI-generated MOFs for uniqueness, synthesizability, structural validity, uses molecular dynamics simulations to study their stability and chemical consistency, and crystal graph neural networks and Grand Canonical Monte Carlo simulations to quantify their CO2 adsorption capacities. We present the top six AI-generated MOFs with CO2 capacities greater than 2m mol g−1, i.e., higher than 96.9% of structures in the hypothetical MOF dataset.


Introduction
Metal-organic frameworks (MOFs) have garnered much research interest in recent years due to their diverse industrial applications, including gas adsorption and storage 1 , catalysis 2 , and drug delivery 3 .As nanocrystalline porous materials, MOFs are modular in nature 4 , with a specific MOF defined by its three types of constituent building blocks (inorganic nodes, organic nodes, and organic linkers) and topology (the relative positions and orientations of building blocks).MOFs with different properties can be produced by varying these building blocks and their spatial arrangements.Nodes and linkers differ in terms of their numbers of connection points: inorganic nodes are typically metal oxide complexes whereas organic nodes are molecules, both with three or more connection points; organic linkers are molecules with only two connection points.
MOFs have been shown to exhibit superior chemical and physical CO 2 adsorption properties.In appropriate operating environments, they can be recycled, for varying numbers of times, before undergoing significant structural degradation.However, their industrial applications have not yet reached full potential due to stability issues, such as poor long-term recyclability 5 , and high moisture sensitivity 6 .For instance, the presence of moisture in adsorption gas impairs a MOF's CO 2 capture performance, which may be attributed to MOFs having stronger affinity toward water molecules than to CO 2 molecules 7 .Other investigations of MOF stability issues [8][9][10][11][12] have shown that the gas adsorption properties of MOFs can be enhanced by tuning the building

Analysis of the hMOF dataset
The three most frequent node-topology pairs in the hMOF dataset, accounting for around 74% of its MOFs, are the Cu paddlewheel-pcu, Zn paddlewheel-pcu, and Zn tetramer-pcu, which amount to 102,117 hMOF structures (29,714 + 28,529 +  43,874 from first column of Table 1).Out of these 102,117 structures, we only used those with correctly parsed MOFids and valid Simplified Molecular-Input Line-Entry System (SMILES), i.e., 78,238 MOFs.Therefore, the resulting candidate pool contains a total of 64,800 linkers, i.e., 540 fragments × 20 independent sampling steps × six unique number of connection atoms.Since these linkers only contain heavy atoms, to generate all-atom linker molecules, we apply openbabel to add hydrogen atoms, which results in 56,257 linkers after removing linkers with erroneous hydrogen assignments.Next, dummy atom identification is performed to generate information that enables assembly with metal nodes.A total of 16,162 linkers with dummy atoms are generated.These linkers are then passed through the element filter, which removes linkers that contain S, Br, and I, further reducing the number to 12,305 linkers.The number of linkers in each step are summarized in the Total column of Table 2.

MOF assembly
We generate new MOF structures by assembling three randomly selected DiffLinker-generated linkers with one of the three most frequent nodes in the hMOF dataset.Random sampling of 12,305 linkers ensures that the selected linkers cover a large design space.For each node-linker-topology combination, we considered four levels of catenation (cat0, cat1, cat2, cat3).We generated a total of 120,000 MOFs as follows: we have four catenation levels and three node candidates.The random sampling of 10,000 linkers for each catenation level-node candidate pair generates 120,000 total MOFs of different catenation-node-linker combinations.

Inter-atomic distance check
We used Pymatgen to read each MOF structure's CIF file and to extract the pairwise distance matrix.Diagonal entries of the distance matrix are discarded as they signify the distance between an atom and itself.The off-diagonal values are tested for minimum inter-atomic distance threshold.The inter-atomic distance threshold is predetermined by an experimental database, OChemDb 38 .If a MOF has at least one entry of lower than threshold inter-atomic distance, the MOF is discarded.78,796 of the 120,000 assembled MOFs passed this test.

Pre-simulation Check
We used the open source library, cif2lammps 39 , to automatically assign the Universal Force Field for Metal-Organic Frameworks (UFF4MOF) 40,41 to MOFs and generate LAMMPS 42 input files.This step ensures that all atomic structures and bonding appearing in each MOF structure are chemically valid within the scope of UFF4MOF.For example, if a Zn atom is bonded to a C atom, or if an O atom is bonded to four other atoms, etc., these structures will be identified as invalid and will be discarded.
Of the 78,796 MOFs that passed geometry and inter-atomic distance checks, 18,770 passed this pre-simulation check, and thus had LAMMPS input files generated.

Regression model for MOF CO 2 capacity prediction
To reduce the number of LAMMPS simulations, a screening step of the MOFs' adsorption performance is conducted on the 18,770 MOFs described above using a modified version of the CGCNN model that we introduced in Ref. 37 .Here, we used MOF structures from the hMOF dataset as well as their CO 2 capacities at 0.1 bar as input data.We split the hMOF dataset into three independent sets: 80% for training, 10% for validation, and 10% for testing.Using this data split, we trained three CGCNN models using random initialization of weights for each of them.When we use this model ensemble to infer the CO 2 capacities of newly AI-generated MOF structures, we take the average of the predictions made by the three independent models as the predicted CO 2 adsorption capacity.Out of the 18,770 AI-generated MOF structures, a total of 364 were predicted to be high-performing MOFs with CO 2 capacity higher than 2 m mol g −1 at 0.1 bar.Specifically, we used ensemble model prediction mean plus standard deviation as CO 2 adsorption capacity value to be higher than 2 m mol g −1 as the threshold.The standard deviation consideration is to assure we account for statistical errors of adsorption predictions by our ensemble model.Categorization of the predicted high-performing MOFs by node-topology pair and catenation level indicates that for all three node types most of the high-performing MOFs are cat2 and cat3.

Structural validation of AI-generated MOFs with molecular dynamics simulations
We now examine the stability and porous properties of the 364 AI-generated MOFs described in the previous section using MD simulations with LAMMPS 42 .For each MOF, a 2×2×2 supercell structure is equilibrated under a triclinic isothermal-isobaric ensemble (i.e., NPT) at ⟨p⟩ = 1 atm and ⟨T ⟩ = 300K, such that the cell lengths and angles of the MOF structure can be equilibrated.The NPT simulations are run for 400,000 steps with step size of 0.5 femtoseconds.The relative changes of the lattice parameters before and after the simulation are inspected to evaluate how well the MOF structure is maintained throughout the equilibrium MD simulation.The MOFs with higher than 5% changes in any of the lattice parameters (a, b, c, α, β , γ) are discarded (first three are lattice vector lengths and the latter three are lattice angles) between the initial state and the equilibrated state.The MOFs with higher than 5% changes in any of the lattice parameters (a, b, c, α, β , γ) are discarded.This process produces 102 MOFs that satisfied the previously defined criteria, i.e., < 5% in lattice parameter change 40 .

Property validation of AI-generated MOFs with grand canonical Monte Carlo (GCMC) simulations
These 102 MOFs are further examined with GCMC simulations to calculate their CO 2 adsorption capacity at a pressure and temperature of 0.1 bar and 300 K, respectively.RASPA 43 is used to conduct these GCMC CO 2 adsorption simulations.As demonstrated in Figure 1, 6 MOFs candidates were found to have CO 2 adsorption capacity higher than 2 m mol g −1 by the GCMC simulations.The 3D structure of these six high performing, AI-generated MOFs is presented in Figure 2.

Discussion
GHP-MOFassemble combines novel generative and graph AI applications, as well as a comprehensive screening workflow that combines various modeling methods with increasing levels of chemistry awareness and increasing levels of computational cost.When we deployed and optimized GHP-MOFassemble on the Delta supercomputer, unless indicated otherwise, we found that: • We AI-assembled 120,000 MOFs within 33 minutes using multiprocessing on 28 cores in the ThetaKNL supercomputer at the Argonne Leadership Computing Facility (ALCF).• We screened these 120,000 MOFs in 40 minutes using multiprocessing on 128 cores, and identified 78,796 MOFs with valid bond lengths.• We further screened these 78,796 MOFs and identified 18,770 MOFs with valid chemistry, using UFF4MOF as a reference, within 205 minutes using multiprocessing on 128 cores.• We then used our AI ensemble of CGCNN models to estimate the CO 2 capacity of these 18,770 MOFs, and identified 364 high-performing, AI-generated MOFs with CO 2 capacity higher than 2 m mol g −1 at 0.1 bar.AI inference was completed in 50 minutes using one NVIDIA A40 GPU.In brief, from assembly to selection of high-performing MOFs, GHP-MOFassemble completes the analysis within 5 hours and 7 minutes.Once we have selected 364 AI-generated, high-performing MOFs, we carry out the most compute-intensive part of the analysis, namely: 0.0 0.5  • We used the LAMMPS code to equilibrate 364 MOFs with 200-picosecond NPT MD simulations with the UFF4MOF force field.Each of these simulations is completed within 11 minutes, on average, using between 6 to 14 MPI processes, depending on the number of atoms in the MOF.We identified 102 stable MOFs whose lattice parameters changed less than 5% throughout the MD simulations.
• Finally, each of these 102 MOFs were further examined with GCMC simulations, and a total of six MOF candidates were found to have CO 2 capacity higher than 2 m mol g −1 , which corresponds to the top 5% of the hMOF dataset.Each of these 102 GCMC simulations is completed within six hours, on average, using one CPU core.
Recent studies [44][45][46][47] indicate that several functional groups play an important role in determining MOF CO 2 capacity, including carboxylic acid (-COOH), primary amine (-NH 2 ), hydroxyl (-OH), and nitrile (-CN).These functional groups affect MOF CO 2 capacity due to their interaction with CO 2 molecules through charge redistribution 46 , hydrogen bond interaction 45 , and electrostatic interaction 45 .Thus, we analyzed the functional groups of linkers in the 364 AI-generated high-performing MOF structures and compared them with high-performing hMOF structures.The results of this analysis are presented in Figure 3.
We observe that linkers in high-performing hMOF structures have a higher proportion of carboxylic groups, whereas hydroxyl groups appear more often in the predicted high-performing MOFs generated by our framework.Moreover, the percentages of primary amine and nitrile in linkers are similar in both cases.Note that some substructures appear more often than others.For example, ring structures appear in most of the linkers in the top candidates, and many of the DiffLinker-generated molecular substructures (in between the terminal fragments) also contain ring structures.The frequent occurrence of rings in linkers of high-performing MOFs may be due to the presence of CO 2 -ring interaction, as reported in Ref. 48.In addition, rings in the linkers may increase MOFs' structural rigidity due to less rotatable bonds.In summary, the entire discovery process from generative AI MOF assembly to detailed GCMC simulations of top AIpredicted MOFs can be completed within 12 hours using distributed computing in modern supercomputing platforms.The entire workflow may be further accelerated by scaling up any of the subprocesses using more cores or by distributing AI inference over more GPUs.Furthermore, the various pieces of GHP-MOFassemble may be assembled to create a standalone workflow that combines MD, density functional theory, and GCMC simulations to expose our generative AI model to a larger set of chemically diverse, high performing MOFs.Through online learning methods one may continually guide generative AI until it consistently assembles high performing MOFs.A study of this nature will be pursued in future work.

Methods
Our GHP-MOFassemble framework has three components: • Decompose.We use a molecular fragmentation algorithm to decompose the MOF linkers found in high-performing MOF structures within the hMOF dataset 22 -an open source dataset that provides, for each of 137,652 hypothetical MOF structures, and corresponding MOFid, MOFkey, geometric features, and isotherm data for six adsorption gases (CO 2 , N 2 , CH 4 , H 2 , Kr, Xe) at 0.01, 0.05, 0.1, 0.5, and 2.5 bar-covering the pressure ranges of cyclic adsorption gas separation processes in industrial applications.The adsorption properties of gases provided in this dataset were calculated using GCMC calculations 22 , as described in Ref. 49 .
• Generate.We use a diffusion model to generate new MOF linkers.We then screen the AI-generated linkers by removing linkers with S, Br and I elements (we call this step "element filter"), and evaluate their quality using five different scores to quantify their synthesizability (SAScore and SCScore), validity, uniqueness, and internal diversity.Linkers that passed the element filter are then assembled with one of three pre-selected nodes into new MOF structures in the pcu topology.
• Screen and Predict.We check for inter-atomic distances with pre-simulation checks to ensure we filter poor AI-generated MOF structures from previous steps.We then use a pre-trained regression model to predict the CO 2 capacities of the newly generated MOF structures.Lastly, we perform MD and GCMC simulations to obtain stable and high-performing MOF structures.
In the following sections we describe the datasets we have used, and each of the GHP-MOFassemble components in detail.

The hMOF dataset
We selected the most common topologies of the hMOF dataset for this work, i.e., Cu paddlewheel-pcu, Zn paddlewheel-pcu, and Zn tetramer-pcu, i.e., 78,238 MOFs with correctly parsed MOFids and valid SMILES.We gather the numbers of molecular fragments in Table 1, produced through the Fragment step of our GHP-MOFassemble framework, described below.In Table 1 high-performing MOFs (second column) with three parsed linkers are selected (third column), and their unique linkers are parsed by using the MMPA (Matched Molecular Pairs Algorithm) algorithm.The cumulative distribution functions of the CO 2 capacities of these three types of MOFs are shown in the right panel of Figure 4.In Table 1, the number of output molecular fragment conformers (last column) is much less than the number of unique linker SMILES (second to last column) for two reasons.First, around 56% of linkers did not pass the valency check, which may be due to the intrinsic limitations in the parsing of SMILES of MOF linkers.Second, around 90% of linkers that passed the valency check share similar molecular fragments, and thus many duplicated molecular fragments exist.The successfully parsed unique molecular fragment conformers are subsequently used for linker generation.
Figure 5 shows the empirical cumulative distribution functions of the CO 2 capacities of hMOF structures at different catenation levels (i.e., MOFs with interpenetrated lattices).Therein, we observe that a higher percentage of catenated MOFs (cat1, cat2 and cat3) are high-performing, as compared to the uncatenated MOFs (cat0).This result confirms that catenation is an important factor when designing new MOF structures with high CO 2 capacity.This observation is consistent with other studies in the literature 50,51 , which indicate that even though catenation reduces pore size and surface areas, catenated MOFs generally have higher CO 2 /H 2 selectivities because MOF-CO 2 interactions are enhanced as a result of the strong confinement of CO 2 with a much lower adsorption surface.Thus, the results presented in Figure 5 using the hMOF dataset indicate that CO 2 working capacities of catenated MOFs are higher than their noncatenated counterparts.As we discuss below, we found a similar pattern in AI-generated MOFs.Catenated MOFs were generated by using site translation method, which is achieved by displacing the reference lattice along the diagonal line of the unit cell.For all four catenation levels, the amount of relative lattice displacement is given in Table 3.The numbers are fractional displacements relative to the unit cell diagonal.
(0,0,0) ( Figure 6 shows the pairwise relationships of the CO 2 capacities of hMOF structures at five pressures.We observe a strong correlation of CO 2 capacities between 0.05 bar and 0.1 bar, with a Pearson's correlation coefficient of 0.86.For other pairs of pressures, the CO 2 capacities are only weakly correlated, with a decreasing correlation for larger pressure differences.Moreover, the distributions of CO 2 capacities at all pressures exhibit long tails at high value ranges, which indicate that the majority of MOFs are low-performing and high-performing MOFs are uncommon.

Decomposing MOF linkers into molecular fragments
The first component of the GHP-MOFassemble framework, Decompose, decomposes linkers from high-performing MOF structures in the hMOF dataset into their molecular fragments.As illustrated in Figure 7, this process applies the following three steps to a specified node-topology pair, which in the figure is the most frequent node-topology pair in the hMOF dataset, Zn tetramer-pcu.Note that for the pcu topology, one linker is orientated along each of the x, y, and z directions, and thus at most three unique linker types are possible.
1. Select: We select high-performing MOFs with the given node-topology pair from hMOF.We define here a highperforming MOF structure as one with CO 2 capacity higher than 2 m mol g −1 at 0.1 bar, which corresponds to the top 5% of CO 2 capacity of the hMOF dataset.2. Extract: We extract the linker Simplified Molecular-Input Line-Entry System (SMILES) strings for each high-performing MOF identified in Select step, eliminate those with more than three unique linker types, and assemble the remaining into a linker dataset.3. Fragment: We fragment the linkers produced in Extract step to obtain their chemically relevant fragment-connection atom pairs, which we assemble into a molecular fragment dataset.The extraction of linkers in Extract step is straightforward because the MOFid of each hMOF structure specifies the SMILES strings of its constituent metal nodes, linkers, as well as a format signature and a topology code 52 .Together, these elements uniquely define the topology and building blocks of a given MOF structure.
Fragment step use MMPA 53 as implemented in DeLinker 54 to generate molecular fragments of a given molecule by breaking chemical bonds between atom pairs.We set the minimum number of connection atoms, the minimum fragment size, and the minimum path length to 3, 5, and 2, respectively.Moreover, we only consider the case where the fragments are at least two atoms away from each other.The chemically relevant fragment-connection atoms pairs are then used to form the molecular fragment dataset.

Generating new MOF structures
The Generate component employs the pre-trained diffusion model DiffLinker 31 to generate new MOF linkers, and then assembles those new linkers with one of three pre-selected nodes into new MOF structures in the pcu topology.It comprises three steps: Diffuse and Denoise; Screen and Evaluate; and Assemble.An example of these three steps for Zn tetramer-pcu MOFs is shown in Figure 8.

Diffuse and Denoise
We apply the pre-trained diffusion model DiffLinker 31 to generate new linkers based on the molecular fragments outputted by the Decompose component.This model connects the molecular fragments supplied as input (also known as context) with a specified number of sampled atoms.With less sampled atoms, straight chains or branched chains may be obtained, while with more sampled atoms, ring structures may be present.Moreover, during the denoising process, the species of the sampled atoms may change.In this work, we vary the number of sampled atoms from 5 to 10 to ensure that the generated linkers have a diverse set of chemistry and substructures.
DiffLinker applies a denoising process to determine the atomic species and Cartesian coordinates of the sampled atoms by using a decoder neural network architecture named E(3)-Equivariant Graph Neural Network (EGNN) 55 .This graph neural network predicts zero-mean centered Cartesian coordinates noise and one-hot encoding noise of atomic species, and accounts for equivariance due to translation, rotation, and reflection of molecules, i.e., Euclidean Group 3 (E(3)).In this work, we use the pre-trained DiffLinker model that was trained on the GEOM dataset 56 , which contains 37 million molecular conformations for over 450,000 molecules.Such diverse training data enables DiffLinker to sample linker molecules with high chemical and structural diversity.
The outputs of DiffLinker are the 3D coordinates and the atomic species of heavy atoms in the linker molecules, rather than SMILES strings or 2D graphs.Since hydrogen atoms are implicit in the DiffLinker model, their 3D coordinates are not outputted by the model.To generate the spatial configurations of all-atom molecules, we employ openbabel to convert the DiffLinker generated molecules to SMILES strings, which in turn are used to generate 3D configurations of linkers with explicit hydrogen atoms.Openbabel uses distance-based heuristics to determine bond connectivity in a given molecule 57 , which is critical to the assignment of the number of hydrogen atoms.Finally, after the hydrogen addition step, we identify the dummy  1 for unique linker fragments of Zn tetramer node).First, we generate new linkers via the Diffuse and Denoise step.
We then add hydrogen atoms to ensure correct valency of the generated linkers.Next, we identify dummy atoms by replacing the two carbon atoms in the carboxyl groups of each generated linker with dummy atoms.Linkers with dummy atoms then undergo the Screen and Evaluate step, where we remove those with S, Br, and I elements, since these three elements are present in the GEOM dataset but not in the hMOF dataset.As a result, this step reduces the number of potential linkers to 5,162.The generated linkers' molecular statistics are quantified using five metrics, including SAscore, SCscore, validity, uniqueness, and internal diversity.Finally, in the Assemble step, we build 40,000 new MOFs with Zn tetramer node in pcu topology.As before, the color scheme of elements is: carbon in grey, oxygen in red, nitrogen in blue, and hydrogen in white.See last column of Table 2 regarding this component.

11/25
atoms which contains information about how the linkers connect with metal nodes.The diffusion process involves two consecutive steps.The first step adds Gaussian noise to the original data (i.e., x), yielding noisy input (i.e., z t at time t).Next, the denoising step applies neural network-based noise removal operation.Mathematically, the following Markovian properties and equations are satisfied: where q and p are the probability density functions during forward (diffusion) and reverse (denoising) process, respectively.N stands for normal distribution.ᾱt = α t /α t−1 , and σt 2 = σ 2 t − ᾱt 2 σ 2 t−1 .I is an identity matrix that computes an isotropic Gaussian upon multiplying with a constant (e.g., σ 2 t ); α t and ᾱt are signal controls, i.e., meaningful information during training and generative steps; and σ t and σt are noise controls, i.e., noisy information used for diversity during training and generative steps.The subscript t = 0, 1, ..., T is the time step at which a molecule is generated (a.k.a.denoised or diffused).ξ t , A t , B t ,C t , and D t are constants consisting of α t , ᾱt , σ t , and σt 31 .Equations ( 1) and ( 2) represent a diffusion process that is conditioned on the previous value, z t−1 , and initial value, x, to predict a current value, z t .The denoising processes are described by Equations ( 3) and ( 4), which predict a previous value conditioned on (either real or predicted) initial value and current value.Equations ( 5) and ( 6) provide a detailed evolution for µ θ t which represents the learned denoising mean parameterized by θ , as well as x, a real initial value, approximated by x.The training objective is to learn a neural network, i.e., EGNN, to predict a denoising value so that random noise can be denoised to a physically and chemically valid molecule during the generative process.In practice, we predict ε θ t (learned noise value) rather than x directly (i.e., Equations ( 5) and ( 6)) for better prediction 58 .Since DiffLinker generates chemically valid molecules, EGNN needs to take more regularization into account, such as E(3) and O(3) (i.e., orthogonal group in dimension 3: rotations and reflections).Thus, invariant features such as one-hot and equivariant features such as atomic coordinates updates need to be considered 31 .

Screen and Evaluate
To ensure consistency of elements in hMOF linkers and AI-generated MOF linkers, we manually filter out generated linkers that contain elements not present in the hMOF dataset.Since the pre-trained DiffLinker model we used in this work was trained on the GEOM dataset 56 , a total of nine heavy elements may be sampled, including C, N, O, F, P, S, Cl, Br, and I.Among these elements, S, Br, and I do not appear in the hMOF dataset, therefore we remove generated linkers that contain these three elements.For each molecular fragment, we then perform sampling 20 times.Each time the model samples a different molecule may be obtained because of the random and probabilistic nature of denoising process.The probabilistic nature of diffusion model enables it to generate linkers from an extensive linker design space beyond that of the hMOF dataset.
We then use five metrics to evaluate the quality of the remaining linkers.The first two metrics are commonly used heuristic measures of synthesizability: the synthetic accessibility score (SAscore), and the synthetic complexity score (SCscore) 59 .
The SAscore, as defined by Ertl and Schuffenhauer 60 , is based on analysis of one million PubChem molecules, and combines fragment contributions from molecule substructures with a complexity penalty that accounts for molecular size and for structural features of molecules such as the presence of rings.The SCscore is computed by using a neural network trained on 12 million chemical reactions from the Reaxys database to estimate the number of reaction steps required to produce a molecule 59 .For both SAscore and SCscore, the higher the values are, the more difficult it is to synthesize the linker, hence less desirable.Specifically, the SAscore 60 is defined as the difference of fragment score and complexity penalty: where fragment score is calculated by averaging over the contributions of non-zero elements of the Morgan fingerprint for all molecular fragments.The complexity score is calculated based on five components: complexityPenalty = −sizePenalty − stereoPenalty − spiroPenalty − bridgePenalty − macrocyclePenalty.
The size penalty increases with the number of atoms: sizePenalty = nAtoms 1.005 − nAtoms,

12/25
The stereo penalty is calculated based on the number of chiral centers: stereoPenalty = log(nChiralCenters + 1), The spiro penalty is calculated based on the number of spiro centers, or atoms that connect two rings together: spiroPenalty = log(nSpiro + 1), The bridge penalty term is calculated based on the number of bridgehead atoms, or atoms that belong to two or more rings: bridgePenalty = log(nBridgeheads + 1), Lastly, if macrocycles are present, the macroPenalty is: macrocyclePenalty = math.log10(2),otherwise, the macroPenalty is zero.The SCscore 60 is calculated by using a neural network trained on 12 million reactions from the Reaxys database.The synthetic complexity of each molecule is rated from 1 to 5, with 1 being the easiest and 5 the most difficult to synthesize.To evaluate the capability of GHP-MOFassemble to generate a valid, novel, and diverse set of linkers, we leverage the MOSES framework 61 to compute three additional metrics: the fraction of valid linker SMILES strings (validity), the fraction of unique linker SMILES strings (uniqueness), and the dissimilarity of linkers (internal diversity).Validity measures whether atoms in generated molecules have the correct valency, whereas uniqueness quantifies the percentage of any molecule that is different from the rest of molecules.
We show in Figure 9 the synthetic accessibility score (SAscore) and synthetic complexity score (SCscore) values for the remaining linkers.We observe that as the number of sampled atoms increases, both distributions generally shift to the right, with the exception of the SAscore distribution with six sampled atoms, which is to the left of that with five atoms.This general trend indicates that linkers become harder to synthesize as the molecules become bigger.This result is expected because as more atoms are sampled, more complex substructures may be present.We note that no linker has zero or very large SAscore or SCscore, values that would indicate unsynthesizability.We show in Table 4 the validity, uniqueness, and internal diversity metrics, grouped by the number of sampled atoms and node-topology pairs of the corresponding MOFs.The validity column confirms that all generated linkers are valid.The last 13/25 three columns, when reviewed from top to bottom, reveal a considerable increase in linker uniqueness and a more modest increase in internal diversity as the number of sampled atoms increases.High uniqueness values indicate that our model is generating non-duplicate molecules, whereas high internal diversity values indicate that the generated molecules are chemically diverse.Internal diversity (which indicates how dissimilar a specific linker is to the rest of the population) is computed using two internal diversity scores: IntDiv 1 and IntDiv 2 61 , also shown in Table 4.The increase in these metrics as more atoms are sampled is expected because the degrees of freedom of atomic species and their spatial coordinates increase with molecular size.The internal diversity scores were calculated based on the Tanimoto distances 62 among all pairs of molecules, which are obtained by calculating the normalized Jaccard score of Morgan fingerprint bit vector between all pairs of AI-generated linkers.This yields similarities between a specific linker and the rest of the linkers in the linker pool, which are averaged out.Therefore, we end up with a metric showing each linker's similarity compared with the population of generated linkers.We used the MOSES 61 framework to compute the internal diversity scores IntDiv 1 and IntDiv 2 61 by using the relations: where T d is the Tanimoto distance, which relates to the Tanimoto similarity (T s ) by: where G is the generated set of molecules, |G| is the size of that set, and (m 1 , m 2 ) is a pair of molecules in the set.
To measure the similarity between the AI-generated linkers and hMOF linkers in high-performing MOF structures, we show in Figure 10 the distribution of maximum Tanimoto similarity between the two linker sets.For each unique linker in the 364 AI-generated high-performing MOF structures we calculated its Tanimono similarity with all of the unique linkers in the hMOF structures.The maximum value of the Tanimoto similarities gives a quantitative measure of how different the generated linkers are from the hMOF linkers.Results in Figure 10 indicate that most linkers in the predicted high-performing MOF structures are vastly different from those in hMOF, i.e., our GHP-MOFassemble framework generates novel MOF structures with chemically unique linkers. .Similarity between AI-generated and hMOF linkers in high-performing MOF structures.Distribution of maximum Tanimono similarity, which measures the uniqueness of AI-generated linkers as compared to hMOF linkers for high-performing MOFs.The peak around 0.3 to 0.4 indicates that most generated linkers are just 30-40% similar to those in hMOF, i.e., our AI framework generates novel linkers not present in hMOF structures.On the other hand, the trailing heavy right tail above 0.4 Tanimoto similarity indicates that we are also able to generate linkers that are structurally similar to those present in hMOF, showing that GHP-MOFassemble enables generation of a diverse set of novel linkers.

MOF Assembly
Once we have generated linkers, we can then assemble them with metal nodes.To construct MOFs, we need to guide the assembly process.We do this by using dummy atoms, which indicate the points at which the building blocks are to be connected.In practice, this is done as follows.Our parsing of MOFids generates, for MOFs with Zn tetramer nodes, linkers that carry two carboxyl groups, which by definition are part of metal nodes instead of linkers.Such wrong assignment of carboxyl groups is due to how the MOFid algorithm parses MOF structures.To reflect the correct molecular structure of linkers, the two carbon atoms in the carboxyl groups are identified as dummy atoms, and the redundant four oxygen atoms and two hydrogen atoms are removed.An illustration of dummy atom identification for linkers containing carboxyl groups is shown in the left panel of Figure 11.
For MOFs with Cu paddlewheel and Zn paddlewheel node, however, another type of linker containing heterocyclic rings exists.For this type of linker, the two atoms that are nitrogen-metal bond distance away from the terminal nitrogen atoms on the heterocyclic rings are identified as dummy atoms.After the dummy atoms are correctly identified, three randomly selected linkers (duplicates allowed) are assembled with one of the three pre-selected nodes into MOFs in the pcu topology.An illustration of dummy atom identification of linkers containing heterocyclic groups is shown in the right panel of Figure 11.
More than half of the hMOF structures are catenated MOFs, i.e., MOFs with interpenetrated lattices.By varying the level of interpenetration, it is possible to generate MOFs with different pore sizes, with a higher catenation level generally corresponding to smaller pores.We denote the four catenation levels in hMOF, with increasing number of interpenetrated lattices, as cat0, cat1, cat2 and cat3.To generate MOFs with high structural diversity, we applied site translation method as implemented in Pymatgen 63 to generate MOFs with different catenation levels.Figure 12 presents the building block combinations of the final six MOF candidates that passed all screening processes.For linkers, the corresponding molecules (without dummy atoms) are

15/25
Linker containing heterocyclic rings shown for ease of visualization.For catenated MOF structures, we keep the spacing between interpenetrated lattices the same, so as to ensure equal pore sizes.

Screening and predicting properties of AI-generated assembled MOFs
Here we describe the final Screen and Predict component of GHP-MOFassemble.After the MOF structures are assembled, a comprehensive screening workflow is developed to sift through the candidates with increasing levels of computational cost and confidence level: • Inter-atomic Distance Check (screen) • Pre-simulation Check (screen) • Predicting CO 2 Capacity of MOF via Pre-trained Regression Model (screen and predict) • Structure Validation of the Assembled MOFs (screen) • Grand Canonical Monte Carlo Simulation (screen and predict)

Inter-atomic Distance Check
A pairwise distance matrix M n×n is extracted from a MOF's CIF file.The matrix element M i, j denoting the Euclidean distance between atom A i and atom A j is: where 1 ≤ i ≤ n and 1 ≤ j ≤ n, and (x i , y i , z i ) and (x j , y j , z j ) are the Cartesian coordinates of atoms A i and A j .For an arbitrary pair of atoms A i and A j , their chemical symbols can be denoted as E i and E j , respectively.The bond length between atoms A i and A j is δ (E i , E j ).The minimum allowed inter-atomic distance between the these two atoms can be estimated by calculating the infimum of all experimental bond lengths between these two element types, i.e.: inf (δ (E i , E j )), which can be queried from the OChemDb 64 database.All values of M i, j (i ̸ = j) are compared against inf (δ (E i , E j )), and any occurrence of M i, j < inf (δ (E i , E j )) (i ̸ = j) will result in the MOF candidate being discarded.Entries with i = j (i.e., the diagonal entries of the distance matrix M i,i ) are not examined as they should always be zero, because the distance between an atom and itself is zero.We used Python 3.10's built-in multiprocessing library to help launch the inter-atomic distance check on different MOF structures in parallel.

Pre-simulation Check
For each atom in the MOF structure, the pre-simulation check workflow can be described as the following: 1.The atom's element should be recognized as one of the supported elements in the UFF4MOF parameter set, 2. The atom's neighboring atoms that are within the allowed bonding distance are examined for chemistry validity.
We conducted this process with the help of the cif2lammps library.Python's multiprocessing library is also used to speed up this process over many structures in parallel.

Predicting CO 2 capacity of MOF via pre-trained regression model
The GHP-MOFassemble framework uses an ensemble of regression models to estimate the CO 2 capacity of newly generated pcu MOF structures.We use for this purpose a modified version of the CGCNN model, developed in our previous work 37 , which adopts an adjacency list to format node and edge embeddings, rather than the adjacency matrix format of the original CGCNN, a change that enhances model training speed, training stability, and prediction accuracy.To fine-tune this AI model, we used MOF structures in the hMOF dataset, along with their CO 2 capacities at 0.1 bar as input.We split the hMOF dataset into 80% training set, 10% validation set, and 10% test test.We independently trained three modified versions of the CGCNN model for 5,000 epochs with a batch size of 160, using Adam as the optimizer, at a learning rate of 10 −4 , and a weight decay rate of 2 × 10 −5 .Table 5 shows the R 2 score, mean absolute error (MAE), and root mean squared error (RMSE) of the three AI models ensemble on the 10% test set.Figure 13 shows the distribution of the standard deviations of ensemble model predictions.We treat the standard deviation of predictions from the three models as a measure of model uncertainty, which we view as arising from the model's difficulty in learning certain data points.This uncertainty is also known as epistemic uncertainty.By employing an ensemble of models and averaging their prediction results, we can increase prediction accuracy, which is thoroughly and extensively explored in the machine learning community 65,66 .We chose the threshold for the standard deviation filter to be 0.2 m mol g −1 because it is sufficiently small (with 96% of data points below the threshold) that for low CO 2 capacity predictions, the predicted values across the ensemble are similar.On the other hand, for high CO 2 capacity predictions, it also ensures that the outliers (i.e., predicted values across the ensemble are vastly different) are filtered out, therefore minimizing the overall error.In Table 6 we also notice that most of the high-performing MOFs are cat2 and cat3.This is consistent with our preliminary analysis on catenation levels and distributions of CO 2 adsorption capacity in Figure 5.
We show in the left panel of Figure 14 the scatter plot of the ensemble model predictions for the 10% test set (from hMOF), which has a MAE of 0.093 m mol g −1 .Using CO 2 capacity of 2 m mol g −1 as a threshold, we repurposed our predictive model as a classifier to categorize MOFs into low and high performers, with predicted CO 2 capacities below and above the threshold, respectively.Using this scheme, the confusion matrix for identifying low and high performers is shown in the right panel of Figure 14.We conclude from the confusion matrix that the pre-trained model classifies both low and high performers with high accuracy, with 98.4% (13,551 out of 13,765) of test samples (of hMOF) correctly classified.As the test set is heavily imbalanced, with many more low performers than high performers, we also calculated the balanced accuracy of classification 67 (the sum of true positive rate and true negative rate divided by two for binary classification or average of recalls for multi-class classification), obtaining a value of 90.7%.Since the majority of the 214 misclassified samples lie close to the decision boundaries, as shown in the red and purple regions of the left panel of Figure 14, we conclude that the ensemble model is capable of differentiating low and high performers.We show in Figure 15 the empirical cumulative distribution functions of the CO 2 capacities of hMOF (dashed lines) structures and AI-generated MOF structures (solid lines) at different catenation levels.We observe that there are more high-performing cat2 and cat3 MOFs among the generated structures compared to the hMOF dataset, as indicated by lower cumulative density values at 2 m mol g −1 (black dashed line).On the other hand, there are fewer high-performing cat0 and cat1 MOFs, as shown by higher cumulative density values at 2 m mol g −1 .This result demonstrates that within the AI-generated MOF design space, there are more high-performing candidates with higher catenation levels compared to lower catenation levels.
In summary, we use an ensemble of 3 AI models to identify pcu AI-generated MOFs produced by the Generate step (in addition to inter-atomic distance and pre-simulation checks) that have predicted CO 2 capacities higher than 2 m mol g −1 .We use the ensemble mean plus standard deviation as the final prediction.These MOFs are then selected for further investigation with MD simulations.created.A triclinic isothermal-isobaric ensemble (i.e., NPT) at 300 K and 1 atm is applied to the supercell structure.All lattice parameters: a, b, c, α, β , γ are allowed to equilibrate freely.A relatively short simulation with 400,000 steps is conducted for each MOF to allow for structural equilibrium with a step size of 0.5 femtoseconds.The potential energy E pot of the MOF structure can be expressed as: where E bond is the bond stretch energy, E angle is the angle between bonds energy, E dihedral is the dihedral angle energy, E improper is the improper dihedral angle energy, E vdw is the nonbonded interaction energy, and E coul is the Coulombic interaction energy.Coupry et al. (2016) 40 stated that 76.5% of the CoRE structures 14 lattice parameters are within 5% change when they are simulated with the UFF4MOF parameters.A similar criterion is used here such that if any of MOF's lattice parameters of triclinic cell changes are more than 5%, the structure is discarded in the screening process.This screening process provides a new level of confidence in terms of structural stability of the AI-generated MOFs. Figure 16

Grand Canonical Monte Carlo (GCMC) Simulations
The void fraction of each candidate MOF is calculated using the RASPA 43 's helium void fraction function.It helps us understand the porosity contribution toward MOF CO 2 adsorption capacity.Next, PACMOF 68 is used to assign partial charges to MOFs.Using the pre-trained partial atomic charge model on the density derived electrostatic and chemical (DDEC) charge 69 data, the partial charges are assigned to atoms in MOF structures.After determining the partial charges, the MOF structures are assigned with UFF4MOF force field parameters.The GCMC simulations are conducted at 0.1 bar and 300 K with a step size of 0.5 fs/step.The MOF structures are under the rigid assumption, therefore only Van der Waals interactions and Coulombic interactions (i.e., non-bonded) are considered: which can be expressed as 70 E pot = ∑ i, j 4ε i j σ i j r i j 12 + σ i j r i j 6 + q i q j 4πε 0 r i j , where ε i j , σ i j , and r i j are the well depth, zero position, and inter-atomic distance of a Lennard-Jones 12-6 interaction between atom i and atom j; q i and q j are the electrostatic charges of atom i and atom j; ε 0 is the vacuum permittivity.

Figure 1 .Figure 2 .
Figure 1.CO 2 adsorption values at 0.1 bar and 300K.CO 2 adsorption capacity values at a pressure and temperature of 0.1 bar and 300 K, respectively, for the top six AI-generated MOFs' according to grand canonical Monte Carlo (GCMC) simulations and our modified crystal graph convolutional neural network (CGCNN) model.

Figure 3 .
Figure 3. Functional groups in high-performing MOFs.Comparison of the proportion of selected functional group that appear in high-performing AI-generated MOFs and MOFs in the hMOF dataset.

Figure 4 .
Figure 4. Properties of the hMOF dataset.a Depictions of the most frequent node-topology pairs in hMOF structures.b Cumulative distribution functions of their 0.1 bar CO 2 capacities.

Figure 5 .
Figure 5. CO 2 capacities of hMOF structures at different catenation levels.Empirical cumulative distribution functions of MOFs in the hMOF dataset at 0.1 bar at different catenation levels.The x axis is capped at 4 m mol g −1 to preserve details.

Figure 6 .
Figure 6.CO 2 capacities of hMOF structures at different adsorption pressures.Pairplot of CO 2 capacities of hMOF structures.The (x, y) axes represent adsorption pressures.Different colors indicate MOFs with different node-topology pairs.

1 Figure 7 .Figure 8 .
Figure 7. First component of the GHP-MOFassemble framework.Schematic representation of the Decompose component, which consists of three steps, which we showcase for Zn tetramer-pcu MOFs.First, the Select step selects the Zn tetramer-pcu MOFs in high-performing hMOF structures.Next, the Extract step extracts unique linkers from those MOFs.Finally, the Fragment step generates unique linker fragments.The color scheme of elements is: carbon in grey, oxygen in red, nitrogen in blue, and hydrogen in white.See Table1for unique linker fragments of Zn tetramer node.

Figure 9 .
Figure 9. Synthesizability of DiffLinker-generated MOF linkers with dummy atoms.a-c Distributions of synthetic accessibility score (SAscore); and d-f synthetic complexity score (SCscore) of DiffLinker-generated MOF linkers with dummy atoms.We indicate both node-topology pairs, and the number of sampled atoms, from 5 to 10 inclusive.

Figure 10
Figure10.Similarity between AI-generated and hMOF linkers in high-performing MOF structures.Distribution of maximum Tanimono similarity, which measures the uniqueness of AI-generated linkers as compared to hMOF linkers for high-performing MOFs.The peak around 0.3 to 0.4 indicates that most generated linkers are just 30-40% similar to those in hMOF, i.e., our AI framework generates novel linkers not present in hMOF structures.On the other hand, the trailing heavy right tail above 0.4 Tanimoto similarity indicates that we are also able to generate linkers that are structurally similar to those present in hMOF, showing that GHP-MOFassemble enables generation of a diverse set of novel linkers.

FindFigure 11 .
Figure 11.Assembly of AI-generated linkers with metal nodes.a Identification of dummy atoms for linkers containing carboxyl groups.The dummy atoms are found by substituting the carbon atoms in the carboxyl groups.The remaining oxygen and hydrogen atoms in the carboxyl groups are removed.b Identification of dummy atoms for linkers containing heterocyclic rings.The dummy atoms are found at nitrogen-metal bond distance from the terminal nitrogen atoms along the vectors pointing from the opposing carbon atoms to nitrogen atoms.

Figure 12 .
Figure 12.Optimal linkers and catenation levels for MOF assembly.Building blocks and catenation levels of the top six AI-generated MOF candidates.

Figure 14 .
Figure 14.Performance of AI ensemble to measure CO 2 capacity. a Predictive performance of the ensemble model on the 10% test set of hMOF.b Confusion matrix of the ensemble model in classifying low and high performers in the test set.MOFs with real/predicted CO 2 capacity lower than 2 m mol g −1 are identified as low performers, and MOFs with real/predicted CO 2 capacity higher than 2 m mol g −1 are identified as high performers.

Figure 15 .
Figure 15.MOFs' CO 2 capacities in terms of catenation levels.Comparison of empirical cumulative distribution functions of the predicted CO 2 capacities of generated and hMOF structures for a cat0, b cat 1, c cat2, and d cat 3.

Figure 16 .
Figure 16.Lattice parameter relative errors of the top 6 AI-generated MOFs.The top six AI-generated MOF candidates have change less than 5% in all lattice parameter during MD simulation.The first three lattice parameters are cell axes length and the latter three lattice parameters are cell plane angles.Their CO 2 capacities are higher than 96.9% of MOF structures in the hMOF database.

Table 1 .
Most frequent node-topology pairs in the hMOF dataset.Properties of the hMOF dataset.PW, TM, and HP-MOF stand for paddlewheel, tetramer, and high-performing MOF, respectively.For all three MOF types, we start with 540 (i.e., = 180 + 162 + 198 from Table1last column) unique molecular fragments extracted from high-performing hMOF structures, and use DiffLinker to generate new MOF linkers, with the number of sampled connection atoms ranging from five to 10 (total six connection atom types).For each linker, sampling is performed 20 times.

Table 2 .
Number of linkers at each step of the MOF assembly process.Statistical summary of the number of linkers after each step categorized by the corresponding MOF types.PW and TM stand for paddlewheel and tetramer, respectively.Element filter means that linkers with S, Br and I elements are removed.

Table 3 .
Relative lattice displacement at each catenation level.Amount of lattice displacement for catenated MOFs.

Table 4 .
Statistical properties of AI-generated linkers.Statistics of AI-generated linkers that correspond to different MOF types (first column), number of sampled atoms (N in the second column), number of carboxyl linkers (C , fourth column), and number of heterocyclic linkers (H , fifth column).In total, there are 12,305 linkers (see Table2's Total column, Element filter row) with identified dummy atoms.

Table 5 .
Statistical properties of AI ensemble to characterize MOF's performance.Statistics of AI ensemble, including R 2 score, mean absolute error (MAE), and root mean squared error (RMSE).Standard deviation of AI model ensemble to estimate MOFs' CO 2 capacity.Distribution of the standard deviation of AI ensemble predictions.Predictions with standard deviation of less than 0.2 m mol g −1 consist of around 96% of the test set.This shows that our three independently trained AI models are in great agreement.The remaining 4% with larger than 0.2 m mol g −1 standard deviation implies that these are the data points which may be difficult to predict due to errors in target property or extra information other than atomic species and periodic neighbor being necessary for accurate predictions.

Table 6 .
Catenation level of high-performing MOFs.Numbers of predicted high-performing MOFs found in the hMOF dataset.