Rapid Design of Top-Performing Metal-Organic Frameworks with Qualitative Representations of Building Blocks

Data-driven materials design often encounters challenges where systems require or possess qualitative (categorical) information. Metal-organic frameworks (MOFs) are an example of such material systems. The representation of MOFs through different building blocks makes it a challenge for designers to incorporate qualitative information into design optimization. Furthermore, the large number of potential building blocks leads to a combinatorial challenge, with millions of possible MOFs that could be explored through time consuming physics-based approaches. In this work, we integrated Latent Variable Gaussian Process (LVGP) and Multi-Objective Batch-Bayesian Optimization (MOBBO) to identify top-performing MOFs adaptively, autonomously, and efficiently without any human intervention. Our approach provides three main advantages: (i) no specific physical descriptors are required and only building blocks that construct the MOFs are used in global optimization through qualitative representations, (ii) the method is application and property independent, and (iii) the latent variable approach provides an interpretable model of qualitative building blocks with physical justification. To demonstrate the effectiveness of our method, we considered a design space with more than 47,000 MOF candidates. By searching only ~1% of the design space, LVGP-MOBBO was able to identify all MOFs on the Pareto front and more than 97% of the 50 top-performing designs for the CO$_2$ working capacity and CO$_2$/N$_2$ selectivity properties. Finally, we compared our approach with the Random Forest algorithm and demonstrated its efficiency, interpretability, and robustness.


Introduction
With recent advances in machine learning (ML), material system design and development has undergone rapid acceleration 1,2 .However, one of the major challenges in applying ML to material system design lies in finding the appropriate design representations 3 .Most material design applications take advantage of quantitative (or numerical) design variables to represent material systems.In most cases, these quantitative descriptors (features) require either expert knowledge or data analysis to find the most appropriate ones.
On the other hand, although most qualitative (or categorical) variables (e.g. chemical elements, chemical compositions) are more accessible than quantitative variables, it is challenging to directly include qualitative variables as a part of the design variables in automated materials design.Metal-organic frameworks (MOFs) are an example of such materials systems.MOFs are a class of porous crystalline materials that have been used extensively for gas storage 4,5 , gas separation 6,7,8,9 , and catalysis 10,11,12 .Because of their highly tunable nature, MOFs have been looked at as a potential solution for different applications such as carbon dioxide (CO2) capture and separation 13,14 .Using a vector notation in which each element corresponds to a qualitative design variable such as topology, node, and edge, MOFs can be represented with the sole usage of qualitative variables as shown in Figure 1.However, the versatility and different possible combinations of the MOF building blocks lead to millions of candidates.To demonstrate a simple example, consider a MOF system with a topology that requires 2 nodes and 3 edges for construction.
Selecting only 20 different building blocks for each node and edge leads to a combinatorial design space of more than 10 6 MOF candidates.Due to the high experimental cost, both in time and resources, computational approaches have been increasingly used to replace experimental exploration 3 .
Figure 1.The qualitative representation and construction of metal-organic framework materials.Each MOF can be represented with a "vector" where each element (letter) in the vector corresponds to a choice of building block or topology.With the combination of different building blocks and topologies explored in this work, there are more than 10 4 hypothetical MOFs to be explored.
While high-throughput screening approaches 15,16,17,18,19,20 and ML techniques 21,22,23,24,25,26,27,28 have been utilized to computationally search or design for top-performing MOF structures in different applications, existing approaches usually rely on large data sets and high-dimensional physical descriptors to represent the material design space.These processes can be both time consuming and property specific, meaning that the ML models and descriptors are often not transferable to different design objectives.Finally, many ML models are viewed as 'black boxes' that are not easily interpretable for understanding how and why the model performs the way it is 29,30,31 .Therefore, a novel and a generic computational approach that (i) employs a simple but descriptor-free design representation, (ii) requires substantially smaller amount of data, and (iii) is easily interpretable would be highly useful for the design of novel MOFs.
Bayesian optimization (BO) has been shown to be effective for identifying the optimum candidates for materials systems with large design spaces and local optimums in different applications such as drug discovery, additive manufacturing, and genetics 32,33 .BO has also been used to identify high performance MOFs 34,35 .However, previous works on MOFs require expert knowledge for the choice of appropriate physical descriptors (e.g., gravimetric surface area, largest included sphere diameter) as inputs for surrogate model training.Gaussian Process (GP) is a popular surrogate model choice for BO as it provides both predictions and uncertainty quantification, which are the two main components of the acquisition function for choosing samples when applying BO.However, GP models fall short when there are qualitative design variables.This bottleneck has been recently bypassed by the novel Latent Variable Gaussian Process (LVGP) 36 approach, which can incorporate qualitative variables into GPs by implicitly mapping each qualitative variable into a quantitative space through low-dimensional latent variables.LVGP still possesses the qualities of a GP model in terms of providing fast surrogate modeling, capturing nonlinear responses, providing predictions, and quantifying uncertainties.Furthermore, the latent variable approach provides physics-based dimension reduction.Specifically, the latent variables and their locations in the latent space provide physically meaningful information on how the qualitative variables influence the responses.Thus, LVGP bridges the gap for incorporating qualitative information into engineering design applications and has been already employed in data-driven materials design research 37,38 .Although LVGP and BO have been applied to materials design and development, its application has been limited to qualitative design variables with small number levels, i.e., the design options per variable.
Here we present the Latent Variable Gaussian Process Multi-Objective Batch Bayesian Optimization (LVGP-MOBBO) framework to perform rapid design of superior MOFs directly from the building blocks that construct the material.Specifically, we are interested in identifying the Pareto front for a multiobjective optimization and top-performing MOFs without any human intervention.We are particularly interested in examining the performance of the approach under both small and large numbers of levels for qualitative variables.We take advantage of the readily available qualitative building block information that is used to construct the MOFs and build an interpretable LVGP surrogate model that cooperates with MOBBO to adaptively lead towards promising MOF candidates for CO2 capture and separation.With the integration of batch BO, this work shows that descriptor-free LVGP can also be effectively extended to applications with substantial number of levels.

Design Spaces
To show the effectiveness of LVGP-MOBBO, we demonstrated our framework on a design space using the fof topology, which consists of 4 types of building blocks (BB).We used 7 organic nodes (Nodular BB1) and 4 inorganic nodes (Nodular BB2).There are also two types of edges in the fof topology, and we used 41 edges for Connecting BB1 and 42 edges for Connecting BB2, as shown in Supplementary Figure 1.For the use of MOFs in carbon capture, two of the most important metrics are the CO2 working capacity and the CO2/N2 selectivity.Since we focused on method development, we calculated these properties for all MOFs in our design space in advance to aid in testing different variations of the search methods.The first design space, which we denoted as the Reduced Design Space (RDS) for validation purpose, consists of 1001 MOF designs that were specifically selected to demonstrate the applicability of the LVGP and BO onto MOF design problems.The second design space, which we denoted as Entire Design Space (EDS) contains 47,740 MOF design candidates that were constructed by combining all available building blocks (7, 4, 41, 42) for the organic node, inorganic node, and the two edges.Our framework, LVGP-MOBBO, was demonstrated on this design space to show the effectiveness of LVGP when a large number of building blocks (levels) are present.
The two main goals of our design optimization are to identify (i) the Pareto front of MOF designs between the CO2 working capacity and CO2/N2 selectivity, and (ii) the top-performing MOFs.The Pareto front represents set of MOF designs that possess properties that are superior to the rest of design space but cannot be improved without sacrificing the other properties of interest 39 .Furthermore, the top-performing MOFs represent the MOF designs that are closest to the Utopian MOF design in the Euclidian space.The Utopian MOF corresponds to a hypothetical MOF design that possesses the maximum available property of all objectives, which is often not achievable and therefore considered to be "utopian".As a result, the "topperforming" MOFs are identified based on the distance of a design point to the Utopian MOF in the objective space.

Framework
We would like to explore a given design space with as few resources as possible.Thus, we implemented the LVGP-MOBBO framework to perform descriptor-free MOF design optimization with only qualitative representations of building blocks.Our proposed LVGP-MOBBO design exploration framework, consists of 5 major parts: Initial Design of Experiments (DOE), Property Evaluation, Latent Variable Gaussian Process, Multi-Objective Batch Bayesian Optimization, and Design Solution (Figure 2).

Initial Design of Experiments
For a large design space optimization, the initial selection of design candidates plays a key role.Ideally, they should span the design space as much as possible, for which we employed the optimal sliced Latin hypercube sampling (OSLHS) 40 .The generated MOF designs were then passed into the next task for property evaluation.The detailed generation of the DOE is explained in the Methods sections.

Property Evaluation
Hypothetical MOFs were created using the ToBaCCo 3.0 package 41 , and the geometry optimization was carried out using the LAMMPS code 42 with the UFF4MOF force fields 43 .Grand canonical Monte Carlo (GCMC) simulations were performed using the RASPA package 44 to evaluate the CO2 working capacity and CO2/N2 selectivity properties.Further details of the property evaluation can be found in the Methods section.

Latent Variable Gaussian Process (LVGP)
Using the available MOF designs and their associated properties from the GCMC simulations, one LVGP model for each property was trained.Next, the properties of unexplored MOFs in the design space were predicted along with their quantified uncertainty in predictions, which are utilized by the MOBBO.The details of the LVGP modeling are provided in the Methods section.

Multi-Objective Batch Bayesian Optimization (MOBBO)
Utilizing both the predictions and the uncertainty estimates on the remaining candidates in the design space from the LVGP model, the MOBBO selects a batch of MOFs that has the highest Expected Maximin Improvement (EMI) values.The EMI is formulated in a way that both objectives have equal importance.A batch of  number of MOFs designs with the highest EMI values is selected and passed on to the Property Evaluation task once again.The framework then continues in this cycle until the stopping criterion (e.g.number of MOFs identified) is reached.Further details and formulation of the MOBBO are provided under the Methods Section.

Design Solution
Once the optimization stopping criterion is reached, the identified design candidates are analyzed further to distinguish the Pareto front and the top-performing MOF designs.Finally, the latent space of each building block is visualized to make inferences about their influence on each property of interest.

Performance Validation using a Reduced Design Space (RDS)
Before applying our proposed methodology to a large design space, we validated the effectiveness of LVGP and BO on MOFs by implementing the optimization campaign on a relatively small design space.This space contains Connecting BBs that were handpicked to show the novelty of the methodology by validating the latent variables obtained at the end of the optimization campaign and assessing the efficiency of the methodology for designing MOFs that possess superior properties.All the Nodular BBs (7 and 4 levels for Nodular BB1 & BB2) and 6 building blocks from the Connecting BB1 & BB2 were selected for RDS.Specifically, we selected Connecting BBs labeled as {5, 7, 8, 28, 29, 41}.The blocks {5, 7, 8} have similar molecular structures with different functional groups (-CN, -F, -NH2).Blocks {28} and {29} are extended structures of blocks {5} and {7}, respectively.Finally, block {41} is an empty building block, which facilitates a direct connection between the Nodular BB1 and BB2.The building blocks and labels can be found in the Supplementary Figure 1.
The RDS contains 1001 MOF design candidates and three Pareto front MOFs designs.The property space with the known Pareto front and the Utopian designs is shown in Figure 3a.In addition to demonstrating the effectiveness of LVGP for MOFs, the design optimization goal of this study was to identify both the Pareto and other top-performing designs.To account for all the possible Nodular BB1s, 7 MOFs were chosen for the initial DOE using OSLHS.Each of the 7 MOFs corresponds to one level of the Nodular BB1.Furthermore, due to the small number of candidates, we chose to add one MOF design,  = 1, during each iteration of BO.Batch BO was implemented on a larger design space, as discussed later in the paper.
The LVGP-BO design optimization campaign ran until the stopping criteria of identifying both the Pareto front and the 10 top-performing MOFs were satisfied.Starting with 7 initial MOFs in the DOE, this stopping criterion led to 44 iterations, which in return shows that a total of 51 (7 + 44) MOFs (5.1% of the design space) are explored as the next design candidates.Specifically, the LVGP-BO framework found all three Pareto front MOF designs in 42 iterations, which corresponds to 4.9% of the design space.The optimization history of identified MOF designs can be seen in Figure 3b.The fast design exploration of the LVGP-BO demonstrated its capability of finding top-performing MOFs.To verify the interpretability of LVGP models, we examined the latent spaces obtained from training the LVGP surrogate model for both properties after the 44-iteration optimization campaign.In addition, to validate the correctness of the latent variables obtained from the optimization campaign, we also trained the LVGP models on the entire RDS for both properties.We then compared the latent spaces obtained from these two instances.The comparison of the latent spaces for both objectives is shown in Figure 4.The four large colored boxes represent the latent space obtained for each BB after training the LVGP, the two columns represent different properties, and the two rows represent the different training instances.By comparing boxes in each column, we observed that the latent space representations obtained at the end of the design optimization show differences with the latent spaces obtained from the LVGP model trained on the entire design space.This was an expected result since LVGP-BO is optimization driven.However, independent of orientation and the scale of z1 and z2 (the 2D latent space axis), the relative distances between latent variables, which reflect the relationships between design choices (building blocks) and their influence on the properties, are preserved after LVGP-BO.For example, for Connecting BB1, level {41}is far from the other levels for both properties in both training instances in Figure 4.This similarity shows that even though the LVGP used in the design optimization framework was trained on a very small portion of the entire design space that is biased towards promising building block candidates, it can capture the underlying latent variables and the relationships between building blocks very well.This can be very advantageous for designers to understand and extract true meanings from the design decisions that our framework makes.The next question then becomes, what do these latent variables represent?Figure 5 shows the importance of the input space, in terms of the textural characteristics, on the property space.For both the RDS and the EDS that will be demonstrated later, we found that most top-performing MOFs for CO2/N2 separation often have small pores, characterized by low values of the largest cavity diameter (LCD) and small gravimetric surface area (GSA).MOFs with smaller pores could result in stronger van der Waals interaction and thus favor CO2 adsorption over N2 adsorption.Knowing the importance of the input space on the latent space, we further investigated how different building blocks affect the pore size, and ultimately the latent space (Figure 6).For the latent plot of Nodular BB1 (as shown in Figure 4), we found that the distance among the blocks {1, 3, 4, 5} and {6, 7} are small, and block {2} is always far from the rest of the variables.Building blocks {2, 6, 7} are smaller blocks than {1, 3, 4, 5}, resulting in MOFs with smaller LCD (Figure 6a).This could explain why blocks {1, 3, 4, 5} are always closer in the latent space than {2,6,7}.Moreover, block {2} is bulkier, with a branching -CH3 group, than block {6,7}, resulting in MOFs with slightly smaller pores, and thus far away from the other building blocks.In the Connecting BB1 latent variable plots, we observed that the 5 blocks {5, 7, 8, 28, 29} formed a cluster and are located far away from the block {41}.
Because {41} is an "empty" building block (Supplementary Figure 1c & d), using block {41} resulted in MOFs with significantly smaller pores than other building blocks (Figure 6b), and thus different in the property space.For Nodular BB2 and Connecting BB2, we found that the building blocks lead to minimal differences in the pore sizes (Supplementary Figure 2), and thus LCD could not be used to explain the latent space.For Nodular BB2, the building blocks have the same shape and differ only in their metal elements {1: Co, 2: Cu, 3: Ni and 4: Zn}.A potential explanation for the latent space is the difference in Lennard-Jones parameters (Supplementary Table 1), in which Zn has an  value of about one order of magnitude larger than the other elements, suggesting a stronger van der Waals interaction for Zn, which could favor CO2 adsorption over N2.As a result, block {4} (or Zn) is far apart from the other designs.Although the chemical identities of the building blocks in Connecting BB2 are the same as in Connecting BB1, Connecting BB2 has small effect on the pore size, and thus the property space.As a result, the points are evenly spread out in the latent space of Connecting BB2.Both the optimization performance and the physically interpretable model obtained from this design study demonstrate the effectiveness of LVGP and BO for further design applications.

Entire Design Space (EDS)
After confirming the effectiveness of our methodology on the RDS, we applied our framework to the entire design space (EDS) that contains 47,740 MOF candidates through combination of 7, 4, 41 and 42 building blocks for Nodular BB1, Nodular BB2, Connecting BB1, and Connecting BB2, respectively.The MOF candidates and their respective properties can be seen in Figure 7a.The design space contains 7 Pareto front MOF designs of interest.Incorporating our knowledge from previous LVGP implementations 37,45 and considering the large number of available building blocks in the design space, we decided to match each edge (Connecting BB2, 42 options) with each metal node (Nodular BB2, 4 options), resulting in a total of 168 MOFs to be selected for the initial DOE.To create this DOE, we used OSLHS once again.After starting the LVGP-MOBBO framework with 168 MOFs, we proceeded by adding batches of  = 5 new MOFs with highest Expected-Maximin Improvement (EMI) values until the design space search campaign reached the stopping criterion.The design optimization campaign stopped when the mean EMI values of the 5 MOFs that are selected for property evaluation in each iteration is less than a constant, , taken as  = 10 −5 in our study.With the aforementioned stopping criterion, the LVGP-MOBBO design optimization campaign continued for 66 iterations, identifying 498 MOF designs in total, including the initial 168 MOFs.Our results show that by scanning only 1.04% of the entire design space, LVGP-MOBBO identified all MOF designs that lie on the Pareto front.Specifically, as seen in Figure 7c, all the Pareto front designs are identified within 45 iterations, which corresponds to exploration of only 0.82% of the entire design space.This shows that our methodology is very effective and efficient.Although the initial DOE covers the MOF input design space as evenly as possible, the MOFs in the DOE are not distributed evenly in the property space (Figure 7c).
For some machine learning and optimization methods, this can be problematic, as we show later with the Random Forest approach.However, our methodology was swift in guiding the design decisions towards MOFs with high properties.Figure 7b shows the result of exploring the different number of top-performing MOF designs that are closest to the Utopian MOF design.The LVGP-MOBBO found all of the 25 topperforming MOFs.Furthermore, our methodology identified more than 97%, 87%, 80% of the 50, 100, 200 top-performing MOF designs, respectively.Finally, out of all 330 MOFs explored, 206 MOF designs (63.3%) belong to the 330 top-performing MOFs.The high efficiency in identifying a large number of solutions is advantageous due to two potential main reasons.First, it is possible that not every proposed MOF can be synthesized in the laboratory.Second, there are other criteria that must be addressed in practice beyond the CO2 working capacity and selectivity, such as cost and stability.Thus, it is useful to have alternative promising candidates at hand, so that a practical solution can be found.
By looking further into the histogram of selected building blocks at the end of the optimization campaign (Supplementary Figure 3), we observed a bias towards particular building blocks.Specifically, for Nodular BB2 and Connecting BB1, the blocks {2} and {41} are favored because all the Pareto front MOFs possess these building blocks.Therefore, LVGP-MOBBO can identify the promising building blocks effectively and choose them as the next MOF designs at the very early steps of the optimization campaign.
The interpretability of the LVGP approach can be demonstrated using the results for the entire design space.
At the end of the 66 iterations, we observed that the latent spaces of the Nodular BB1 (Figure 8a & c) and BB2 (Supplementary Figure 4a & c) converged to a final state.This means that after each iteration, the latent spaces obtained for these design variables did not change.On the other hand, we observed nonconvergent latent spaces for Connecting BB1 & BB2 that contain 41 and 42 different design choices, respectively.This is because the LVGP model is trained with a very small percentage of the design space (~1%).The optimization campaign still works well although the latent spaces are not stable.Specifically, block {41} is always separate from the rest in the latent space plots, meaning that its superior effect on the properties is identified clearly.Furthermore, the non-converging behavior is observed for the blocks that have minimal effect on the performance properties.Therefore, our framework can identify the specific building blocks that are superior with a physical justification using the physics-aware LVGP approach.
The latent variable plots of the Nodular BB1 (Figure 8a & c) and Connecting BB1 (Figure 8b & d) can also be explained using the MOF textural properties.For Nodular BB1, blocks {1, 3, 4, 5} form a cluster in the latent space of the CO2 working capacity, while blocks {2, 6, 7} are spread out (Figure 8a).A similar trend was observed in the RDS (Figure 4), which we ascribed to the size of building blocks that determine the LCD of the MOFs.However, the latent space for the CO2/N2 selectivity changed slightly compared to the RDS.Specifically, blocks {3, 4} are away from blocks {1, 5} and become closer to block {6}, while the positions of the other building blocks remain similar.For Connecting BB1 (Figure 8b & d), block {41} is distant from the other building blocks, which was also observed in the small dataset.Although some building blocks are also further apart from the clusters, their locations change from one iteration to another.
The latent variables for the Nodular BB2 and the Connecting BB2 can be found in Supplementary Figure 4.The Nodular BB2 plots can be explained by a similar reasoning as for the RDS plots, whereas the Connecting BB2 is non-convergent due to low training percentage of the LVGP.

Comparison with Random Forest and Robustness of LVGP-MOBBO
We compared our LVGP approach with another ML approach, Random Forest (RF), which is also used for optimization problems with qualitative variables.Both approaches employed the same MOBBO method defined previously.To conduct the study, we ran both frameworks 15 times on the EDS for 60 MOBBO iterations with 15 different initial DOEs.The Pareto front and top-performing MOF design exploration performance of the study can be seen in Figure 9.We observed that the LVGP can identify all the Pareto front MOF designs whereas the RF approach fails to do so in most cases (Figure 9a).The small confidence interval in the performance shows that the LVGP approach is robust and reliable in identifying the Pareto front candidates.On the other hand, the confidence interval of RF is large since some of the RF-MOBBO instances fail to identify any Pareto Front MOF designs.This is because RF-MOBBO is stuck in local optimum designs since the algorithm cannot predict beyond the training data, which usually contained initial DOEs with low properties.In contrast, LVGP was able to expand beyond the low property region towards the high property region by its capabilities of extrapolating beyond training data.Moreover, the Bayesian prediction of uncertainty provided by the LVGP compared to frequentist prediction of RF leads to better and more effective design space exploration 46 .More importantly, the LVGP approach makes the correct design decisions at a faster rate compared to the RF approach, which is crucial if the cost of conducting simulations or experiments is very high.Similarly, for all number of top-performing MOF identification categories, the LVGP approach resulted in a better and more robust performance (Figure 9b).The advantages provided by the LVGP approach are not limited to design optimization.The physical justification provided by the latent variables make the LVGP an interpretable and a favorable ML model for MOF design.The latent variables obtained at the end of the optimization campaign enabled us to gain physical insights behind the design decisions.Although RF is an explainable ML model, model agnostic methodologies are required to draw conclusions or understand its performance.Thus, together with the better performance and accuracy results, the interpretability of LVGP makes our approach more desirable and meaningful for materials design applications.

Discussion
Due to their versatile and tunable nature, MOFs have very large design spaces, and it is impossible to simulate or perform experiments for every MOF to find the novel candidates for an application of interest.
Although numerous ML and high-throughput screening approaches exist, they require either large databases or property-specific descriptors.To tackle these challenges, we presented the LVGP-MOBBO framework to design superior MOFs by only employing qualitative representations of building blocks.The framework presented here provides three main advantages compared to current similar efforts: (i) the framework requires no specific descriptors and only uses the MOF building blocks to perform the adaptive design space search, (ii) the framework is application independent, meaning that it can be applied to any property without the need to select important descriptors for the application of interest, and (iii) the physically justifiable latent variable approach provides interpretability on how each building block influences the resulting performance properties.We demonstrated our framework on a design space with 47,740 MOF candidates.The LVGP-MOBBO successfully identified all Pareto front designs and more than 97% of the 50 top-performing MOF candidate designs by scanning only ~1% of the design space.Compared to Random Forest, LVGP has better performance and robustness, and provides interpretability regarding the design through physically justifiable latent variables.Finally, although we demonstrated our framework on a MOF design space with adsorption properties, LVGP-MOBBO can be applied to any property that requires time consuming simulations such as quantum mechanical calculations.
A key challenge in the presented framework lies in the high number of building blocks.When a large number of blocks are present, although the design optimization campaign works efficiently to identify the top building blocks and MOF designs, the LVGP model struggles to converge to a final latent space due to high number of parameter (latent variables) estimations during model fitting.We expect that by incorporating prior knowledge, when available, into the framework such as assigning prior known distributions to latent variables, the latent variable realizations can be more accurate 47 .We can also incorporate additional descriptors that can further differentiate building blocks from each other to build more accurate LVGP models 48 An interesting application of our framework would involve performing materials design and development through autonomous experimentation studies.As there is no human intervention in LVGP-MOBBO, and the experimental inputs can be both qualitative and quantitative, we envision that the method we presented here can help researchers guide their experiments efficiently.

MOF Representation and Database Construction
For the fof topology, we constructed each MOF using an inorganic node, an organic node, and two edge blocks.Thus, to represent each MOF we use a 5-element 'vector' representation, [ −  −  −  − ], where each letter represents Nodular BB1, Nodular BB2, Connecting BB1, Connecting BB2, and Topology, respectively.A visualization of this representation is shown in Figure 1.
We eliminated 104 MOFs that had poor initial geometries and missing bonds.We performed geometry optimization on the remaining 48,112 MOFs, in which we found and eliminated 372 MOFs that collapsed after the geometry optimization.Therefore, 47,740 MOFs were considered for this study.
For the initial set of materials, also known as design of experiments (DOE), that initialize the optimization framework, first an optimal Latin hypercube sample with specified number of experiments and variables was created.Then, for each qualitative variable, the design space was sliced into   sections, where   represents the number of unique options for each qualitative variable.Each DOE design is assigned to a qualitative variable that falls under the sliced section.An example DOE with two qualitative variables (Nodular BB1 & Nodular BB2) that each have 7 and 4 levels is shown in Figure 2

MOF Construction and Geometry Optimization
MOFs were created using the topologically based crystal constructor (ToBaCCo 3.0) 41 software.Geometry optimization was carried out to optimize the unit cell parameters and atomic position using LAMMPS 42 with the UFF4MOF 43 force field.For each structure, the geometry optimization was performed in a cycle that consisted of two steps, as recommended by Anderson et al. 41 .The unit cell parameters and atomic positions were first relaxed using a conjugate gradient (CG) algorithm, followed by atomic position relaxation using the FIRE algorithm (we chose a timestep of 0.1 fs).Each minimization converged only when the change in energy from the previous step to the current step divided by the current energy magnitude was less than 10 -8 and the forces on atoms were less than 10 -8 kcal/mol Å.The cycle stopped when the change in energy between the previous cycle and the current cycle was less than 10 -8 kcal/mol.

GCMC Simulations
Grand canonical Monte Carlo (GCMC) simulation was carried out using the RASPA package 44 .Each simulation consisted of 10,000 equilibration cycles and 10,000 production cycles.The Monte Carlo moves used were translation, rotation, insertion, deletion, and random reinsertion.Lennard-Jones (LJ) and Coulombic interactions were used to calculate the energies between non-bonded atoms.LJ parameters between different atom types were computed using the Lorentz-Berthelot mixing rules.CO2 and N2 were modeled as three-site rigid molecules with charges on each site, using the LJ parameters and partial charges from the TraPPE force field 49 .LJ parameters for the framework atoms were from the Universal Force Field (UFF) 50 .The framework atom partial charges were calculated using the PACMOF (Partial Atomic Charges in Metal-Organic Frameworks) software 51 .For each MOF, we carried out two GCMC simulations; the first was at the adsorption condition of 1 bar, 313 K, and a bulk molar composition of CO2 : N2 = 0.15 : 0.85, and the second was at the desorption condition of 0.1 bar, 313 K, and a bulk molar composition of CO2 : N2 = 0.9 : 0.1.We used the CO2 working capacity (∆  2 ) and the CO2/N2 selectivity at adsorption (  2 / 2  ) as the criteria to determine top-performing MOFs for CO2/N2 separation.The two properties are defined as follows:

Latent Variable Gaussian Process (LVGP)
One of the main contributions of this paper lies in the design optimization of MOFs using only the readily available qualitative representations of building blocks.On the other hand, due to the nature of the correlation functions, it is not possible to directly implement the building blocks into the Gaussian Process (GP) models as the difference between variables becomes unclear.Therefore, in this paper, we implemented the novel Latent Variable Gaussian Process (LVGP) to account for the qualitative variables in the GP model 36 The Gaussian correlation function shown in equation ( 3) evaluates the correlation between points  and ′ based on 2-norm distance.The main reason behind choosing this correlation function is because we assume that points that are close in the spatial input space should also reflect a similar behavior in the output space as well.Along with the 2D mapped latent variables  = (  1 (),   2 ()) for level  of each qualitative variable , the parameters,  and  are estimated through Maximum Likelihood Estimation (MLE) of the log-likelihood function where  is the number of samples,  is the  ×  correlation matrix with   = (  ,   ) for ,  = 1,2,3, … , ,  is a vector of ones with dimensions of  × 1, and  is the observed response with dimensions of  × 1.Finally, the 2D quantitative latent variables are then used to construct a GP model that provides both prediction and statistical representation of uncertainty in the design space for Bayesian optimization.

Multi-Objective Batch Bayesian Optimization (MOBBO)
Bayesian Optimization is a well-known efficient, fast, and easy-to-implement optimization technique that has been used in numerous materials design applications.For single objective optimization, BO makes the decision on which design in the design space should be sampled next based on the choice of an acquisition function.Three well-known acquisition functions are Lower Confidence Bound 52 , Probability of Improvement 53 , and Expected Improvement (EI) 54 .With its ability to balance exploration of the design space and exploitation of the objective, EI has been a popular choice for most materials design applications.
Considering the large MOF design space, we have also chosen EI as our base acquisition function.For a given candidate design , with its predicted objective value   ′ and quantified uncertainty   from the LVGP model, EI for single objective optimization can be calculated using, ), but also the uncertainties associated with the design space,   .
Often there are tradeoffs between objectives, meaning that one objective cannot be optimized without sacrificing the other one.This type of problem is also known as Pareto front optimization and is frequently observed in material systems 39 .Thus, for multi-objective optimization problems, the goal becomes discovering the Pareto front of the property space.Therefore, we have expanded the EI formulation by implementing the Expected-Maximin Improvement (EMI) acquisition function to serve as the balancer of the exploration and exploitation for multi-objective optimization.For the case of optimizing two objectives, the formulation of EMI is (  ) = min(max( 1 ,  2 ), 0) (6)   where   corresponds to the Expected Improvement value of each objective .The EI formula was used to compare the candidates with respect to the observed number of  Pareto front designs so far in the optimization campaign.Therefore, each   is a vector of  × 1 that contains the EI values of a candidate design on the observed Pareto front designs for each objective.Lastly, we first take the maximum of EI's for both objectives to observe the dominance of the candidate on the current Pareto frontier and then select the minimum of the maximum EIs to balance the multi-objective search.As a result, the EMI is formulated in a way that both objectives have equal importance.Equation (6) selects the single best candidate in each multi-objective BO iteration.
Due to the large number of candidate designs and the cost of training GP models, it is not ideal to train the LVGP with a single design candidate at each iteration.Therefore, to extend single candidate BO to select a batch of promising candidates, we select  candidates that possess the highest EMI values in each iteration and use them as the next design candidates.A demonstration of a single MOBBO iteration is demonstrated in Figure 2 under the Multi-Objective Batch Bayesian Optimization box.

Figure 2 .
Figure 2. The Latent Variable Gaussian Process-Multi Objective Batch Bayesian Optimization (LVGP-MOBBO) framework.The initial set of materials, also known as DOE, is generated by optimal sliced Latin hypercube sampling.Property Evaluation includes MOF construction and prediction of their adsorption properties using Grand canonical Monte Carlo simulations.The LVGP builds the surrogate model that captures the relationship between the design and property space.MOBBO makes the next batch of MOF designs for property optimization.Design Solution analyzes the MOF designs and the latent spaces.The details of each box are explained thoroughly in the Methods section.

Figure 3 .
Figure 3.The LVGP-BO results for the Reduced Design Space (RDS) exploration.(a) The property space of the available MOF candidates.The known Utopian and Pareto front MOF designs are highlighted with black and red points, respectively.(b) Design optimization history for 10 top-performing and Pareto front MOF designs.The blue color represents the Pareto front search, and the red color represents the 10 top-performing design search.

Figure 4 .
Figure 4.The latent variables obtained from the RDS study.Each colored box shows the 4 building block design variables and red dots show their respective latent variable.The numbers represent the design choice for the specific building block.The axes z1 and z2 represent the 2D latent space obtained from the LVGP model.The 1 st row represents the latent variables obtained by training LVGP on the entire design space and the 2 nd row represents the

Figure 5 .
Figure 5. Structure -property relationship of the Entire Design Space (EDS) and Reduced Design Space (RDS) datasets.The CO2/N2 selectivity versus the CO2 working capacity for the EDS (a and b) and for the RDS (c and d), colored by the largest cavity diameter (LCD) (a and c) and the gravimetric surface area (GSA) (b and d).

Figure 6 .
Figure 6.The distribution of the largest cavity diameters of 1001 MOFs in the RDS for different building blocks.Largest Cavity Diameter (LCD) distribution for (a) Nodular BB1 and (b) Connecting BB1 on RDS.

Figure 7 .
Figure 7. Performance of the LVGP-MOBBO on the EDS.(a) The property space of MOF design candidates along with Pareto front and Utopian MOF designs.(b) Percentage of top-performing MOFs identified after 66 iterations.Numbers on top of bars indicate the amount of identified top-performing MOF designs.(c) The initial DOE and the identified MOF designs after different numbers of iterations.(d) The building block representations of Pareto front MOFs and their crystal structures.Each MOF is represented as a vector [A-B-C-D], where each letter represents Nodular BB1, Nodular BB2, Connecting BB1, Connecting BB2, respectively.

Figure 8 .
Figure 8. Latent variable plots after the LVGP-MOBBO campaign on the EDS.(a) Latent spaces of the Nodular BB1 for (a) the CO2 working capacity and (c) the CO2/N2 selectivity.Latent spaces of Connecting BB1 for (b) the CO2 working capacity and (d) the CO2/N2 selectivity.

Figure 9 .
Figure 9. Comparative Study with Random Forest and LVGP-MOBBO.(a) Pareto front and (b) top-performing MOF designs identified by the two methods after 15 different runs.
under the Initial Design of Experiments box.
where  * is the best observed objective so far in the optimization campaign and ,  represent the cumulative distribution function (CDF) and probability distribution function (PDF), respectively.As equation(5) shows, the EI function suggests a new design by not only considering the exploitation of the objective function, ( * −   ′ . It is known that for every qualitative variable, there are underlying, possibly high-dimensional, quantitative variables that explain its effect on properties.The latent variable approach helps us to map the qualitative variables to a quantitative latent space.Consider a GP model input with  = [ 1  ,  2  , … ,    ] ∈  × with  qualitative variables and  number of points.Each variable,   , has   unique levels (design choices) { 1 (  ),  2 (  ), …    (  )} for  = 1: , (e.g., Cu, Co, Ni options for nodular building block (  = 3)).Then, each qualitative variable can be represented with a latent variable vector   (   ) = {   ( 1 ), … ,    (   )} for      , where k represents the dimension of the   .The developers of the algorithm have stated that users are free to choose the dimensions of   but also demonstrated that  = 2 is enough to represent the underlying high-dimensional quantitative space.Consequently, we chose  = 2. Thus, each level within a qualitive variable can be represented with a latent vector of   (   ) = {  1 (   ),   2 (   )}, and the input to the GP model becomes () = [  1 (   ),   2 (   ), … ,   1 (   ),   2 (   )] ∈  ×2 .An illustration of the latent variable representation of qualitative building blocks is shown in Figure 2 under the Latent Variable Gaussian Process box.Consider a typical single response Gaussian Process model, which consists of prior constant mean  and   (), describing the mean response at any given point in the input space, and a zero-mean Gaussian Process with a covariance function (,  ′ ), respectively.The covariance function (,  ′ ) determines the relationship or the correlation between variables in the model.The covariance function can be further extended to (,  ′ ) =  2 ⋅ (,  ′ ) where the  2 represents the prior variance of the GP model and (,  ′ ) describes the correlation between each point in the model through the specified correlation function.To explain the relationship between each design candidate for this application, we have implemented the Gaussian correlation function: