Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material’s structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal–organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25–30% decrease in root-mean-squared-deviation and an average increase of 40–50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure–property relationships for materials design.

www.nature.com/scientificreports/ desirable to have an automatic framework to generate expressive features that work across multiple applications, enabling more transferability and less "handcrafting. " Creating a universal representation from the input material structure, suitable for all different prediction tasks, is incredibly complicated. Typically, domain experts select specific features as the model input, usually tailored to making predictions about a particular property of interest. Often, this approach requires a large amount of manual processing to extract the necessary features 15 . For example, in the case of gas adsorption at high pressure, guest molecules tend to occupy the entire void space in a material, so void fraction can be used in predictive models. In contrast, for gas adsorption at low pressures, the guest molecules aggregate in the strongly binding regions of the material's pore-standard structural descriptors are not able to capture this information as well. Additionally, chemical interactions of the system, in particular local strong adsorption sites, are important in determining some gas adsorption properties; this information also needs to be encoded in material descriptors.
Besides geometry and topology, chemical makeup of the internal surfaces is key for predicting MOF properties. Chemistry is especially important for predicting adsorption capacities at low pressures. Previous approaches have constructed chemical descriptors by incorporating information from MOF building blocks, such as functional groups 13,16,17 . These approaches have resulted in some improvements in predictive capabilities, but they still require manual feature curation to inspect all of the building blocks in the dataset. Moreover, the prediction accuracy of these descriptors often does not transfer across structures and properties.
In this paper, we describe how to overcome the above challenges and present an end-to-end ML framework that automatically generates a material representation, while only requiring the basic material structure (atomic coordinates and elemental composition) as input. As a consequence, this approach avoids handcrafting representations that do not transfer across property predictions. We use a topological descriptor, called persistent homology 18 , to compute multi-scale signatures of the channels and voids in the pores of the material. There have been previous approaches applying topological data analysis to materials 19,20 ; however, in this work, we show that descriptors can be constructed from topological data analysis for downstream machine learning tasks for materials.
Additionally, we use features built using word embedding techniques 21 to describe chemical information. As we demonstrate, this automated ML framework beats the standard structural descriptors in predicting a variety of materials properties. We also show that the overall methodology-coupling these features with ML algorithms that assign importances-opens the proverbial ML black box and allows us to interpret the predictions by identifying geometric and chemical properties relevant to different tasks.

Methods
Datasets. We demonstrate our approaches on three datasets corresponding to MOFs of various diversity, and across a range of CH 4 and CO 2 uptake pressures predicted using grand cannonical Monte Carlo simulations [22][23][24] . The first dataset is the hypothetical MOFs (hMOFs) database generated by Wilmer et al. 22 . The hMOF structures were taken from MOFDB (http:// hmofs. north weste rn. edu), which also has adsorption uptakes for carbon dioxide at five different pressures ranging from 0.05 bar to 2.5 bar.
The second dataset is the Boyd-Woo predicted MOF database 23 with the predicted methane and carbon dioxide adsorption capacities at low and high pressure, and methane and carbon dioxide Henry's coefficients. The Henry's coefficients are expressed in terms of their logarithms.
Finally, we also included the 2019 CoREMOF dataset of the experimentally synthesized MOFs 24 . For each structure in our dataset, as in our previous work 25 , we have determined the values of the following commonly-used geometric descriptors. We call these structural descriptors, and use them as a baseline to compare against topological descriptors: (a) pore limiting diameter (PLD), in (Å), the diameter of the largest sphere to percolate through a material; (b) largest cavity diameter (LCD), in (Å), the diameter of the largest sphere than can fit inside the material's pore system; (c) crystal density ( ρ ), in (kg/m 3 ); (d) accessible volume (AV), in (cm 3 /g); (e) accessible surface area (ASA), in (m 2 /cm 3 ).
The values for these descriptors were computed using the Zeo++ software package 26 . Automated topology-processing pipeline. We construct an automated pipeline to process an input MOF. We describe the topological structure of the MOFs using persistent homology 18 . To normalize the size of each MOF, expressed as (periodic) base cells of different sizes, we fill a (100 Å) 3 cell with the atoms of the MOF. The size is chosen to be large enough to capture the statistics of the distribution of the topological features in every structure.
We represent a MOF as a union of hard spheres centered on its atoms. We increase the radii of these spheres and keep track of the changes in the topology of their union. The changes come in two types: a topological feature, such as a loop or a void, either appears or disappears. An important consequence of the algebraic formulation of this process is that these events can be paired uniquely, resulting in a set of birth-death pairs of radii, called a persistence diagram; see Fig. 1. There are two persistence diagrams relevant to us: a diagram that tracks births and deaths of loops that we interpret as tunnels in the MOF (we call these 1-dimensional features), and a diagram that tracks voids that we think of as pockets in the MOF (2-dimensional features). The difference in birth-death values is called persistence of the pair. Pairs of larger persistence capture more prominent pores in the MOF. We compute persistence diagrams using the Dionysus library (https:// github. com/ mrzv/ diony sus).  27 . The birth-death pairs (b, d) are transformed into birth-persistence pairs (b, d − b) . They are then convolved with Gaussians and discretized onto a grid of a fixed size, by integrating the resulting mixture of Gaussians in the cells of the grid. For this, we use the resolution of 50 × 50 and a Gaussian spread of σ = 0.15.

Word embeddings.
We incorporate word embeddings of the chemical elements to represent a given MOF's stoichiometric formula into our automated pipeline. We use this to capture the MOF's chemical information. The chosen embeddings were constructed from a large corpus of abstracts with the word2vec algorithm 21 . The only input required is the elemental composition of the MOFs. Using word embeddings maintains the automated nature of our machine learning pipeline. While the use of word embeddings to featurize composition do represent an implicit knowledge that the chemical elements are distinct, they use no explicit element-specific properties and are themselves derived from an unsupervised learning procedure on raw text. From an input MOF structure, we construct features based on the composition of each MOF structure that represent word embeddings for the different elements in the MOF using the "matscholar_el" preset ElementProperty featurizer in matminer 28 . The features correspond to 200 embedding dimensions, with the minimum, maximum, range, mean, and standard deviation for each dimension, for a total of 1000 values. We note that the different datasets have different numbers of unique elements. For example, the hMOF dataset has eight, while the BW dataset has 16.
Machine learning. We use random forest 29 regression to predict carbon dioxide and methane adsorption uptakes at different pressures including infinite dilution (the Henry's coefficients). One of our motivations for using the random forest is the ability to determine the feature importances in the model. The random forest algorithm builds an ensemble of decision trees and chooses a random subset of features for each one. The frequency with which a particular feature is chosen for a split is an estimate for the importance of the said feature.
We build trees for different groups of features: topological features, standard structural features, word embeddings, a combined model of topological features and word embeddings, a combined model of topological and structural features, and a combined model of topological features, structural features, and word embeddings. The topological features consist of both the 1D and 2D persistence images. We train the random forest on the specific target prediction of each material. Each of the forests consists of 500 trees, and the final prediction is the average of the prediction of all trees in the forest. After training the random forest on a training set, predictions are made on an unseen test set. For most of the predictions, we use an 80%/20% training-test split. The quality of the prediction is evaluated by comparing the predicted adsorption values and the correct adsorption values. We quantify our predictions by computing the root-mean-square deviation, (ŷ i − y i ) 2 /n , and the coefficient of determination (R 2 ), 1 − (y i −ŷ i ) 2 )/ (y i −ȳ) 2 . We also note that there are other approaches to utilize persistence diagrams in machine learning algorithms, such as by directly processing the diagrams through an input persistence layer in a neural network 30 . Interpretability and representative cycles. The algorithm used to compute persistence 31 tracks cycles that represent the topological features summarized in the persistence diagram. The cycles are not unique, but they reveal the atomic structures responsible for particular birth-death pairs. In a crystal structure, representative cycles correspond to channels or voids in the material. We visualize the cycles to better understand the topological features that appear in the MOFs. We choose which cycle to visualize using the feature importances found

Results
We evaluate the accuracy of the automatically generated descriptors for our machine learning models by predicting a number of different targets across the different datasets. For each target, we calculate the root-mean-square deviation (RMSD) and coefficient of determination (R 2 score). For each target and each dataset, we include results from models trained on only the topological features, only the word embeddings, and both the topological features and the word embeddings (T + WE). We also include results from the structural descriptors, described in Section "Datasets", as a baseline. Finally, we incorporate the standard structural descriptors by including models combining topological and structural descriptors (T + S), as well as topological descriptors, structural descriptors, and word embeddings (T + S + WE). hMOF dataset. For the hMOF dataset, we predict carbon dioxide adsorption capacities at different pressures, as shown in Fig. 2. The RMSD is low at lower pressures because the distribution of carbon dioxide adsorption capacity has low variance in this regime. While the topology-based model outperforms the word embeddings, the model combining the two performs even better. We also see that the topological features always outperform the structural features, often significantly. The word embeddings do not perform as well here. This is likely due to the hMOF dataset lacking compositional diversity: the hMOF data set contains only eight unique elements. Nevertheless, word embeddings help boost the overall model performance when combined with the topological features.
We achieve the best performance by combining all three features together, but the accuracy achieved by subsets of the features is revealing. Adding structural to topological features slightly improves the performance, but doesn't match that of all three features combined. On the other hand, the T + WE model performs only slightly worse than the T + S + WE model, indicating that the topological features capture most of the information that the structural features provide.
We compare our results to Fanourgakis et al. 11 , who used standard structural features and a customized featurization based on atom types to predict CO 2 adsorption capacity in the hMOF dataset. Table 1 shows results for each of our models at different pressures, along with the best model from Fanourgakis et al 11 .
Our model does particularly well at low pressures, achieving an R 2 score of 0.86 at 0.05 bar, compared to 0.65 from 11 . Carbon dioxide adsorption at low pressure has an important application: carbon capture from flue gases. Thus, it is particularly promising to have a generalized framework for accurate prediction of these targets. In general, our model transfers well across different pressures, as demonstrated by consistently high performance. BW dataset. We evaluate the accuracy of the automated machine learning pipeline on the BW dataset. We predict six targets grouped into three categories: the Henry's coefficient (log(K H )) for CO 2 and CH 4 , the gas uptakes for CO 2 at 0.15 and 16 bar, and the gas uptakes for CH 4 at 5.8 and 65 bar.  www.nature.com/scientificreports/ Table 2 shows the results of these predictions for the BW dataset for the baseline structural features (S), topological features (T), and topological features + word embeddings (T + WE). As a general trend, the T + WE model outperforms the structural features by a large amount, with an average (across all targets) decrease of 25.2% in RMSD and an average increase of 26.6% in R 2 score. This is especially apparent for the Henry's coefficient predictions and the CO 2 and CH 4 gas uptakes at low pressure. For these low pressure and infinite dilution gas adsorption predictions, to our knowledge, these topological descriptors are currently the best-performing descriptors that only take into account geometric information about the MOF. Supplementary Fig. 1 shows further visualization of the results with different sets of features.
CoREMOF dataset. Finally, we evaluate the accuracy of the automated ML pipeline on the CoREMOF dataset. To narrow the dataset in a principled manner, we only include MOFs with a known topology net 32 , with each topology net appearing at least 15 times in the dataset for a total of approximately 50 topology nets in the whole dataset. We predict four targets here: the Henry's coefficient (log(K H )) for CO 2 and CH 4 and the gas uptakes for CH 4 at 5.8 and 65 bar.
The improvement in using our ML framework in contrast to the commonly used structural features is particularly apparent in prediction improvement for the Henry's coefficient's of both CO 2 and CH 4 as well as low pressure CH 4 . This improvement is especially noticeable in R 2 scores. For example, as seen in Table 3, our ML framework results in a 165% improvement over the structural features when predicting the Henry's coefficient for CO 2 . The implications here are vast as adsorption in the infinite dilution regime, such as is commonly seen at low partial pressures, is very important for carbon capture applications. Moreover, the same model provides additional improvement over RMSD and R 2 scores across all the targets, with an average decrease of 27.8% in RMSD and an average increase of 68% in R 2 score. While the same structural features cannot be used for accurate predictions across many different targets, in contrast, our model shows far greater transferability. Supplementary  Fig. 2 shows further visualization of the results with different sets of features.
Notably, across all the datasets, the model combining topological and structural features only performs marginally better than the topological features alone. This indicates that the topological features are capturing almost everything the structural features capture, as well as much more.

Interpretability
We also show the utility of our approach from an interpretability point of view. The feature importances extracted from the ML models contain important information to enhance our understanding of the material design process, and we explore multiple facets of this in the next sections. www.nature.com/scientificreports/ Feature analysis. The random forest algorithm infers the importance of individual features by measuring how frequently they are used by the decision trees to make a prediction about a MOF. In our methodology, there are three distinct types of features: topological, structural, and word embeddings. Further, topological features come in two types, 1-dimensional features that capture the distribution of channels in the MOF and 2-dimensional features that describe the voids. Each of those consists of 2500 individual features (pixels in the persistence image), but we combine them to infer the aggregate importance of the different feature types. In this section, we analyze contributions from the topological and word embedding features, since the structural features contribute little extra information. Figure 3 shows the relative importance of topological descriptors and word embeddings. For the BW dataset, 2D features are most important for the prediction, with word embeddings playing a larger role in the predictions of the Henry's coefficient. For the CoREMOF dataset, word embeddings are more important, especially for the CO 2 Henry's coefficient where they account for 50-60% of the decisions, with topological features dominating the importance of predictions for both low and high pressure methane adsorption (albeit, 1D features play a larger role in low pressure methane adsorption, while 2D features play a larger role in high pressure methane adsorption). For the hMOF dataset, 1D topological features are most important at low pressures, with 2D being more important at higher pressure, and word embeddings used in ∼ 30% of the decisions.
As Fig. 3 shows, topological features play a major role in predicting gas adsorption, with the 1-dimensional channels being especially important for adsorption at low pressures in the CoREMOF and hMOF datasets, and 2-dimensions voids being important for the predictions with the BW dataset. The differences in feature importances can also be linked back to the data: for example, the CoREMOF MOFs tend to have smaller pores than the BW MOFs.
These results reveal the importance of different properties for different tasks. They support the claim that chemical information is more important for infinite dilution and low-pressure CO 2 adsorption. In these conditions, the specific interactions between the gas and the MOF framework, e.g. manifested as strong binding sites, play an important role in adsorption capacity-the word embeddings capture this non-structural information. Table 3. Model performance on CoREMOF dataset. Root-mean-square-deviation (RMSD) and coefficient of determination (R 2 score) results in predicting the Henry's coefficient (log k H ) for CO 2 and CH 4 and gas uptakes for CH 4 for the CoREMOF dataset. Different sets of features (S = baseline structural, T = topological, T + WE = topological and word embeddings) are shown. For each target, the units are mol kg −1 Pa −1 and V STP /V respectively. The best model is in bold. As the improvement from the topology + word embeddings is always greater than the structural features, the percentage of improvement (decrease in the case of RMSD and increase in the case of R 2 score) is also shown ( ).   www.nature.com/scientificreports/ On the other hand, methane adsorption at higher pressure is mostly described by 2D topology features, which can described voids at large, a trend that we also observed in zeolites 25 .
Our results also suggest why the conventional structural descriptors perform especially poorly when predicting CO 2 adsorption in hMOFs at low pressure or in the infinite dilution region. The standard structural features describe the pore geometry by the largest sphere to percolate through the materials and the largest sphere that can fit inside its pore system. At low pressures and/or in the infinite dilution region, the standard structural features are not able to capture the nuance of the gas molecules aggregating closer to the binding regions of the porous framework. In contrast, topological features record the widths of the channels that criss-cross the MOF as well as the sizes of different cavities. They also distinguishing between the distribution of channels and voids, by separating 1D and 2D topological features, and record other finer information about their shape.
Topological features and representative cycles. Nanoporous materials, and especially MOFs, are known for how tunable they are: experimentalists can synthesize materials with precisely sized pores. Understanding how structure features influence a particular material property helps guide this process. Our approach incorporating persistent homology is especially helpful in this task.
The points in a persistence diagram correspond to voids and channels of specific sizes. A point (b, d) in a 2-dimensional diagram is generated by a cavity that can fit the largest sphere of radius d; the largest sphere that can access the cavity has radius b. A point (b, d) in a 1-dimensional diagram is produced by a channel in the material, specifically, by its narrowest "bottleneck. " The death value, d, records the radius of the largest sphere that can pass through this bottleneck. The birth value, b, records how close the atoms of the bottleneck are to each other.
For each dataset and each target property, the most important 1D and 2D birth-death points, as identified by the random forest algorithm, are listed in Table 4. We note a few patterns. In the case of methane adsorption in all three regimes (infinite dilution, low pressure, and high pressure), the 2D birth and death values are similar for both the BW and CoREMOF datasets-in fact, almost identical for the infinite dilution and high pressure cases. Specifically, birth values are around 2.3-2.4 Å for high pressure methane adsorption, and 3.4-3.8 Å for low pressure and infinite dilution methane adsorption. Death values are 3.2 Å for high pressure methane adsorption, and 4-4.6 Å for low pressure and infinite dilution methane adsorption. The radius of a methane molecule is assumed to be 3.8 Å. These results suggest that pores somewhat larger than this radius adsorb well at low pressures and partial pressures, while at high pressures slightly smaller pores influence the overall adsorption capacity of the MOF.
Another pattern to note in the hMOF dataset is that 1D death values get larger as pressure increases, meaning the size of the largest sphere able to pass through the channel increases. The radius of a CO 2 molecule is assumed to be 3.3 Å. For high pressure targets, the model picks out the channels that can accommodate the molecule of this size. The 1D birth/death values for lower pressures correspond to smaller pores, such as the porous surface, which is related to the binding regions of the material's pore.
We can dissect topological representations further and extract representative cycles for each point. Although these cycles are not unique-we are at the mercy of certain choices persistent homology calculation makesthey are helpful in visualizing the cavities and channel bottlenecks represented by the points in the persistence diagram. Table 4. Most important 1D/2D birth-death points for the different datasets (in Angstroms). These values correspond to the porous framework sizes most important for a given adsorption task. www.nature.com/scientificreports/ Since we train our machine learning algorithm on vectorized persistence images, we have to take an extra step to identify the points in a persistence diagrams with relevant representative cycles. We illustrate our steps for this approach in Supplementary Fig. 3.
We extracted the representative cycles from the high gas adsorption MOFs from different databases. Two examples, including both 1D topology (channels) and 2D topology (voids), appear in Fig. 4. One notable trend is that the loop in Figure 4a is present in many of the materials in the hMOF dataset that have high CO 2 adsorption at low pressure. Similarly, the void size seen in Fig. 4b is present in many of the MOFs with high Henry's coefficients for CO 2 adsorption.
We expand on the latter by showing, as an example, in Fig. 5 the extracted voids that appear in a number of the top MOFs with a high CO 2 Henry's coefficient. As noted in 33 , the process of identifying the void structure that appears in top performing MOFs can be extremely time-consuming via manually detected features. Thus, we hope that our approach will allow for further study in pinpointing the channel and void shapes and bonding structures that correlate best to important material's properties, thereby encouraging the targeted design of structures to maximize desirable properties.

Word embeddings and material properties.
We explore the interpretability of the word embeddings by relating their importances in predicting MOF properties and in predicting chemical properties of individual elements. The former we obtain from the random forests just as the importances of the topological features. To calculate the importances for individual elements, we retrieve word embeddings for all the elements in the matscholar database 21 and use these as features to train models to predict various chemical properties-electronegativity, atomic radius, electrical resistivity, melting point, etc.-of the pure elements contained in pymatgen's 'periodic_table' module 34 . We extract the feature importances for each of these models. Because each MOF has 1000 features, summarizing the distribution of 200 features over its elements, as described in Section "Word embeddings", we sum up the MOF feature importances corresponding to the same elemental feature.
We take the subset of feature importances that account for 90% of the random forest decisions. By definition, these features describe the subspace of our input where most of the decisions are made to make a prediction about the given target property. Given a MOF target property and a chemical target property, we compute the Jaccard similarity between the two subsets of features. This metric measures the relative size of the subspace, important for the random forest decisions for both targets. Table 5 lists the top three materials properties by similarity to each MOF target property; all of them have a Jaccard similarity greater than 0.4. Following this procedure, we identify the chemical property with the strongest informational relevance to a given MOF target property.
We focus on interpreting the results from a MOF design perspective. The word-embedding features play a bigger role than topology in predicting log(K H ) CO 2 . For this target, the machine learning model trained on electronegativity was the most similar to the model trained on the word embeddings for each MOF. This suggests that local interactions are more significant in carbon dioxide adsorption in the infinite dilution regime, which is consistent with qualitative descriptions of low pressure or dilute-limit profiles of absorptivity in porous materials from literature 12 .
Thermal conductivity also appears multiple times, and is the most relevant elemental property for high pressure CH 4 adsorption. The relevance of thermal conductivity at higher pressures is more difficult to interpret, given that thermal conductivity contains an electronic and vibrational component. However, a relationship between thermal conductivity and MOF geometry has been suggested previously. Specifically, thermal conductivity correlates with pore size and porosity 35,36 , which in turn affects adsorption. Thus, when designing a MOF, including or substituting metal atoms which have low thermal conductivity in their phase pure form may improve adsorption in MOF structures. The coordination environment and identity of the coordinating linkers also likely plays a role in determining the trend for a given site. For reference, we have included the compositions of the high adsorption MOFs for each prediction task in the Supplementary Material.
Another materials property that appeared multiple times for multiple MOF targets was the Poisson's ratio, which reflects elasticity of a material. This is another property that fits in the existing paradigm of MOF design:  www.nature.com/scientificreports/ namely, flexibility. MOFs with flexible frameworks often are better adsorbents 37 , since they can accommodate a larger space to fit a gas molecule with less stress.
In summary, the latent information contained in the word embeddings overlaps with known descriptors for MOF gas adsorption, pointing to important chemical features for designing high adsorption MOFs 38 .

Conclusions
We have developed an automated end-to-end machine learning framework for MOFs, and nanoporous materials in general, by using persistent homology and word embeddings. Our approach builds a complex and holistic representation of the materials using only the basic input material structure, requiring less handcrafting and domain expert guidance than the currently widely-used porosity and chemical descriptors. Our topological representation is a vectorized persistence diagram, obtained from the atomic coordinates of the normalized Figure 5. Correlating void structure to MOF property. (a) str-m4-o14-acs-sym-8 (b) str-m4-o1-o22-acssym-94 (c) str-m4-o1-o24-acs-sym-96 (d) str-m4-o1-o24-acs-sym-165. The representative cycles of voids corresponding to the void most correlated with the CO 2 Henry's coefficient in example MOFs with high CO 2 Henry's coefficients. The voids are all composed of a similar bonding structure, with each different atom type represented by a different color. As noted in 33 , the process of identifying the void structure that appears in top performing MOFs can be extremely time-consuming via manually detected features. Thus, we hope that our much faster and topologically-grounded approach will allow for further study in pinpointing the channel and void shapes and bonding structures that correlate best to important material's properties, thereby encouraging the targeted design of structures to maximize desirable properties. Figure created with VisIt 3.1.4 (https:// wci. llnl. gov/ simul ation/ compu ter-codes/ visit). Table 5. Material properties sharing overlap with word embedding feature importances. Machine learning models trained with elemental word embeddings and materials properties are compared to the models trained with MOF composition word embeddings and MOF target properties for the CoREMOF dataset. The feature importances of each model are analyzed, and compared by Jaccard similarity. The top three materials properties most similar to the model trained to MOF target properties are listed. www.nature.com/scientificreports/ supercell representation of a materials' crystal structure. It can be used in any machine learning algorithm. We augment the topological information with element embeddings, constructed from a large set of scientific abstracts via the word2vec algorithm 21 . They provide a generalized representation of the MOF composition. We have tested this approach on three different datasets, predicting several important methane and carbon capture adsorption targets at various pressures. These experiments show a significantly improved performance compared to standard structural descriptors. The topological features we compute are generic and transferable across different property targets. As the topological descriptors consistently outperform standard structural descriptors, they provide a simple way to boost the performance of any machine learning algorithm. Additionally, to our knowledge, these descriptors are the best purely geometric descriptors for predicting gas adsorption at low pressures and in the infinite dilution regime. Moreover, these descriptors are interpretable: their components can be traced to specific channels and voids in the crystal structure, which contributes to a greater understanding of structure-property relationships in MOFs. We conclude by highlighting the key strengths of our approach.
(1) It is an ML pipeline that can automatically generate descriptors for a particular material's prediction task without the need to handcraft specific features. We make large gains in performance (ranging from an average 25-30% in root-mean-square-deviation and an average 45-50% increase in R 2 scores) across numerous different gas adsorption targets. (2) The generalizability and transferability of our ML model provides a way to quickly screen any dataset to find the top MOFs for a particular task without the need to handcraft specific features, speeding up highthroughput screening of materials for adsorption applications. As our results show, topological descriptors should be used for any porous materials adsorption prediction task and bring us closer to having a universal predictor for adsorption in porous materials. (3) Our model helps guide materials design by directly connecting property predictions to the crystal structure, thereby encouraging the targeted design of structures to maximize desirable properties.

Code availability
The code for generation of the material representations is available at: http:// www. github. com/ a1k12/ molec ule-tda.