Quantifying similarity of pore-geometry in nanoporous materials

In most applications of nanoporous materials the pore structure is as important as the chemical composition as a determinant of performance. For example, one can alter performance in applications like carbon capture or methane storage by orders of magnitude by only modifying the pore structure. For these applications it is therefore important to identify the optimal pore geometry and use this information to find similar materials. However, the mathematical language and tools to identify materials with similar pore structures, but different composition, has been lacking. We develop a pore recognition approach to quantify similarity of pore structures and classify them using topological data analysis. This allows us to identify materials with similar pore geometries, and to screen for materials that are similar to given top-performing structures. Using methane storage as a case study, we also show that materials can be divided into topologically distinct classes requiring different optimization strategies.

enough geometric information to enable detection of materials that have similar overall pore shapes.There are computational techniques to quantify the similarity between crystal structures.(16,17) However, these algorithms are limited to identifying identical crystal structures, while we are interested in finding materials that may have different crystal structures or chemical compositions but similar pore geometries.Martin et al. (18) developed Voronoi network representations of pore geometries, which are useful as fingerprints but do not capture details of the local pore structure.In this Letter, we develop a mathematical quantification of geometric similarity by using topological data analysis (TDA).TDA is a field of big-data analysis that builds on techniques from algebraic topology, most noticeably persistent homology.(3) Its guiding philosophy is that the 'shape' of the data reveals important information about the data.(4) To assign a geometric descriptor to a given material, we sample points on the pore surface.
By growing balls stepwise around each sample point and monitoring their pairwise overlaps, we compute the associated filtered Vietoris-Rips complex (see SI section 4), which is then characterized by its 0-, 1-and 2-dimensional homology classes.We store the lifetime of each homology class in the corresponding persistence barcode.Combining the 0-, 1-, and 2-dimensional barcodes yields a fingerprint that characterizes the overall shape of the pore structure.
For analyzing pore shapes we are in the unusually fortunate situation that, unlike most other big-data applications of persistent homology, our data have actual geometric meaning.In almost all known big-data applications only the 0-D and 1-D barcodes are of relevance, while here the 2-D barcodes also carry essential information.For example, figure 1 shows the fingerprints of two different zeolite structures, IZA zeolite DON and hypothetical zeolite PCOD8331112.
DON contains eight identical cylindrical pores that run parallel to each other.The pore structure of PCOD8331112 is a 3-dimensional network that is formed of two types of connected spherical cavities.The 0-D barcodes of both structures start with as many intervals as there are points sampled on the pore surfaces.More information is contained in the long intervals describing robust shape features: the existence of the single long interval in its 0-D barcode implies that the pore system of PCOD8331112 is connected.In contrast, the pore system of DON consists of eight unconnected components, encoded by the eight long intervals in its 0-D barcode.The 1-D and 2-D barcodes contain information on the shape of the cavities (see SI section 4 for details).
The most elementary, but highly non-trivial, application of our approach is to identify porous materials with similar pore structures.As we have a database of over 3,000,000 nanoporous structures, (19) visual inspection is out of question.Suppose we would like to know whether the library of hypothetical zeolites contains structures whose pore geometry is similar to a given material, e.g., a synthesized zeolite.To see the effectiveness of our approach, it is instructive to take a structure and find the four structures that are most similar to the chosen one, selected once by conventional descriptors (ConD) and once using persistent homology (PerH).To compare these two sets, we compute their average distances to the reference material, measured by the metric D CS of the conventional space as well as by the metric D T S of the barcode space (see SI sections 1 and 4 for details).Figure 2a shows the average distances of the two sets for each of the 146 experimentally known zeolite structures accessible to methane taken.The distances are normalized by the largest pairwise distance in the database.The TDA approach provides what one would expect: when persistent homology is used to identify similar pore structures, small D T S correlates well with small D CS .I.e., similar persistence diagrams describing the pore shapes correlate with similar conventional geometric measures.Figure 2a shows that the relatively few zeolites for which there are no four structures very similar to a given one with respect to PerH (large D T S ), the first four structures chosen by PerH might or might not have similar conventional geometric descriptors (small or large D CS ).The conventional approach, however, gives a different result: for each reference structure we can find structures with sim-ilar conventional descriptors (small D CS ) but the shapes of their pores can differ enormously (large D T S ). Figure 2b shows two cases where the conventional approach identifies structures with very similar conventional descriptors (see SI Table SI-3) but very different pore structures.
In contrast, if we use our topology-based fingerprint, we indeed retrieve structures that look strikingly similar.In the SI (section 2) we show that one can also use this similarity search to compare structures from different classes of nanoporous materials.These findings are guaranteed by a stability theorem that is a key result in persistent homology: (20) materials with similar shapes are described by similar barcodes.
For the traditional descriptors with geometric meaning, one expects to find correlations with information encoded in the persistent homological fingerprint (see SI Table SI-1).For example, the radius of the maximum included sphere is correlated with the 2-D barcode as the radius of a cavity determines the death time of an interval in the 2-D-barcode.Further geometric information, like the connectivity of the pore structure (0-D) or the number of independent tunnels (1-D), is also encoded in the persistence barcodes.Therefore, only the combination of the barcodes of all three dimensions captures the global geometric features of the pore shapes we are interested in.
One of the characteristics of MOFs is their chemical tunability.Indeed, over the last 5 years over 10,000 structures have been synthesised.(8) Such a large number of materials makes it simply impossible to know all corresponding pore structures by heart.Therefore, an important application of our methodology is that we can now readily identify similar pore structures.In Figure 3 we show some examples of materials from the CoRE-MOF database that have similar pore geometries.Our list of similar structures in much longer but what is specific to these examples is that the authors of the corresponding manuscripts did not report the similarities in the original references.Of course, this does not imply that the authors of these articles were not aware of these similarities, but given that there are over 10,000 experimental MOF structures, such similarities are easily overlooked.
An important practical application of nanoporous materials is methane storage.The performance property of this application is deliverable capacity, which is defined as the difference between the amount of methane that is adsorbed at the (high) pressure at which the material is charged and the amount that remains in the material at the de-charging (low) pressure; the higher this deliverable capacity, the better the material.One of the interesting features of nanoporous materials is that one can optimize the pore geometry for a given application.The idea is that if one identifies a material with a high deliverable capacity, materials with similar pore geometries should also have an excellent performance.Let us illustrate this idea using all-silica zeolites.
For this class of nanoporous materials the chemical composition (Si/O) is the same, hence the determining factor is the pore shape.From molecular simulations we have determined the 13 best performing out of the 180 know-zeolite structures, each having a deliverable capacity larger that 90 (v STP/v).We subsequently identified for each of these top-performing materials the 10 most similar structures in our database of 139,407 predicted zeolites.Figure 4 (a) shows that indeed 80% of these 130 new structures have a deliverable capacity that is similar to the 13 top-performing knowns-zeolites.In Figure 4 (b) we show a similar results for MOFs where we used the 20 top-performing structures from the CoRE-MOF database and identified similar structure in the databases of 41498 predicted MOF structures: 85% of these materials show high performance with a deliverable capacity larger than 150 (v STP/v).It is interesting that even for MOFs which have different chemical compositions (unlike zeolites), our method of identifying similar pore shapes illustrates the importance of pore geometry, and hence, of our methodology to quantify similarity for these types of materials.
We can also use our approach to study the topological diversity of the top-performing materials for methane storage.Bathia and Myers (21) analyzed a small number of porous materials and concluded that top-performing materials should all have very similar heats of adsorption for given loading and de-charging pressures.Their work has had significant impact, as it provides a straightforward experimental recipe for optimizing the deliverable capacity of a material.(22) I.e., if all top-performing materials shared a similar heat of adsorption, having a heat of adsorption similar to this value was a necessary condition for all good performing materials.Given this impact it is surprising that the conclusion of Bathia and Myers stands in sharp contrast with observations of Simon et al. (1) Simon et al. computed the deliverable capacity for over 200,000 zeolite structures, and their data (Figure 5a) provide no evidence for a single optimal heat of adsorption, pointing to an interesting paradox: if one randomly selects a set of materials from Figure 5a, one finds no experimental indication that an optimal heat of adsorption even exists.Yet, the approach of Bathia and Myers has indeed been shown to be useful in optimizing performance.
To shed some light on this paradox, we applied topological data analysis to the data in Figure 5a.Analyzing the heat of adsorption for sets of geometrically similar structures, we obtain the desired 'volcano plots' shown in Figure 5b, which allow us to systematically search for the optimal heat of adsorption within a class of geometrically similar structures, and hence the best-performing materials.Interestingly, this optimal heat of adsorption depends on the geometric type of a material (23,24) (see Figure 5) and is not, as suggested by Bathia and Myers, a universal constant.In fact, Bathia and Myers assume implicitly that the entropy of adsorption is the same for all materials; for a set of similar materials as often chosen this assumption is more likely to hold.
The results above suggest that there is not a single class of optimal materials.For this particular example Simon et al. (1) have used brute-force simulations to compute the performance of all materials.This allows us to use the TDA approach to analyze the geometric diversity of the top-performing structures, and to visualize the topography of the zeolite library by generating the mapper plot (4,25) shown in Figure 6, encoding the topological structure of the set of the best 1% performing zeolites with respect to methane storage.
The shape of the diagram shows seven topologically different classes of top-performing materials.For example, group C consists of materials that have one-dimensional small cylinders, while group E has two-dimensional channels (see Table SI-2 for all different groups).The color coding of the mapper plot nicely illustrates that materials in classes of different pore shapes have very different optimal heats of adsorption.
In this article we have developed a topology-based methodology to quantify similarity of the chemical environment of adsorbed molecules, focusing on applications in which the pores play a passive role in providing adsorption sites.For applications in which the pores play a more active role, such as catalysis, a logical step would be to extend the methodology to include chemical specificity and charge distribution.From a topology viewpoint this application is of particular significance because it is one of the first applications of topological data analysis that requires persistent homology in three different dimensions.

Figure legends
Figure 6 1 Methods and computational details

Methods
In this section we briefly describe how we assign a descriptor to a porous material using persistent homology.
In order to assign the persistent homological descriptor to a material, we perform the following steps.We start by preparing a supercell of the material by expanding each unit cell to approximately the size of the largest unit cell of all considered materials, in order to compare materials that have unit cells of very different sizes.The pore system accessible to a gas molecule of interest is determined using the software package Zeo++.(1) The surface of this pore system is sampled with a fixed number of points per unit surface area.From these sampled points, filtered Vietoris-Rips complexes are constructed and their 0-, 1-, and 2-dimensional persistence barcodes computed using the software package Perseus.(2) We measure the distance between two barcodes by a combination of the L 2 -landscape distances of the barcodes from the dimensions 0,1, and 2, using the Persistence Landscape Toolbox.(3)

Computational details
The program Zeo++ (1) detects the accessible void space inside a porous material using a periodic Voronoi network, modeling the framework atoms as hard spheres with radii taken from the Cambridge Structural Database.(4,5) The space accessible to a gas depends on the gas molecule size and is determined in terms of a probe gas molecule, where the size of the probe has to be chosen according to the specific application.We treat a probe gas molecule as a sphere with radius 1.625, 1.5, 1.83, or 1.98 Å for methane, carbon dioxide, krypton, or xenon, respectively.
These values are chosen smaller than usual to mimic by geometric constraints the accessibility of pore space as determined by energy barriers.Zeo++ encodes the pore structure as a large set of points situated on the pore surface which is defined as the boundary of the space where a probe can be placed.For example, a cylinder-shaped pore whose radius equals the probe radius will be represented by points along the central line of the pore.To analyze this point cloud with persistent homology tools, it is necessary to decrease the number of points by performing a secondary sampling, since the raw output is too large to be handled; the raw output is hundreds of thousands of points for each supercell.On the one hand, it is important to have a fine enough resolution to capture details of the pore structures using only finitely many points and to ensure that the barcode assignment is stable with respect to the choice of the point cloud.On the other hand, high resolutions increase computational costs for the persistence computation.
We use a combination of random sampling and grid sampling.The grid sampling guarantees that different samplings of a structure give comparable barcodes, in particular by ensuring that points on narrow parts of the pore system are sampled while still maintaining its connectivity.
On the other hand, the random sampling prevents picking up the grid structure in the barcodes.
For the random sampling we choose one point per 2 Å2 surface area while respecting a minimal distance r min between two sampled points where we decrease r min in steps of 0.1 Å starting with r min = 1.3 Å until the given number of points has been selected.The grid size is 0.5 Å and for each cube of the grid the point of the original point cloud is chosen that is closest to the midpoint of the cube.A point of the grid sampling is added to the random sampling whenever its distance to the randomly sampled points is greater than the final value of r min .
The second step towards the persistent homological descriptor consists of calculating the persistence barcodes for a filtration of Vietoris-Rips complexes, obtained from the sets of points computed in the first step using the software package Perseus.(2) We restrict ourselves to constructing 3-dimensional Vietoris-Rips complexes, where the filtration parameter (corresponding to the radius of the balls grown around each point) increases in 164 steps of 0.025 Å increments, starting from the initial value of 0. The resulting 4.1 Å maximal filtration parameter is due to the fact that the memory cost needed by Perseus grows extremely fast as the radius increases in our calculations.While the relatively small maximal filtration parameter does not allow us to build a complete complex, it prevents geodesically distant points of the surface that are close in Euclidean metric to be connected unless the pore structure is very densely packed in the material.This is important since our construction does not distinguish homology classes that are formed in the solid part of the material from those formed in the pore regions.Technically, this makes the descriptor an overall descriptor of the geometry of the embedding of the pore-surface in the ambient space and not strictly describing the pore surface with respect to the pore space only.Fortunately, the technique does not tend to misidentify structures since the material part is typically much larger than the pore part.However, our maximal filtration parameter is not sufficiently large for all homology classes to die -these correspond to essential intervals in the barcodes -especially for zeolites having large pores.Therefore, to take account of these homology classes in computing distances between two barcodes, a maximal value for the death time has to be assigned, which is especially important in dimension 2 because of the small cardinality of barcodes.For 2-dimensional barcodes, we assign a death times to essential intervals based on the relation between the diameter D i of the largest included sphere and the death time for small and medium pores which is linearly fitted.An example for zeolites with methane is shown in Figure SI-1.The 1-dimensional barcodes contain sufficiently many intervals to distinguish different structures and we discard essential intervals.
To quantify the similarity between two materials in the barcode space, we combine as follows the L 2 -distances between the persistence landcapes (see section 4) corresponding to their barcodes in the different dimensions.After testing landscape distances of different orders (i.e. L ∞ , L 0 , L 1 , L 2 , and so on), L 2 -distances were chosen because they gave the smallest errors in predicting global structural properties and performance properties for a test set of materials.

2-dimensional) persistence barcodes, and let
where n i is the number of points sampled on the pore surface of the i th material, and V i is the volume of the supercell.The distance between two materials in the barcode space is then with coefficients α 0 = 0.1, α 1 = 0.45, and α 2 = 0.45, the values of which were chosen to minimize the error in predicting global structural properties and performance properties for a test set of materials.In dimension 0 the essential intervals are effectively discarded and instead the 0-dimensional barcode the number of sampled points per unit volume is used.This is a simplification that corresponds to discarding the essential intervals in all cases where different connected components of the pore system stay separated during the entire filtration; the 0dimensional barcodes of connected components are determined by the sampling procedure by construction.
The distance D CS between two materials in the conventional descriptor space is estimated with a normalized euclidean distance of five conventional structural properties with an equal weigh for each: D i (the diameter of largest included sphere), D f (the diameter of largest free sphere), ρ (density of a framework), ASA (accessible surface area), and AV (accessible volume).

Similarity in different classes of nanoporous materials
To illustrate the application of our method to finding similar pore geometries across different classes of nanoporous materials, we consider the following questions.
1. Are there zeolites that have the same pore geometry as a given MOF?
2. Are there hypothetical MOFs that are similar to MOFs that have already been synthesized?
3. Are there materials among the ca.5000 MOF structures that are deposited in the Cambridge Structural Database (6) that have similar pore structures but different chemical compositions?
The common theme behind these questions is to illustrate how the methodology developed here allows researchers to identify materials that have similar pore geometries.

MOFs and zeolites
In Figure SI-2, we identify the structures in the IZA+ hypothetical zeolite database (7,8) that are most topologically similar to some of the best known MOFs (e.g., MOF-5 and Cu-BTC).
The figure shows that we can indeed find hypothetical zeolite structures that look very similar to these MOFs.

Hypothetical and experimental MOFs
For hypothetical MOFs we have a database of over 140,000 materials.( 9) An interesting question is whether pore geometries similar to those occurring in hypothetical MOFs have already been synthesized.This question is difficult to answer with traditional methods, since the materials might differ in their chemical composition.We have compared the similarity of structures from the database of hypothetical MOFs (hMOFs) with the experimental structures in the is the 0-dimensional descriptor (P D 0 ), while the maximum included sphere (D i ) is best predicted with a 2-dimensional descriptor (P D 2 ).In section 4 we explain that this corresponds to the geometric interpretation of the persistent homology in the different dimensions.In addition Table SI-1: TDA analysis of the conventional descriptors K H (Henry coefficient), Q ad (heat of adsorption), ρ (density), D i (maximum included sphere), D f (maximum free sphere), ASA (accessible surface area), and AV (accessible volume).The data show the mean absolute percentage error, expressing how well these descriptors can be predicted on the basis of a training set using only the 0-D, 1-D or 2-D barcode as fingerprint, (P D 0 , P D 1 , and P D 2 , respectively), the combined 1-D and 2-D barcodes (P D 1,2 ), and the combination of barcodes from all 3 dimensions (P D 0,1,2 ).The mean absolute percentage error is calculated as where n is the number of zeolites, and P i or P i , P i,P erH is a property of a zeolite or that of the most similar zeolite selected by P D, respectively.

Theoretical background
Topological data analysis (TDA) is a mathematical technique for assigning various topological invariants to data.The guiding philosophy of TDA is that the 'shape' of the data, encoded by a mathematical 'signature', should reveal important relations among the data points.One of the best-known TDA techniques is persistent homology 11,12 , which we describe very briefly here.
Let us denote that persistent homology have already been used in material analysis to detect various phase states of glass. 13  Each material is encoded as a point cloud obtained by sampling points on the pore surface, giving rise to a description of the material in terms of the coordinates of the sampled points in 3-space.From the points, we construct a filtration of Vietoris-Rips complexes, which is a sequence of nested triangulated objects.
For a fixed non-negative real number r, the Vietoris-Rips complex of a point cloud is constructed as follows from the collection of balls of radius r centered at the points of the point cloud.Starting with the elements of the point cloud, add a line segment between a pair of points when the balls centered at the two points overlap.Similarly, a solid triangle is added when each pair of balls centered at its corners intersect and a solid tetrahedron when four balls all intersect pairwise.This procedure can be extended to all higher dimensions, but we stopped at solid tetrahedra both for computational reasons and because our point cloud represented a real three-dimensional structure.Since the Vietoris-Rips complex for a small radius is contained in the complex for a bigger radius, we obtain a filtration of complexes (Figure SI-6 top).
The shape of a complex is partly captured by its homology groups H n , where n is a nonnegative integer.The nonzero elements of H n are the homology classes in dimension n, which correspond to the n-dimensional 'holes' in the complex.More precisely, the 0-dimensional homology classes correspond to the connected components, while a 1-dimensional homology class is represented by a closed curve that does not bound a surface and a 2-dimensional homology class by a bounded cavity.
For example, a hollow tube has one 0-dimensional homology class since it is connected, one  The homology classes of a point cloud (such as that obtained by sampling a pore surface) are not very informative, since each point forms its own connected component, while H n = 0 for all n > 0. In contrast, the homology groups of its Vietoris-Rips complexes strongly depend on the position of the points in space.This information is stored in persistence barcodes that track the non-trivial homology classes through the radius-dependent filtration.A persistence barcode is a set of intervals where each nontrivial homology class is represented by a bar.The starting point of an interval denotes the smallest radius for which the homology class represented by the interval (e.g., a circle around a hole in dimension 1) appears in homology of the associated Vietoris-Rips complex, while the endpoint is given by the radius where the homology class disappears (e.g., the smallest radius for which the balls close the hole) (

Fig. 1 .Fig. 2 .
Fig. 1.Examples of fingerprints obtained by persistent homology for two different zeolite structures DON (top) and PCOD8331112 (bottom).The figures on the left show the structures, the middle the fingerprints, and the right magnifies details of the 1-D fingerprints.The red lines in the figures on the left show the zeolite structures and the navy dots are the set of randomly sampled points on the pore surfaces.The SI contains animations of growing these fingerprints.

Fig. 3 .
Fig. 3. Materials from the CoRE-MOF database that have a similar pore geometry.Each row gives examples of materials that are very similar.There are many more similar structures in the CoRE-MOF data base than we have listed here.The ones that are listed are those in which there are no cross references in the original articles of the corresponding similar structures.

Fig. 4 .
Fig. 4. Deliverable capacity of the 10 materials that are most similar to the best performing 13 zeolites (a) respectively 20 MOFs (b) with respect to PerH.

Fig. 5 .
Fig. 5.The deliverable capacity and heat of adsorption of zeolites.(a) The deliverable capacity and heat of adsorption of all zeolites (data from Simon et al. (1)).(b) Four reference structures IFR, LEV, VSV and BIK were chosen and for each of them we show the 500 geometrically most similar materials (with respect to our topological descriptor) highlighted on the plot from (a).The optimal heats of adsorptions for these subsets are depicted with the vertical lines in (a).

Fig. 6 .
Fig. 6.Mapper plot generated by performing Topological Data Analysis (TDA) on the subgroups of top-performing zeolites (top 1%) for methane storage application.Nodes in the network represent clusters of materials with similar pore shapes and edges connect nodes that contain structures in common.Each material is represented by its persistent barcodes and the metric in this space is D T S .Examples of the different groups are shown in SI.The inset distinguishes the top-performing materials in the training set (white circles) among all topperforming materials.These figures were obtained with the Ayasdi Core software platform (ayasdi.com).Lens: Neighborhood lens (Resolution 30, Gain 3.0x), see Ayasdi manual.Nodes are colored by the average value of the heats of adsorption of the materials in a cluster (Red: high value, Blue: low value). FIGURES Figure 1

Figure SI- 1 :
Figure SI-1: Correlation of the death time of 2-dimensional homology classes and the diameters of the largest included sphere D i when using methane CH 4 as a probe molecule.The red line indicates the least squares regression line; Death time = 35.6×Di -100.8.

Figure SI- 4 :
Figure The persistence barcodes of a torus as obtained from a channel by implying periodic boundary conditions.

Figure SI- 5 : 4 Figure SI- 6 :
Figure.Classes that have a short lifetime can be considered as noise, while classes that persist through long intervals reveal actual structural features of the point cloud.To compare two materials in terms of their persistence barcodes, we use a combination of the L 2 -distances between the persistence landcapes corresponding to the persistence barcodes of same dimensions.Informally, a persistence landscape is a family of functions

Table SI
Mean absolute percentage errors of K H and AV are obtained with log K H and log AV . * 1-dimensional homology class corresponding to the circle running around the axis of the tube, which does not bound a disk, and no 2-dimensional homology class, as the tube does not bound a 3-dimensional cavity.In contrast, if the ends of the tube are glued together, for example by applying periodic boundary conditions, a torus is formed (see FigureSI-4).Being connected, it still has one 0-dimensional homology class, but two independent 1-dimensional homology classes, which are represented by the circle around the axis (blue) and the newly formed circle that runs parallel to the axis (red).The torus is hollow and thus bounds a cavity that represents a 2-dimensional homology class.If a solid torus is considered, the circle around the axis bounds a disk and therefore does not contribute to the 1-dimensional homology, and there is no nonzero 2-dimensional homology class.