Zeo-1, a computational data set of zeolite structures

Fast, empirical potentials are gaining increased popularity in the computational fields of materials science, physics and chemistry. With it, there is a rising demand for high-quality reference data for the training and validation of such models. In contrast to research that is mainly focused on small organic molecules, this work presents a data set of geometry-optimized bulk phase zeolite structures. Covering a majority of framework types from the Database of Zeolite Structures, this set includes over thirty thousand geometries. Calculated properties include system energies, nuclear gradients and stress tensors at each point, making the data suitable for model development, validation or referencing applications focused on periodic silica systems. Measurement(s) potential energy Technology Type(s) Computational Chemistry Factor Type(s) Crystal structure, composition and topology Measurement(s) potential energy Technology Type(s) Computational Chemistry Factor Type(s) Crystal structure, composition and topology Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.17313236


Background & Summary
Atomistic models are an essential tool for the prediction of thermodynamic, mechanical or biochemical properties of a substance. More recently, the use of pre-trained models has become increasingly popular due to their comparably low complexity and high accuracy on modern hardware [1][2][3][4][5][6] . In order for such models to perform well, their empirical parameters require fitting to high-quality reference data. Depending on the application, reference data are either experimental, or come from computationally more expensive ab initio calculations. Although there are already a handful of large computational data sets covering small organic molecules [7][8][9] , such data is still scarce for larger periodic systems (cf. Materials Cloud Archive 10,11 or the NOMAD database 12, 13 ). Motivated by this fact, we present a quantum-chemical data set for zeolites. Zeolites are porous materials comprised of interconnected SiO 4 or AlO 4 tetrahedra. Their properties can be fine-tuned through synthesis of materials with specific pore size, or the inclusion of additional metal cation sites [14][15][16][17] . Because of their topology and synthetic flexibility, zeolites have various applications as adsorbents [18][19][20] and catalysts 17,[21][22][23] . To this day, a myriad of different zeolite framework types is available experimentally, and many more hypothetical structures can be derived [24][25][26] . The documentation of fundamental zeolite framework types and derived materials has led to the publication of the well-known Atlas of Zeolite Structures 27 in several editions. The atlas lists each unique framework type by its three-letter-code, as assigned by the by the Structure Commission of the International Zeolite Association (IZA). Today, its contents are available online at the Database of Zeolite Structures 28 , which we use as a source of initial structures for our data set. In this first installment, we include properties for 204 out of the currently available 256 zeolite framework types in the database (a total of 226 unique geometries when also considering derived materials). Our descriptor provides the complete optimization trajectories for each system with atomic positions, lattice vectors, atomic gradients and stress tensors at each step. We envision future extensions of the data set to focus on derived geometries, covering structural defects and host-guest interactions.

Methods
Initial zeolite structures are collected from the public Database of Zeolite Structures 28 in the Crystallographic Information File (CIF) format, before conversion to the XYZ format with the Atomic Simulation Environment 29 (ASE) package. After selection of all systems with less than 301 atoms, each is manually filtered by removing redundant atom positions in case of fractional occupancies and adding missing hydrogen atoms where needed. Each structure's coordinates and cell parameters are energy-minimized with the periodic density functional code BAND 30  www.nature.com/scientificdata www.nature.com/scientificdata/ processes [37][38][39] . For the optimization of the initial structures, geometry convergence criteria are left at their default values, namely 0.001 Hartree/Å, 0.00001 Hartree/Atom and 0.1 Å for atomic gradients, energy and atomic displacements respectively. We use a Quasi-Newton optimizer 40 in the delocalized coordinates space for the initial optimizations. Cases of problematic convergence are restarted with the FIRE 41 optimizer.

Data records
The data is made available at the Materials Cloud Archive 42 . Each system's trajectory is stored in an individual NumPy 43 . npz file. We describe the data types held in each file in Table 1, storing the complete geometry optimization trajectory, including atomic coordinates, system energies, nuclear gradients, lattice vectors and stress  Table 1. Overview of the data structures stored in a .npz file. Each array can be accessed through the respective key. The variables N and R denote the number of geometry optimization steps and the system size respectively. Partial charges are only computed for the last geometry.   www.nature.com/scientificdata www.nature.com/scientificdata/   www.nature.com/scientificdata www.nature.com/scientificdata/ tensors for each geometry optimization step. Entries at the first position correspond to the input structure; the last position holds the data for the final, optimized structure. Hirshfeld partial charges 44 are provided for the final (optimized) geometries. Atomic coordinates and lattice vectors are stored in ångström, all other properties are stored in atomic units.

Technical Validation
The complete data set includes geometry optimizations of 226 systems, resulting in a total of 32550 geometries. System sizes range between 15 and 334 atoms (mean: 126). We illustrate the convergence of all reference calculations in Fig. 1, showing that all optimized systems are well within the defined convergence criteria. Elemental occurrences in the data set are listed in Table 2  www.nature.com/scientificdata www.nature.com/scientificdata/ Fig. 2 as the most prominent geometrical descriptors. As most of the initial structures from the IZA database are idealized geometries 45 , a sharp mean for the Si-O bond distance can be observed at roughly 161 pm (Fig. 2a, blue histogram). Long tails in the distribution vanish and the mean is shifted towards approximately 164 pm when considering geometry-optimized structures (Fig. 2a, orange histogram). Considering the Si-O-Si angles, a slight shift towards smaller values is observed (mean of 149 vs. 142 degrees, Fig. 2c). Both effects have been previously reported by Fischer et al. 35,36 and are inherent to the selected level of theory. Distributions of the Si-Si distances in the second coordination sphere do not shift significantly when comparing initial and optimized geometries (Fig. 2b). Relative changes in the cell volumes are presented in Fig. 3 as the ratio of each system's optimized-to-initial volume. Values below 1 translate to a shrinking unit cell as the optimization progresses. Overall, the geometrical descriptors are in good agreement with experimental data [46][47][48][49][50][51] . Additional averages for bond distances and angles are summarized in Tables 3, 4 respectively. Distributions of energies, atomic gradients, cell volumes and stress tensors are depicted in Fig. 4. As expected from geometry optimization trajectories, all properties have -with the exception of relative cell volumes -a distinct mean close to zero. Structures close to the initial input geometries contribute to the relatively high standard deviations. Evaluation of the relative cell volumes shows a shifted distribution, with roughly 76% of all structures having a larger volume than their respective optimized geometry. A detailed overview of all calculated structures, sorted by their IZA three-letter-code, the system size and number of iterations is provided in Online Table 1.

Usage Notes
No data points were filtered as outliers with regards to the distributions of chemical properties (see. Figure 4). Consecutive structures from the same optimization trajectory will be autocorrelated. The data repository provides an interactive plotting script, displaying the system energy, maximum absolute component of the nuclear gradients and the cell volume at every iteration step for each structure. This requires the Bokeh 52 (v. 2.3.1) package for Python to be installed. SHA-1 hash sums are provided for each file to guarantee data integrity, as well as an example input script for a calculation with BAND. Naming conventions: Derived materials are referred to by their IZA three-letter-code, e.g. H-EU-12 is tabulated as ETL_0. Leading non-alphabetical characters have been removed, e.g. *-ITN is tabulated as ITN.

Code availability
Downloads of the Atomic Simulation Environment 29 (v. 3.21.1) and NumPy 43 (v. 1.20.1) packages for Python are freely available. Amsterdam Modeling Suite 31 (v. 2020.203, r92091) is a commercial software, for which a free trial may be requested at www.scm.com.