Abstract
Discovering new materials is a challenging task in materials science crucial to the progress of human society. Conventional approaches based on experiments and simulations are laborintensive or costly with success heavily depending on experts’ heuristic knowledge. Here, we propose a deep learning based Physics Guided Crystal Generative Model (PGCGM) for efficient crystal material design with high structural diversity and symmetry. Our model increases the generation validity by more than 700% compared to FTCP, one of the latest structure generators and by more than 45% compared to our previous CubicGAN model. Density Functional Theory (DFT) calculations are used to validate the generated structures with 1869 materials out of 2000 are successfully optimized and deposited into the Carolina Materials Database www.carolinamatdb.org, of which 39.6% have negative formation energy and 5.3% have energyabovehull less than 0.25 eV/atom, indicating their thermodynamic stability and potential synthesizability.
Similar content being viewed by others
Introduction
Understanding the relationship of structures and functions is one of the most important questions in many disciplines such as chemistry, materials science, biology, which is critical for rational design of structures for achieving specific e.g. molecule, protein, or materials functions. However, the sophistication of the physical, chemical, and geometric atomic interactions makes it challenging to exhaustively enumerate the “design rules” for heuristic design approaches. On the other hand, the lack of sufficient data, the diversity of the hypothetical structure types, along with the strong physicochemical constraints makes it hard even for data driven approaches to structure design.
Here, we propose a physics guided deep learning model for generative design of crystal materials. Solid crystals such as ionic conductors, perovskites, photovoltaics, and piezoelectrics, play an important role in modern industries. Over centuries, humanity has dedicated significant amount of efforts to discovering highperformance functional materials. However, to date, only around 250,000 inorganic materials have been experimentally determined as collected by the ICSD database^{1}, which only covers a tiny portion of the almost infinite material design space considering the combinatorial space with the number of elements cross the periodic table and the total 230 possible symmetries of crystal structures. Traditional trialanderror tinkering methods for materials discovery are mainly reliant on domain experts’ knowledge^{2}, which is timeconsuming and laborintensive. To meet the high demand for new functional materials, we need more efficient strategies to explore the vast chemical space to accelerate the materials discovery process.
Currently, a popular approach to generating materials is based on element substitution of existing materials combined with highthroughout virtual screening (HTVS)^{3}. The whole process contains three steps: (1) combinatorially substituting elements in known crystal structures, (2) optimizing the candidate structures using density functional theory (DFT) calculations, and (3) experimental verification. Typical large computational materials databases created by HTVS are Materials Project (MP)^{4} and Open Quantum Materials Database (OQMD)^{5}. Despite its promising usage in material design, a fundamental drawback of HTVS is that it cannot generate materials beyond the structural prototypes of existing materials. It is also extremely computationally intensive and its success rate heavily depends on experts’ intuitions.
One way to overcome the drawbacks of HTVS for discovering materials is to perform crystal structure prediction for candidate material compositions using global optimization techniques, which are used to identify their stable structural phases. Simulated annealing has been used to predict the structures of alloys^{6} and boron nitride^{7}. The minima hopping^{8} is another algorithm for finding unknown crystalline structures^{9}. Two widely used crystal structure prediction (CSP) algorithms are USPEX^{10} and CALYPSO^{11}, which use evolutionary algorithms and particle swarm optimization for finding crystal structures. Despite their success in a variety of cases, these CSP based approaches for materials discovery suffer from their limited applicability to only relative simple structures usually with small number of atoms in unit cell.
Another promising approach to design solid materials beyond known crystal structure prototypes is generative deep learning models^{12,13,14}, which can learn data distribution (knowledge of forming stable crystal structures) from known materials and then sample from it to generate materials. Variational Autoencoder (VAE)^{15} and Generative Adversarial Network (GAN)^{16} are two popular generative models used to generate materials. A VAE^{15} model consists of two deep neural networks, an encoder and a decoder. The encoder is trained to encode materials into latent vectors and the decoder reconstructs the materials from the latent vectors. After training, different strategies can be used to sample the latent space and then use the decoder to generate materials. iMatGen^{2} is believed to be the first work that uses VAE to realize the inverse design of solid materials. It encodes unit cells into 3D grid based representations, and spherical linear interpolation and Gaussian random sampling are used to sample from the latent space to generate materials. Hoffmann et al.^{17} extend iMatGen by combining a UNet module to segment reconstructed 3D voxel images into atoms. Based on iMatGen and Hoffmann et al., ICSG3D^{18} integrates formation energy per atom into 3D voxelized solid crystals, which enables the VAE to encode materials and energy simultaneously. This makes it possible to generate materials subject to userdefined formation condition. Another approach to represent 3D crystals is to encode 2D crystallographic representations as the combination of the real space and the reciprocalspace Fouriertransformed features^{19}. In CDVAE^{20}, a diffusion network is trained to generate material structures^{21}, in which a diffusion process within their diffusion variational autoencoder moves atoms into positions in the lower energy space to generate stable crystals. All these models have difficulty in generation of high quality structures with high symmetry (e.g. space group number >=62) due to their neglecting the structure symmetry in their generation models, a major special characteristic of periodic crystal structures. A GAN model^{16} consist of two deep neural networks, a generator and a discriminator (critic). The generator creates fake materials with inputs of random vectors with or without conditioning on elements and space groups while the discriminator tries to tell real materials from generated ones. With learnt knowledge of forming crystals, the generator can directly create materials. The first method to generate materials using GAN is CrystalGAN^{22}, which leverages a CycleGAN^{23} to generate ternary materials from existing binaries. However, it remains uncertain whether CrystalGAN can be extended to produce more complex crystals. GANCSP^{24} and CubicGAN^{12} are two GAN based generation models that directly encode crystal structures as matrices containing information of fractional coordinates, element properties, and lattice parameters, which are fed as inputs to build models that generate crystals conditioned on composition or both composition and space group. The major difference between them is that GANCSP can only generate structures of a specific chemical system (e.g. MgMnO system) while CubicGAN can generate structures of diverse systems of three cubic space groups. In CCDCGAN^{25}, Long et al. use 3D voxelized crystals as inputs for their autoencoder model, which then converts them to 2D crystal graphs, which is used as the inputs to the GAN model. A formation energy based constraint module is trained with the discriminator, which automatically guides the search for local minima in the latent space. More recently, modern generative models such as normalizing flow^{26,27} and diffusion models have also been^{20} (CDVAE) or planned to be^{28} applied to crystal structure generation. Less related works include MatGAN^{29} and CondGAN(x^{bp})^{30} developed for generating only chemical compositions.
Despite the success of VAEs and GANs in material generation^{2,12,20}, all current generative models have several major drawbacks. For example, the iMatGen algorithm^{2} can only generate structures of a specific chemical system such as vanadium oxides and only several metastable VxOy materials were discovered out of 20,000 generated hypothetical materials. Similarly, GANCSP^{24} and CrystalGAN^{22} only generate for a given chemical system (e.g. MgMnO system and hydride systems). VAEUNet pipeline developed in^{18} expands the diversity of generated materials and can reconstruct the atom coordinates more accurately by incorporating UNet segmentation and conditioning on properties. However, VAEUNet still confines itself to cubic crystal system generation and the number of atoms in a unit cell is restricted to no more than 40. All above discussed works do not realize highthroughout generation of crystal materials. CubicGAN^{12} is an early public example of a highthroughput generative deep learning model for (cubic) crystal structures, which has discovered four prototypes with 506 materials confirmed to be stable by DFT calculations. Although CubicGAN has generated millions of crystal structures with hundreds of stable ones confirmed, the generated structures are limited to three space groups in the cubic crystal system, of which the atom coordinates are assumed to be multiples of 1/4: it is not capable of generating generic atom coordinates. While these works open the door to generative design of materials, several unique challenges still remain that prevents effective generative design: (1) how to learn the physical atomic constraints of stable materials to enable efficient sampling; (2) how to achieve precise generation of atom fractional coordinates and lattice parameters; (3) how to handle the extreme bias of the distribution of materials in the 230 space groups; (4) how to exploit the high symmetry of crystal structures in the generation process.
In this work. we introduce a physics guided crystal generative model (PGCGM) to exploit the physical rules for addressing aforementioned challenges. Our contributions are summarized as follows:

1.
We present a physics guided deep generative model for crystal generation that combines the space group affine transformation and an efficient selfaugmentation method.

2.
We propose two physicsoriented losses based on atomic pairwise distance constraints and structural symmetry to fuse the physical laws into deep learning model training.

3.
We evaluate our model against two baselines to show its superiority and perform DFT calculations to validate our generated structures with high success rate (93.5% can be optimized successfully).
Results
Dataset
We collect our material data from MP^{4}, ICSD^{1} and OQMD (v1.4)^{5}. In total, 42,072 ternary materials with 20 space groups (5 crystal systems) are curated when we start this project (The choice of 20 space groups is due to the limited structure samples for other remaining space groups. Our model is applicable to any space group given sufficient training samples). We use a 80–20 random training/validation split for all of our experiments. We term the dataset with 42,072 materials as MIO. When conducting this project, the latest version of OQMD is just yet released. There are 9441 ternary materials that are filtered by the same criteria and are brandnew materials in the latest OQMD (v1.5). We use these 9441 ternary materials as our test dataset TST to compare our method with two baselines. Details regarding dataset collection are in Dataset Curation section of supplementary materials.
Generation performance
We compare PGCGM with two latest algorithms that can generate crystals with multiple chemical systems instead of only a special group of materials (e.g. VxOy and MgMnO systems)^{2,24}. FTCP^{19} combines real space properties (e.g., atom coordinates) and momentumspace properties (e.g. diffraction pattern) to represent crystal structures. Then a CNN based VAE is trained for materials generation. CubicGAN^{12} trains a WGANGP^{31} to generate cubic structures in three space groups and here we expand the original method to 20 space groups.
The performance is shown in Table 1. For each method, we sample 500,000 structures and for PGCGM and CubicGAN, we perform atom clustering and merging. However, our atom clustering and merging cannot proceed with materials generated by FTCP and then we did not perform atom clustering and merging on those materials. The percentage of Crystallographic Information Files (CIFs) that are readable by pymatgen^{32} are shown in the CIFs column. Here readable means it can be proceeded by pymatgen.core.structure.Structure.from_file. We can find that PGCGM+dist has the largest percentage of materials left and PGCGM+dist+coor comes next. It tells us that distance and coordinates losses play a big part in generating readable materials. For later percentage related metrics, we use the number of CIFs left of each method as denominator. Our model significantly outperforms FTCP by 36.4% in terms of distance validity and is slightly better than CubicGAN. In terms of distance validty, our model outperforms FTCP and CubicGAN by 6.5% and 27.0%, respectively. Since validity are relatively weak metrics, property distribution is further used to provide a stronger metric to evaluate whether the generated materials are realistic. Our model significantly outperforms both two baselines. In terms of minimum atom distance, PGCGM decreases wasserstein distance (WD) by 1.461 compared to FTCP and by 0.402 compared to CubicGAN. In terms of maximum atom distance, PGCGM+dist+coor decreases WD by 0.264 compared to FTCP and by 2.986 compared to CubicGAN. Although CubicGAN has a close minimum atom distance distribution to PGCGM, the much bigger gap of maximum atom distance distribution between CubicGAN and PGCGM+dist+coor indicates that CubicGAN tends to generate larger crystal structures. In terms of density, PGCGM+dist decreases WD by 2.130 compared to FTCP and by 3.106 compared to CubicGAN. PGCGM also achieves the best diversity score even though it generates more readable CIFs than FTCP, which further shows that FTCP is not able to generate not only physically realistic materials but also materials with restricted diversity of formulas. We choose PGCGM+dist+coor as our finalized model to generate materials for late analysis since PGCGM+dist+coor has better properties distribution performance than PGCGM and PGCGM+dist on average.
Analysis of materials optimized by Bayesian optimization with symmetry relaxation
Bayesian optimization with Symmetry Relaxation (BOWSR) algorithm^{33} is an approach that uses Bayesian optimization to iteratively search lower energy surface to optimize the crystal structures based on the properties predicted by deep learning methods, such as CGCNN^{34} and MEGNet^{35}. Instead of directly using expensive DFT for relaxing generated materials, we first use BOWSR to optimize structures generated by our model and two baseline models and then use DFT calculation to further relax them. We randomly select 2,000 generated materials with less than or equal to 32 atoms for FTCP, CubicGAN and PGCGM. We select 100 materials for 20 space groups equally generated by PGCGM. Note that we also use the same 2,000 materials of PGCGM for further DFT analysis. Because some space groups are underrepresented (with less than 100 materials) in CubicGANgenerated materials, we select all materials under these space groups and then we select materials for the rest of space groups proportionally to obtain 2000 materials. For FTCP, materials that can be successfully analyzed to have space groups by pymatgen get_space_group_info with symprec=0.1^{32} surprisingly all belong to space group P1, which means FTCP loses the significant symmetric constraints when generating materials. Our methods PGCGM and CubicGAN are much better than FTCP in terms of space groups retention. Moreover, it takes more than 10 times time to optimize materials generated by FTCP than by PGCGM and CubicGAN using BOWSR. We use StructureMatcher from pymatgen^{32} to match the generated materials with the corresponding optimized materials by BOWSR.
Table 2 shows the match rate and RMS displacement. The match rate is the percentage of materials satisfying the criteria ltol=0.2, stol=0.2, angle_tol=0.5 and then we calculate RMS displacement for the matched materials. Firstly we find that our method has a slightly higher number of successfully optimized materials by BOWSR. However, our method significantly outperforms FTCP and CubicGAN by 4200% and 34.23% in terms of match rate, respectively. It seems that FTCP has the best RMS displace but the extremely low match rate might tell us that BOWSR is hard to optimize materials with low symmetry, such as space group P1. CubicGAN comes next in terms of RMS displacement.
Analysis of rediscovering materials in training and test datasets
It would be helpful to show how fast our model can rediscover materials in training datase MIO and test dataset TST. To do this, we sample different number of materials and then calculate the percentage of materials rediscovered in generated materials. “Reduced Formula  Space Group ID  # of Atoms” is defined as prototype to identify unique materials in the existing and generated materials. Figure 1a shows the change of unique crystals and rediscovery rate over size of sampling materials. We start to sample materials from half million and the number ends at 60 million eventually. It is found that the percentage of unique materials (cyan line) are decreasing and gradually tend to grow flat as number of sampling materials increases. The rediscover rate of MIO (orange bars) increases consistently over the sampling process and it soars quickly to 42.6% when 35 million materials are sampled. Starting from 42.6%, the percentage of rediscovered materials in MIO grows smoothly and it reaches 52.0% when the sampling size is 60 million. Similar growing patterns can be observed for the rediscover rate for the test dataset, as shown by blue line in Fig. 1a. The rediscover rate reaches 43% at the end of 60 million sampling size. This percentage is lower than that of training dataset because of the different proportions over the 20 space groups in training and test datasets as shown in Supplementary Table 1 and Supplementary Table 2.
After rediscovering the materials in MIO and TST from the generated materials when sampling size is 60 million, we utilize StructureMatcher from pymatgen^{32} to test whether the generated materials match the rediscovered materials and to calculate RMS displacement between two matched structures considering all invariances of materials. Because one prototype might correspond to multiple structures in existing and generated materials, we only show the least RMS displacement by exhausting each pair of existing and generated materials for this prototype. The match rate is the percentage of materials satisfying the criteria ltol=0.2, stol=0.3, angle_tol=5.0. The match rate and RMS are 25.4% and 0.05 for training dataset and are 17.7% and 0.085 for test dataset, respectively. Figure 1b shows parity plot that compares generated lattice lengths against DFT calculated lattice lengths. Surprisingly, the corelation between the discovered materials in test dataset and generated materials is better than in training dataset in terms of R^{2}. The R^{2} for lattice a, b, and c in test dataset are 0.606, 0.616, and 0.606, respectively as in Fig. 1b, which increases R^{2} as in training dataset by a factor of six except for lattice c. The rediscovered materials in training dataset have larger lattice a and b and we find that these materials mostly are with cubic space groups. It seems that our approach tends to generate more realistic lattice for noncubic space groups than cubic space groups in rediscovered materials.
DFT validation
We use the same 2000 materials as in Bayesian optimization for DFT verification. Out of 2000 generated crystals, 93.5% (1869) are successfully optimized, which is significantly better than 33.8% of CubicGAN as reported in^{12}. Figure 2a demonstrates the distribution of formation energy of successfully optimized materials after removing 6 materials with formation energy larger than 10 eV. It is observed that most of the materials have formation energy around 0 eV and 39.6% of them have negative formation energy. Negative formation energy indicates potentially stable materials. Figure 2b shows the distribution of 1579 materials that have energyabovehull after removing one material with super large energyabovehull (1160.8 eV). The energyabovehulls are calculated using the Pymatgen’s Phase diagram analyzer^{32}. Energy above hull is a stronger indicator whether the materials are stable or not. Overall, 3 materials with energy above hull of 0 eV/atom and 106 (5.3%) ones with energy above hull less than 0.25 eV/atom, which further indicates our model can generate reliably stable materials. All the optimized materials are included in the supplementary materials.
Pairwise atom distance based loss not only constrains the two atoms in a reasonable range, but also helps generate lattice lengths close to DFTcalculated ones. To demonstrate this, we calculate relative error, R^{2}, RMSE, and O (outliers percentage) for lattice lengths for 1869 materials as shown in left panel of Fig. 3 and for only 293 cubic materials by PGCGM and 14,432 cubic materials by CubicGAN as shown in right panel of Fig. 3. In terms of relative error, we can find that the mean relative error of lattice lengths is much more close to zero regardless of when comparing 1869 materials or just cubic materials by PGCGM with cubic materials by CubicGAN, which indicates that PGCGM tends to generated precise lattice lengths. In addition, the outliers of lattice lengths in 1869 materials by PGCGM scatter across 100% and cubic materials from 1869 ones only have two outliers compared to CubicGAN whose outliers cluster near to 150% even though CubicGAN overall has a lower outliers percentage. We also evaluate the lattice lengths generation performance between PGCGM and CubicGAN with R^{2} and RMSE. For 1869 materials, the generated lattice lengths by PGCGM better fit to the DFT calculated lattice lengths than CubicGAN in terms of R^{2}. In terms of RMSE, PGCGM is generally slight better CubicGAN. When only comparing cubic materials, PGCGM significantly outperforms CubicGAN in terms of both R^{2} and RMSE. Although it is not a direct comparison between PGCGM and CubicGAN, all this findings indicate that our model can generate high quality materials with reasonable lattice lengths.
In Table 3, 20 structures with lowest formation energy are selected for 20 space groups as in the dataset. Before any postprocessing, both materials has a large number of atoms. It is easily found that the atoms of the same elements are crowded together in column of GEN. After clustering and merging the atoms of the same elements, the number of atoms drop rapidly, such as space group 164 (from 36 to 14) and space group 227 (from 432 to 32) as shown in column of MER. Column OPT shows the crystal structures after DFT optimization. Columns of Formula (GEN) and Fromula (MER) show the formulas before and after clustering and merging, respectively. We can find that the formulas may be changed after clustering and merging even though we only keep materials that do not change space group when conducting clustering and merging. The reason behind this is because we employ base atom sites in generated materials. Note that line 10 of Algorithm 1 tends to fail because of the decimal mantissa of base atom sites which in return can easily lead to the large number of atoms when converting from base atom sites to full atom sites via Algorithm 1, particularly space groups with large affine matrix, such as 227 and 225 as shown in Table 3.
Case study of three example stable materials
We discovered three compounds with Mg_{2}GaIr, SrYO_{6} and ZnTe_{2}S_{6} chemical formulas, which are thermodynamically stable with negative formation energies and zero energy above hull. The structures we found have P6_3/mc [194] (hexagonal), Pm3 [200] (cubic) and [148] (trigonal) space group symmetries [space group number] (crystal systems) for Mg_{2}GaIr, SrYO_{6} and ZnTe_{2}S_{6} compounds, respectively. Figure 4a shows the structures of those three materials. The lattice parameters for Mg_{2}GaIr material are a = b = 4.38 Å, c = 8.54 Å, α = β = 90^{0} and γ = 120^{0}, while that for SrYO_{6} material are a = b = c = 4.61 Å and α = β = γ = 90^{0}. Moreover, the lattice parameters for ZnTe_{2}S_{6} material are a = b = c = 7.57 Å and α = β = γ = 49.38^{0}. As shown in Table 4, our spinpolarized DFT calculations show that the both Mg_{2}GaIr and ZnTe_{2}S_{6} compounds have the nonmagnetic ground states, whereas SrYO_{6} material has a ferromagnetic ground state with a total magnetic moment of 1 μ_{B}. Figure 4b contains the electronic band structures for each stable material. It is clear that both Mg_{2}GaIr and ZnTe_{2}S_{6} compounds are metals. However, we can see spinsplitting in SrYO_{6} ferromagnetic material. In this compound only spindown electrons cross the Fermi level, while spinup electrons have a band gap of 3.09 eV. Thus, this is a halfmetal where spindown electrons show metallic character, while spinup electrons are insulating. Half metallicity is widely investigated for spintronics and it is vital for developing memory devices and computer processors^{36}.
Table 5 contains the elastic constants and the mechanical properties of the stable materials. The elastic stability criteria (Born criteria) for Mg_{2}GaIr with P6_3/mm space group symmetry are C_{11} > ∣C_{12}∣, \(2* {C}_{13}^{2} < {C}_{33}({C}_{11}+{C}_{12})\), and C_{44} > 0. The Born criteria for SrYO_{6} with Pm3 space group symmetry are C_{11} − C_{12} > 0, C_{11} + 2C_{12} > 0, C_{44} > 0, and that for ZnTe_{2}S_{6} with R3 space group symmetry are C_{11} > ∣C_{12}∣, \({C}_{13}^{2} < 0.5* {C}_{33}({C}_{11}+{C}_{12})\) and \({C}_{14}^{2}+{C}_{15}^{2} < 0.5* {C}_{44}* ({C}_{11}{C}_{12})\), and C_{44} > 0^{37}. It is clear that all those three materials comply with their elastic stability criteria implying they are mechanically stable. Table 5 also has the bulk (B) modulus, shear (G) modulus and Young’s (Y) modulus and Poisson’s ratio (ν) for Mg_{2}GaIr, SrYO_{6}, and ZnTe_{2}S_{6} compounds. We used Hill approximation as implemented in vaspkit code^{38,39}. It is clear that Mg_{2}GaIr and SrYO_{6} materials have approximately same B, G, and Y values, while those values for ZnTe_{2}S_{6} compound is considerably lower. In Table 5, ν of Mg_{2}GaIr has the highest value, while lowest ν can be obtained from ZnTe_{2}S_{6}. Furthermore, we used Phonopy code^{40} to calculate the phonon dispersion relations for the above materials. As shown in Fig. 4c, there are no imaginary phonon modes (negative frequencies) indicating those three materials are dynamically stable at 0 K.
In this work, we propose a physics guided deep crystal generative model (PGCGM), in which two kinds of physics based losses are invented in the generator to improve the quality of generated materials. The atom distance based losses constrain the atom distance in a certain range in the generated materials and thus the generated lattice parameters fall into reasonable range too. To fulfill the symmetry requirements, the model transforms the implicit rules between base atoms sites and full atom sites into explicit cost functions. Two baseline methods are compared and PGCGM achieves the best performance across all evaluation metrics. In particular, PGCGM significantly outperforms the two baseline models in terms of property distribution metric which is a much stronger indicator to show the reality of the generated materials^{20}. In addition, we use BOWSR to optimize 2000 randomly selected materials in each method. Our approach has the best match rate calculated between the Generative modelgenerated materials and BOWSRoptimized materials, which further demonstrate our method can generate realistic materials.
In order to see how our approach can rediscover materials in existing databases, we sample different size of materials and calculate rediscover rate for training and test datasets. We can observe a clear trend of increased rediscover rate over sampling size. There is no clear saturation point of rediscover rate at the end of 60 million sampled materials as in CubicGAN^{12}. The reasons are: (1) the possible design space of 20 space groups (5 crystal systems) in this work are much bigger than 3 space groups (only cubic crystal systems) in CubicGAN; (2) CubicGAN uses special fractional coordinates while PGCGM generates fractional coordinates in full space, which means PGCGM has a significantly broader space to explore new materials. Furthermore, 1869 out of 2000 materials are successfully optimized by DFT calculation. Among 1869 materials, 39.6% possess negative formation energy and 5.3% with energy above hull less than 0.25 eV/atom, indicating that invented physics guided losses help generate stable crystal structures effectively. This research gives a deep insight into how physics losses help generate realistic materials and offers an approach to expand the diversity of generated materials.
Due to the difficulty to generate 500,000 structures by the diffusion based generative model CDVAE, we conduct a smallscale evaluation and comparison. We generated 1100 structures using CDVAE, among which 78.2% (860) are pymatgen readable. We then analyzed the space groups of these 860 structures and find out of the top 10 space groups with the most structures (a total of 790), 97.8% (773) structures have low symmetry with space group number less or equal to 25 and 582 (73.7%) of them have no symmetry. However, it is shown^{41} that highsymmetry materials tend to form stable structures and have larger potential to have good functional properties. It seems that like other VAE based generator models, CDVAE also has difficulty in generating highsymmetry structures as PGCGM does.
Methods
Problem statement and notations
Our datadriven generative models of crystal structures are first trained with known crystal structures in materials databases. Since crystal materials are periodic structures, instead of representing the infinite structures, the structure of an inorganic material is represented by a unit cell in material science, which is the smallest unit that completely reflects the arrangement of atoms in the 3D space. Given a generated unit cell, it can be used to build the periodic structure of an inorganic material by repeating it multiple or infinite times along three directions to form a super cell. A material \({{{\mathcal{M}}}}\) can then be denoted as following:
where

(a)
\({{{\bf{E}}}}=({e}_{0},{e}_{1},{e}_{2})\in {\mathbb{E}}\) denotes elements in materials, where \({\mathbb{E}}\) is the element set in periodic table. In this work, we only deal with ternary materials so that there are only three unique elements in the unit cell;

(b)
\({{{\bf{B}}}}=({{{{\bf{b}}}}}_{0},{{{{\bf{b}}}}}_{1},{{{{\bf{b}}}}}_{2})\in {{\mathbb{R}}}^{3\times 3}\) denotes the base atom sites, which are the symmetry equivalent positions. termed . b_{i} is fractional coordinates of an atom denoted by \({\left[u,v,w\right]}^{T}\). We choose materials that one element only has one base atom site so that three atom sites can be used to represent the atom positions. Moreover, any one atom site of each element can be considered as the base atom site for that element;

(c)
\({{{\bf{P}}}}=(a,b,c,{{{\boldsymbol{\alpha }}}},{{{\boldsymbol{\beta }}}},{{{\boldsymbol{\gamma }}}})\in {\mathbb{R}}\) are six lattice parameters that define three lengths and three angles of the unit cell;

(d)
\({{{\bf{O}}}}=({{{{\bf{t}}}}}_{0},{{{{\bf{t}}}}}_{1},\ldots ,{{{{\bf{t}}}}}_{n})\in {{\mathbb{R}}}^{n\times 4\times 4}\) denotes affine matrix that represents the symmetry operations defined by space groups sgp. t_{j} is one affine operator containing the rotation and translation matrices. n is determined by space groups. Generally the higher symmetry of a space group, the larger n. n can be as small as 1 or as large as 192.
Now we can model the generation of materials as follows:
where f_{θ} is the generative model that learns the knowledge of forming crystal structures given inputs of random noise Z, element set E, and space group sgp.
Overall architecture of physics guided crystal generative model
Our Physics Guided Crystal Generative Model (PGCGM) is shown in Fig. 5. The PGCGM mainly consists of four major components: (1) discriminator, (2) generator, (3) selfaugmentation, and (4) atom distance matrix/loss calculation module. In the generator and discriminator, affine matrix is integrated into the training to generate fake materials and tell fake materials from real ones, respectively. Affine matrix is related to symmetry information for space groups. The implicit combination of affine matrix and base atom sites can help keep the symmetry when generating materials. Selfaugmentation increases the training materials for underrepresented space groups by randomly forming base atom sites. With three sets of base atom sites, we can not only have a fixed size of input to the discriminator, but also deduce more physical information for crystals to help the discriminator better distinguish real materials from fake ones. Furthermore, we design two kinds of physics guided losses. Any set of base atom sites can be converted to full set of unique atom sites. When generating three sets of base atom sites, it implicitly can be stated that the three sets of base atom sites should be different but the full atom sites converted from them separately ought to be same. Hence a specific loss is invented to explicitly incorporate this rule into training of the generator. In order to restrict the two atoms in the 3D space to be not too close or not too distant, inter and intraatom distance losses are proposed. With distance loss, the generator further can generate reasonable lattice parameters in order to push any pair of atoms to fall into a certain range.
Discriminator
There are two input branches for crystal representation and affine matrix in Discriminator as in subfigure (b) of Fig. 5. Each branch is forwarded to a 2D convolutional block and the learnt features are concatenated together. The concatenated vector is sent to a couple of fully connected layers to get the discriminative score. We have three different sets of base atom sites in our inputs and with the affine matrix branch, it helps to implicitly learn the knowledge of how affine matrix transforms base atom sites into full atom sites. The detailed architectures of two convolutional blocks can be found in Table S3 in the supplementary materials.
Generator
The architecture of generator is shown in subfigure (a) of Fig. 5, which contains three branches. Conditioning on element constituents and space group, the generator outputs three sets of base atom sites \(\left({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2}\right)\) and unit cell length P^{*}. Then we reformalize Eq. (2) as follow:
Taking random noise Z, space group sgp, and element properties matrix E as inputs, the generator can generate a material with the same lattice parameters and space group but different representations of the base atom sites when merely sampling one material. Our goal here is that the generated three sets of base atom sites belong to the same material. Random noise Z is mapped to a dense vector a fully connected layer. The space group branch is the same as in discriminator. Element matrix E is forwarded to a 1D convolutional layer (Conv1D). The outputs of random noise and space group branches are combined together as the inputs to a multilayer perceptron (MLP) block to generate unit cell length \({{{{\bf{P}}}}}_{fake}^{* }\). The outputs of random noise and element branches are combined together as the inputs to 2D deconvolutional layers (ConvTran2D) to generate three sets of base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\). The detailed descriptions for MLP, Conv1D, and ConvTran2D can be found in Supplementary Table 4.
Physics guided loss function
The original GAN^{16} is notoriously hard to train because of saturation and mode collapse in discriminator. We take advantage of WGANGP^{31} with gradient penalty to enhance the training stability in our network. WGANGP changes the Sigmoid function of the discriminator to a 1Lipschitz function while introducing a gradients penalty term to enforce the norm of gradients to be close to 1. The loss function is described in Eq. (4):
where \({\hat{{{{\mathcal{M}}}}}}^{* }\) is linearly interpolated between real materials \({{{{\mathcal{M}}}}}_{real}^{* }\) and fake materials \({{{{\mathcal{M}}}}}_{fake}^{* }\) and ϵ is uniformly sampled from 0 and 1. \({{{{\mathcal{L}}}}}_{dis}\) and \({{{{\mathcal{L}}}}}_{adv}\) represent the loss function of the discriminator and adversarial loss for generator respectively. The third term in \({{{{\mathcal{L}}}}}_{dis}\) is the gradient penalty and λ_{d} is set to 10. D(.) means the score result from the discriminator.
Atom Distance Losses. To ensure that the atoms in generated crystal structures are not crowded or not too far apart from each other, we introduce the inter and intraatom distance based losses as following:
where \({{{{\mathcal{L}}}}}_{inter}\) constrains the distance in H_{inter} which describes interatom distance matrices. \({\left[\max ({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}){\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}\right]}^{2}\) enforces the atom distance to be smaller than \({\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}\) and \({\left[\min ({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}){\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}\right]}^{2}\) enforces the atom distance to be bigger than \({\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}\). S_{inter} are atom radius sum corresponding to each pair of atoms in H_{inter} and \({\phi }_{inter}^{upper}\) and \({\phi }_{inter}^{lower}\) are control weights for upper and lower bound of interatom distance, respectively. In this way, the distance of two atoms is constrained to be in the grey area indicated by two circles in subfigure (d) of Fig. 5. Similarly, \({{{{\mathcal{L}}}}}_{intra}\) constrains the distance in a range in H_{intra} which describes intraatom distance matrices. S_{intra} are atom radius sum corresponding to each pair of atoms in H_{intra} and \({\phi }_{intra}^{upper}\) and \({\phi }_{intra}^{lower}\) are control weights for upper and lower bound of interatom distance, respectively. N is batch size and 9 is the number of distance value in H_{inter} and H_{intra}.
Symmetrycompliant Base and Average Full Coordinates Losses. The generator generates three sets of base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) which are used to generate the full site coordinates using the symmetric operation defined by the space group. The averaged transformation to \(({{{{\bf{F}}}}}_{fake}^{0},{{{{\bf{F}}}}}_{fake}^{1},{{{{\bf{F}}}}}_{fake}^{2})\) from base atom sites should be exactly same. With these implicit rules, we design two losses to explicitly enforce them in the generator as expressed below:
where cos is cosine similarity function. We normalize each coordinate value across the minibatch of size N. 9 is the number of coordinates.
Full Generator Loss. By combining above losses, we can achieve our full loss for the generator:
where λ_{1}, λ_{2}, λ_{3}, and λ_{4} are balancing parameters.
Crystal symmetry based selfaugmentation
Data augmentation is commonly used for images in which operations such as rotation of an image does not change its label. Similarly, selfaugmentation as we define here is used to do data augmentation based on the symmetryoriented Wyckoff position representation of CIF files. In the representation of symmetric crystals, the coordinates of the nonequivalent positions (Wyckoff positions) are just one of a set of possible positions as defined by the symmetric operations of the space group. So, for each structure file, we can use the set of symmetric operations of the space group to transform the Wyckoff position coordinates without changing the structure, which can then generate more equivalent structure samples. The generation of atom coordinates that meet the symmetry constraints is one of the most challenging tasks in crystal generation. In order to make the fixed size of representation for crystals (details in Table 7), we use base atom sites. As shown in subfigure (c) of Fig. 5, we can use any atom site of each element to form a set of base atom sites. Instead of randomly selecting them, we choose three atoms for three elements individually using steps as shown in below:

1.
Randomly select the first element e_{0} and one atom position b_{0} for it;

2.
Randomly select the second element e_{1} from the rest two elements and find the closest atom b_{1} to atom b_{0} in the first step;

3.
Calculate the atom distance from the atoms of the last element e_{2} to the atom b_{0} and the atom b_{1} respectively, then sum the atom distance elementwise and the atom of the last element with the smallest sum is considered as the closest atom b_{2} to the selected atoms in the first and second steps;

4.
Repeat Steps 2, 3, and 4 three times to obtain three sets of base atom sites \(({{{{\bf{B}}}}}_{real}^{0},{{{{\bf{B}}}}}_{real}^{1},{{{{\bf{B}}}}}_{real}^{2})\);

5.
Repeat last five steps 31 times.
In Step 4, we use three sets of base atom sites as part of inputs to the discriminator so that we can obtain more information from crystal structures. In this work, we obtain three sets of base atom sites 32 times repeatedly as in Step 5.
Atom clustering and merging
For crystals with high symmetry, the number of atoms in the unit cell tends to be very large after conversion by Algorithm 1. We propose a postprocessing method to reduce the number of atoms by clustering and merging. Firstly, we cluster the nearby atoms of the same elements by forming flat clusters from hierarchical clustering^{42,43}. The maximum atom distance allowed in our research is 1.2 times the atom radius sum. Secondly, we merge the atoms in the same clusters considering periodic attributes of crystal structures.
Implementation details
Materials representation
We use \({{{\mathcal{M}}}}=({{{\bf{E}}}},{{{\bf{B}}}},{{{\bf{P}}}},{{{\bf{O}}}})\) to completely describe a crystal material. As shown in mainframe of PGCGM, however, we use three sets of base atom sites (B^{0}, B^{1}, B^{2}). Thus here we reformulate a material as \({{{{\mathcal{M}}}}}^{* }=({{{{\bf{B}}}}}^{0},{{{{\bf{B}}}}}^{1},{{{{\bf{B}}}}}^{2},{{{\bf{P}}}},{{{\bf{E}}}},{{{\rm{sgp}}}})\). The space group sgp is used to link to the affine matrix O. We can use (B^{0}, B^{1}, B^{2}) in \({{{{\mathcal{M}}}}}^{* }\) to calculate physical properties as inputs to the discriminator and to design physicsbased losses. Three sets of base atom sites are useful for two reasons: (1) we want to add more crystal information for the discriminator and let the discriminator have enough information to tell real materials from fake ones; (2) With more base atom sites, we can calculate more atom distances as the physical constraints in the generator and the inputs to the discriminator.
All Fractional Coordinates We use affine matrix O to acquire the whole atom sites in the unit cell as shown in Algorithm 1. Since the number of affine operators in O varies in space groups, we zeropad the affine matrices as large as 192 × 4 × 4. We then transform each base atom site by the affine matrix and get a coordinates matrix F_{all} with shape of 192 × 3 × 3. Affine transformation leads to duplicate fractional coordinates. In material science, practitioners usually remove the duplicates. However, uniqueness calculation is not differentiable and it requires lots of time to do it. We choose to average along with the first dimension of F_{all} to get three sets of averaged full fractional coordinates (F^{0}, F^{1}, F^{2}), each of which is with shape of 3 × 3.
For a real material, base atom sites \(({{{{\bf{B}}}}}_{real}^{0},{{{{\bf{B}}}}}_{real}^{1},{{{{\bf{B}}}}}_{real}^{2})\) can be transformed into the same average full fractional coordinates, which means \({{{{\bf{F}}}}}_{real}^{0}={{{{\bf{F}}}}}_{real}^{1}={{{{\bf{F}}}}}_{real}^{2}\). When generating a fake material, base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) are supposed to belong to the same fake material, which hopefully results in \({{{{\bf{F}}}}}_{fake}^{0}={{{{\bf{F}}}}}_{fake}^{1}={{{{\bf{F}}}}}_{fake}^{2}\). However, the transformation of \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) might slightly deviate from the goal. Thus using (F^{0}, F^{1}, F^{2}) in real and fake materials implicitly adds physical constraints, which somehow pushes the generator to generate different sets of base atom sites for a same material, which increases chances to generate good materials in return.
Base Cartesian Coordinates Three sets of Cartesian coordinates can be calculated for each set of base atom sites by Eq. (8) and we denote them by (C^{0}, C^{1}, C^{2}).
Atom Distance Matrices Given three sets of base atom sites (B^{0}, B^{1}, B^{2}), we calculate the atom distance matrices H_{inter} and H_{intra} as shown in subfigure (d) of Fig. 5. We firstly calculate pairwise different atom distance matrix for each base atom site B^{j}, j = 0, 1, 2 and return only values in upper triangle of corresponding distance matrix termed by H_{inter}. Then we select three atoms belonging to the same element to form a set of three atom sites for three elements and calculate pairwise same atom distance matrix and again return only values in upper triangle of corresponding distance matrix termed by H_{intra}. The final shape of H_{intra} and H_{inter} both is 3 × 3.
Lattice Parameters The volume of the unit cell can be calculated by lattice parameters P. We repeat the scalar volume three times to get the volume vector V. We also use the lattice matrix A in Eq. (9) as part of the inputs to the discriminator.
Element Properties We use 23 properties as shown in Table 6 to formalize element matrix E.
Now we list all parts of inputs to the discriminator in Table 7. P^{*} only contains the lengths because the angles are either (90°, 90°, 90°) or (90°, 90°, 120°) in the training materials. Thus instead of generating three angles in P for fake materials, we build a map between angles and the space group sgp. Then we concatenate all parts and a zero matrix of shape 3 × 3 into a matrix of shape of 3 × 64. The matrix is finally reshaped into 3 × 8 × 8 as the inputs to the discriminator.
Mathematical conversion in crystal representations
Fractional coordinates can be converted to Cartesian coordinates \({\left[x,y,z\right]}^{T}\) using^{44}:
where A is a lattice matrix calculated by lattice parameters P using:
where \(V=abc\sqrt{1{\cos }^{2}{{{\boldsymbol{\alpha }}}}{\cos }^{2}{{{\boldsymbol{\beta }}}}{\cos }^{2}{{{\boldsymbol{\gamma }}}}+2\cos {{{\boldsymbol{\alpha }}}}\cos {{{\boldsymbol{\beta }}}}\cos {{{\boldsymbol{\gamma }}}}}\) is the volume of the unit cell.
In order to acquire all atom positions in the unit cell, each base atom site can be converted by affine matrix O. The conversion procedure is summarized in Algorithm 1. Different materials vary from the number of atoms and the number of elements. In order to make a fixed size of inputs, we only use ternary materials in this research. After conversion shown in Algorithm 1, the number of atom (sites) also differs from materials. That is the reason why base atom sites (one element one base site) are used to represent atom positions. In addition, it should be noted that the calculation of the uniqueness at line 10 of Algorithm 1 is not differentiable and timeconsuming.
Algorithm 1
Generate unique coordinates using base sites () and affine matrix
Require: The space group sgp, the base atom sites B
1: O ← GetAffineMatrices(sgp)
2: n ← len(O)
3: coords ← an empty list
4: for i ← 1 to 3 do
5: add 0 to b_{i}
6: uniq ← an empty list
7: for j ← 1 to n do
8: c ← b_{i} ⋅ t_{j} − ⌊b_{i} ⋅ t_{j}⌋
9: pop last element from c
10: if c not in uniq then
11: add c to uniq
12: end if
13: end for
14: add uniq to coords
15: end for
16: return coords
Evaluation metrics
Past studies in crystal generation used different evaluation metrics, making it hard to compare different methods. Here, we create a set of metrics to evaluate our method and two baselines. (1) Validity^{20}. Following^{18}, we consider a crystal structure as valid when the shortest distance between any two atoms is bigger than 0.5 Å. Following CubicGAN, we calculate the overall charge of a crystal structure using pymatgen^{32} and if it is neutral, then it is valid. Also, we count the number of structures after postprocessing in our method and we apply the same postprocessing onto the CubicGAN. (2) Property distribution^{20} We calculate wasserstein distance (WD) between the property distribution of generated materials and materials in test dataset TST. The properties we used are minimum atom distance, maximum atom distance, and density. (3) Diversity. We calculate the diversity of compositions, which means the ratio of unique number of compositions in generated structures. (4) DFT validation of generated structures. We find that both the pairwise atomic distance and property distribution based performance metrics are indirect weak criteria in crystal structure generation as the major challenge in such models is to generate stable structures. It is thus critical to check the success rate of DFTbased structural relaxation, the thermodynamic stability (e.g. evaluated by DFT phonon dispersion calculation), and synthesizability (e.g. based on energyabovehull calculation).
Data availability
The raw crystal dataset is downloaded from http://www.materialsproject.org.
Code availability
The source code to generate crystals can be obtained from github at https://github.com/MilesZhao/PGCGM.
Change history
17 June 2023
A Correction to this paper has been published: https://doi.org/10.1038/s41524023010598
References
Belsky, A., Hellenbrandt, M., Karen, V. L. & Luksch, P. New developments in the inorganic crystal structure database (icsd): accessibility in support of materials research and design. Acta. Crystallogr. B Struct. Sci. Cryst. Eng. Mater. 58, 364–369 (2002).
Noh, J. et al. Inverse design of solidstate materials via a continuous representation. Matter 1, 1370–1384 (2019).
PyzerKnapp, E. O., Suh, C., GómezBombarelli, R., AguileraIparraguirre, J. & AspuruGuzik, A. What is highthroughput virtual screening? A perspective from organic materials discovery. Annu. Rev. Mater. Res. 45, 195–216 (2015).
Jain, A. et al. Commentary: The materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Kirklin, S. et al. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. NPJ Comput. Mater. 1, 1–15 (2015).
Franceschetti, A. & Zunger, A. The inverse bandstructure problem of finding an atomic configuration with given electronic properties. Nature 402, 60–63 (1999).
Doll, K., Schön, J. & Jansen, M. Structure prediction based on ab initio simulated annealing for boron nitride. Phys. Rev. B 78, 144110 (2008).
Amsler, M. & Goedecker, S. Crystal structure prediction using the minima hopping method. J. Chem. Phys. 133, 224104 (2010).
FloresLivas, J. A. Crystal structure prediction of magnetic materials. J. Phys. Condens. Matter 32, 294002 (2020).
Glass, C. W., Oganov, A. R. & Hansen, N. Uspex—evolutionary crystal structure prediction. Comput. Phys. Commun. 175, 713–720 (2006).
Wang, Y., Lv, J., Zhu, L. & Ma, Y. Calypso: a method for crystal structure prediction. Comput. Phys. Commun. 183, 2063–2070 (2012).
Zhao, Y. et al. Highthroughput discovery of novel cubic crystal materials using deep generative neural networks. Advanced Science 8, 2100566 (2021).
Fuhr, A. S. & Sumpter, B. G. Deep generative models for materials discovery and machine learningaccelerated innovation. Front. Mater. 182 (2022).
SchwalbeKoda, D. & GómezBombarelli, R. Generative models for automatic chemical design. In Machine Learning Meets Quantum Physics, 445–467 (Springer, 2020).
Kingma, D. P. & Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
Ian, G. et al. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014).
Hoffmann, J. et al. Datadriven approach to encoding and decoding 3d crystal structures. arXiv preprint arXiv:1909.00949 (2019).
Court, C. J., Yildirim, B., Jain, A. & Cole, J. M. 3d inorganic crystal structure generation and property prediction via representation learning. J. Chem. Inf. Model. 60, 4518–4535 (2020).
Ren, Z. et al. An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties. Matter 5, 314335.
Xie, T., Fu, X., Ganea, O.E., Barzilay, R. & Jaakkola, T. Crystal diffusion variational autoencoder for periodic material generation. arXiv preprint arXiv:2110.06197 (2021).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Nouira, A., Sokolovska, N. & Crivello, J.C. Crystalgan: learning to discover crystallographic structures with generative adversarial networks. In AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering (2019).
Zhu, J.Y., Park, T., Isola, P. & Efros, A. A. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2223–2232 (2017).
Kim, S., Noh, J., Gu, G. H., AspuruGuzik, A. & Jung, Y. Generative adversarial networks for crystal structure prediction. ACS Cent. Sci. 6, 1412–1420 (2020).
Long, T. et al. Constrained crystals deep convolutional generative adversarial network for the inverse design of crystal structures. NPJ Comput. Mater. 7, 1–7 (2021).
Ahmad, R. & Cai, W. Free energy calculation of crystalline solids using normalizing flows. Model. Simul. Mat. Sci. Eng. 30, 065007 (2022).
Wirnsberger, P. et al. Python materials genomics, P. Normalizing flows for atomic solids. Mach. Lear.: Sci. Tech. 3, 025009 (2022).
Baird, S. G. et al. xtal2png: a python package for representing crystal structure as png files. J. Open Source Softw. 7, 4528 (2022).
Dan, Y. et al. Generative adversarial networks (gan) based efficient sampling of chemical composition space for inverse design of inorganic materials. NPJ Comput. Mater. 6, 1–7 (2020).
Sawada, Y., Morikawa, K. & Fujii, M. Study of deep generative models for inorganic chemical compositions. arXiv preprint arXiv:1910.11499 (2019).
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028 (2017).
Ong, S. P. et al. Python materials genomics (pymatgen): a robust, opensource python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Zuo, Y. et al. Accelerating materials discovery with bayesian optimization and graph deep learning. Mater. Today 51, 126–135 (2021).
Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018).
Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 31, 3564–3572 (2019).
Pickett, W. E. & Moodera, J. S. Half metallic magnets. Phys Today 54, 39–44 (2001).
Mouhat, F. & Coudert, F.X. Necessary and sufficient elastic stability conditions in various crystal systems. Phys. Rev. B 90, 224104 (2014).
Wang, V., Xu, N., Liu, J.C., Tang, G. & Geng, W.T. Vaspkit: a userfriendly interface facilitating highthroughput computing and analysis using vasp code. Comput. Phys. Commun. 267, 108033 (2021).
Hill, R. The elastic behaviour of a crystalline aggregate. Proc. Phys. Soc. A 65, 349 (1952).
Togo, A. & Tanaka, I. First principles phonon calculations in materials science. Scr. Mater. 108, 1–5 (2015).
Shao, X. et al. A symmetryorientated divideandconquer method for crystal structure prediction. J. Chem. Phys. 156, 014105 (2022).
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011).
BarJoseph, Z., Gifford, D. K. & Jaakkola, T. S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, S22–S29 (2001).
Fractional coordinates. Fractional coordinates — Wikipedia, the free encyclopedia (2021). https://en.wikipedia.org/wiki/Fractional_coordinates#cite_note3. [Online; accessed 11November2021].
Acknowledgements
The research reported in this work was supported in part by National Science Foundation under the grant and 2110033, 1940099 and 1905775. The views, perspectives, and content do not necessarily represent the official views of the NSF.
Author information
Authors and Affiliations
Contributions
Conceptualization, J.H., M.H., Y.Z.; methodology, Y.Z., J.H., E.S., Z.W., M.H., N.F., M.A.; software, Y.Z., Z.W.; resources, J.H.; writing–original draft preparation, Y.Z., J.H., E.S., N.F.; writing–review and editing, J.H., E.S., M.H.; visualization, Y.Z., E.S., J.H.; supervision, J.H., M.H.; funding acquisition, J.H. and M.H.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhao, Y., Siriwardane, E.M.D., Wu, Z. et al. Physics guided deep learning for generative design of crystal materials with symmetry constraints. npj Comput Mater 9, 38 (2023). https://doi.org/10.1038/s41524023009879
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524023009879
This article is cited by

Designing semiconductor materials and devices in the postMoore era by tackling computational challenges with datadriven strategies
Nature Computational Science (2024)

Structurebased outofdistribution (OOD) materials property prediction: a benchmark study
npj Computational Materials (2024)

Comparative study of crystal structure prediction approaches based on a graph network and an optimization algorithm
Science China Materials (2024)

Methods and applications of machine learning in computational design of optoelectronic semiconductors
Science China Materials (2024)

Accelerating the prediction of stable materials with machine learning
Nature Computational Science (2023)