CCDCGAN: Inverse design of crystal structures

Autonomous materials discovery with desired properties is one of the ultimate goals for modern materials science. Applying the deep learning techniques, we have developed a generative model which can predict distinct stable crystal structures by optimizing the formation energy in the latent space. It is demonstrated that the optimization of physical properties can be integrated into the generative model as on-top screening or backwards propagator, both with their own advantages. Applying the generative models on the binary Bi-Se system reveals that distinct crystal structures can be obtained covering the whole composition range, and the phases on the convex hull can be reproduced after the generated structures are fully relaxed to the equilibrium. The method can be extended to multicomponent systems for multi-objective optimization, which paves the way to achieve the inverse design of materials with optimal properties.


Introduction
The discovery and exploitation of new materials has enormous benefits for the welfare of society and technological revolutions 1 , which motivates the launch of Materials Genome Initiative in 2011 2,3 . Till now, high-throughput (HTP) workflows based on density functional theory (DFT) enable massive calculations on existing and potentially novel compounds, accelerating materials discovery dramatically 4 . For instance, crystal structure predictions can be performed based on brute force substitution of the known prototypes or the evolutionary algorithms as implemented in CALYPSO 5 and USPEX 6 . Nevertheless, the soaring computation cost prevents exhaustive screening over immense phase space, limiting the application of such methods. On the other hand, the emergent machine learning techniques have become the fourth paradigm for materials science 7 , as exemplified by the rapid developments and applications in the last decades 8 . Particularly, the machine learning methods have been utilized to discover novel However, there are mostly tens of known crystalline phases for a given binary system, e.g., Materials Projects (MP) 27 provides only 17 already known Bi x Se y materials, which are not sufficient for deep learning. Following Ref. 21 , we constructed 10981 artificial structures eliminating the large unit cell, i.e. the maximum number of atoms in unit cell less than 20 and the maximum length of unit cell smaller than 10 Angstrom, and we do the substitution based on their result. 21 of Bi x Se y compounds by substituting all binary materials in MP, and follow-up HTP DFT calculations. We use the high-throughput environment to optimize the structure and determine the thermodynamical stability of Bi-Se database and the generated structures, 28 where thermodynamical stability is evaluated by calculation of the formation energy. All data in Bi-Se database are considered to calculate the convex hull. For VASP setting, 29 Perdew−Burke−Ernzerhof (PBE) is adopted, the energy cutoff is 500 eV, and K space density is 40 on each direction for the Brillouin zone.

Representation
Although crystallographic and chemical information are stored in the crystallographic information file (CIF), the discontinuous and heterogeneous formats are not a suitable choice of representation in a generative model, thus a continuous and homogeneous representation including both the chemical and structural information is required. Following Ref. 18 , the lattice constants and atomic positions are translated into the voxel space, followed by encoding into a 2D crystal graph through autoencoder 21 , as demonstrated in Fig. 1(A). In this way, a continuous latent space is constructed. The whole process is reversible, i.e., a random 2D crystal graph can be reconstructed into a crystal structure in real space, which is essential for a generative model. Specifically, 3D voxel grids are used for a typical binary compound: two grids to record the atomic positions of two elements separately and the third one to store the lattice constants, i.e., the lengths and angles of/between them. Hereafter such grids are referred as site voxel and lattice voxel, respectively. Using the probability density based on Gaussian functions, the lattice voxel is defined as 2 ,, 2 2 ,, where x, y, z denotes the grid index, p is the probability density of the denoted grid, r is the real space distance between this grid to the center of unit cell, and σ denotes the standard deviation of the Gaussian function. Similarly, sites voxel is defined as where n is number of atoms and r is the real space distance between this grid point and the i th atom. In this way, the inverse transformation is trivial for the lattice voxel while that for the sites voxel relies on the image filter technique 21 .
Autoencoder is a type of artificial neural network which is able to encode inputs into low dimension vectors, as well as decode the vectors back 30 . In this work, the autoencoder is applied to encode 3D voxel data into one-dimensional (1D) vectors in the latent space, which is realized based on 3D convolutional neural network (CNN) autoencoder. The loss function of autoencoder consists of two parts: the first is the loss of information, i.e., the difference between inputs and outputs, and the second is the regulation term to prevent over fitting, which yields where x and y are the input and output of the autoencoder, λ is regulation coefficient, and ω is weight in the autoencoder 11 . The construction of autoencoder for both site and lattice voxel are the same, where both are trained using a 90%/10% training/test set ratio. Both site and lattice voxel autoencoders have high reconstruction ratio. All of them are considered to be identical by the crystal structure comparison subroutine available in pymatgen 31 , with fractional length tolerance being 0.2, sites tolerance 0.3, and angle tolerance 5 degrees.
For data transformation between real space and voxel space, the voxel used for atomic positions is 64×64×64, while the voxel for lattice parameters is 32×32×32 and the σ used in Gaussian function is always 0.15, function "gaussian_filter" from scipy package is used. 21 The autoencoders constructed by tensorflow in python. The sites autoencoder consists of 10 3D convolutional layers, i.e. 5 for encoder and 5 for decoder, strides except the first and last on are 2×2×2 and activation function is leakyRELU, Adam optimizer is used, learning rate is 0.003, λ is 0.000001, β 1 is 0.9 and β 2 is 0.99 21 . Similar, the lattice autoencoder consists of only 8 3D convolutional layers (4 for encoder and 4 for decoder), other parameters are same to the previous one 21 . Detailed design of these two models can be found in Table S1 and Table S2.

DCGAN
Turning now to the realization of generative models, VAE 11 and GAN 12 are two most popular algorithms. VAE is a mutation of autoencoder discussed above, which assumes a specific (such as Gaussian) distribution of data (in our case 2D crystal graphs) in the latent space. To be generative, such a distribution function should be defined properly, i.e., to be consistent with the distribution of 2D crystal graphs of the known crystal structures in the latent space. In addition, the distribution function has to be specified prior to the training process and its form determines the performance of the generative model, which demands domain expertise in statistics and profound understanding of the input data. For instance, a recent work by Ren et al. used sigmoid function in VAE to design inorganic crystals and find novel structures that do not exist in training set 32 . In contrast, the GAN model trains two mutual-competitive neural networks (i.e., generator and discriminator) to generate new data statistically the same as the training data without assuming the distribution function 12 . That is, the discriminator tries to distinguish the generated data from the training data, while the generator attempts to fool the discriminator by generating data similar to the training data. Mathematically the objective of GAN is defined by the following equation: where D is the discriminator, G is the generator, E means expectation value, x is the original data, D(x) is the output of discriminator with x as input, p x is the possibility density function of the original data, while p g is the possibility density function of the generated data. Through the competition between the generator and discriminator, the distribution of generated data becomes hardly distinguishable with the distribution of training data. Compared with VAE, GAN does not require specification of the distribution function in the latent space (p g ), which makes the generation process more robust, and thus is adopted in this work 13 .
The DCGAN model consists of generator and discriminator, they both use Adam optimizer with 0.002 learning rate, β 1 is 0.5 and β 2 is 0.999. 2D convolutional layer, dropout and batch normalization are used in both model, while RELU, leakyRELU, tanh and sigmoid are used as activation functions. 1,000,000 steps are trained for this model for 16 hours on Quadro P2000 GPU. Desgin of generator and discriminator are listed in Table S3 and Table S4 respectively. Latent space of the GAN model has 200 dimensions.

Constraint
As mentioned above, the objective of inverse design is to design compounds with desired properties, including thermodynamical, mechanical, and functional properties. In this work, we take the formation energy (i.e., thermodynamic stability) as the target property. In order to be able to optimize the formation energy in the latent space, another convolutional neural network (CNN) model is trained taking 2D crystal graphs in the latent space as inputs and formation energies as the output physical property. The training is carried out by a 90%/10% training/test-set ratio over the Bi-Se database. The constraint model consists of 4 convolutional 2D layer and 6 fully connected layer, conducted by keras. 33 It also use Adam optimizer, learning rate 0.002, β 1 is 0.5 and β 2 is 0.999, optimization loss is mean square error, activation function is leakyRELU. To prevent overfitting, dropout and batch normalization are used after each convolutional layer. 34 Detailed design of this model is described in Table S5.

DCGAN + constraint
A straightforward way to implement the optimization of physical properties, e.g., formation energy in this work, is to apply an add-on constraint on the generated structures using DCGAN, as sketched in Fig. 2(A). The advantage doing so is that there is no need to train another model, which saves training time. Additionally, all the existing machine learning models can be transplanted no matter whether such forward predictions can take 2D crystal graphs in the latent space or vector based chemical and structural information as descriptors. This allows the optimization of a wide spectrum of physical properties 14 . Nevertheless, this method is essentially a selection of the structures generated by DCGAN, thus it cannot search automatically for a specific region in the latent space to reach the local optimal values. More details are described in Supplementary Section S5.

CCDCGAN
The constraint can also be integrated into DCGAN as a back propagator, as illustrated as CCDCGAN in Fig. 3(A), to realize automated optimization in the latent space so that inverse design can be realized. In this way, the constraint gives rise to an additional optimization objective of the generator, which can be mathematically described as: where E f is the formation energy predicted by the constraint model, z is the generated 2D crystal graph, and ω is defined as the weight of formation energy loss. Note that such an additional optimization objective cannot outweigh the primary objective, leading to lower weight for the formation energy loss (0.1 in this work) than the discriminator loss. Unlike the DCGAN + constraint model, CCDCGAN can accomplish automated searching for the local minima in the latent space and thus improve the efficiency of discovering new stable structures. It is noted that CCDCGAN requires a training from scratch, which takes 18 hours on a Quadro P2000 GPU machine, which is 2 hours longer than the DCGAN + constraint model. Please refer to Supplementary Section S5 for more details.
The generator, discriminator and constraint have exactly the same design as the previous model, all parameters are same as parameters in DCGAN. The only difference is weight of discriminator remains as 1 while the weight of the formation energy loss is 0.1. The training time is about 18 hours under the same condition as DCGAN.

Data
HTP DFT calculations leads to 9810 compounds with successful crystal structure optimization 35 . Correspondingly, the formation energies obtained are shown in Fig. S1. Obviously, the whole composition range is covered. However, only 155 (1.56%) out of all the structures are stable (formation energy lower than 0 eV/atom and distance to the convex hull smaller than 100 meV/atom) and 707 (7.2%) cases are meta-stable (formation energy lower than 0 eV/atom and convex hull distance smaller than 150 meV/atom). Interestingly, there is only one phase, i.e., Bi 2 Se 3 , on the convex hull. Such a database is referred as the Bi-Se database in the rest of the manuscript.

Representation
When applied to the Bi-Se database, 9420 of 9810 crystal structures are successfully reconstructed. Such a high ratio is not caused by overfitting, demonstrated by the learning curve of both autoencoders as shown in Fig. 1(B) and (C). The differences of training set loss and test set loss are both negligible, suggesting that the 2D crystal graphs in the latent space contain adequate information to reconstruct crystal structures. This is confirmed by the diversity of twelve typical 2D crystal graphs in the latent space (cf. Fig. 1(D) for four cases). Please refer to Supplementary Section S2 for more details.

DCGAN
Such an implementation of DCGAN can generate crystal structures with a high success rate (number of generated crystals over number of generated 2D crystal graphs), e.g., 2397 crystal structures are transformed from 13000 generated 2D crystal graphs and relaxed by the DFT calculation. Most (~90%) discarded cases fail due to a rigid constraint on the interatomic distance where the minimum interatomic distance should be larger than 2.0 Angstrom, which is set based on the known structures in the Bi-Se database. The generated structures cover a large composition range as shown in Fig. 2(B), where the red bullets denote the original data in Bi-Se database and the blue triangles mark the newly generated structures by DCGAN. One interesting observation is that the generated structures are mostly with negative formation energies, on average lower than that of the original database (Fig. S2(A)

Constraint
The resulting R 2 score and mean absolute error (MAE) being about 85% and 0.019 eV/atom (Fig.  S3(A)), respectively. Such a MAE is even smaller than that (0.021 eV/atom) obtained using the state-of-the-art CGCNN model 36 . That is, the 2D crystal graphs in the latent space can be considered as effective descriptors to perform forward prediction of physical properties. We note that overfitting is a marginal issue for such a high accuracy, as seen in the learning curve of the training process in Fig. 2(D). Details on the CNN model are described in Supplementary Section S4.

DCGAN + constraint
Applying the constraint on the formation energy (with a tolerance of 0.3 eV/atom) on 2397 generated structures by DCGAN, 1998 crystal structures are selected and optimized by further DFT calculations. Such a screening procedure does not affect the diversity of the composition as shown in Fig. S2(D). The application of such a constraint has evident effect in the latent space, as demonstrated in Movie. S1 where the constraint screens the unwanted points out leading to a shrunk region in the latent space. With such a constraint applied, the ratio of structures with negative formation energy is higher, increasing from 91.1% for DCGAN to 94.4% (1887/1998). The ratio of meta-stable structures also becomes higher compared to the DCGAN case, i.e., 1028 meta-stable structures are generated from 1887 structures, reaching the ratio of 54.4% which is better than 48.4% in the DCGAN model. However, the number of generated stable structures reduces to 8 due to the selection process.
An unexpected observation is that applying the constraint jeopardizes the generation of distinct structures. After performing DFT calculations, only 151 distinct structures remain, which are significantly reduced compared with the previous 313 cases obtained by DCGAN. The reason is two-fold. On the one hand, the generated crystal structures by DCGAN are not guaranteed to be at the mechanical and dynamical equilibrium, i.e., the lattice constants and atomic positions will change during DFT relaxation. On the other hand, the CNN model applied as constraint is not good enough in predicting formation energy, though the MAE for the formation energy is only about 0.019 eV/atom. Thus, 344 candidates have been screened out by considering a tolerance of 0.3 eV/atom. After comparing with the known crystal structures in the Bi-Se database, 84 distinct structures remain with negative formation energy, 35 distinct meta-stable structures and 2 distinct stable structures generated. These structures are still with high quality, e.g., two of them Bi 3 Se and Bi 2 Se 2 are demonstrated in Fig. S2(E) and Fig. S2(F), with the formation energies (distances to the convex hull) being -0.065 (0.099) eV/atom and -0.202 (0.126) eV/atom, respectively. Overall, generated structures via DCGAN + constraint have a higher ratio of meta-stable structures than DCGAN but lower ratio of distinct structures.

CCDCGAN
In comparison with DCGAN, CCDCGAN has a higher success rate of generating crystal structures, i.e., 3106 crystal structures are transformed successfully from 13000 generated 2D crystal graphs. Same as the DCGAN model, the generated structures cover a large composition range, especially at the Bi rich part centered at Bi 4 Se (Fig. 3(B)). The average formation energy (-0.0895 eV/atom) for the CCDCGAN generated structures (Fig. S2(G)) is noticeably lower than that from the DCGAN (-0.0788 eV/atom, Fig. S2(A)) and DCGAN + constraint (-0.0868 eV/atom, Fig. S2(D)). This suggests that the back propagation indeed causes optimization in the latent space. Correspondingly, the number of structures which are with negative formation energy, meta-stable, and stable is 2918, 1417, and 21, respectively. The most intriguing result is that the number of distinct structures is bigger than that from DCGAN or DCGAN + constraint, where 373 distinct structures have been identified after performing DFT relaxations on 3106 structures. Correspondingly, the number of distinct meta-stable and stable structures is 90 and 5, respectively. Three particularly promising structures with formation energy lower than their counterparts in Bi-Se database are Bi 2 Se 4 ( Fig.  3(E)), Bi 4 Se 2 ( Fig. 3(F)), and Bi 6 Se 2 ( Fig. 3(G)), whose formation energies (distance to the convex hull) are -0.287 (0.041) eV/atom, -0.139 (0.069) eV/atom, and -0.063 (0.101) eV/atom, respectively. In this regard, CCDCGAN performs better than both the DCGAN + constraint and DCGAN models in generating distinct structures.
As seen from Fig. 2(B), and Fig. 3(B), the generated distinct phases seem to be all above the convex hull and there is no new phase on or below the convex hull defined by the Bi 2 Se 3 phase. Particularly, around the Bi 2 Se 3 composition, there is a forbidden region with no distinct phase generated. To find out the reason and also to explore the full competence of the CCDCGAN model, we generated one million crystal structures and did DFT calculation on 1278 generated structures with the Bi 2 Se 3 composition, resulting in 236 distinct structures. As shown in Fig. 3(C), the whole range of formation energies can be covered, with the smallest distance to the convex hull for the generated structures being only 0.0151 eV/atom, which confirms the ability of generating structures on the convex hull. In this regard, the forbidden region close to the convex hull is originated from limited number of generation and also the fact that the generated structures are not at their mechanical and dynamical equilibria. Furthermore, we trained another CCDCGAN model with all the phases with the Bi 2 Se 3 composition excluded in the training set, resulting 233 of the 1612 generated Bi 2 Se 3 structures are the same as those in the Bi-Se database ( Fig. 3(D)). More importantly, the case with the smallest convex hull being only 0.0005 eV/atom with the corresponding formation energy being -0.3935 eV/atom) has the same structure as the only stable Bi 2 Se 3 phase. This proves the capability of the CCDCGAN model to generate experimental synthesizable structures in unknown composition range, and thus to accelerate the discovery of new crystal phases. Nevertheless, additional DFT calculations are mandatory in order to relax the crystal structures, so that distinct structures on or below the current convex hull can be identified. Particularly for the Bi-Se system, a complete evaluation of the convex hull by performing explicit DFT calculations on one million generated structures covering the whole composition range is still numerically expensive. We suspect that the recently developed machine learning interatomic potential will be helpful for the optimization of the generated crystal structures with the chemical accuracy 37 .

Conclusion
We have developed an inverse design framework CCDCGAN consisting of generator, discriminator and constraint and successfully applied it to design unreported crystal structures with low formation energy for the binary Bi-Se system. It is demonstrated that 2D crystal graphs can be used to construct a latent space with continuous representation of the known crystal structures, which serve as effective descriptors for modeling the physical properties (e.g., formation energy) and can be decoded into real space crystal structures enabling the generation of new crystal structures. Such an inverse design model can be easily generalized to the other compositions and the multicomponent cases. Importantly, we elucidate that the optimization of physical properties (e.g., formation energy) can be integrated into the generative deep learning model as explicit constraints or back propagators. This allows further development of a multi-objective inverse design framework to optimize other physical properties. One apparent challenge is how to get the generated structures into their mechanical and dynamical equilibria, entailing further exploration such as relaxation using the machine learning interatomic potential.