Introduction

Crystallography is the experimental science of determining the structure of crystals by analyzing x-ray, neutron or electron diffraction patterns1,2,3. Powder crystallography is a sub-branch of crystallography that solves this problem when the measured sample consists of a large number of small, randomly oriented grains of the material4,5,6,7. This problem is mathematically harder because of the loss of orientational information which must be recovered through inference during the structure reconstruction. It is useful when single crystals are difficult to obtain experimentally. However, it also is a good starting point for developing methods to determine the structure of nanomaterials and molecules in solution8, problems that currently have no robust solution.

The field of structure determination from powder diffraction9 has grown by adapting conventional crystallographic methods to the powder case. As with all crystallographic methods, these use inference and an iterative design approach to obtain structure candidates. The approach is a human-intensive activity requiring hands-on guidance by skilled experts. It involves first identifying the crystallographic coordinate system, a process called indexing, followed by finding the fractional coordinates of atoms in the unit cell from Bragg peak intensities1,9. For PXRD data, the process sometimes works and sometimes does not, depending on the quality of the data and the complexity of the structure. It is not a straightforward process and requires considerable expertise.

Recent work suggests that deep learning methods hold great potential to simplify the solution of complex inference problems with a straightforward end-to-end process. For instance, the protein-folding problem has recently been “solved" by end-to-end deep learning approaches like AlphaFold10,11 and RoseTTAFold12. This is highly relevant, because protein folding is a sister problem to powder crystallography—both problems involve recovering the enigmatic shape of complex molecules from sparse and low-dimensional (i.e., 1-dimensional) inputs (amino acid sequences for the case of proteins and PXRD patterns for the powder crystallography case)13. Other examples of problems that have yielded to end-to-end learning are image classification14, autonomous vehicle driving15, and speech recognition16.

Machine and deep learning methods have been proposed to accelerate various stages of the powder crystallographic process. However, most of these works are conducted in a classification or feature regression paradigm: given an observation such as the XRD pattern, predict a property of the structure, such as space group symmetry, phase, unit cell parameters, or magnetism17,18,19,20,21,22,23,24,25,26,27,28,29,30. There are some works that generate crystal structures, but their methodologies are not readily applicable to our problem because they (1) largely focus on unconditional (with respect to XRD pattern) generation cases in which there is no ground truth structure to reconstruct31,32,33; (2) solve the easier single-crystal diffraction problem34,35; (3) were designed only for specific classes of materials, such as proteins36 and monometallic nanoparticles37. Furthermore, the source code for many works in the deep learning for crystallography paradigm is not open-sourced, limiting their reproducibility23,32,34.

Here, we propose an approach towards an end-to-end deep neural network that is able to determine a transformed three-dimensional electron density field directly from a 1-dimensional diffraction pattern. The actual electron density distribution may then be recovered with the inverse transform as we describe below.

The model we call CrystalNet is a variational38 query-based multi-branch deep neural network (DNN) architecture (also known as a conditional implicit neural representation39,40,41,42,43) that takes powder x-ray diffraction patterns and chemical composition information as input, and outputs a continuous function that is related to the 3D electron density distribution. We call this function the Cartesian mapped electron density (CMED) because we map the electron density from the crystallographic coordinate system of the structure to a Cartesian coordinate system. This distorts the resulting electron density but places it on a universal basis that allows the model to be seamlessly trained on structures from different crystal systems and with different unit cell parameters. The advantage of this representation for material structure is that it frees us from traditionally predefined properties such as the number of atoms and the crystallographic coordinate system. The actual electron density distribution may be recovered from the CMED through the inverse mapping, and if required, the discrete molecular structure can be straightforwardly decoded from this electron distribution if needed44. After training, given a previously unseen diffraction pattern (and corresponding chemical composition information), CrystalNet can be queried to produce a 3D CMED map at any desired resolution. Due to our variational approach38,45, CrystalNet can also be queried multiple times to produce different predictions, should the first guess be sub-satisfactory. The design, training and testing protocols are described in the Methods section.

The performance of the model are described here. We report preliminary results from the cubic and trigonal crystal systems using theoretically simulated data from the Materials Project46. CrystalNet was able to reconstruct atomic structures from the cubic system almost perfectly. For the trigonal system, CrystalNet achieves success in most cases, with the infrequent failure modes providing insights for future work. We chose the cubic and trigonal systems for the initial tests as representative systems that are close to, and far from, respectively, the Cartesian coordinate system. They both have the property a = b = c, but in the trigonal case one of the lattice angles is 120°. As such these systems might be representative of best-case and not-as-good scenarios. Although other crystal systems were not explored fully in this study, the results on these two crystal systems give us hope that our approach can be highly effective for the remaining five systems. We note that the model does not make use of any symmetry or chemical property information beyond composition and yet still shows success. This means that such information may be added as priors in future iterations when there is even greater information loss in the input signal, for example, due to very low symmetry structures or broad diffraction signals charateristic of nanomaterials.

We also conduct ablation studies by systematically reducing the input chemical composition information to gain insight into which information is most important for AI-enabled powder crystallography going forward. We find that while this information helps our model, for these high symmetry structures, crystal reconstruction is generally successful with only the XRD data and no compositional information at all.

Results

We evaluate CrystalNet by feeding in the XRD pattern, chemical composition, and queried coordinates as input. CrystalNet then processes this information with multiple branches and fuses it into one shared representation. Finally, via the charge density regressor, it outputs the predicted charge densities at each of the queried coordinates. See Fig. 1 for a schematic overview of how CrystalNet works.

Fig. 1: CrystalNet System Overview.
figure 1

As input, CrystalNet takes in 1D powder x-ray diffraction (PXRD) patterns that may be obtained by azimuthally integrating a 2D diffraction pattern such as shown as the black square. It also takes in chemical composition ratios, and the atomic coordinates of known structures. In the Solver it processes each input item with a specialized branch (pink, purple and black). It then fuses them into one unified latent vector of length 512, which is passed through the charge density regressor to produce a voxelized 3D charge density map at arbitrary resolution. See “Methods” for more details.

Reconstruction

Table 1 shows reconstruction success metrics (SSIM, PSNR) on the cubic and trigonal crystal systems from powder XRD and chemical formula information. SSIM stands for structural similarity index, which measures the patchwise correspondence between two signals on a scale of 0 (worst) to 1 (best)47. PSNR stands for peak signal-to-noise ratio, which measures the magnitude of the predicted charge density signal relative to the size of the errors in the prediction, where higher values are better, and indicates perfect reconstruction48: typically, values of PSNR above 30 are considered high-fidelity reconstructions49. More details are available in Methods.

Table 1 Reconstruction Performance from combined powder XRD and chemical formula

To demonstrate the functionality of our methodology, Fig. 2 shows sampled reconstructions of two testing crystals viewed from various angles, given only chemical composition and powder XRD as input. Inspired by variational approaches, we achieve multiple reconstructions by sampling from the conditional latent distribution38. We see that this stochasticity in output can be helpful if the initial guess is incorrect; in principle, we can resample to obtain a more reasonable prediction that matches the given XRD, as measured by the analytically solved forward process. Even for failure cases like Ge7Ir3, we still see that sampling multiple times allows us to get a prediction that is closer to the ground truth. On average, over five latent space samples for the same given crystal (i.e., XRD and formula input), the standard deviation of SSIM is 0.017 in the cubic system and 0.018 in the trigonal system, while the standard deviation of PSNR is 2.78 in the cubic system and 0.68 in the trigonal system.

Fig. 2: Multi-view variational reconstructions.
figure 2

We display CMED reconstructions of previously unseen crystals at multiple viewing angles. Given chemical composition and powder XRD as input, we generate a distribution over latent codes, which we then sample from to obtain multiple crystal reconstructions. This allows us to obtain better guesses, if the first prediction is sub-optimal.

See Fig. 3a for success cases of cubic reconstruction, and Fig. 3b for failure cases of cubic reconstruction. Overall, reconstruction is very successful over a diverse range of crystal structures, judging from both visual and quantitative metrics.

Fig. 3: Cubic system reconstructions.
figure 3

a shows the CMED plots for the success cases. b shows the failure cases. Ground truth and CrystalNet prediction alternate left-to-right. Formulas are under ground truth images, and the corresponding powder XRD peak inputs are under predictions. Powder XRD peak inputs are visualized as relative intensity maps, with the diffraction angle (horizontal direction) increasing from 0° → 180°.

Quantitatively (Table 1), we achieve great success, as evidenced by the 0.934 mean SSIM on the testing set. Indeed, value 1 (perfect reconstruction) is actually within one standard deviation (0.149) of this mean testing SSIM, indicating that many crystals had near-perfect reconstructions. The PSNR is also very high, above the typical success threshold49 of 30, even if we go one standard deviation (12.7) below the mean PSNR (43.0) on the testing set.

From a qualitative perspective, we also see many good results (Fig. 3a). Encouragingly, we see that our method succeeds for structures (such as \({{\rm{V}}}_{3}{({{\rm{Co}}}_{10}{{\rm{B}}}_{3})}_{2}\)) with a high number of atoms in the unit cell, despite not knowing how many atoms are contained a priori. Its success also seems to be consistent across a variety of chemical compositions, e.g., it succeeds on both Gd2Hf2O7 and ThCd11, which share no common elements. We also observe, as expected, that crystals containing similar elements—such as Zr3Sb4Pt3 and Ce3Sb4Pt3—have similar structures, albeit with different average charge densities. Even the cubic failure modes (Fig. 3b) still give good guesses for rough structural outlines, even if the details are slightly incorrect. For instance, Cs3H12N4F3 has the predicted general structure close to the ground truth, but the charge density peaks are not as sharp, and the atomic boundaries are slightly blurred. Another example is Cr4GaCuSe8, which actually has a predicted structure reasonably close to the ground truth, except that the predicted structure is oriented upside-down and has some extraneous medium-charge locations. Indeed, the upside-down prediction is actually not that significant of an error, since material identity is invariant to rotation.

See Fig. 4a for success cases of trigonal reconstruction, and Fig. 4b for failure cases of trigonal reconstruction. Quantitatively and qualitatively, reconstruction on this system is also successful, although not as successful as the cubic system.

Fig. 4: Trigonal system reconstructions.
figure 4

a shows the CMED plots for the success cases. b shows the failure cases. Ground truth and CrystalNet prediction alternate left-to-right. Formulas are under ground truth images, and the corresponding powder XRD peak inputs are under predictions. Powder XRD peak inputs are visualized as relative intensity maps, with the diffraction angle (horizontal direction) increasing from 0° → 180°.

Looking at Table 1, we see that both the SSIM and PSNR are lower than that achieved on the cubic system. This is expected, as the trigonal system is less symmetric. That being said, SSIM levels are still decent, with average value 0.741 out of 1. Average PSNR levels are only slightly below the threshold for high-fidelity reconstruction (27.8 vs. 30)49.

Moving to qualitative analysis, similar to the cubic system outcomes, we are able to solve crystals with a high number of atoms in the unit cell (e.g., \({{\rm{CrP}}}_{6}{({{\rm{WO}}}_{8})}_{3}\)), and crystals from diverse chemical makeups (e.g., LaZnCuP2, \({\rm{Ba}}{({{\rm{B}}}_{2}{{\rm{Pt}}}_{3})}_{2}\)). Additionally, in the trigonal success cases (Fig. 4a), we see that the model is able to successfully solve crystal structures with considerably lower symmetry than the examples in the cubic system. For instance, \({{\rm{Rb}}}_{3}{\rm{Na}}{({{\rm{RuO}}}_{4})}_{2}\) and Mn8Nb3Al are considerably less symmetric than any of the examples displayed for the cubic system, yet our method was still able to achieve high-fidelity reconstructions of both.

Furthermore, due to the CMED representation placing all crystals in a unit cell with orthogonal inter-axial angles (as opposed to the non-orthogonal inter-axial angles of the trigonal system), we observe slight atomic distortion in both the ground truth and predicted structures, e.g., \({\rm{Ba}}{({{\rm{B}}}_{2}{{\rm{Pt}}}_{3})}_{2}\) has ellipsoid rather than spherical site shapes. This is expected, and more detail about the CMED representation is available in Methods.

We see that the failure cases (Fig. 4b) are a bit more apparent for the trigonal system than for the cubic system. Indeed, some of the predictions do not contain useful information, e.g., Si5P6O25. Noticeably, the model appears to have difficulty predicting the high charge density regions, such as in Pr6Mn\({({{\rm{SiS}}}_{7})}_{2}\). That being said, some failures (such as NaBiF6) still contain reasonable information about the structure, which can be used as a first step in an iterative structural refinement process. It is also notable that a lot of the failure cases exhibit difficulty with orientation. For instance, NaBiF6, Rb2PtC2, and \({\rm{Rb}}{({{\rm{V}}}_{3}{{\rm{S}}}_{4})}_{2}\) have reconstructions that would be considered more reasonable, were they rotated differently.

Data ablation

We conduct ablation studies on the chemical formula information, since in reality, this data is known to varying degrees during the crystallographic process. We try three ablations: (1) Eliminate elemental ratio information, with a 1 in the composition vector if the element is contained in the material, and 0 otherwise; (2) Randomly drop one element from the ratio-free composition information, i.e., flip 1 to 0 for a singular randomly selected element (at least one element must be known, so we do not drop elements if the material contains a singular element); (3) No elemental information at all, leaving only XRD. In all these experiments, full XRD information was retained in all these ablation studies. See Table 2 for the results of the ablation studies.

Table 2 Ablation Performance at various levels of input information: (1) Baseline: Powder XRD + full chemical composition information; (2) No Ratios: Powder XRD + elements contained, without any information about their ratios; (3) Drop Element: Powder XRD + elements contained, with one element randomly dropped; (4) No Formula: Powder XRD information only

See Fig. 5a for visualizations. As expected, as we ablate information about the chemical composition, the quantitative reconstruction performance, as measured by SSIM and PSNR, declines on the cubic system (Table 2). That being said, the visual and quantitative results indicate that even with heavy degradation in the elemental composition information inputted, we still achieve very reasonable reconstructions. Indeed, there is virtually no difference between the Baseline (powder + full composition info) and the No Ratio versions of the model, as measured by SSIM and PSNR. And, even though there is a ten-point PSNR gap between Baseline and No Formula, No Formula still has a mean PSNR of 30.0, which is higher than that of any of the trigonal versions.

Fig. 5: Ablation studies.
figure 5

a shows cubic system ablations. b shows trigonal system ablations. Left-to-right in each panel: ground truth, full information (XRD + formula) prediction, excluding elemental ratios (XRD + elements contained) prediction, randomly dropped element (XRD + all but one element contained) prediction, XRD-only prediction.

See Fig. 5b for visualizations. The trend of decreasing performance with decreasing degrees of chemical composition information still generally holds for the trigonal system (Table 2). Yet, similar to the cubic model, the trigonal model still works even under this heavy degradation in information.

Surprisingly, different from the cubic system, removing the formula altogether from the trigonal reconstruction model’s input actually performs slightly better (as measured by SSIM and PSNR) than randomly dropping one element from the composition information. For instance, the No Formula reconstruction for ErNi3 is more successful than the Drop One (Er) reconstruction in Fig. 5b. It is also interesting to note that even the No Formula version of the cubic model performs better than the Baseline (full information) version of the trigonal model: this indicates that (at least using our model design), the cubic system is easier to solve than the trigonal system.

Discussion

This is a successful attempt at large-scale reconstruction of crystals in the cubic and trigonal systems. This is significant because it can pave the way for fully automated solutions to crystal structures from powder XRD data, potentially speeding up materials discovery and analysis by orders of magnitude. Furthermore, even if the structure initially predicted by our method is not correct, it can still be used as a first guess in the iterative refinement process, or we can even re-sample from the latent space to generate another candidate (since we use a variational approach).

Of particular interest is our CMED representation (described further in Methods). By mapping all structures onto a universal coordinate system, we are able to train the same model architecture on structures from different crystal systems and unit cell parameters. This is advantageous (especially in comparison to approaches that predict coordinates of discrete atoms), because this representation does not require a priori knowledge of properties that are required by other methods, such as the number of atoms or lattice vectors. However, because it re-maps structures onto another coordinate system, the CMED inherently distorts atoms, in size and shape.

All the experiments conducted were on simulated powder x-ray diffraction patterns. Furthermore, many of the materials in the Materials Project are theoretical materials that have never been synthesized46. This still provides us valid data pairs to train and evaluate our model, since generating XRD from crystal structure is an analytically solved problem1. However, this also means that much of the data is free from defects we would find in experimental data, e.g., peak broadening, missing peaks4,8. Thus, while we have shown that deep learning methods, in principle, can work to solve the structure problem, there will still need to be future work to overcome this simulation-to-real gap.

Furthermore, we solved only the two most symmetrical crystal systems, out of seven total1. Based on our preliminary explorations on the other five systems, we hope that this method, with appropriate tweaks, could be applicable to them. Indeed, due to our CMED representation, the data format should be exactly the same: empirical chemical formula and PXRD pattern as input, voxelized electron density grid as target. Yet, future exploration needs to be done to adapt our approach to these other systems, such as addressing the unequal lattice vector lengths and different symmetry operations.

Additionally, solving crystal structures can be a one-to-many problem, in the cases of degraded XRD and/or chemical composition data. Although the variational approach allows us to have variation in the output via re-sampling from the latent space (see “Methods”), we seek more principled ways to model the uncertainty in our predictions.

Also, our representation of chemical composition information only tells the model which elements are contained, but it does not encode information about the chemical properties. In future works, we can perhaps incorporate some prior chemical knowledge, e.g., atomic mass, period, group.

Finally, as seen by some outputs that were reasonable but oriented incorrectly, future work should either propose a reliable method for enforcing canonical poses or design a model that can learn on and output multiple orientations of the same structure.

Methods

Dataset

We get our data from the Materials Project46, which has publicly available standard data on over 150,000 inorganic compounds, largely for materials in the Inorganic Crystal Structure Database (ICSD)50. Some of the material properties are experimentally observed, while others are calculated with Density Functional Theory (DFT)51,52.

We ensure there is no train-test leakage in the dataset, as follows. Our criteria for whether two molecules are “duplicates” is that they have the same (1) chemical formula; and (2) spacegroup. We go through our datasets and find all the molecules that have the same formula-spacegroup combination. Out of the molecules that share the same formula-spacegroup combination, we remove all but one of them from our dataset.

We use data from the cubic and trigonal crystal systems, which constitute two out of the seven total crystal systems53. We only experiment on these two systems in this preliminary study because the intra-crystal axial lengths are equal (i.e., a = b = c), which eliminates the need to predict the axial lengths (whether implicitly as an intermediate calculation, or explicitly as the model’s output), and allows us to focus on predicting charge densities. See Table 3 for the numbers of crystals used in our experiments.

Table 3 Number of crystals

We run separate experiments for the cubic and trigonal systems, i.e., we train and test one version of our model only on cubic crystals, and we train and test another version of our model only on trigonal crystals. In practice, to determine the structure of a material, we would run each version of the model (where each version is trained to solve one specific crystal system) on the XRD and partial chemistry information, then take the most plausible structure from the given outputs. This does not add significant burden to the end user of our method, since there are only seven total crystal systems, and inference time for our model is less than a minute per structure.

We use the theoretically calculated powder x-ray diffraction patterns from the Materials Project API. The diffraction angle ranges used were between 0° and 180°. More detail is available in the references54,55.

The simulated patterns are generated using the MoKα wavelength of 0.711 Å. Depending on the atom types present in the compounds, the amplitude of the powder XRD patterns may vary drastically. This variation can be inherently problematic for most machine learning algorithms56. To solve this issue, we normalize the peak intensities so that the highest peak intensity is set to 1. While this normalization process does reduce some of the information related to specific atom species, it retains the relative differences between them. Consequently, when the chemical formula is provided, or even if only partial information about the atom species is available, we can still reconstruct the structure with the correct atom types.

We reiterate that the simulated patterns we use are of higher quality than those collected in experimental settings, due to the lack of noise, e.g., peak broadening, missing/extraneous peaks. Thus, we note that performance is expected to fall off for real data compared to the simulated diffraction patterns. Tests on real data will be the subject of a future study.

We also incorporate the chemical composition, that is, the molar ratios of the elements contained in the material. We include this because chemical composition is often known, at least to some degree. We also test the robustness of the model by ablating this information to various degrees in our experiments.

For the training, validation, and testing data, we use electron density maps from Materials Project DFT calculations57,58,59. These are in a crystallographic basis, which depends discontinuously on the crystal system and details of the unit cell size and shape as we move from one material to another. We resample the electron densities within the unit cell onto a grid that has 50 voxels along each axis, with the locations of the voxels expressed in fractional coordinates. We use PyRho (a library from the Materials Project)59 to do this via Fourier interpolation. The charge densities are further normalized to be expressed in e\(*\) Å3. This will give different spatial resolutions for different structures, but has the advantage that it gives a representation that is a uniformly shaped array for all materials.

We call this quantity the Cartesian mapped electron density (CMED). The result of the normalization and resampling is a grid of 50 × 50 × 50 voxels. For visualization we can project this onto a Cartesian coordinate system with orthonormal basis vectors. The CMED is distorted from the real electron density by the procedure, but it allows us to visualize all structures, from all crystal systems and unit cells, on the same coordinate system. However, more importantly, it allows in principle a single ML model to be trained on structures from all the different space groups and crystal systems.

We stress the importance of CMED’s uniformly shaped array for all materials. In previous attempts, we attempted to predict electron densities in raw Cartesian space (e.g., electron density queries were at exact Angstrom positions). The issue with that is that the output domain was not well-bounded, so we needed to train with large maximum (x, y, z) coordinates. While this was reasonable for crystals with very large unit cells, it did not work so well on crystals with small unit cells, as the training objective was too sparse. In contrast, the CMED representation maps every unit cell to a space where the coordinates are well-bounded, which makes training much more tractable.

To get from the CMED predicted by our model to an undistorted electron density the inverse mapping must be carried out. If the unit cell of the unknown structure is indexed and the lattice parameters are known, this is straightforwardly done by plotting the voxels in the same order in the other basis.

In practice, we seek an end-to-end procedure that can discover the unit cell parameters as part of the automated process. This has not been done in the current paper, but we believe it will be straightforward. Indeed, there is already evidence that such information can be obtained straightforwardly by ML25,60,61.

Neural network design

See Fig. 6 for the layer-by-layer neural network architecture. See Fig. 7 for a mid-level system diagram that shows how the components interact. The XRD, chemical composition, and spatial positions are inputted into the model and processed by separate branches. The XRD and chemical composition embeddings are fused with each other via concatenation. Then, they are fused with the spatial position embedding via FiLM62. That fused representation is then passed to the charge regressor, which predicts the charge density at the queried spatial positions. In total, our model has 14,775,187 parameters.

Fig. 6: CrystalNet architecture.
figure 6

a shows the architecture of the x-ray diffraction encoder, b shows the architecture of the elemental composition encoder, c shows the architecture of the feature fusion network, d shows the architecture of the positional encoder, e shows the architecture of the conditioning network, and f shows the architecture of the final charge density regressor.

Fig. 7: CrystalNet system diagram.
figure 7

XRD and chemical input are passed through encoders, which output latent feature distributions. Sampling from those latent distributions, we fuse the features with the positional (x, y, z) query, and pass the fused information through the charge density regressor to get our final charge density prediction.

We adopt a variational approach38,45 for powder XRD and formula embedding prediction. Particularly, rather than deterministically predicting the embeddings, we predict the means and standard deviations of the embedding distributions, which are modeled as multivariate Gaussian distributions. Thus, we have

$$E({\bf{x}}) \sim {N}(\mu ({\bf{x}}),{\sigma }^{2}({\bf{x}}))$$
(1)

where E is a sample from the distribution of formula- or XRD-conditioned embeddings, x is the corresponding formula or XRD input, μ is the neural network function that regresses the mean, and σ is the neural network function that regresses the standard deviation. We use the reparameterization

$$E({\bf{x}})=\mu ({\bf{x}})+\epsilon * \sigma ({\bf{x}})$$
(2)

where ϵ is unit Gaussian noise, to make the process differentiable38.

We justify this variational approach with the following reasons: (1) Crystallographic inference, i.e., predicting molecular structure given XRD and formula, can be a one-to-many problem, so a non-deterministic approach is appropriate for modeling these multiple outputs. (2) Crystallography is an iterative design process. The variational approach allows us to resample candidate structures, if the first prediction is not appropriate. (3) Variational approaches allow the model to learn a smoother latent space, which may generalize better to out-of-training-distribution inputs38.

We note that although Fig. 6 depicts the powder XRD (Panel a) and formula (Panel b) encoders as deterministic networks, this is only for the sake of simplicity in the illustration. In reality, we have two versions of each network, one for regressing μ, and the other for regressing σ, which are then combined to produce the actual embedding, according to Eq. (2).

The powder XRD encoder is shown in Fig. 6, Panel a; in short, the goal of this branch is to extract relevant information from the sparse XRD pattern. The inputs are the extracted peaks xd from the x-ray diffraction patterns, which are normalized such that the highest peak is at intensity 1. They are represented as vectors with with 1024 pixels of resolution, where the value at each pixel represents the intensity of the diffraction pattern at that location. The outputs are 512-dimensional embeddings Ed = Ed(xd).

The architecture is an adaptation of the DenseNet architecture for vector (rather than image matrix) inputs, with the most important design characteristic being the densely connected concatenations between convolutional feature maps63. The convolutional feature maps provide the important inductive bias of translational invariance, since (at least in early stages of processing) we wish to extract low-level features from all XRD peaks, regardless of where they are located, in essentially the same way. The dense connections promote integration of low-level and high-level features that may both be important to solving the task. Every convolutional layer (except the last one) is followed by LayerNorm64 and ReLU; the final linear layer is followed by BatchNorm56. We re-emphasize that technically, we have two versions of the XRD encoder under our variational framework: one for regressing μd(xd), and one for regressing σd(xd), to construct Ed(xd) as defined in Eq. (2).

The formula encoder is shown in Fig. 6, Panel b; this branch is intended to extract relevant information about the chemistry that complements the information contained in the XRD peaks. The input is the empirical formula, represented as a 118-dimensional vector xf, where each index of the vector refers to the normalized amount (as defined by number of atoms) of the element with that atomic number that is contained in the formula. (For instance, if the formula was H2O, we would first normalize that to H0.66O0.33. The resultant vector would contain 0.66 at index 1, the atomic number of hydrogen; 0.33 at index 8, the atomic number of oxygen; and 0 everywhere else.) The output is a 512-dimensional embedding Ef = Ef(xf).

The architecture is a simple MLP, in which every linear layer is followed by BatchNorm (which improves stability and convergence speed)56 and ReLU. The only exception is that the last layer does not use ReLU. We reiterate that we use a variational framework for regressing Ef(xf), which technically necessitates two versions of the encoder, one for μf(xf), and one for σf(xf).

The feature fusion network is shown in Fig. 6, Panel c; this network is designed to integrate the XRD and chemical information into one unified representation. The inputs are the concatenated embeddings from the XRD encoders and formula encoders, such that we have a 1024-dimensional combined embedding. This combined embedding then gets passed through two MLPs with four linear layers each, and BatchNorm56 and ReLU following every linear layer. The outputs are two 512-dimensional embeddings, one for multiplicative interactions (labeled γ(Ed, Ef)), the other for additive interactions (labeled β(Ed, Ef)) with the positional encoding (described in next section).

The positional encoder is shown in Fig. 6, Panel D; its function is to convert the positional information into a format that can meaningfully interact with the aforementioned XRD and chemical information. It takes in the (x, y, z) coordinates as input. The inputted coordinates are normalized and centered, such that −0.5 ≤ x, y, z ≤ +0.5. The output is a 512-dimensional positional embedding.

This approach of querying specific coordinates as compared to directly predicting a voxel grid is advantageous, because in principle, it allows us to represent electron density maps with arbitrary precision. (That being said, in our work, the maximum resolution is the 503 grid).

To process the input, we use modified random Fourier features41, according to the formula:

$$p([x,y,z])=\,\text{MLP}\,\left(\left[\sin \left({\bf{B}}\cdot {[x,y,z]}^{{\rm{T}}}\right),\cos \left({\bf{B}}\cdot {[x,y,z]}^{{\rm{T}}}\right)\right]\right)$$
(3)

We generate the frequency matrix \({\bf{B}}\in {{\mathbb{R}}}^{m\times 3}\), where each \({{\bf{B}}}_{ij} \sim {N}(0,{\sigma }^{2})\). (We set m = 128, σ = 3.) Then, we calculate \({x}^{{\prime} }={\bf{B}}\cdot {[x,y,z]}^{{\rm{T}}}\), which represents linear combinations of each of the coordinates. Then, we calculate \(\sin ({x}^{{\prime} })\) and \(\cos ({x}^{{\prime} })\) and concatenate them to get a 2m-dimensional psuedo-Fourier series representation. We employ this coordinate transformation for two reasons: (1) it approximates a high-dimensional Fourier series of the charge density map, which allows the model to capture high-frequency features (shown via Neural Tangent Kernel theory41,65); (2) the cosine (periodic even function) and sin (periodic odd function) parameterizations allow us to encode the many inherent symmetries1 of crystals. Finally, we pass the 2m-dimensional pseudo-Fourier series representation through two linear layers, with a BatchNorm56 and ReLU in between; to get our positional encoding p([x, y, z]).

The feature conditioner is shown in Fig. 6, Panel E. This part of the architecture combines the information from all the previous branches: XRD, chemical, and positional; such that it can be fed into the final charge density regressor. It takes as input the multiplicative embeddings γ(Ed, Ef), additive embeddings β(Ed, Ef), and positional encoding p([x, y, z]). It outputs P, the 512-dimensional feature-conditioned positional encoding.

The feature-conditioned positional encoding P is calculated as:

$${\bf{P}}=\gamma ({{\bf{E}}}_{d},{{\bf{E}}}_{f})* p([x,y,z])+\beta ({{\bf{E}}}_{d},{{\bf{E}}}_{f})$$
(4)

This is known as the feature-wise linear modulation (FiLM)62. It is effective because it allows us to have both multiplicative and additive interactions during feature conditioning, which increases expressivity. (In contrast, traditional concatenation-based approaches to feature conditioning are shown to only simulate additive interactions66).

The charge density regressor is shown in Fig. 6, Panel f; it is responsible for predicting the final structure of the crystal. The input is P, the feature-conditioned positional encoding. The output is the charge density at the corresponding (x, y, z) coordinates that P was generated from. One major advantage of designing the network to continuously output electron density at arbitrary query points (as opposed to outputting a set of discrete atomic coordinates, for instance) is that we can predict structures without needing to know a priori how many atoms are contained in the material.

The architecture is a MLP with BatchNorm56 and ReLU after every layer, except for the final layer. It also uses skip connections to encourage feature reuse, inspired by DeepSDF43 and NeRF40.

Training process

We minimize L1 Loss on the predicted charge densities, averaged over the entire batch:

$${L}_{1}({\rho }_{{\rm{pred}}},{\rho }_{{\rm{gt}}})=| {\rho }_{{\rm{pred}}}-{\rho }_{{\rm{gt}}}|$$
(5)

Minimizing this loss encourages the predicted output to match the ground truth output. We call this the reconstruction loss.

We also simultaneously minimize a KL-Divergence Loss on the predicted mean μd(xd), μf(xf) and standard deviation σd(xd), σf(xf) of the distribution of embedding vectors Ed(xd), Ef(xf)38,45,67, similar to that in β-VAE45:

$$\beta {L}_{KL}(\mu ({\bf{x}}),\sigma ({\bf{x}}))=\frac{\beta }{N}\mathop{\sum }\limits_{i=1}^{N}\left(1+\log \left({\sigma }_{i}{({\bf{x}})}^{2}\right)-{\mu }_{i}{({\bf{x}})}^{2}-{\sigma }_{i}{({\bf{x}})}^{2}\right)\propto {D}_{KL}({q}_{\phi }(E({\bf{x}})| {\bf{x}})| | p(E({\bf{x}})))$$
(6)

where E(x) = μ(x) + ϵ\(*\)σ(x), N = E(x) = μ(x) = σ(x) = 512 is the dimensionality of the embedding vector, qϕ(E(x)x) is the conditional distribution of the embedding vectors given the inputted XRD pattern or chemical formula, and β is a weighting parameter given to the loss. This closed form is possible because we parameterize p(E(x)) as \({N}(0,{\bf{I}})\), following Kingma and Welling, who include a derivation in their paper38. Intuitively, minimizing this loss encourages the XRD and formula embedding vectors to match multivariate Gaussian distributions, which not only smoothens the latent space, but encourages variation in the outputs, such that we can conduct an iterative refinement process in this one-to-many problem.

Thus, the total loss to be minimized is the sum of Equations (5) and (6):

$$L={L}_{1}({\rho }_{{\rm{pred}}},{\rho }_{{\rm{gt}}})+\beta {L}_{KL}({\mu }_{d}({{\bf{x}}}_{d}),{\sigma }_{d}({{\bf{x}}}_{d}))+\beta {L}_{KL}({\mu }_{f}({{\bf{x}}}_{f}),{\sigma }_{f}({{\bf{x}}}_{f}))$$
(7)

The β term tweaks the a balance between the reconstruction and the KL terms. Empirically, we set β = 0.05.

We considered incorporating XRD adherence into our loss function68, but we ultimately did not. This choice was made because it is not straightforward to compute the diffraction pattern from the CMED representation directly without carrying out an inverse transform, and we wanted to use a more direct objective for reconstruction performance, like L1 Loss.

We train our model to minimize the total loss from Equation (7) for 1500 epochs, with 128 crystals per batch at a resolution of 103 sampled charge densities per crystal. The charge densities are sampled via stratified bin sampling, where \(x,y,z \sim {\text{Uniform}}[\frac{i}{S},\frac{i+1}{S}]\) (we set S = 10) – this probabilistically allows us, over the course of the optimization procedure, to capture fine-resolution details of the electron density field, despite processor memory limits for individual batches40.

We use the Adam69 optimizer. We follow a cosine annealing schedule with warm restarts70, in which the learning rate decays from 10−3 to 10−6, then increases back to 10−3 and decays again to 10−6 over another cycle that has double the number of epochs: this helps the optimization procedure break out of local minima. The initial cycle length is 100 epochs, and increases to 200, 400, and 800 on the subsequent cycles, to constitute the 1500 total epochs.

As data augmentation, we randomly add small Gaussian perturbations from \({N}(0,0.00{1}^{2}{\bf{I}})\) to the inputted XRDs and chemical formula ratios (the perturbed input undergoes a ReLU, since we cannot have negative peaks or ratios). We also randomly shift the XRD patterns by less than 0.6°.

We save the version of the model that has the highest SSIM47 score on the validation set at the end of each epoch, where the model is given two guesses for each structure, and the rotation (24 ways) of the predicted structure that gives the highest SSIM score with the ground truth is used.

Evaluation setup

We run through the testing dataset, and give the model 5 tries (via sampling from the latent space in the variational framework) to predict each crystal structure. (We give the model multiple tries because crystallography is typically an iterative refinement process, so we consider our model successful if it can give a good guess.) For each guess, we rotate the predicted crystal 24 ways (in multiples of 90° about the x, y, z unit cell axes) and take the best SSIM47 and PSNR48 over all these rotations, as compared to the ground truth crystal. Finally, we report the best results over all rotations of all guesses of each crystal structure.

At evaluation time, we sample evenly to get a 50 × 50 × 50 charge density map (i.e., 3D grid) for each crystal. We then use 3D SSIM47 and PSNR48 as our evaluation metrics on the resultant 3D grid.

SSIM ranges from 0 to 1, where higher is better. It compares the structural similarity of the ground truth charge density map with the predicted charge density map. SSIM is calculated over patches of the 3D structure with a sliding cubic window of side length 7, and then averaged over all such patches. The patch-wise formula is:

$$SSIM({\bf{x}},{\bf{y}})=\frac{(2{\mu }_{x}{\mu }_{y}+{C}_{1})(2{\sigma }_{xy}+{C}_{2})}{\left({\mu }_{x}^{2}+{\mu }_{y}^{2}+{C}_{1}\right)\left({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{2}\right)}$$
(8)

where x, y are the spatially corresponding patches of the ground truth and predicted electron density maps, μx, μy are the mean charge densities (i.e., intensity) in those patches, σx, σy are the standard deviations of the charge densities (i.e., contrast) in those patches, σxy is the covariance between the position-wise charge densities in those patches, and C1, C2 are small constants for numerical stability.

PSNR stands for peak signal-to-noise ratio, where higher is better. Its value is theoretically infinite. Typically, values above 30 are considered good49. PSNR is calculated as follows71:

$$PSNR({{\bf{X}}}_{gt},{{\bf{X}}}_{pred})=10* {\log }_{10}\left(\frac{{\left(\max ({{\bf{X}}}_{gt})-\min \!{({\bf{X}})}_{gt}\right)}^{2}}{MSE\left({{\bf{X}}}_{gt},{{\bf{X}}}_{pred}\right)}\right)$$
(9)

where Xgt, Xpred are the ground truth and predicted charge density maps, respectively.

Formula ablation experiment

To reduce the computational burden of these formula ablation studies, we make a few modifications. In the optimization loop, we only train for 700 total epochs, use 83 samples per crystal, and have only 1 sample from the latent space per validation crystal. Additionally, in testing, we give the model 3 tries (instead of 5) to predict each crystal via variational sampling. We can make these modifications because the purpose of these ablation experiments is to compare the predictive ability of the model at varying degrees of chemical composition information, rather than to optimize the model to perfect predictive ability. For fair comparison, we also recalculate the baseline model performance according to these pared-down protocols.