Identification of crystal symmetry from noisy diffraction patterns by a shape analysis and deep learning

The robust and automated determination of crystal symmetry is of utmost importance in material characterization and analysis. Recent studies have shown that deep learning (DL) methods can effectively reveal the correlations between X-ray or electron-beam diffraction patterns and crystal symmetry. Despite their promise, most of these studies have been limited to identifying relatively few classes into which a target material may be grouped. On the other hand, the DL-based identification of crystal symmetry suffers from a drastic drop in accuracy for problems involving classification into tens or hundreds of symmetry classes (e.g., up to 230 space groups), severely limiting its practical usage. Here, we demonstrate that a combined approach of shaping diffraction patterns and implementing them in a multistream DenseNet (MSDN) substantially improves the accuracy of classification. Even with an imbalanced dataset of 108,658 individual crystals sampled from 72 space groups, our model achieves 80.12 ± 0.09% space group classification accuracy, outperforming conventional benchmark models by 17–27 percentage points (%p). The enhancement can be largely attributed to the pattern shaping strategy, through which the subtle changes in patterns between symmetrically close crystal systems (e.g., monoclinic vs. orthorhombic or trigonal vs. hexagonal) are well differentiated. We additionally find that the MSDN architecture is advantageous for capturing patterns in a richer but less redundant manner relative to conventional convolutional neural networks. The proposed protocols in regard to both input descriptor processing and DL architecture enable accurate space group classification and thus improve the practical usage of the DL approach in crystal symmetry identification.


Introduction
High-throughput material synthesis and characterization have been popular topics of research during the last few decades and have accelerated the discovery of novel materials [1][2][3][4][5] .Although various characterization methods exist, identifying the crystal symmetry, i.e., the way the atoms are arranged in space, is inarguably the first and most important process in material characterization.This is because the crystallographic structure of a material plays an important role in determining the material properties (structure-property relationship) 6,7 .For a concrete example, consider the magnetism of iron: bcc Fe is ferromagnetic, while fcc Fe shows paramagnetic behaviors 8 .The most effective way to classify crystal symmetries is to find the group representing all transformations under which a system is invariant, namely, its space group.In three dimensions, there are 230 distinct types of space groups when chiral copies are considered [9][10][11] ; these space groups are formed from the combinations of the 32 point groups with the 14 Bravais lattices 12 .Manually determining the space group to which a target material belongs is a tedious and highly inefficient task due to the brute-force nature of the search algorithms, which are based on matching diffraction patterns to those in a database, such as the Crystallography Open Database or the Inorganic Crystal Structure Database 6,[13][14][15][16][17] .Thus, there is a strong and timely need for robust and automated assessment tools for crystal symmetry determination.
Techniques based on X-ray and electron-beam diffraction are the most related to the identification of crystal symmetries.The latest generation of tools for diffraction experiment allows the simultaneous collection of large volumes of data 18,19 , the handling of which calls for big data techniques and machinelearning-based approaches.Several recent works have introduced regression models or deep learning (DL) models for material characterization.Liu et al. 20 refined atomic pair distribution functions in a convolutional neural network (CNN) to classify space groups.For similar purposes, Park et al. 21, Vecsei et al. 22 , Wang et al. 23 and Oviedo et al. 24 used powder X-ray diffraction (XRD) 1D curves, for which information such as peak positions, intensities, and full widths at half maximum (FWHM) are mainly treated as the key input descriptors.In addition, Ziletti et al. 25 (in a parent work of this study), Aguiar et al. 26 , Kaufmann et al. 27 , and Ziatdinov et al. 28 developed DL models by extracting features from electron-beam based 2D diffraction patterns.These studies clearly show that DL methods can effectively reveal correlations between diffraction data and crystal symmetry.Despite their promise, however, most of these studies have been limited to identifying relatively few classes or crystal systems into which a material can be grouped.DL-based methods of crystal structure determination work perfectly for problems with a small number of symmetry classes (fewer than 10); however, they suffer from a drastic drop in accuracy for more difficult problems involving classification into tens or hundreds of symmetry classes (e.g. up to 230 space groups), severely limiting their practical usage.A DL model that is capable of identifying hundreds of classes with a sufficiently high accuracy will be needed to realize a robust, automated, and ultimately self-driving microscopy system or laboratory [29][30][31] .
In this work, considering the limitations imposed by the spotty and noisy distributions of raw diffraction patterns (DPs), we propose a solution, namely, shaped DPs in a multistream DenseNet (MSDN).Our new method greatly enhances the accuracy of space group classification.Even for an imbalanced dataset of 108,658 crystals sampled from 72 space groups, the model achieves 80.2%, exceeding the performance of benchmark methods by 17-27 percentage points (%p).We find that the shaping strategy enhances the uniqueness of the raw DPs; hence, even small observable differences between raw images of symmetrically close crystal systems (e.g., monoclinic vs. orthorhombic or trigonal vs. hexagonal) become pronounced.In addition, the introduction of the MSDN allows the patterns to be captured in a richer but less redundant manner than is possible in a standard CNN.Owing to their substantial performance enhancements, our proposed methodological protocols show promise for improving the practical usage of DL approaches in crystal symmetry determination.

Shaped diffraction patterns in a multistream DenseNet
Raw DPs are spotty and noisy and, thus, difficult to learn from.To enhance the capabilities of DL, we propose two ideas: one is to shape the DPs, and the other is to implement them in a multistream DL network (Figure 1).The former strategy is to refine the raw DPs by selectively connecting nodes, which transforms them into shaped DPs.One can expect three possible benefits from shaped DPs: (1) the learning objective becomes more solid; (2) by controlling the shaping criteria, it is possible to maximize the uniqueness of each diffraction pattern; and (3) the added lines may amplify critical information such as lattice parameters (length, angles, etc.).We hypothesize that these benefits will result in improved deep learning of crystal symmetries.Shaped DPs are produced as follows.First, raw DPs are collected from three orthogonal zone axes (the x-, y-, and z-axes) in the Condor software with an incident beam wavelength λ of 3.5×10 -12 m 32 .In Euclidean distance function.We draw interpolated lines only for node pairs with a distance smaller than a certain threshold, i.e., 1.7×min(distN*).The prefactor 1.7 was determined after extensive tests: the shapes become too complex with a larger threshold value, whereas the shapes are not clearly formed with a smaller threshold value.The colors R, G, and B are used for lines in images of the x-, y-, and z-axes.Thus, the shaped DP, or S*, is calculated as R* + ∑lineplot(N*,i, N*,j), where the sum ∑ is taken over the selected node pairs and lineplot(•) is the interpolation function.As shown in the scheme of the DP shaping process (Figure 1b), the lineplot(•) function is dependent on the node sizes; as a result, the line thickness will differ for different node pairs.Additional information related to the DP shaping protocols is provided in the Methods and in Supplementary Figure 1.As seen in the examples from several space groups presented in Figure 1b and Supplementary Figure 2, the shaped DPs are more solid and much less noisy than the raw versions.The resulting shapes comprise composition information that describes the particular regions of interest that are useful for representing DPs in more unique manners.For the further processing of multiple inputs (DPs collected from the three zone axes), we propose a novel multistream network, namely, an MSDN, as shown in Figure 1c.In the MSDN, three substream DenseNets are applied in parallel to each shaped DP; these DenseNets share all of their parameters (weights W and biases b).The idea of sharing parameters is warranted by the consistent learning process for all three shaped DPs (SR, SG, and SB).This imposes prior knowledge that the inputs to each substream are processed concurrently by the network, which substantially reduces the number of parameters in the MSDN.In addition, the MSDN utilizes the design concept of DenseNet 33 , in which all layers are densely connected (Figure 1c); in contrast, in a standard CNN, the features in each conv layer are used as input to the next layer without communication.The superior performance of DenseNets over standard CNNs has been previously reported in the field of image learning and classification [33][34][35] .Likewise, in the present study on the processing of DP images, the proposed MSDN is expected to create rich patterns while maintaining a low complexity of information, thus enabling better classification performance.
The MSDN concurrently accepts and processes shaped DPs, i.e., SR, SG, and SB, to extract a better feature representation from each substream for space group classification.Specifically, each layer in each DenseNet receives the inputs from all preceding layers and passes its features to all subsequent layers, meaning that the final output layer has direct supervision over every single layer.As a result, the network offers stronger feature propagation for the extraction of collective knowledge in the inference process.
Regarding the network configuration, the MSDN used in this study consists of four dense-block (DB) layers and three transition layers in each substream network, as shown in Figure 1c and Supplementary Table 1.

Dataset
A large-scale collection of diffraction patterns for 108,658 materials sampled from 72 space groups was acquired.These 72 space groups (out of a total of 230) were selected based on the criterion that each group should be represented by at least 295 materials in the Materials Project (MP) library 36 , as shown in Figure 2a.There are too few materials (mostly <100) available for the remaining space groups in the MP library, which were therefore excluded for DL training and testing.The selected space groups include 2 triclinic, 12 monoclinic, 22 orthorhombic, 13 tetragonal, 6 trigonal, 8 hexagonal, and 9 cubic crystal systems.Because we downloaded the full list of materials for each space group, the dataset is highly imbalanced, ranging from 295 materials for space group #223 to 8,700 materials for space group #14.For the following DL experiments on space group classification, we constructed datasets consisting of 8, 20, 49, and 72 space groups (SGs), as shown in Figure 2b.The number of materials in each space group is tabulated in Supplementary Table 2.

Classification experiments with varying numbers of space groups
We conducted DL experiments to study the classification of space groups (Figure 3).To evaluate the impact of our strategy (shaped DPs in an MSDN), we performed comparisons with other benchmark models, i.e., spot DPs in AlexNet 37 , DenseNet 33 , ResNet 38 , and VGGNet 39 .Spot DPs, which were originally proposed in the work of Ziletti et al. 25 , are the superimposed version of the raw DPs from R/G/B color channels.See the scheme in Supplementary Figure 3 for an exemplary illustration of spot DPs.The key parameter in our experiments was the number of space groups into which materials could be classified; we considered 8, 20, 49, and 72 (Figure 2b).In each case, the dataset was divided into 80% of the data for learning (training and validation) and 20% of the data for testing, with no overlap.In Figure 3a, to begin with the smallest-scale dataset (with 8 SGs), both our approach and the other benchmark models work excellently: ours shows 99.5% accuracy, while the others also achieve accuracies of above 94.5%.Notably, we have well reproduced the results of the state-of-the-art work of Ziletti et al. (over 99% for 8 SGs) 25 , which indicates that our experiments are reliable.
Proceeding to more difficult problems, i.e., larger-scale datasets (20, 49, and 72 SGs), we observe that our strategy of shaped DPs in an MSDN performs substantially better than the benchmark models.In Figure 3a, our method achieves excellent top-1 classification accuracies of 99.5%, 93.0%, 84.4% and 80.2% for the 8 SG, 20 SG, 49 SG and 72 SG datasets, respectively.On the other hand, the other models based on spot DPs considerably underperform: even the leading model among the benchmarks (spot DPs in Ziletti et al.'s network) exhibits an accuracy of below 63% for the 72 SG dataset.This result proves the relatively high tolerance of our model to an increasing number of space groups for classification, which is a critical requirement for its practical usage.We additionally measured the performance achieved with shaped DPs in a multistream VGGNet (MSVGG) in order to distinguish the contributions from the "shaped DP" and "MSDN" aspects of the proposed strategy.For the case of the 72 SG dataset, the total enhancement of 17 %p can be divided into a 10 %p contribution from the shaped DPs and the remaining 7 %p of the contribution from the MSDN, confirming that both strategies play critical roles.
Unlike in Figure 3a, in which only the top-1 classification performance is considered, the top-k (k=1−5) ranking accuracy is presented in Figure 3b-3e (for the 8, 20, 49, and 72 SG datasets, respectively).
We observe that for all cases, our strategy of shaped DPs in an MSDN performs the best regardless of the k value, followed by shaped DPs in an MSVGG.This once again confirms the superiority of shaped DPs over the conventional spot DPs as the descriptors used for crystal symmetry determination.For the smaller datasets (8 and 20 SGs), the classification is almost perfect (accuracy>99%) even at the top-2 ranking.For the larger datasets (49 and 72 SGs), the accuracy remains above 95% at the top-4 ranking (49 SG dataset) or the top-5 ranking (72 SG dataset).
The more challenging task of classification on an untrained space was also addressed in testing (Figure 3f).This task arises when the sample being tested does not fall into any of the space groups on which the classifier was previously trained.In this experiment, for testing purposes, we randomly sampled 3,052 materials from 30 additional space groups, which had no overlap with the aforementioned 72 SGs.A list of these 30 SGs is provided in Supplementary Table 3.These 3,052 material samples were divided into a reference set (50%) and a test set (50%).Next, classification was performed by measuring the cosine similarity distance between the reference set and each tested material.Details of the similarity distance calculation can be found in the Methods.Surprisingly, our network achieved a top-1 classification accuracy of 70.2% and reached 87.5% at the top-5 ranking.These accuracy values are impressively high, given that the tested materials belonged to SGs that were never considered in training.
The observed generalizability of our model is likely to be beneficial in real situations in which the tested materials are not part of the training space.

Classification results for individual space groups
We investigated the classification results for individual space groups.Only the 49 SG and 72 SG cases were analyzed (Figure 4a and Figure 4b).An interesting observation for both benchmarks and our model is that the accuracy is generally higher for SGs in high-symmetry crystal systems.The classification process tends to work much better for cubic/hexagonal/trigonal systems than for monoclinic/orthorhombic ones.Triclinic systems are an exception, largely due to the insufficient number of materials belonging to these systems.In Figure 4c and Figure 4d, while the benchmarks show the highest accuracy for cubic systems, the accuracy of our model is the highest for trigonal and hexagonal systems rather than cubic systems.In particular, for the 49 SG dataset, it is observed that for all space groups corresponding to trigonal and hexagonal systems (#146−#194), the classification accuracy is excellent, being over 90%.The accuracy improvements in our model over the benchmarks appear to be universal for most SGs.
To identify the source of these improvements, we now decompose the contributions for each crystal system (Figure 4c  Next, we focus on further characterizing the incorrect classifications obtained from the benchmark (spot DPs+Ziletti et al.) and our model (shaped DPs+MSDN).In Figure 4e and 4f, for instance, the [monoclinic, orthorhombic] coordinate in the matrices represents the materials belonging to an SG corresponding to a monoclinic system that were incorrectly classified as belonging to an orthorhombic system.In the comparisons between the benchmark and our model, the most prominent changes are observed in two areas, i.e., the monoclinic/orthorhombic and trigonal/hexagonal pairs.This indicates that the benchmark model often finds it difficult to correctly classify SGs corresponding to monoclinic vs. orthorhombic systems or to trigonal vs. hexagonal systems, whereas our model performs much better in resolving this confusion.We speculate that such confusion may occur mainly between symmetrically close crystal systems.For instance, monoclinic and orthorhombic systems are very close in terms of lattice symmetry, differing only in the lattice angle requirements (90° angle requirements).Therefore, similar spot distributions in spot DPs can possibly arise even from materials from different crystal systems, which may undermine the performance of spot-DP-based benchmark models.Matrices showing the distribution rates (%) of incorrect predictions for the 49 SG (e) and 72 SG (f) datasets.If the rate is, for example, 20% for the [monoclinic, orthorhombic] coordinate in a matrix, this means that 20% of the materials belonging to monoclinic systems in our dataset are incorrectly classified as belonging to SGs corresponding to orthorhombic systems.Red dotted boxes highlight the regions that are considerably different between the benchmark and our model.
To further justify our observation that our model (shaped DPs+MSDN) can largely resolve the confusion between symmetrically close systems, we scrutinize the DPs of several test samples.Figure 5 shows exemplary cases in which spot DPs fail and shaped DPs succeed in yielding correct SG classifications.For the first two example pairs of mp-1076884 (SG #1, triclinic) vs. mp-6406 (SG #7, monoclinic) and mp-6019 (SG #14, monoclinic) vs. mp-556003 (SG #74, orthorhombic), the raw and spot DPs are both too similar (almost identical) to be easily differentiated.This is consistent with the powder X-ray diffraction data available in the MP library in which the peak locations and intensities are alike.
However, the shaped DPs look substantially different, enabling the correct SG classification of these samples.In appearance comparisons of the shaped DPs, we find that the shaped DPs appear more symmetric for the higher-symmetry crystal system, as seen in the R-channel image for the first example pair (triclinic vs. monoclinic) and the G-and B-channel images for the second example pair (monoclinic vs. orthorhombic).The result indicates that the shape analysis can distinguish even small differences (barely observable by human eyes) in node position, size, and brightness, which are likely to be induced by the different level of lattice symmetries of crystal systems.For the latter two example pairs of mp-757070 (SG #166, trigonal) vs. mp-1195186 (SG #176, hexagonal) and mp-5055 (SG #186, hexagonal) vs. mp-29211 (SG #160, trigonal), although the raw and spot DPs do look slightly different, the benchmark models unfortunately do not predict the correct SGs for these samples.In the shaped DPs, however, these subtle differences are maximized.Notably, the distance information of adjacent node pairs, which is often related to the lattice parameters, is greatly amplified in the shaped DPs, as observed in the R and B channels of the 4 th example pair.From these case studies, we find that the shaping strategy enhances the uniqueness of the raw DPs more than the superimposition strategy used to produce the spot DPs does; hence, even small observable differences in pattern between symmetrically close crystal systems (e.g., monoclinic vs. orthorhombic or trigonal vs. hexagonal) become pronounced.In addition to the shaping strategy, the MSDN architecture also contributes to the performance improvements; here, we would like to discuss the benefits of this network.Figure 6 visualizes both the conv layers from the MSVGG and the DB layers from the MSDN for selected diffraction images.Several additional examples are presented in Supplementary Figures 4 and 5.The visualization results show that the patterns captured in the MSDN are clearer, richer, and less redundant than those in the MSVGG.Indeed, several feature patterns in the MSVGG are redundant, such as those for samples A, C, and D (highlighted in the red dotted boxes), while such redundant feature patterns are not found in the MSDN.This is likely because the MSDN reuses the features from previous layers to prevent redundancy within the network (Supplementary Figure 6).
We also compared the computational and memory efficiency of the MSVGG and MSDN.The MSDN is superior to the MSVGG in terms of both space complexity (total number of parameters) and time complexity (FLOPS: floating-point operations per second).The numbers of parameters and FLOPS are 128.85Mand 515.37M, respectively, for the MSVGG, while they are much smaller at 1.54M (84 times smaller) and 5.75M (90 times smaller), respectively, for the MSDN.In fact, the number of parameters of the MSVGG is enormous because every single layer has its own weights and biases (W and b) to be learned.In the MSDN, this complexity is avoided by optimizing the parameters and simplifying the connectivity between layers because it is unnecessary to learn redundant feature maps.Such a large difference is possible because the MSDN can receive direct supervision for the propagation of the error signal from the preceding layers to the final layer.These comparisons indicate that DP image processing is extremely fast and efficient in our MSDN model.evaluation scores.For the testing scheme, the test set images were used to evaluate the performance of our network.
For the proposed model (shaped DPs+MSDN), we used the Adam optimizer 41 with a learning rate of 1.0×10 -5 and a weight decay and momentum of 1.0×10 -7 and 0.9, respectively.The MSDN consists of four dense-block layers and three transition layers in each substream (Figure 1c).The structure of a dense block is illustrated in Supplementary Figure 6b.Let DB be a dense block with l layers Hl, composed of conv, rectified linear unit (ReLU) and dropout 42 layers: where x0~xl-1 represent feature outputs and [•••] is defined as a concatenation operator.Then, a transition layer is implemented in every block that performs 1×1 conv and avgpool operations.Supplementary Table 1 shows the configuration of the proposed network in detail.During training, we defined a total loss For the alternative model (shaped DPs+MSVGG), we used the Adam optimizer with a learning rate of 1.0×10 -5 and a weight decay and momentum of 1.0×10 -7 and 0.9, respectively.This network consists of

Figure 1 .
Figure 1.Shaped diffraction patterns in an MSDN.a, A scheme that describes the automated determination of crystal symmetry based on diffraction experiments.b, A scheme describing the generation process for shaped DPs as well as two exemplary results from space groups #187 and #205.Note that in the generation scheme, the line thickness depends on the node size, which makes the shapes more unique.c, The network architecture of the MSDN.

Figure 2 .
Figure 2. Population distribution of the diffraction pattern dataset.a, The number of materials in each space group, along with the crystal system information.The background colors represent seven types of crystal systems: triclinic in red, monoclinic in orange, orthorhombic in yellow, tetragonal in green, trigonal in blue, hexagonal in light gray, and cubic in dark gray.b, The usage of our dataset for the experiments.

Figure 3 .
Figure 3. Space group classification performance.a, Top-1 accuracy as a function of the number of space groups for classification.b-e, Top-k accuracies for the datasets consisting of 8 SGs (b), 20 SGs (c), 49 SGs (d), and 72 SGs (e).f, Top-k accuracies for testing an untrained space with an additional 30 SGs.The top-k accuracy refers to the percentage of cases in which the correct class label appears among the top-k probabilities.
and Figure 4d).The model named spot DPs+Ziletti et al. is selected as the representative benchmark here due to its relatively high performance.Triclinic systems are excluded from the analysis due to the statistically insufficient number of materials.The enhancements in accuracy are ranked as follows: trigonal (24.1 %p) > monoclinic (19.7 %p) > hexagonal (18.1 %p) ≈ tetragonal (18.11 %p) > orthorhombic (13.7 %p) > cubic (4.8 %p), where the values in parentheses are the average values for the 49 and 72 SG datasets.The contribution for cubic systems is much smaller than those for the other crystal systems.

Figure 4 .
Figure 4. Decomposition analysis to identify the origins of the performance improvement.a-b, Classification results for individual space groups from the 49 SG (a) and 72 SG (b) datasets.The background colors represent the seven types of crystal systems, as in Figure 2a.c-d, Average classification accuracy by crystal system type for the 49 SG (c) and 72 SG (d) datasets.e-f,

Figure 5 .
Figure 5. Case studies in which spot DPs fail and shaped DPs succeed in yielding correct SG classifications.The top row provides the material information of the test samples, which are available in the MP library, including the MP id, SG #, and powder X-ray diffraction data.The chemical formula of each material is as follows: mp-1076884 (Sr6Ca2Fe7CoO20), mp-6406 (Na2MgSiO4), mp-6019 (Sr2YNbO6), mp-556003 (CaTiO3), mp-757070 (BaCaI4), mp-1195186 (RbLa2C6N6ClO6), mp-5055 (Na6MnS4), mp-29211 (V4Cu3S8).The next four rows show the spot DPs and shaped DPs of each material.The green and red boxes indicate success and failure cases, respectively, for SG classification, and the blue boxes refer to the reference data in the training set.Best viewed in an electronic version.

Figure 6 .
Figure 6.Benefits of the MSDN over the MSVGG in processing DP images.For selected exemplary diffraction images A, B, C, and D, the block layers of the MSVGG (1, 2, 3, and 4) and MSDN (5, 6, 7, and 8) are visualized.The 3 rd conv block of the MSVGG and the DB2 layer of the MSDN are shown for comparison.The red dotted box indicates redundant (almost identical) feature maps.Best viewed in an electronic version.

(
ℓtotal) function consisting of a sum of the softmax cross-entropies ℓ of logit vectors and their respective encoded labels, as follows:ℓ total = ℓ( R ) + ℓ( G ) + ℓ( B ),(2) ℓ( * ) = − ∑ ∑   log[ SG ( * )  ] the zone axis information (one of the color R, G and B), F is a flatten layer, L denotes the class labels, T is the number of training samples, C is the number of classes, and δSG(•) is the output layer, implemented with the softmax function.The ℓtotal function provides joint supervision for the training process of the MSDN; it can robustly aggregate the descriptors from the different substreams.