Introduction

Innovations in high-throughput and autonomous experimentation1,2,3,4 are exceedingly increasing the acquisition rate of data, particularly in the case of XRD. Manual analysis of combinatorial datasets, i.e. for the identification of composition–structure-relationships, is a challenging cognitive task that requires the ability to recognize patterns under the awareness of several constraints5. Recent progress has been made in using AI for unsupervised XRD dataset decomposition6,7, crystal structure classification8,9,10,11,12,13,14,15,16,17,18,19,20, and integrating the latter with autonomous experimentation21. Classification models are particularly promising, having been developed for a broad scope (classifying crystal system, space group, point group)10,11,12,15, and specific challenges integrating experimental information13,17,18,22. Further refinement of classification results could be achieved by model interpretation9. In broader approaches, experimentally relevant domain knowledge can be integrated modularly, e.g. by adding chemical composition15. Broader models operate without specific domain knowledge, such as predicted phases of an investigated materials or experimental measurement parameters, and can be directly applied to a broad suite of classification challenges without fine tuning17,18. In contrast, experiment specific models have a more relevant, yet more narrow, distribution or training data, and thus struggle when encountering data outside of this distribution. Nonetheless, it is the domain knowledge and prior information (e.g. simulated X-ray diffractograms of expected phases from crystallographic database entries that encompass the breadth of possible experimental non-idealities) that makes these AI agents so successful18.

These priors stem from expert knowledge and expectation, and as such, even accurate, feed-forward classifiers can come to false conclusions when operating on new or novel data. Here, we define novelty as XRD patterns that are unknown to the AI, e.g. structures that were absent in the training data or XRD patterns of phase mixtures which were not considered in the training data. In the context of materials discovery, material novelty comprises unreported materials of certain chemical composition and crystal structure.

Hence, the recognition of novelty in large amounts of high-throughput data is a key challenge for scientists, and subsequently the tools that they use. An AI agent that is sensitive to novelty (i.e. aware of ‘what it doesn’t know‘) could inform scientists about experiments that mandate further investigation due to failure modes or material novelty. In this way, scientists can invest essential resources into the most promising areas and focus their manual effort where human experience is valued most.

Here, we use a variational autoencoder (VAE) (Fig. 1a) trained on a synthetic dataset as a prior18,23 to solve commonly occurring visualization and novelty detection challenges in XRD analysis. Notably, this same synthetic dataset can be used to train a state-of-the-art classification model in tandem18,24. Firstly, we identify regions of learned similarity and potentially degenerate solutions in our prior. Next, we develop a dynamic visualization tool for experimental XRD patterns and the VAE latent space. While this offers a correlation with structural classification, we show that visualizing the latent space with respect to the reconstruction error of the VAE allows for novelty detection during an experiment.

Fig. 1: Visualization of the VAE latent space representation.
figure 1

a Schematic VAE architecture. XRD patterns are encoded into a low dimensional representation (mean, variance). The red cross marker indicates the latent space position of the encoded XRD pattern with respect to the prior (circles). The latent vector is decoded into a reconstructed XRD pattern. b Latent space embedding with color-coded diffraction angle 2Θ of maximum XRD intensity. c Latent space embedding of an exemplary synthetic dataset. Im\(\overline{3}\)m is clearly separated from Fm\(\overline{3}\)m and P63/mmc. The x-markers show the latent space position of the corresponding XRD patterns in (d). d XRD patterns of the x-marked latent space positions in (c). The latent space embedding in b) clearly shows the main reflection axes of the different crystal structures, i.e. P63/mmc has six main reflection axes in the angular range from 20 to 90°2Θ (for Cu Kα). It further elucidates possible ambiguities between different structures that exhibit a preferred orientation: Fm\(\overline{3}\)m (111) and P63/mmc (002) as well as Im\(\overline{3}\)m (020) and P63/mmc (102) have peaks at a similar diffraction angle. This is important during the pattern matching task, as an experimental XRD pattern could be a result of either structure.

Results and discussion

Visualizing a synthetic dataset with a variational autoencoder

Autoencoders are a class of self-supervised artificial neural networks that compress input data into a lower dimensional feature-value space (latent space) and then decodes this latent representation back into its original space. These have been broadly applied for the purpose of dimensionality reduction, denoising25, anomaly26 and novelty detection27 and specific materials science applications such as detecting phase transitions28, and translating between different dimensional representations, i.e., 2-d images and 1-d spectra29. VAEs are a special class of autoencoders that approximate a posterior over latent random variables30,31, jointly optimizing the reconstruction of the input and the Kullback-Leibler (KL) divergence between the latent representation and a smooth (often normal) distribution. This later loss function acts as a regularization mechanism, resulting in an efficient distribution over the latent space, and can be expanded to include physical constraints32,33. The resulting probabilistic latent space has made these approaches popular for generating new data based on the prior training data34, and inverse design in materials science35,36.

We first study the behavior of a VAE on XRD data with regards to latent space distribution using a synthetic dataset of 15000 one-dimensional XRD patterns covering three spacegroups (Fm\(\overline{3}\)m (ISCD: 108308), Im\(\overline{3}\)m (ISCD: 108347), P63/mmc (ISCD: 622438), 5000 XRD patterns each). The dataset includes different aberrations that are frequently encountered in thin-film XRD patterns such as texture and preferred orientation, peak broadening, and peak shifting18. Figure 1c shows the latent space distribution of the validation dataset, color-coded by the spacegroup labels. The latent space representation provides direct visual evidence of the clustering properties of the encoder model and distribution of main reflection axes in the XRD patterns (Fig. 1b). In the latent feature space, proximity is a first indicator of structure type. P63/mmc and Fm\(\overline{3}\)m show an overlapping region which stems from homometrics: similar patterns that arise from distinct structures with preferred orientation in the training dataset. The XRD patterns, marked by green, red and blue crosses in Fig. 1c, are shown in Fig. 1d and highlight the similarity of Im\(\overline{3}\)m and P63/mmc for texturing along (020) and (102), respectively. By examining the latent space distribution with respect to the location of maximal intensity in the XRD patterns, we can see that the latent space is organized according to the main reflection axes of the crystal structures (Fig. 1b). This demonstrates how the learned latent representation can directly point towards possible structural ambiguities in the prior, and creates an explainable AI tool for understanding what determines location in latent space. Classification boundaries are mapped over this latent space to validate that the model is learning a physically meaningful representation, and to provide a visualization of the latent distribution and structural similarity (Fig. 2).

Fig. 2: Classification and novelty detection using the reconstruction error.
figure 2

The decision boundaries of a KNN classifier are outlined. a Latent space representation of a known phase color-coded by the reconstruction error. b Latent space representation of an unknown phase color-coded by the reconstruction error. The unknown phase shows a distinctly higher reconstruction error (average reconstruction error = 0.09) compared to phases that are recognized by the model (average reconstruction error = 0.017). The classification test score in this idealized case was 99.16%.

By interfacing a VAE trained on synthetic data with an experimental datastream37, this latent representation can visualize a measurement in real time. To demonstrate how this visualization behaves with new or unknown phases, a set of 1000 XRD patterns of a novel phase, unknown to the model, with P42/mnm spacegroup (ICSD: 601378) are generated using the same parameter variation as in the training dataset. In Fig. 2, the reconstruction error is mapped over the latent space for the test set of all phases (known and unknown/ novel). The decision boundaries of a KNN classifier, trained on the latent representation are outlined. While it could be naively assumed from the visualized location in latent space (Fig. 2b) that the P42/mnm phase was the Im\(\overline{3}\)m or P63/mmc phase, the reconstruction error increases by an order of magnitude, which means that the naive assumption is most likely wrong. This indicates a posterior distribution which does not capture the information contained in the input XRD pattern: this is new to the model. The model encodes a latent representation for the P42/mnm phase near the P63/mmc and Im\(\overline{3}\)m phases which suggests similarly important reflections (e.g.: P42/mnm(330) = 43.6°2Θ, Im\(\overline{3}\)m(011) = 43.13°2Θ or P42/mnm(411) = 47.03°2Θ, P63/mmc(101) = 47.36°2Θ); however, the reconstruction error provides evidence of an unknown structure.

The reconstruction loss between the input and the decoder can also elicit knowledge of phase mixtures. Phase mixtures could be considered during dataset synthesis by generating binary and ternary mixtures; however, the combinatorial explosion in multinary material systems would drive the dataset size to numbers that cannot be handled efficiently16. To test the VAE behavior with phase mixtures, we generate a dataset of binary combinations of the three structures (Fm\(\overline{3}\)m, Im\(\overline{3}\)m, P63/mmc) for 50 different binary compositions between 0 and 100%, and calculate the reconstruction error (Fig. 3a). An increase in average reconstruction error of approximately one order of magnitude is observed for mixtures of known phases and a pure unknown phase. Additionally, the distributions show a larger spread. When considering the reconstruction error as a function of phase fraction, the error is maximized at approximately 50% for all binary mixtures (Fig. 3b), indicating a maximum in reconstruction error for XRD patterns that are furthest apart from the training data. The large standard deviation results from cases where the mixture of certain XRD patterns with preferred orientation shows similarity to a pure phase. These properties show that unlike contemporary classification models, a VAE has an indication of when it encounters something unfamiliar. The reconstruction error can be used for decision making by alerting a scientist to review the classification results of a corollary classification model8,15,16,18, guiding an acquisition function for curiosity or exploration driven experiments, or to select an appropriate de-mixing model for multi-phase classification22,24,38.

Fig. 3: VAE reconstruction error of phases and phase mixtures.
figure 3

a Violin plot showing the statistics of the reconstruction error for pure phases, two-phase mixtures and an unknown pure structure. The median of phase mixtures is significantly higher than for pure phases. The median of the unknown phase shows a one order of magnitude increase in reconstruction error with respect to pure phases. Pure phases show a multimodal distribution, phase mixtures a bimodal distribution and the unknown phase a normal distribution. A broader distribution is observed for phase mixtures and the unknown phase. b Reconstruction error versus mixing ratio of binary mixtures. The reconstruction error shows a maximum at appr. 50% mixture for which the VAE has the highest uncertainty. We suggest that the reconstruction error be used as a metric for uncertainty that indicates XRD patterns that are either mixtures of phases or new phases that are not contained in the training data. The reconstruction error could be used to guide a data-driven acquisition function in the search for single phase regions in large chemical composition spaces and the discovery of new materials.

Visualization and anomaly detection in an experimental setting

We further tested the VAE in an experimental setting where a multitude of candidate phases are possible. An exemplary experimental dataset acquired from a materials library of the thin film system Co-Ni-Cr-Re contains 225 XRD patterns. Nine quasi-ternary composition spreads are distributed on a rectangular grid over a 100 mm substrate (Supplementary Figure 1). We selected 21 possible structures from the ICSD database and generated a synthetic dataset following the procedures outlined by Maffettone et al.18. The dataset contains several phases that show similar crystal structures for different chemical compositions. The chemical composition is a natural constraint in XRD analysis that bounds phase stability. A sensible alternative condition, beyond the scope of this work, is the phase transition temperature. The chemical composition acts as a constraint using a conditional VAE (cVAE)39. In this case, the cVAE encodes the concatenation of chemical composition and 1D XRD pattern (Supplementary Information). Latent variables and chemical composition are concatenated and passed to the decoder. The relation between chemical composition and crystal structure is coded in the cVAEs parameters after training on the selected 21 structures from database entries.

The VAE is now conditioned on the prior of both the XRD pattern and the chemical composition, which implicates that a mismatch between the conditional (chemical composition) and the XRD pattern will respond with an increased reconstruction error. We exemplify this behavior by recreating the above approach with synthetic data, but using a cVAE (Supplementary Figures 2 and 3). Following training on the synthetic data, the reconstruction error of the experimental data is evaluated and mapped over the physical coordinates (Fig. 4a). We inspect three XRD patterns with different reconstruction errors in Fig. 4b. We also demonstrate the veracity of the latent representation by training a Gaussian Naïve Bayes (GNB) classifier on an ensemble of cVAEs to predict space group from latent vector and chemical composition (Supplementary Figure 4). As evident from the spacegroup classification results (Supplementary Fig. 4) the majority of the XRD patterns have a high probability of Fm\(\overline{3}\)m spacegroup while with increasing Co and Re concentration P63/mmc and P42/mnm probabilities are enhanced.

Fig. 4: XRD pattern reconstruction by VAE and reconstruction error as a measure of novelty.
figure 4

a Materials library map of reconstruction error. Each position marks the measurement area where an XRD pattern was acquired for a unique chemical composition within the library. b Experimental XRD patterns (color-coded) and VAE reconstructed-XRD patterns (black) of highlighted data points on materials library.

The sample with the composition Co54Cr5Ni31Re10 (Fig. 4, red) shows a highly textured Fm\(\overline{3}\)m – Ni structure that was part of the synthetic dataset. Hence, the reconstruction error is minimal and the VAE-reconstructed XRD pattern closely matches the experimental pattern. The classification probability of the GNB is 73% (Supplementary Figure 4). The sample Co63Cr18Ni14Re5 (Fig. 4, blue) shows a phase mixture of Fm\(\overline{3}\)m – Ni (GNB probability: 45%) and P63/mmc – Co (GNB probability: 40%). As phase mixtures are unknown by the model, the reconstruction error is amplified, which informed us that closer inspection of this XRD pattern is required. The sample Co29Cr29Ni6Re36 (Fig. 4, yellow) shows a large peak at 39°2Θ that matches to Fm\(\overline{3}\)m – Ni (GNB probability: 43%) and a broad side-peak that could be associated with P42/mnm (GNB probability: 39%) according to the GNB prediction.

In summary, we have demonstrated that variational autoencoders provide a powerful toolset to assist in the analysis of XRD datasets that is complementary to other analysis methods. The crystal structure representation provided by the VAE latent vectors highlights intrinsic features of the dataset. This representation can be used for on-the-fly analysis of a dataset distribution across different structures, in which latent space proximity is a structural estimator. The reconstruction error informs the experimenter or adaptive AI driving an autonomous experiment of suspicious data, and assists in detecting unknown phases or phase mixtures. These models can be improved in the future by incorporating domain specific physics into the latent space as well as the dataset32,33. We envision the VAE working cooperatively together in a federation of specialized AI agents. Combining the power of modern classification agents16,18,38, data reduction agents40, and the wisdom of ignorance offered by the VAE with adaptive experimental agents21,41 will significantly increase the veracity and efficacy of high-throughput diffraction measurements.

Methods

XRD pattern simulation

The computed X-ray diffraction datasets used in this study were synthesized using a custom python package based on CCTBX42. Three crystallography information files (CIF) from three different spacegroups (ICSD-ids: 108347, 108308, 622438) were chosen. 5000 XRD patterns were simulated for Cu Kα radiation in an angular range from 20–90°2Θ. For each structure the same parameters that are optimized in a Rietveld refinement are varied. The angular range is propagated on 2048 datapoints. The 21-structure dataset for the Co–Ni–Cr–Re system was generated using the same parameters. Binary phase mixtures are calculated according to the formula \(x\star{\mathrm{XRD}}_{\mathrm{A}} + \left( {1 - x} \right)\star{\mathrm{XRD}}_{\mathrm{B}}\) and subsequent normalization of the intensity to 0 and 1.

Variational autoencoder models

The variational autoencoders were written with Python 3.7, using Tensorflow 2.1 and the Keras module. The encoder network consists of an input layer (2048) followed by two dense layers for the encoder (sizes 256 and 128) and a latent space with 2 dimensions (µ and σ). A sampling layer z randomly samples datapoints from a normal distribution (µ = 0 and σ = 1). The output of z is connected to two dense layers of the decoder (sizes 128 and 256) that is connected to an output layer (2048). The loss function is the weighted sum of the binary cross entropy between input XRD and decoded output and the KL-divergence calculated from the latent space distribution and a 2D Gaussian function. The reconstruction loss is weighted by a factor of 2048 to obtain a clear separation of structural features in the latent space. The conditional VAE used on the experimental dataset comprises the concatenation of the 1D XRD and the normalized chemical compositions as inputs. The latent variable output and the chemical composition are concatenated and provide the input to the decoder.

Synthesis of composition spread materials library

The Co–Ni–Cr–Re thin film composition spread materials library was deposited in a high vacuum magnetron sputter system (DCA Finland). Deposition was conducted from four 100 mm diameter magnetron sputter cathodes with pure elemental targets in Ar at a pressure of 0.66 Pa and no intentional heating of the substrate. The Ar flow was 60 sccm. Composition gradients were created by using a computer-controlled substrate shutter. The quaternary composition spread type thin film materials library was fabricated by a wedge-type multilayer technique by successive deposition of 200 nanoscale wedge-type elemental layers. A similar approach is described in detail by Salomon et al.43. The wedge-type layers were made using a moving shutter, which was set to shield the substrate and was then slowly retracted during deposition. In case of Co and Ni single wedges with length of 67.5 mm were deposited with a rotation of 90° in between. In case of the Cr and Re wedge triplets consisting of three wedges with a length of 22.5 mm were deposited again with a rotation of 90° in between. This the nominal thickness of the wedges varies from 0 to ~13 nm in case of Co, Cr and Ni and from 0 to ~3 nm in case of Re. After deposition, the materials library was annealed at 600 °C for 24 hours in vacuum at 2.6 × 10−4 Pa.

XRD measurements

XRD measurements were done in a Bruker D8 Discover with a 2D detector (Vantec 500) with Cu Kα radiation. Four 2D-frames were measured to cover a 2θ range from 20° to 100°. The frames were merged and integrated into 1D-intensity diffractograms. The background was subtracted by fitting a 4-degree polynomial function.