Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning

Anker, Andy S.; Kjær, Emil T. S.; Juelsholt, Mikkel; Christiansen, Troels Lindahl; Skjærvø, Susanne Linn; Jørgensen, Mads Ry Vogel; Kantor, Innokenty; Sørensen, Daniel Risskov; Billinge, Simon J. L.; Selvan, Raghavendra; Jensen, Kirsten M. Ø.

doi:10.1038/s41524-022-00896-3

Download PDF

Article
Open access
Published: 01 October 2022

Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning

npj Computational Materials volume 8, Article number: 213 (2022) Cite this article

3632 Accesses
19 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Characterization of material structure with X-ray or neutron scattering using e.g. Pair Distribution Function (PDF) analysis most often rely on refining a structure model against an experimental dataset. However, identifying a suitable model is often a bottleneck. Recently, automated approaches have made it possible to test thousands of models for each dataset, but these methods are computationally expensive and analysing the output, i.e. extracting structural information from the resulting fits in a meaningful way, is challenging. Our Machine Learning based Motif Extractor (ML-MotEx) trains an ML algorithm on thousands of fits, and uses SHAP (SHapley Additive exPlanation) values to identify which model features are important for the fit quality. We use the method for 4 different chemical systems, including disordered nanomaterials and clusters. ML-MotEx opens for a type of modelling where each feature in a model is assigned an importance value for the fit quality based on explainable ML.

The rule of four: anomalous distributions in the stoichiometries of inorganic compounds

Article Open access 12 April 2024

Efficient construction of linear models in materials modeling and applications to force constant expansions

Article Open access 07 September 2020

Predicting the propensity for thermally activated β events in metallic glasses via interpretable machine learning

Article Open access 15 December 2020

Introduction

The development of advanced, functional materials builds on an understanding of the intricate relationship between material structure and properties, and over the past century, crystallographic methods using scattering and diffraction have thus been essential for materials science. Crystallography allows ab initio determination of crystal structures from diffraction data and has provided us with the vast knowledge of crystal chemistry that is now used in the design of functional materials. However, in the case of nanomaterials with limited long-range order, crystallographic methods are challenged, and ab initio structure determination, or structure solution, is not currently possible. Over the past decades, total scattering with Pair Distribution Function (PDF) analysis has become an essential tool for the characterization of nanomaterial structure^1,2. The PDF is the Fourier transform of normalized and corrected X-ray, neutron, or electron scattering intensities, and is a function in real space representing a histogram of inter-atomic distances in the sample. Compared to crystallographic methods relying on long-range order, PDF analysis can be applied for nanomaterials^3,4,5, disordered^1,6,7, or amorphous materials^3,5,8. However, structure solution from the PDF is not possible except in a very few simple cases⁹, using either the Reverse Monte Carlo method¹⁰ or the LIGA algorithm^11,12. In the absence of broadly applicable ab initio nanostructure determination methods, it is, therefore, necessary to propose reasonable starting models and to then ‘refine’ the model parameters against the data using local minimization methods. The step of finding a starting model can be a major challenge and is thus a bottleneck in complex material characterization. In the case of PDF analysis of nanomaterials, such models are often guessed at by considering related bulk materials; however, these are often not good starting models for very small clusters and nanoparticles, where significant structural changes may take place^3,5,13,14. A way of building plausible starting models is thus needed, where structure models can be built capturing local bonding topologies suggested by known chemistries.

Recently, automated methods such as ‘structure mining’ and ‘cluster mining’ have appeared in the literature to help overcome this challenge^15,16,17. In a study of the structure of metallic nanoparticles, Banerjee et al. automatically generated thousands of discrete metal nanocluster structures and fitted PDFs from each of them to experimental data to identify the best model in an automated manner¹⁷. In a recent study of molybdenum oxide nanomaterials, we introduced another approach, where we automatically generated a large number of MoO_x cluster structure models and compared their PDFs to experimental data in order to identify dominating structural motifs in the sample, i.e. arrangements of atoms that dominate the material structure on the local scale⁷. We hypothesized that the structural motifs present in amorphous molybdenum oxides can also be found in crystalline structures, and therefore used crystal structures of molybdenum oxides as starting models. From these models, we cut out thousands of different cluster structure models of different sizes to build a ‘catalogue’ of structure candidates. These models were all tested against our data to identify the best fitting structural motif. We recently used a similar approach for the identification of a bismuth oxido cluster intermediate structure in a study of cluster growth¹⁸.

While these approaches can extend the structural space searched when identifying models for structure refinement, new challenges arise. Firstly, the refinement processes can be computationally heavy, which can limit the number of catalogue structures that are tested. For example, our brute-force approach for cluster identification above generates 2^N − 1 structures for starting model sizes with N atoms. Each structure must have its PDF computed and then refined against the target measured PDF, so that its fit quality can be evaluated. This process is computationally costly and does not scale well with the number of structure candidates. Furthermore, for disordered, amorphous, and nanostructured systems, many hundred models may provide similar fit qualities, and if only reporting a few of them, it is difficult to assess which structural features of these models are important. We, therefore, need effective and unbiased methods to compare many fits to extract structural information.

Here, we introduce a method that uses an explainable Machine Learning (ML) model that, after training, will predict the agreement factor for a test cluster with a given dataset. Furthermore, the use of explainable ML informs which features in the model are important for the agreement factor^{19,20,21,22,23,24}. Our Machine Learning based Motif Extractor (ML-MotEx) model is illustrated in Fig. 1. Firstly, it builds a large catalogue of thousands of candidate structural motifs, which are ‘cut outs’ from a chosen bulk structure^7,18 (step 1). The PDF is then computed from each one, and each model is fit to the target dataset (step 2). The structures and R_wp values from each fit are handed to an ML algorithm applying gradient boosting decision trees (GBDTs)²⁵, which learns to predict R_wp values for new fits based on an atomic structure model (step 3). The ML-MotEx algorithm then outputs quantified values of how important each atom or feature in the starting structure is for the fit to yield a low R_wp value with the given fitting algorithm (step 4). This is done by using SHAP (Shapley Additive exPlanation)^26,27 values, which is a known method for explaining tree-based ML models. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the R_wp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important.

**Fig. 1: Illustration of the ML-MotEx process.**

Compared to the automated, brute-force methods previously introduced for PDF analysis^7,15,16,17, we can much faster screen a larger number of structures. Our method only needs to screen a sub-sample (~10,000) of the much larger number of motifs that can be generated from bulk material to learn how to predict which structures provide a good agreement with the data. The analysis done for the examples presented below would take ~24 days for starting models with 24 atoms, ~3 × 10⁶ years for starting models with 48 atoms, and ~6 × 10¹³ years for starting models with 72 atoms using a brute-force approach (Supplementary Notes 1), while ML-MotEx analysis is done in minutes or hours. Furthermore, the use of explainable ML provides a way to better analyse the output of the screening: instead of just identifying the model that provides the lowest R_wp value, we are able to output a measure of how important each atom or feature (e.g. size or shape) in the starting model is for the fit to yield a low R_wp value (step 4). This procedure is automated and can be done in quasi-real experimental time and without human bias.

We illustrate the use of ML-MotEx using four different examples. We first show the principles of the method using a simple model system based on simulated X-ray PDF data from a C₆₀ buckyball. We further demonstrate the use of ML-MotEx on experimental X-ray PDF data from amorphous, disordered molybdenum oxides⁷ and tungstate α-Keggin clusters in solution²⁸, where it allows identifying the main structural motifs present in the samples using different starting models. Lastly, we extend the method to use a ‘cookie-cutter’ strategy to generate structures for the catalogue of candidate motifs. Here, the algorithm is used to identify a bismuth oxido cluster by using a cut-out of the β-Bi₂O₃ structure as starting model. The examples illustrate that it is possible to obtain knowledge of dominating structural motifs from PDF in an automated manner using ML.

Results

ML-MotEx algorithm

ML-MotEx consists of four steps. These four steps are shown in Fig. 1 and the simplified pseudo-code of the algorithm in Fig. 2. In the first step, a starting structure model is used to generate a catalogue of candidate structure motifs. As detailed in the Methods section, the structures are generated by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. In the second step, a fitting script is used to fit the generated candidate structures to the dataset. In the third step, the fitting results are handed to the explainable ML algorithm, which is optimized and trained. By using this information, SHAP values of the atoms or structural features in the starting model are calculated in the fourth step. The output of the algorithm is thus the starting model along with SHAP values, indicating the importance of each individual atom in the structure for the fit quality, or in other words; how much each individual atom or feature affects the R_wp value either positively or negatively. We refer to this value as the “atom contribution value”. We furthermore define the ratio between the atom contribution value and its uncertainty as the “confidence factor”. Further definitions and descriptions of the individual steps of the algorithm are given in the Methods section.

**Fig. 2: Pseudo-code describing the four steps of ML-MotEx.**

Example 1: Proof-of-concept: identification of the C₆₀ buckyball

We first show the use of ML-MotEx with a simple, proof-of-concept example, using a calculated PDF from an ideal C₆₀ buckyball (Fig. 3a). The aim is to identify the structural motif, the C₆₀ buckyball, from the data.

**Fig. 3: Analysis of a simulated PDF from a C₆₀ buckyball.**

We first need a starting structure that contains the motifs we are looking for. In this simplified example, we use a single unit cell of the crystal structure of C₆₀²⁹. However, we discarded all symmetry and generated a discrete structure model corresponding to the 132 atoms in one unit cell. This model is shown in Fig. 3b, where one whole C₆₀ structure (Fig. 3a) is seen along with fragments of the neighbouring C₆₀ buckyballs. The simulated PDF of the C₆₀ buckyball and the starting model is shown in Fig. 3c.

We can now use this starting model to generate a catalogue of structures, which are all fitted to the data. The structures are created by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. This model generation and fitting steps are identical to our previously reported brute-force approach, where we simply compare the R_wp values of all the fits to identify the best structure motif. We first consider this simple approach. One of the limitations of the brute-force method is that the possible candidate structures are exponential in N, the number of atoms in the model. Since each atom in the starting model can be present or absent, the number of possible subclusters is equal to 2^N − 1. For large models such as the C₆₀ starting model containing 132 atoms, this is ~10⁴⁰, a gigantic number, making it impossible to investigate all candidate structures. For this example, we used 384,260 structures to train ML-MotEx, which is only a very small fraction of the 2¹³² − 1 possible candidate structures. Note that the model with a single C₆₀ buckyball was not in the generated structure catalogue.

All these 384,260 structures were fitted to the PDF calculated from the C₆₀ cluster. Only a scale factor, an isotropic expansion/contraction factor, and isotropic Atomic Displacement Parameters (ADPs) were refined, as detailed in Supplementary Table 2. We note that refinement of the atom positions can be added to the fitting procedure to expand the chemical space that is investigated. However, this would be computationally expensive, and it would allow deviations from the chemical topologies set up in the starting model.

To get an overview of the results from these fits, we plot the R_wp value versus the number of atoms in the structure. To further investigate the results, one must visually inspect the fits of the catalogue of candidate structure motifs and their R_wp value. Some of the candidate structure motifs are shown as inserts in Fig. 3e, where transparent grey atoms represent atoms deleted from the models. The fits of these structures to the dataset are presented in Fig. 3e, along with the R_wp values. The R_wp value appears to drop when the ‘outer’ atoms are removed, while it increases when the atoms that are part of the centre C₆₀ buckyball are removed. From investigating these few, but manually selected, structures and their corresponding fitted R_wp value, one can hypothesize that the structure giving the best fit should be the C₆₀ buckyball. However, this method can be biased by human interaction, and it is time-consuming and difficult to go through the many fits to extract structural information.

We, therefore, move on to the ML-MotEx method. Using the catalogue of candidate structure motifs and the corresponding R_wp values obtained above, we train a GBDT model on the training set to predict the R_wp value of the candidate structure motifs. Figure 4 shows the predicted R_wp values of the ML algorithm versus the R_wp value of the structures when they are fitted to the simulated C₆₀ dataset in DiffPy-CMI³⁰. For the structures used in the test set, the GBDT model predicts the R_wp value with a mean absolute error of 2.0%.

**Fig. 4: Predicted R_wp values versus true R_wp values.**

We now use explainable ML to explain R_wp values with the use of the feature importance tool SHAP values²⁷. As described in detail in the Methods section, a SHAP value is calculated for each structural feature (here, each atom and the cluster size) for each candidate structure motif that is fitted to the PDF during the training process. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the R_wp value of the fit towards 100% (poor fit) or 0% (perfect fit), in other words why it is important.

Figure 5a shows the most important results from the SHAP value analysis. The first feature we consider is the number of atoms, with SHAP values shown in the top part of Fig. 5a. The plot represents SHAP values for the cluster size feature with the size shown on a colour scale, going from small (blue) to large clusters (red). From the large amplitude of some of the SHAP values observed from this feature, we see that the number of atoms in the structure motif is the most important feature for the R_wp value. All small clusters (0–34 atoms, plotted in blue colours) show a large positive SHAP value, which implies that the R_wp value of the fit to the PDF data is high, i.e. the fit quality is low. All small clusters can thereby be discarded as structural models for satisfyingly describing the data.

**Fig. 5: Summary of the ML-MotEx analysis of C₆₀ PDF.**

Next, we can investigate the SHAP values obtained for the individual atoms in the structure. We first consider atom 13, as labelled in the structure drawing in Fig. 5b. The SHAP values obtained from this atom for each of the fits in the training set are all plotted on the SHAP axis. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. If first considering the cases where the atom is kept in the model, the atom 13 SHAP values are generally negative, which means that the presence of this atom pushes the R_wp value towards 0%. We interpret this as ML-MotEx wanting to keep the atom in the model. The SHAP values obtained for the fits without the atom present are positive, which confirms that if removing the atom, the fit quality becomes worse. Based on the SHAP values obtained for the atom in each fit, we calculate an atom contribution value. The atom contribution value is defined in the Methods section and is calculated as the difference between the average SHAP values obtained for the atom when kept in the model, and when removed from the model. A negative atom contribution value means that the atom pushes the R_wp value down if kept in the structure. The atom contribution value obtained for atom 13 is negative, and we, therefore, colour it yellow in the structural representation in Fig. 5b to indicate that it should be kept in the model. We use this strategy to automatically go through all the atoms in the starting model and colour them yellow/black based on their impact on the R_wp value. The result can be seen in Fig. 5b where the 60 atoms with the lowest atom contribution values are coloured yellow. The results are also shown in Supplementary Fig. 1, where the atom contribution values are plotted using a continuous colour bar. The results show that ML-MotEx mainly favours the atoms comprising the central buckyball. While the average confidence factor (as defined in the Methods section) is 1.26 for all of the atoms in the starting model, we observe that the average confidence factor of the mislabelled atoms is 0.37, meaning that ML-MotEx is less confident about the atom contribution values of those.

The ML-MotEx algorithm thus provides an unbiased method to extract important motifs from PDF data, without any inputs other than a starting model and a fitting script. We emphasize that the structural motifs extracted with ML-MotEx are based on the R_wp value of the fits and are thereby not necessarily physically reasonable. It is therefore important to still critically consider the extracted motif with chemical knowledge, in the same manner as for conventional PDF refinements. In this process, one could refine additional parameters such as atom positions. Consequently, in Fig. 5b, the user should identify the full C₆₀ buckyball as the structural motif rather than just choosing the motif of the yellow atoms. Another approach to avoid unphysically results from ML-MotEx would be to include e.g. density function theory (DFT) calculations in the goodness-of-fit value.

Example 2: Identification of structural motifs in disordered molybdenum oxides

As discussed above, we have recently used the brute-force automated modelling method to identify structural motifs in disordered molybdenum oxides from PDF analysis⁷. Here we show that by reassessing the data with ML-MotEx, we can reproduce the results from Christiansen et al.⁷ but in an automated way that allows analysis of the resulting structure model using SHAP values. Figure 6a shows the difference-PDF (d-PDF) obtained from amorphous molybdenum oxide supported on γ-Al₂O₃ nanoparticles (15 w% Mo), where the signal from the γ-Al₂O₃ nanoparticles has been subtracted. The d-PDF thus only reflects the structure of the supporting material. The aim is to develop a structural model for the amorphous MoO_x. In our previous paper, different starting models were tested, which were all based on structures of molybdenum-based polyoxometalates (POMs) built from [MoO₄] tetrahedra and [MoO₆] octahedra. The analysis showed that the best fitting models did not contain tetrahedral motifs. Instead, the brute-force automated modelling approach hinted at a unit of three edge-sharing [MoO₆] octahedra, or a ‘triad’, to be present in the structure. However, the use of the computationally expensive brute-force method limited the number of atoms that could be included in the starting model. This meant that a range of different smaller starting models were used to test different structure hypotheses. With ML-MotEx, we can instead test much larger systems and thereby include several different structural motifs at the same time in one starting model, as well as quantitatively analyse the results using SHAP values. We, therefore, use a larger POM as starting model, namely the entire Mo₃₆O₁₂₈ cluster cut-out of the K₈(Mo₃₆O₁₁₂(H₂O)₁₆)·(H₂O)₃₇ crystal structure³¹, which contains a range of different chemical topologies. Figure 6a shows the simulated data from the Mo₃₆O₁₂₈ cluster, which has some similarities to the experimental PDF, and Fig. 6b shows the structure of the Mo₃₆O₁₂₈ cluster.

**Fig. 6: Analysis of experimental PDF from disordered molybdenum oxide.**

We apply ML-MotEx to the molybdenum oxide system in the same manner as we did to the C₆₀ buckyball. First, we used the starting model to make a catalogue of candidate structure motifs, as described in detail in the Methods section. These are all fit to the experimental PDF, and the results are used to train the GBDT model. The fits are made with the same fitting algorithm as used in the paper from Christiansen et al.⁷ Figure 6c illustrates the R_wp values of the fits, plotted as a function of the number of molybdenum atoms present in the structural motif. The best fitting models contain 5–7 molybdenum atoms. The model that fits the data with the lowest R_wp value (45%) can be identified as a Mo₅O₂₄ structure, as shown in Supplementary Fig. 2. However, it is difficult to justify that this structural model is unique in representing the structure in the sample, purely based on the R_wp value.

We, therefore, use steps 3 and 4 of ML-MotEx to analyse the results of the ensemble of fits. The resulting SHAP values are shown in Fig. 7a. The plot should be interpreted in the same way as for the C₆₀ example: Each atom is assigned a SHAP value in each of the fits in the training set. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. When considering the amplitudes of the SHAP values, we see that the atoms labelled with 14, 15, 19, and 20 are marked as very important by ML-MotEx. When these atoms are present in the structure (red), they all have large negative SHAP values, indicating that their presence in the model pushes the R_wp down. When they are not present in the structure (blue), they all have large positive SHAP values, also indicating that they should be present in the structure to obtain a good fit. Atoms 22 and 23 are examples of atoms that ML-MotEx do not suggest keeping in the structure. As seen from the SHAP values, their presence pushes up the R_wp value.

**Fig. 7: Summary of the ML-MotEx analysis of experimental PDF from disordered molybdenum oxide.**

Based on the SHAP analysis, atom contribution values were calculated. The results are visually illustrated in Fig. 7b, where the molybdenum atoms in the structure are coloured yellow if the atom contributed to a better fit quality, otherwise it is coloured black. Figure 7b clearly shows a specific motif that ML-MotEx wants to keep in the model. The yellow molybdenum atoms are all part of a ‘triad’ structure, where three [MoO₆] octahedra share edges and all oxygen atoms that bond to 3 or 4 Mo atoms is connected to yellow molybdenum atoms. This is further illustrated in Supplementary Fig. 3. Specifically, the resulting structural unit that ML-MotEx wants to keep is similar to heptamolybdate [Mo₇O₂₄]⁶⁻, which can be described as several triads connected through edge-sharing. These results indicate that a motif of connected edge-sharing triads, as shown in Fig. 7b, is important in order to describe the data of the disordered molybdenum oxides, which was also found by Christiansen et al.⁷ We note here that when fitting this model to the PDF itself, we cannot describe the medium-range order present in the PDF. The ML-MotEx rather allows identifying the main local motifs in the data.

Example 3: Identification of the ionic cluster structure from PDFs

To investigate the reproducibility of the ML-MotEx method, we investigate if similar results are achieved with different starting models, all containing the correct structure motif. We here model a PDF obtained from a solution of 0.05 M ammonium metatungstate hydrate, (NH₄)₆[H₂W₁₂O₄₀]·H₂O in water, which dissolves to form monodisperse α-Keggin clusters²⁸. Experimental details are provided in Supplementary Methods.

To test the ML-MotEx method, we use four different starting models of tungstate oxide crystals, all including the α-Keggin cluster motif with varying complexity. Unit cells from the four following crystal structures were used as starting models: [Hpy]₄H₂[H₂W₁₂O₄₀] (py = pyridine) [1]³², (CH₃)₄N)₄SiW₁₂O₄₀ [2]³³, (((CH₃)₂NH₂)₆ (Cu(HCON(CH₃)₂)₄)(GeW₁₂O₄₀)₂)(HCON(CH₃)₂)₂³⁴ [3], and ((CH₃)₂NH₂)₃(PW₁₂O₄₀) [4]³⁵. Again, we discarded all symmetry and generated a discrete structure model corresponding to the atoms in one single unit cell. All other atoms than tungsten and oxygen were furthermore removed from the structures before catalogue structures were created. Figure 8a shows the experimental dataset with simulated PDFs from the four different starting models. Figure 8b illustrates a W₁₂O₄₀ α-Keggin structure.

**Fig. 8: Experimental PDF from Keggin clusters in solution.**

Again, we first build structure catalogues based on the starting models (step 1) and fit them to the experimental PDF (step 2). In this case, we extract 10⁴ structures from each starting model, which is just a small fraction of all possible structures that can be made from the starting models that have 24 ([2]), 48 ([1] and [3]), and 72 ([4]) atoms that are permuted. Again, a GBDT model was trained to predict the R_wp values of the structures (step 3), and SHAP values were obtained to calculate atom contribution values (step 4). The resulting SHAP value plots can be seen in Supplementary Figs. 4–5. While ML-MotEx takes about 100 s on an AMD Ryzen Threadripper 3990X with 64-core 2.9/4.3 GHz using 10⁴ fits on a structure with 48 atoms, it would take about ~3 × 10⁶ years (Supplementary Notes 1) to make fits of all the 2⁴⁸ − 1 possible structures using the brute-force approach. Supplementary Table 5 in the Supplementary Information shows the exact computer time of the fits on a MacBook Pro and a Threadripper, which clearly demonstrates the scalability of ML-MotEx.

Figure 9 shows the results of applying ML-MotEx to the 4 different starting models. For structures [1], [3], and [4], the 24 atoms most preferred by ML-MotEx were coloured yellow, while the rest were coloured black. For structure [2], 12 atoms were coloured yellow. In all 4 examples, the yellow atoms have a motif of a α-Keggin cluster, however, in Fig. 9c, d, we see a few mislabelled atoms (2 of 24 atoms in the worst case). The mislabelled atoms are found in the starting models containing most atoms, i.e. with the highest permutation value N. To achieve a better prediction, we could have built larger catalogues of candidate structure motifs and thus performed more fits. We, therefore, conclude that the ML-MotEx method is not completely insensitive to the starting model, but that it yields very similar results for all the tested starting models if it contains similar motifs. Furthermore, the example shows that ML-MotEx can be used to investigate PDF data from clusters in solution, whose structure also is part of known crystal structures. As seen from the results in Supplementary Figs. 7–10, we performed an identical analysis of a different dataset also obtained from the second solution of 0.05 M ammonium metatungstate hydrate. This analysis provided highly comparable results, as discussed in the Supplementary Information. This illustrates the reproducibility of the method. In Supplementary Discussion 1, we discuss what happens if a poor staring model is used, and how one can identify if the starting model does contain the right motif using the confidence factor.

**Fig. 9: Summary of the ML-MotEx analysis of experimental PDFs from Keggin clusters in solution.**

We have also used the ML-MotEx method for a larger ionic cluster, namely [Bi₃₈O₄₅]. Here, we use the β-Bi₂O₃ structure as starting model and used a ‘cookie-cutter’ strategy to generate structures for the motif catalogue. This example is described in Supplementary Discussion 2, and the ‘cookie-cutter’ approach is shown in Supplementary Fig. 11.

Discussion

In the four examples presented above, we have shown how explainable ML can aid in identifying structural motifs in nanostructured materials and presented an approach to structure characterization. Traditional PDF analysis investigates how an entire structure model agrees with an experimental PDF, rather than identifying how different features in the model affect the fit quality. Instead, ML-MotEx provides a quantitative measure of how each atom or feature contributes to the fit. The use of ML furthermore allows the screening of a large number of models in an automated and fast manner. In the examples described here, ML-MotEx has been used with various starting models with up to 256 metal atoms; however, the algorithm can handle larger systems, as it is highly scalable. In comparison, a full brute-force approach is computationally restricted to systems with up to 15–30 atoms. For the type of systems described here, it is possible to use the method in quasi-experimental time, which could, for example, be useful for analysis of time-resolved scattering data, where the structural motifs present might change with time, which would be revealed by changing SHAP values.

ML-MotEx shares some similarities with the cluster build-up algorithm LIGA^11,12, which automatically builds clusters of different sizes based on information that is contained in inter-atomic distance lists extracted from the PDF. LIGA has shown to be successful at automatically reconstructing clusters (up to 150 atoms) with no user input except the inter-atomic distance list, extracted from an experimental PDF, and at a low computational cost. However, its use has not caught on because extracting the distance list from the data presents significant practical difficulties, and is not unique. As with ML-MotEx it uses the error each atom in a cluster contributes to the fit to weigh the decision about which atom to include in the model. Presumably, part of the success of LIGA and ML-MotEx is its use of this atom contribution for rapidly finding good candidate motifs. Unlike LIGA, ML-MotEx requires a starting model that contains the target structural motif, and it leverages ML to rapidly compute the atom contributions. It can therefore be positioned between traditional refinement (where the complete starting model is needed) and LIGA (which is ab initio) as it finds structural motifs from within a larger model as a starting model for subsequent refinement. However, it has a significant advantage over LIGA that it works directly on the measured PDF and does not require the inter-atomic distance list to be extracted from the PDF data, and we expect it to be of great practical value. With this in mind, we plan to deploy ML-MotEx as an application on the PDFitc.org web server³⁶.

It may be considered a significant drawback that ML-MotEx requires as an input a structure fragment that contains the target motif within it in order to work. We provide a confidence factor for the starting model, but ML-MotEx still requires significant chemical/structural knowledge and intuition to be of use. We first note that such intuition is widespread in the chemistry community and is unlikely to be a significant drawback in practice. For example, we have recently used the method to identify the structure of intermediates in the formation of transition metal tungstates from polyoxometalate ions using in situ PDF data³⁷, and for identifying stacking fault domain sizes in manganese oxides from PDF and PXRD³⁸. We also note that the method is sufficiently fast that it would be possible to combine it with structural screening applications such as structureMining@PDFitc^15,36. Given chemical information about elements that are present, structureMining searches structural databases for candidate structures. These are then refined to a target dataset, and a rank-ordered list is returned to the user. If the PDF represents a signal from a short-range ordered structural motif, we could insert ML-MotEx between the database mining and refinement steps to search over sets of plausible structures to look for structural sub-motifs. It may be possible to first use structure mining to identify starting models, which could then be used for ML-MotEx analysis. The models could then be further evaluated using both the resulting R_wp values and confidence factor.

The ML-MotEx method is currently limited to PDF analysis in the fitting procedure of the algorithm (step 2), however, the rest of ML-MotEx (step 1 + 3 + 4) is ready to use with data from other techniques. We are confident that a similar approach, taking advantage of explainable ML and SHAP values, can be broadly useful for enhancing and developing how models for data analysis are identified and constructed.

Methods

Step 1: Creation of a catalogue of candidate structure motifs

The first step in ML-MotEx is to use a starting structure model to generate a catalogue of candidate structure motifs, which are all fitted to the data. The structures are generated by removing different numbers of atoms from the original starting structure resulting in thousands of smaller, candidate structure motifs.

This process, which we refer to as ‘structure permutation’, is illustrated in Fig. 10. Here, the starting model contains four metal atoms, which are each bonded to six oxygen atoms. Before candidate structure motifs are generated, we select which atom type should be included in the permutation process. For the project discussed here, this selection is based on the X-ray scattering power of the atoms (i.e. heavier atoms scatter X-rays strongly, while lighter ones do not), and we, therefore, choose to permute over the four metal atoms in the structure rather than oxygen atoms. The total number of atoms that are selected for permutation (here 4) is referred to as the permutation number, N. Note that we do not take symmetry into account in this process.

**Fig. 10: Example of how structure motifs can be extracted from a starting model with 4 metal atoms coordinated to oxygen and used as input to the GBDT model.**

The selected atoms are removed or kept in the model by randomly associating them with zeros and ones, where 0 means that we remove the atom and 1 means we keep it. This is repeated multiple times to generate a large catalogue of candidate structure motifs. The total number of possible motifs from the permutations is equal to 2^N − 1, but only a small fraction of these needs to be produced for ML-MotEx to provide satisfactory results. In Supplementary Discussion 3, we discuss how large a catalogue of candidate structure motifs ML-MotEx needs as training data to output reasonable results. This is likely to be highly system dependent and especially dependent on N and structural symmetry. For the examples presented in the paper, we use ~140–3000 structure motifs per N.

The atoms which were not chosen for permutation, in this case oxygen, are removed if they are not within a distance threshold from any other atom. The threshold is user-defined and can be set according to PDF peaks and/or chemically valid distances (i.e. bond lengths) for the expected compounds.

Step 2: Fitting the catalogue of candidate structure motifs to the data

In the next step, we fit each of the candidate structures in the catalogue to the experimental PDF. We here use the Python-based program DiffPy-CMI³⁰ for PDF fitting^39,40,41 and apply the Debye equation for the calculation of scattering intensities and PDFs from the structures. The fitting strategies and parameters for each of the examples presented below are listed in Supplementary Table 2. The output of the fit is a R_wp value reflecting the quality of the fit:

$$R_{{{{\mathrm{wp}}}}} = \sqrt {\frac{{\mathop {\sum }\nolimits_{{{{\mathrm{i}}}} = 1}^{{{\mathrm{n}}}} \left[ {G_{{{{\mathrm{obs}}}}}\left( {r_{{{\mathrm{i}}}}} \right) - G_{{{{\mathrm{calc}}}}}\left( {r_{{{\mathrm{i}}}},P} \right)} \right]^2}}{{\mathop {\sum }\nolimits_{{{{\mathrm{i}}}} = 1}^{{{\mathrm{n}}}} G_{{{{\mathrm{obs}}}}}\left( {r_{{{\mathrm{i}}}}} \right)^2}}} \cdot 100\,{{{\mathrm{\% }}}}$$

(1)

Here, G_obs and G_calc are the observed and calculated PDFs, and P is the refinement parameters in the model.

Step 3: Predicting R_wp values using Gradient Boosting Decision Trees

Gradient Boosting Decision Trees (GBDTs)²⁵ are a tool that can do classification or regression using decision trees. In this work, we are using XGBoost²⁵ as the GBDT algorithm to do the regression task of predicting the fit quality (step 2) based on the structural input given as zeros or ones (step 1) and the number of atoms in the structure. Figure 10b shows the input to the GBDT model.

The optimization is done by making trees of ‘yes’ and ‘no’ questions on whether to keep an atom in the structure or not, based on the resulting R_wp value. A hypothetical example of a simple tree can be seen in Fig. 1, step 3. When atom 4 is present in the structure, the GBDT model will predict a R_wp value which is 5% lower than if atom 4 is not present in the structure. In the same way, it will predict an R_wp value which is 12% lower if atom 1 is present in the structure. In the decision tree, the algorithm will therefore say ‘yes’ to keep both atoms 1 and 4 in the structure. In this project, the GBDT model predicts the R_wp value using a weighted average of 100 trees.

The GBDT model performance is improved with a large amount of training data, which in this tool is provided by creating a larger catalogue of candidate structure motifs and fitting them to the data.

The GBDT model is trained on 80% of the data, which is referred to as the training set. XGBoost²⁵ were used with default parameters except for learning rate and max depth, which were optimized with the use of Bayesian optimization using 50 iterations and cross-validation split on 3^42,43. While this procedure automates the hyperparameter tuning, we demonstrate in Supplementary Fig. 12 that similar results are achieved across various hyperparameters, and in Supplementary Fig. 13 we show that similar results are achieved across various seeds. The last 20% of the data is used to evaluate the performance of the algorithm and is referred to as the test set.

Step 4: Quantifying the effect of structural features using SHAP values, assigning atomic contribution values

SHAP values are used to analyse the R_wp values resulting from the process described above. For each fit (step 2), each atom in the starting model is assigned a SHAP value. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the R_wp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important. Each atom in the starting model will thus get F number of SHAP values, where F corresponds to the number of fits made in step 2 of the algorithm. We divide the F number of SHAP values into two categories; firstly, the ones where the atom was kept in the structure motif (kept atom SHAP value list) and second, the ones where the atom was removed to create the structure motif (removed atom SHAP value list). From each of the two lists, an average SHAP value for the atoms can be calculated, defined as SHAP_average-kept and SHAP_{average-removed}. We then define an atom contribution value, which is calculated as the difference between two average SHAP values, i.e.:

$${{{\mathrm{Atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}} = {{{\mathrm{SHAP}}}}_{{{{\mathrm{average}} {\mbox{-}} {\mathrm{kept}}}}}-{{{\mathrm{SHAP}}}}_{{{{\mathrm{average}} {\mbox{-}} {\mathrm{removed}}}}}$$

(2)

We also define the uncertainty of this value as described in Eq. 3:

$$Atom\,contribution\,value\,RMS = {({SHAP_{average-kept}^{RMS}}^{2}-{SHAP_{average-removed}^{RMS}}^{2})}^{0.5}$$

(3)

We define a confidence factor for each atom that describes how confident we can be about including/excluding that atom in a structural motif;

$${{{\mathrm{Confidence}}}}\,{{{\mathrm{factor}}}} = {{{\mathrm{atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}}/{{{\mathrm{atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}}\,{{{\mathrm{RMS}}}}$$

(4)

The results of ML-MotEx can be visually inspected as the atoms in the starting model are coloured according to their atom contribution value, using yellow for low atom contribution value (tendency to keep atom, pushing R_wp down) and black for high atom contribution value (tendency to remove atom, pushing R_wp up). ML-MotEx outputs a VESTA⁴⁴ and CrystalMaker⁴⁵ file where all the atoms are coloured with regard to their atom contribution value.

Data availability

The authors declare that the data supporting this study are available within the paper, its Supplementary Information files and the associated Github to the paper: https://github.com/AndySAnker/ML-MotEx. Additional data that support the findings of this study are available from the corresponding author upon request.

Code availability

The authors declare that the codes supporting this study are available on the associated Github to the paper: https://github.com/AndySAnker/ML-MotEx. Additional codes that support the findings of this study are available from the corresponding author upon request.

References

Billinge, S. J. L. & Kanatzidis, M. G. Beyond crystallography: the study of disorder, nanocrystallinity and crystallographically challenged materials with pair distribution functions. Chem. Commun. 7, 749–760 (2004).
Article Google Scholar
Keen, D. A. & Goodwin, A. L. The crystallography of correlated disorder. Nature 521, 303–309 (2015).
Article CAS Google Scholar
Christiansen, T. L., Cooper, S. R. & Jensen, K. M. Ø. There’s no place like real-space: elucidating size-dependent atomic structure of nanomaterials using pair distribution function analysis. Nanoscale Adv. 2, 2234–2254 (2020).
Article CAS Google Scholar
Billinge, S. J. L. & Levin, I. The problem with determining atomic structure at the nanoscale. Science 316, 561–565 (2007).
Article CAS Google Scholar
Juelsholt, M. et al. Size-induced amorphous structure in tungsten oxide nanoparticles. Nanoscale 13, 20144–20156 (2021).
Article CAS Google Scholar
Yang, X. et al. Confirmation of disordered structure of ultrasmall CdSe nanoparticles from X-ray atomic pair distribution function analysis. Phys. Chem. Chem. Phys. 15, 8480–8486 (2013).
Article CAS Google Scholar
Christiansen, T. L. et al. Structure analysis of supported disordered molybdenum oxides using pair distribution function analysis and automated cluster modelling. J. Appl. Crystallogr. 53, 148–158 (2020).
Article Google Scholar
Bennett, T. D. & Cheetham, A. K. Amorphous metal–organic frameworks. Acc. Chem. Res. 47, 1555–1562 (2014).
Article CAS Google Scholar
Kjær, E. T. S. et al. DeepStruc: Towards structure solution from pair distribution function data using deep generative models. Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/62fa600cd0c5cb353465329f (2022).
Cliffe, M. J., Dove, M. T., Drabold, D. & Goodwin, A. L. Structure determination of disordered materials from diffraction data. Phys. Rev. Lett. 104, 125501 (2010).
Article Google Scholar
Juhás, P., Cherba, D. M., Duxbury, P. M., Punch, W. F. & Billinge, S. J. L. Ab initio determination of solid-state nanostructure. Nature 440, 655–658 (2006).
Article Google Scholar
Juhás, P., Granlund, L., Duxbury, P. M., Punch, W. F. & Billinge, S. J. L. The Liga algorithm for ab initio determination of nanostructure. Acta Crystallogr A 64, 631–640 (2008).
Article Google Scholar
Christiansen, T. L., Bøjesen, E. D., Juelsholt, M., Etheridge, J. & Jensen, K. M. Ø. Size induced structural changes in molybdenum oxide nanoparticles. ACS Nano 13, 8725–8735 (2019).
Article CAS Google Scholar
Aalling-Frederiksen, O., Juelsholt, M., Anker, A. S. & Jensen, K. M. Ø. Formation and growth mechanism for niobium oxide nanoparticles: atomistic insight from in situ X-ray total scattering. Nanoscale 13, 8087–8097 (2021).
Article CAS Google Scholar
Yang, L., Juhás, P., Terban, M. W., Tucker, M. G. & Billinge, S. J. L. Structure-mining: screening structure models by automated fitting to the atomic pair distribution function over large numbers of models. Acta Crystallogr. A 76, 395–409 (2020).
Article CAS Google Scholar
Anker, A. S. et al. Characterising the Atomic Structure of Mono-Metallic Nanoparticles from X-Ray Scattering Data Using Conditional Generative Models. In Proc. 16th International Workshop on Mining and Learning with Graphs (MLG). (Association for Computing Machinery, New York, NY, 2020) https://www.mlgworkshop.org/2020/.
Banerjee, S. et al. Cluster-mining: an approach for determining core structures of metallic nanoparticles from atomic pair distribution function data. Acta Crystallogr. A 76, 24–31 (2020).
Article CAS Google Scholar
Anker, A. S. et al. Structural changes during the growth of atomically precise metal oxido nanoclusters from combined pair distribution function and small-angle X-ray scattering analysis. Angew. Chem. Int. Ed. 60, 2–12 (2021).
Google Scholar
Butler, K. T., Le, M. D., Thiyagalingam, J. & Perring, T. G. Interpretable, calibrated neural networks for analysis and understanding of inelastic neutron scattering data. J. Phys. Condens. Matter 33, 194006 (2021).
Article CAS Google Scholar
Suzuki, Y. et al. Symmetry prediction and knowledge discovery from X-ray diffraction patterns using an interpretable machine learning approach. Sci. Rep. 10, 21790 (2020).
Article CAS Google Scholar
Torrisi, S. B. et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. npj Comput. Mater. 6, 109 (2020).
Article Google Scholar
Oviedo, F., Ferres, J. L., Buonassisi, T. & Butler, K. T. Interpretable and Explainable Machine Learning for Materials Science and Chemistry. Acc. Mater. Res., 3, 6, 597–607 (2022)
Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5, 83 (2019).
Article Google Scholar
Lee, K., Ayyasamy, M. V., Delsa, P., Hartnett, T. Q. & Balachandran, P. V. Phase classification of multi-principal element alloys via interpretable machine learning. npj Comput. Mater. 8, 25 (2022).
Article Google Scholar
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, San Francisco, 2016).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Article Google Scholar
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. in Proc. 31st International Conference on Neural Information Processing Systems, 4765–4774 (Curran Associates, Inc., 2017).
Juelsholt, M., Lindahl Christiansen, T. & Jensen, K. M. Ø. Mechanisms for tungsten oxide nanoparticle formation in solvothermal synthesis: from polyoxometalates to crystalline materials. J. Phys. Chem. C. 123, 5110–5119 (2019).
Article CAS Google Scholar
Chen, X. & Yamanaka, S. Single-crystal X-ray structural refinement of the ‘tetragonal’ C₆₀ polymer. Chem. Phys. Lett. 360, 501–508 (2002).
Article CAS Google Scholar
Juhas, P., Farrow, C. L., Yang, X., Knox, K. R. & Billinge, S. J. L. Complex modeling: a strategy and software program for combining multiple information sources to solve ill posed structure and nanostructure inverse problems. Acta Crystallogr. A 71, 562–568 (2015).
Article Google Scholar
Krebs, B. & Paulat-Böschen, I. The structure of the potassium isopolymolybdate K₈[Mo₃₆O₁₂(H₂O)₁₆]·nH₂O (n = 36⋯40). Acta Crystallogr. B 38, 1710–1718 (1982).
Article Google Scholar
Niu, J., Zhao, J., Wang, J. & Bo, Y. Syntheses, spectroscopic characterization, thermal behavior, electrochemistry and crystal structures of two novel pyridine metatungstates. J. Coord. Chem. 57, 935–946 (2004).
Article CAS Google Scholar
Joachim, F., Axel, T. & Rosemarie, P. Strukturen und Schwingungsspektren des Tetramethylammonium-α-dodekawolframatosilikats und des Tetrabutylammonium-β-dodekawolframatosilikats: Structures and Vibrational Spectra of Tetramethylammonium α-Dodecatungstosilicate and Tetrabutylammonium β-Dodecatungstosilicate. Z. Naturforsch. B 36, 161–171 (1981).
Article Google Scholar
Niu, J.-Y., Han, Q.-X. & Wang, J.-P. A novel Keggin units-supported complex: synthesis, characterization and crystal structure of [(CH₃)₂NH₂]₆[Cu(DMF)₄(GeW₁₂O₄₀)₂]·2DMF. J. Coord. Chem. 56, 523–530 (2003).
Article CAS Google Scholar
Busbongthong, S. & Ozeki, T. Structural relationships among methyl-, dimethyl-, and trimethylammonium phosphododecatungstates. Bull. Chem. Soc. Jpn. 82, 1393–1397 (2009).
Article CAS Google Scholar
Yang, L. et al. A cloud platform for atomic pair distribution function analysis: PDFitc. Acta Crystallogr. A 77, 2–6 (2021).
Article CAS Google Scholar
Skjærvø, S. L. et al. Atomic structural changes in the formation of transition metal tungstates: the role of polyoxometalate structures in material crystallization. Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/62ebefdcd131b71fc70c4ef2 (2022).
Magnard, N. P. L., Anker, A. S., Aalling-Frederiksen, O., Kirsch, A. & Jensen, K. M. Ø. Characterisation of intergrowth in metal oxide materials using structure-mining: the case of γ-MnO₂. Dalton Trans., https://doi.org/10.1039/D2DT02153F (2022).
Proffen, T. & Neder, R. B. DISCUS, a program for diffuse scattering and defect structure simulations – update. J. Appl. Crystallogr. 32, 838–839 (1999).
Article CAS Google Scholar
Proffen, T. & Neder, R. B. DISCUS: a program for diffuse scattering and defect-structure simulation. J. Appl. Cryst. 30, 171–175 (1997).
Article CAS Google Scholar
Coelho, A. A. TOPAS and TOPAS-Academic: an optimization program integrating computer algebra and crystallographic objects written in C++. J. Appl. Crystallogr. 51, 210–218 (2018).
Article CAS Google Scholar
Nogueira, F. Bayesian Optimization: Open source constrained global optimization tool for Python. https://github.com/fmfn/BayesianOptimization (2014).
Putatunda, S. & Rama, K. A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost. In Proc. 2018 International Conference on Signal Processing and Machine Learning, 6–10 (Association for Computing Machinery, New York, 2018).
Momma, K. & Izumi, F. VESTA 3 for three-dimensional visualization of crystal, volumetric and morphology data. J. Appl. Crystallogr. 44, 1272–1276 (2011).
Article CAS Google Scholar
Palmer, D. C. Visualization and analysis of crystal structures using CrystalMaker software. Z. Kristallogr. Cryst. Mater. 230, 559–572 (2015).
Article CAS Google Scholar

Download references

Acknowledgements

This work is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. 804066). We are grateful to the Villum Foundation for financial support through a Villum Young Investigator grant (VKR00015416). Funding from the Danish Ministry of Higher Education and Science through the SMART Lighthouse is gratefully acknowledged. We furthermore thank DANSCATT (supported by the Danish Agency for Science and Higher Education) for support. We acknowledge DESY (Hamburg, Germany), a member of the Helmholtz Association HGF, for the provision of experimental facilities. Parts of this research were carried out at beamline P02.1 and P07 at Petra III and we would like to thank Martin Etter, Jozef Bednarcik, and Ann-Christin Dippel for assistance in using the beamlines. This research used resources of the Advanced Photon Source, a US Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under contract No. DE-AC02-06CH11357. We acknowledge MAX IV Laboratory for time on Beamline DanMAX under Proposal 20200731. Research conducted at MAX IV is supported by the Swedish Research council under contract 2018-07152, the Swedish Governmental Agency for Innovation Systems under contract 2018-04969, and Formas under contract 2019-02496. DanMAX is funded by the NUFI grant no. 4059-00009B. S.J.L.B. was supported by the U.S. National Science Foundation through grant DMREF-1922234.

Author information

Authors and Affiliations

Department of Chemistry and Nano-Science Center, University of Copenhagen, 2100, Copenhagen, Denmark
Andy S. Anker, Emil T. S. Kjær, Troels Lindahl Christiansen, Susanne Linn Skjærvø & Kirsten M. Ø. Jensen
Department of Materials, University of Oxford, Parks Road, Oxford, UK
Mikkel Juelsholt
Department of Chemistry & iNANO, Aarhus University, 8000, Aarhus, Denmark
Mads Ry Vogel Jørgensen & Daniel Risskov Sørensen
MAX IV Laboratory, Lund University, 224 84, Lund, Sweden
Mads Ry Vogel Jørgensen, Innokenty Kantor & Daniel Risskov Sørensen
Department of Physics, Technical University of Denmark, 2880, Lyngby, Denmark
Innokenty Kantor
Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY, 10027, USA
Simon J. L. Billinge
Condensed Matter Physics and Materials Science Department, Brookhaven National Laboratory, Upton, NY, 11973, USA
Simon J. L. Billinge
Department of Computer Science, University of Copenhagen, 2100, Copenhagen, Denmark
Raghavendra Selvan
Department of Neuroscience, University of Copenhagen, 2200, Copenhagen, Denmark
Raghavendra Selvan

Authors

Andy S. Anker
View author publications
You can also search for this author in PubMed Google Scholar
Emil T. S. Kjær
View author publications
You can also search for this author in PubMed Google Scholar
Mikkel Juelsholt
View author publications
You can also search for this author in PubMed Google Scholar
Troels Lindahl Christiansen
View author publications
You can also search for this author in PubMed Google Scholar
Susanne Linn Skjærvø
View author publications
You can also search for this author in PubMed Google Scholar
Mads Ry Vogel Jørgensen
View author publications
You can also search for this author in PubMed Google Scholar
Innokenty Kantor
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Risskov Sørensen
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. L. Billinge
View author publications
You can also search for this author in PubMed Google Scholar
Raghavendra Selvan
View author publications
You can also search for this author in PubMed Google Scholar
Kirsten M. Ø. Jensen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.S.A., E.T.S.K., and K.M.Ø.J. conceptualized the project. A.S.A., E.T.S.K., M.J., T.L.C., S.L.S., S.J.L.B., and R.S. designed the methodology and A.S.A. and E.T.S.K. wrote the code. A.S.A., M.J., T.L.C., M.R.V.J., I.K., and D.R.S. collected the data. K.M.Ø.J. procured funding and supervised the project. All authors were involved with the writing of the paper.

Corresponding author

Correspondence to Kirsten M. Ø. Jensen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Anker, A.S., Kjær, E.T.S., Juelsholt, M. et al. Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning. npj Comput Mater 8, 213 (2022). https://doi.org/10.1038/s41524-022-00896-3

Download citation

Received: 25 April 2022
Accepted: 14 September 2022
Published: 01 October 2022
DOI: https://doi.org/10.1038/s41524-022-00896-3

This article is cited by

A deep learning approach for quantum dots sizing from wide-angle X-ray scattering data
- Lucia Allara
- Federica Bertolotti
- Antonietta Guagliardi
npj Computational Materials (2024)
Integrated analysis of X-ray diffraction patterns and pair distribution functions for machine-learned phase identification
- Nathan J. Szymanski
- Sean Fu
- Gerbrand Ceder
npj Computational Materials (2024)
Enhancing interpretability in the exploration of high-energy conversion efficiency in CsSnBr3−xIx configurations using crystal graph convolutional neural networks and adversarial example methods
- Tao Wang
- Xiaolong Lai
- Hao Jin
Science China Materials (2024)

Subjects

Abstract

Similar content being viewed by others

The rule of four: anomalous distributions in the stoichiometries of inorganic compounds

Efficient construction of linear models in materials modeling and applications to force constant expansions

Predicting the propensity for thermally activated β events in metallic glasses via interpretable machine learning

Introduction

Results

ML-MotEx algorithm

Example 1: Proof-of-concept: identification of the C60 buckyball

Example 2: Identification of structural motifs in disordered molybdenum oxides

Example 3: Identification of the ionic cluster structure from PDFs

Discussion

Methods

Step 1: Creation of a catalogue of candidate structure motifs

Step 2: Fitting the catalogue of candidate structure motifs to the data

Step 3: Predicting Rwp values using Gradient Boosting Decision Trees

Step 4: Quantifying the effect of structural features using SHAP values, assigning atomic contribution values

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A deep learning approach for quantum dots sizing from wide-angle X-ray scattering data

Integrated analysis of X-ray diffraction patterns and pair distribution functions for machine-learned phase identification

Enhancing interpretability in the exploration of high-energy conversion efficiency in CsSnBr3−xIx configurations using crystal graph convolutional neural networks and adversarial example methods

Search

Quick links

Example 1: Proof-of-concept: identification of the C₆₀ buckyball

Step 3: Predicting R_wp values using Gradient Boosting Decision Trees