Introduction

The development of advanced, functional materials builds on an understanding of the intricate relationship between material structure and properties, and over the past century, crystallographic methods using scattering and diffraction have thus been essential for materials science. Crystallography allows ab initio determination of crystal structures from diffraction data and has provided us with the vast knowledge of crystal chemistry that is now used in the design of functional materials. However, in the case of nanomaterials with limited long-range order, crystallographic methods are challenged, and ab initio structure determination, or structure solution, is not currently possible. Over the past decades, total scattering with Pair Distribution Function (PDF) analysis has become an essential tool for the characterization of nanomaterial structure1,2. The PDF is the Fourier transform of normalized and corrected X-ray, neutron, or electron scattering intensities, and is a function in real space representing a histogram of inter-atomic distances in the sample. Compared to crystallographic methods relying on long-range order, PDF analysis can be applied for nanomaterials3,4,5, disordered1,6,7, or amorphous materials3,5,8. However, structure solution from the PDF is not possible except in a very few simple cases9, using either the Reverse Monte Carlo method10 or the LIGA algorithm11,12. In the absence of broadly applicable ab initio nanostructure determination methods, it is, therefore, necessary to propose reasonable starting models and to then ‘refine’ the model parameters against the data using local minimization methods. The step of finding a starting model can be a major challenge and is thus a bottleneck in complex material characterization. In the case of PDF analysis of nanomaterials, such models are often guessed at by considering related bulk materials; however, these are often not good starting models for very small clusters and nanoparticles, where significant structural changes may take place3,5,13,14. A way of building plausible starting models is thus needed, where structure models can be built capturing local bonding topologies suggested by known chemistries.

Recently, automated methods such as ‘structure mining’ and ‘cluster mining’ have appeared in the literature to help overcome this challenge15,16,17. In a study of the structure of metallic nanoparticles, Banerjee et al. automatically generated thousands of discrete metal nanocluster structures and fitted PDFs from each of them to experimental data to identify the best model in an automated manner17. In a recent study of molybdenum oxide nanomaterials, we introduced another approach, where we automatically generated a large number of MoOx cluster structure models and compared their PDFs to experimental data in order to identify dominating structural motifs in the sample, i.e. arrangements of atoms that dominate the material structure on the local scale7. We hypothesized that the structural motifs present in amorphous molybdenum oxides can also be found in crystalline structures, and therefore used crystal structures of molybdenum oxides as starting models. From these models, we cut out thousands of different cluster structure models of different sizes to build a ‘catalogue’ of structure candidates. These models were all tested against our data to identify the best fitting structural motif. We recently used a similar approach for the identification of a bismuth oxido cluster intermediate structure in a study of cluster growth18.

While these approaches can extend the structural space searched when identifying models for structure refinement, new challenges arise. Firstly, the refinement processes can be computationally heavy, which can limit the number of catalogue structures that are tested. For example, our brute-force approach for cluster identification above generates 2N − 1 structures for starting model sizes with N atoms. Each structure must have its PDF computed and then refined against the target measured PDF, so that its fit quality can be evaluated. This process is computationally costly and does not scale well with the number of structure candidates. Furthermore, for disordered, amorphous, and nanostructured systems, many hundred models may provide similar fit qualities, and if only reporting a few of them, it is difficult to assess which structural features of these models are important. We, therefore, need effective and unbiased methods to compare many fits to extract structural information.

Here, we introduce a method that uses an explainable Machine Learning (ML) model that, after training, will predict the agreement factor for a test cluster with a given dataset. Furthermore, the use of explainable ML informs which features in the model are important for the agreement factor19,20,21,22,23,24. Our Machine Learning based Motif Extractor (ML-MotEx) model is illustrated in Fig. 1. Firstly, it builds a large catalogue of thousands of candidate structural motifs, which are ‘cut outs’ from a chosen bulk structure7,18 (step 1). The PDF is then computed from each one, and each model is fit to the target dataset (step 2). The structures and Rwp values from each fit are handed to an ML algorithm applying gradient boosting decision trees (GBDTs)25, which learns to predict Rwp values for new fits based on an atomic structure model (step 3). The ML-MotEx algorithm then outputs quantified values of how important each atom or feature in the starting structure is for the fit to yield a low Rwp value with the given fitting algorithm (step 4). This is done by using SHAP (Shapley Additive exPlanation)26,27 values, which is a known method for explaining tree-based ML models. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important.

Fig. 1: Illustration of the ML-MotEx process.
figure 1

Firstly, a starting model is provided. Using this starting model, a structure catalogue is generated, and the structures in the catalogue are fitted to the experimental data in question. An ML algorithm is then trained to predict Rwp values and finally calculating quantified values of feature importance for the fit quality.

Compared to the automated, brute-force methods previously introduced for PDF analysis7,15,16,17, we can much faster screen a larger number of structures. Our method only needs to screen a sub-sample (~10,000) of the much larger number of motifs that can be generated from bulk material to learn how to predict which structures provide a good agreement with the data. The analysis done for the examples presented below would take ~24 days for starting models with 24 atoms, ~3 × 106 years for starting models with 48 atoms, and ~6 × 1013 years for starting models with 72 atoms using a brute-force approach (Supplementary Notes 1), while ML-MotEx analysis is done in minutes or hours. Furthermore, the use of explainable ML provides a way to better analyse the output of the screening: instead of just identifying the model that provides the lowest Rwp value, we are able to output a measure of how important each atom or feature (e.g. size or shape) in the starting model is for the fit to yield a low Rwp value (step 4). This procedure is automated and can be done in quasi-real experimental time and without human bias.

We illustrate the use of ML-MotEx using four different examples. We first show the principles of the method using a simple model system based on simulated X-ray PDF data from a C60 buckyball. We further demonstrate the use of ML-MotEx on experimental X-ray PDF data from amorphous, disordered molybdenum oxides7 and tungstate α-Keggin clusters in solution28, where it allows identifying the main structural motifs present in the samples using different starting models. Lastly, we extend the method to use a ‘cookie-cutter’ strategy to generate structures for the catalogue of candidate motifs. Here, the algorithm is used to identify a bismuth oxido cluster by using a cut-out of the β-Bi2O3 structure as starting model. The examples illustrate that it is possible to obtain knowledge of dominating structural motifs from PDF in an automated manner using ML.

Results

ML-MotEx algorithm

ML-MotEx consists of four steps. These four steps are shown in Fig. 1 and the simplified pseudo-code of the algorithm in Fig. 2. In the first step, a starting structure model is used to generate a catalogue of candidate structure motifs. As detailed in the Methods section, the structures are generated by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. In the second step, a fitting script is used to fit the generated candidate structures to the dataset. In the third step, the fitting results are handed to the explainable ML algorithm, which is optimized and trained. By using this information, SHAP values of the atoms or structural features in the starting model are calculated in the fourth step. The output of the algorithm is thus the starting model along with SHAP values, indicating the importance of each individual atom in the structure for the fit quality, or in other words; how much each individual atom or feature affects the Rwp value either positively or negatively. We refer to this value as the “atom contribution value”. We furthermore define the ratio between the atom contribution value and its uncertainty as the “confidence factor”. Further definitions and descriptions of the individual steps of the algorithm are given in the Methods section.

Fig. 2: Pseudo-code describing the four steps of ML-MotEx.
figure 2

A starting model, fitting script, and dataset are given as input. Firstly, a catalogue of candidate structure motifs is generated (step 1), which are fitted to the dataset (step 2). The output from step 1 and 2 is then given to an ML algorithm, which learns to predict goodness-of-fit (Rwp) values based on the structure motif (step 3). Lastly, SHAP values are calculated for each feature (step 4) which can be converted to atom contribution values.

Example 1: Proof-of-concept: identification of the C60 buckyball

We first show the use of ML-MotEx with a simple, proof-of-concept example, using a calculated PDF from an ideal C60 buckyball (Fig. 3a). The aim is to identify the structural motif, the C60 buckyball, from the data.

Fig. 3: Analysis of a simulated PDF from a C60 buckyball.
figure 3

a C60 buckyball, b single C60 unit cell29, treated as a discrete structure with 132 atoms and c their simulated PDFs. The simulation parameters mimic typical values of a PDF dataset and are presented in Supplementary Table 3. d Rwp values obtained in the fits using the C60 structure catalogue, plotted as a function of number of atoms in the structure motifs. Note that the model with a single C60 buckyball is not included in the set of 384,260 structures tested. This would result in a perfect fit with Rwp = 0%. e Examples of candidate structure motifs with their corresponding fits to the simulated C60 buckyball data. Grey, semitransparent atoms are removed from the starting model.

We first need a starting structure that contains the motifs we are looking for. In this simplified example, we use a single unit cell of the crystal structure of C6029. However, we discarded all symmetry and generated a discrete structure model corresponding to the 132 atoms in one unit cell. This model is shown in Fig. 3b, where one whole C60 structure (Fig. 3a) is seen along with fragments of the neighbouring C60 buckyballs. The simulated PDF of the C60 buckyball and the starting model is shown in Fig. 3c.

We can now use this starting model to generate a catalogue of structures, which are all fitted to the data. The structures are created by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. This model generation and fitting steps are identical to our previously reported brute-force approach, where we simply compare the Rwp values of all the fits to identify the best structure motif. We first consider this simple approach. One of the limitations of the brute-force method is that the possible candidate structures are exponential in N, the number of atoms in the model. Since each atom in the starting model can be present or absent, the number of possible subclusters is equal to 2N − 1. For large models such as the C60 starting model containing 132 atoms, this is ~1040, a gigantic number, making it impossible to investigate all candidate structures. For this example, we used 384,260 structures to train ML-MotEx, which is only a very small fraction of the 2132 − 1 possible candidate structures. Note that the model with a single C60 buckyball was not in the generated structure catalogue.

All these 384,260 structures were fitted to the PDF calculated from the C60 cluster. Only a scale factor, an isotropic expansion/contraction factor, and isotropic Atomic Displacement Parameters (ADPs) were refined, as detailed in Supplementary Table 2. We note that refinement of the atom positions can be added to the fitting procedure to expand the chemical space that is investigated. However, this would be computationally expensive, and it would allow deviations from the chemical topologies set up in the starting model.

To get an overview of the results from these fits, we plot the Rwp value versus the number of atoms in the structure. To further investigate the results, one must visually inspect the fits of the catalogue of candidate structure motifs and their Rwp value. Some of the candidate structure motifs are shown as inserts in Fig. 3e, where transparent grey atoms represent atoms deleted from the models. The fits of these structures to the dataset are presented in Fig. 3e, along with the Rwp values. The Rwp value appears to drop when the ‘outer’ atoms are removed, while it increases when the atoms that are part of the centre C60 buckyball are removed. From investigating these few, but manually selected, structures and their corresponding fitted Rwp value, one can hypothesize that the structure giving the best fit should be the C60 buckyball. However, this method can be biased by human interaction, and it is time-consuming and difficult to go through the many fits to extract structural information.

We, therefore, move on to the ML-MotEx method. Using the catalogue of candidate structure motifs and the corresponding Rwp values obtained above, we train a GBDT model on the training set to predict the Rwp value of the candidate structure motifs. Figure 4 shows the predicted Rwp values of the ML algorithm versus the Rwp value of the structures when they are fitted to the simulated C60 dataset in DiffPy-CMI30. For the structures used in the test set, the GBDT model predicts the Rwp value with a mean absolute error of 2.0%.

Fig. 4: Predicted Rwp values versus true Rwp values.
figure 4

Rwp values from the fits of the catalogue structures to the simulated C60 dataset, plotted versus the predicted Rwp values from the GBDT model from the same structures. The mean squared error (MSE) and the mean absolute error (MAE) are based on all 76,852 predictions in the test set, which are structures the model has not been trained on.

We now use explainable ML to explain Rwp values with the use of the feature importance tool SHAP values27. As described in detail in the Methods section, a SHAP value is calculated for each structural feature (here, each atom and the cluster size) for each candidate structure motif that is fitted to the PDF during the training process. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 100% (poor fit) or 0% (perfect fit), in other words why it is important.

Figure 5a shows the most important results from the SHAP value analysis. The first feature we consider is the number of atoms, with SHAP values shown in the top part of Fig. 5a. The plot represents SHAP values for the cluster size feature with the size shown on a colour scale, going from small (blue) to large clusters (red). From the large amplitude of some of the SHAP values observed from this feature, we see that the number of atoms in the structure motif is the most important feature for the Rwp value. All small clusters (0–34 atoms, plotted in blue colours) show a large positive SHAP value, which implies that the Rwp value of the fit to the PDF data is high, i.e. the fit quality is low. All small clusters can thereby be discarded as structural models for satisfyingly describing the data.

Fig. 5: Summary of the ML-MotEx analysis of C60 PDF.
figure 5

a Plot of the SHAP values obtained in the C60 analysis, showing if atoms in the starting model are favourable for the fit quality. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. The SHAP values are plotted as a violin plot. An enlarged summary SHAP plot of panel a is shown in Supplementary Fig. 14. b Structural visualization of kept and removed atoms. The atoms with the 60 lowest atoms contribution values have been coloured yellow, while the rest are coloured black. Supplementary Figure 1 shows a similar representation but where the atom contribution values are directly shown from a continuous colour bar.

Next, we can investigate the SHAP values obtained for the individual atoms in the structure. We first consider atom 13, as labelled in the structure drawing in Fig. 5b. The SHAP values obtained from this atom for each of the fits in the training set are all plotted on the SHAP axis. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. If first considering the cases where the atom is kept in the model, the atom 13 SHAP values are generally negative, which means that the presence of this atom pushes the Rwp value towards 0%. We interpret this as ML-MotEx wanting to keep the atom in the model. The SHAP values obtained for the fits without the atom present are positive, which confirms that if removing the atom, the fit quality becomes worse. Based on the SHAP values obtained for the atom in each fit, we calculate an atom contribution value. The atom contribution value is defined in the Methods section and is calculated as the difference between the average SHAP values obtained for the atom when kept in the model, and when removed from the model. A negative atom contribution value means that the atom pushes the Rwp value down if kept in the structure. The atom contribution value obtained for atom 13 is negative, and we, therefore, colour it yellow in the structural representation in Fig. 5b to indicate that it should be kept in the model. We use this strategy to automatically go through all the atoms in the starting model and colour them yellow/black based on their impact on the Rwp value. The result can be seen in Fig. 5b where the 60 atoms with the lowest atom contribution values are coloured yellow. The results are also shown in Supplementary Fig. 1, where the atom contribution values are plotted using a continuous colour bar. The results show that ML-MotEx mainly favours the atoms comprising the central buckyball. While the average confidence factor (as defined in the Methods section) is 1.26 for all of the atoms in the starting model, we observe that the average confidence factor of the mislabelled atoms is 0.37, meaning that ML-MotEx is less confident about the atom contribution values of those.

The ML-MotEx algorithm thus provides an unbiased method to extract important motifs from PDF data, without any inputs other than a starting model and a fitting script. We emphasize that the structural motifs extracted with ML-MotEx are based on the Rwp value of the fits and are thereby not necessarily physically reasonable. It is therefore important to still critically consider the extracted motif with chemical knowledge, in the same manner as for conventional PDF refinements. In this process, one could refine additional parameters such as atom positions. Consequently, in Fig. 5b, the user should identify the full C60 buckyball as the structural motif rather than just choosing the motif of the yellow atoms. Another approach to avoid unphysically results from ML-MotEx would be to include e.g. density function theory (DFT) calculations in the goodness-of-fit value.

Example 2: Identification of structural motifs in disordered molybdenum oxides

As discussed above, we have recently used the brute-force automated modelling method to identify structural motifs in disordered molybdenum oxides from PDF analysis7. Here we show that by reassessing the data with ML-MotEx, we can reproduce the results from Christiansen et al.7 but in an automated way that allows analysis of the resulting structure model using SHAP values. Figure 6a shows the difference-PDF (d-PDF) obtained from amorphous molybdenum oxide supported on γ-Al2O3 nanoparticles (15 w% Mo), where the signal from the γ-Al2O3 nanoparticles has been subtracted. The d-PDF thus only reflects the structure of the supporting material. The aim is to develop a structural model for the amorphous MoOx. In our previous paper, different starting models were tested, which were all based on structures of molybdenum-based polyoxometalates (POMs) built from [MoO4] tetrahedra and [MoO6] octahedra. The analysis showed that the best fitting models did not contain tetrahedral motifs. Instead, the brute-force automated modelling approach hinted at a unit of three edge-sharing [MoO6] octahedra, or a ‘triad’, to be present in the structure. However, the use of the computationally expensive brute-force method limited the number of atoms that could be included in the starting model. This meant that a range of different smaller starting models were used to test different structure hypotheses. With ML-MotEx, we can instead test much larger systems and thereby include several different structural motifs at the same time in one starting model, as well as quantitatively analyse the results using SHAP values. We, therefore, use a larger POM as starting model, namely the entire Mo36O128 cluster cut-out of the K8(Mo36O112(H2O)16)·(H2O)37 crystal structure31, which contains a range of different chemical topologies. Figure 6a shows the simulated data from the Mo36O128 cluster, which has some similarities to the experimental PDF, and Fig. 6b shows the structure of the Mo36O128 cluster.

Fig. 6: Analysis of experimental PDF from disordered molybdenum oxide.
figure 6

a Comparison of experimental PDF from a disordered molybenum oxide7, and simulated data from Mo36O128 cluster, used as starting model. The simulation parameters mimic typical values of a PDF dataset and can be seen in Supplementary Table 3. b Structure of the Mo36O128 cluster. c Rwp values obtained in the fits using the Mo36O128 structure catalogue, plotted as a function of number of atoms in the structure motifs.

We apply ML-MotEx to the molybdenum oxide system in the same manner as we did to the C60 buckyball. First, we used the starting model to make a catalogue of candidate structure motifs, as described in detail in the Methods section. These are all fit to the experimental PDF, and the results are used to train the GBDT model. The fits are made with the same fitting algorithm as used in the paper from Christiansen et al.7 Figure 6c illustrates the Rwp values of the fits, plotted as a function of the number of molybdenum atoms present in the structural motif. The best fitting models contain 5–7 molybdenum atoms. The model that fits the data with the lowest Rwp value (45%) can be identified as a Mo5O24 structure, as shown in Supplementary Fig. 2. However, it is difficult to justify that this structural model is unique in representing the structure in the sample, purely based on the Rwp value.

We, therefore, use steps 3 and 4 of ML-MotEx to analyse the results of the ensemble of fits. The resulting SHAP values are shown in Fig. 7a. The plot should be interpreted in the same way as for the C60 example: Each atom is assigned a SHAP value in each of the fits in the training set. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. When considering the amplitudes of the SHAP values, we see that the atoms labelled with 14, 15, 19, and 20 are marked as very important by ML-MotEx. When these atoms are present in the structure (red), they all have large negative SHAP values, indicating that their presence in the model pushes the Rwp down. When they are not present in the structure (blue), they all have large positive SHAP values, also indicating that they should be present in the structure to obtain a good fit. Atoms 22 and 23 are examples of atoms that ML-MotEx do not suggest keeping in the structure. As seen from the SHAP values, their presence pushes up the Rwp value.

Fig. 7: Summary of the ML-MotEx analysis of experimental PDF from disordered molybdenum oxide.
figure 7

a Plot of the SHAP values obtained in the molybdenum oxide analysis, showing if atoms in the starting model Mo36O128 are favourable for the fit quality. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. The SHAP values are plotted as a violin plot. An enlarged summary SHAP plot of panel a is shown in Supplementary Fig. 15. b Structural visualization of kept (yellow) and removed (black) atoms. Supplementary Fig. 3 shows a similar representation but where the atom contribution values are directly shown from a continuous colour bar.

Based on the SHAP analysis, atom contribution values were calculated. The results are visually illustrated in Fig. 7b, where the molybdenum atoms in the structure are coloured yellow if the atom contributed to a better fit quality, otherwise it is coloured black. Figure 7b clearly shows a specific motif that ML-MotEx wants to keep in the model. The yellow molybdenum atoms are all part of a ‘triad’ structure, where three [MoO6] octahedra share edges and all oxygen atoms that bond to 3 or 4 Mo atoms is connected to yellow molybdenum atoms. This is further illustrated in Supplementary Fig. 3. Specifically, the resulting structural unit that ML-MotEx wants to keep is similar to heptamolybdate [Mo7O24]6−, which can be described as several triads connected through edge-sharing. These results indicate that a motif of connected edge-sharing triads, as shown in Fig. 7b, is important in order to describe the data of the disordered molybdenum oxides, which was also found by Christiansen et al.7 We note here that when fitting this model to the PDF itself, we cannot describe the medium-range order present in the PDF. The ML-MotEx rather allows identifying the main local motifs in the data.

Example 3: Identification of the ionic cluster structure from PDFs

To investigate the reproducibility of the ML-MotEx method, we investigate if similar results are achieved with different starting models, all containing the correct structure motif. We here model a PDF obtained from a solution of 0.05 M ammonium metatungstate hydrate, (NH4)6[H2W12O40]·H2O in water, which dissolves to form monodisperse α-Keggin clusters28. Experimental details are provided in Supplementary Methods.

To test the ML-MotEx method, we use four different starting models of tungstate oxide crystals, all including the α-Keggin cluster motif with varying complexity. Unit cells from the four following crystal structures were used as starting models: [Hpy]4H2[H2W12O40] (py = pyridine) [1]32, (CH3)4N)4SiW12O40 [2]33, (((CH3)2NH2)6 (Cu(HCON(CH3)2)4)(GeW12O40)2)(HCON(CH3)2)234 [3], and ((CH3)2NH2)3(PW12O40) [4]35. Again, we discarded all symmetry and generated a discrete structure model corresponding to the atoms in one single unit cell. All other atoms than tungsten and oxygen were furthermore removed from the structures before catalogue structures were created. Figure 8a shows the experimental dataset with simulated PDFs from the four different starting models. Figure 8b illustrates a W12O40 α-Keggin structure.

Fig. 8: Experimental PDF from Keggin clusters in solution.
figure 8

a Comparison of experimental data from a 0.05 M ammonium metatungstate hydrate solution, and simulated PDFs from the four different starting models [1][4]. The simulation parameters mimic typical values of a PDF dataset and can be seen in Supplementary Table 3. b The W12O40 α-Keggin structure.

Again, we first build structure catalogues based on the starting models (step 1) and fit them to the experimental PDF (step 2). In this case, we extract 104 structures from each starting model, which is just a small fraction of all possible structures that can be made from the starting models that have 24 ([2]), 48 ([1] and [3]), and 72 ([4]) atoms that are permuted. Again, a GBDT model was trained to predict the Rwp values of the structures (step 3), and SHAP values were obtained to calculate atom contribution values (step 4). The resulting SHAP value plots can be seen in Supplementary Figs. 45. While ML-MotEx takes about 100 s on an AMD Ryzen Threadripper 3990X with 64-core 2.9/4.3 GHz using 104 fits on a structure with 48 atoms, it would take about ~3 × 106 years (Supplementary Notes 1) to make fits of all the 248 − 1 possible structures using the brute-force approach. Supplementary Table 5 in the Supplementary Information shows the exact computer time of the fits on a MacBook Pro and a Threadripper, which clearly demonstrates the scalability of ML-MotEx.

Figure 9 shows the results of applying ML-MotEx to the 4 different starting models. For structures [1], [3], and [4], the 24 atoms most preferred by ML-MotEx were coloured yellow, while the rest were coloured black. For structure [2], 12 atoms were coloured yellow. In all 4 examples, the yellow atoms have a motif of a α-Keggin cluster, however, in Fig. 9c, d, we see a few mislabelled atoms (2 of 24 atoms in the worst case). The mislabelled atoms are found in the starting models containing most atoms, i.e. with the highest permutation value N. To achieve a better prediction, we could have built larger catalogues of candidate structure motifs and thus performed more fits. We, therefore, conclude that the ML-MotEx method is not completely insensitive to the starting model, but that it yields very similar results for all the tested starting models if it contains similar motifs. Furthermore, the example shows that ML-MotEx can be used to investigate PDF data from clusters in solution, whose structure also is part of known crystal structures. As seen from the results in Supplementary Figs. 710, we performed an identical analysis of a different dataset also obtained from the second solution of 0.05 M ammonium metatungstate hydrate. This analysis provided highly comparable results, as discussed in the Supplementary Information. This illustrates the reproducibility of the method. In Supplementary Discussion 1, we discuss what happens if a poor staring model is used, and how one can identify if the starting model does contain the right motif using the confidence factor.

Fig. 9: Summary of the ML-MotEx analysis of experimental PDFs from Keggin clusters in solution.
figure 9

Results from the ML-MotEx method on a PDF from a solution of ammonium metatungstate hydrate, using four different starting models: a [Hpy]4H2[H2W12O40] (py = pyridine)32, b (CH3)4N)4SiW12O4033, c (((CH3)2NH2)6 (Cu(HCON(CH3)2)4)(GeW12O40)2)(HCON(CH3)2)234, d ((CH3)2NH2)3(PW12O40)35. Atoms kept by ML-MotEx are shown in yellow while removed atoms are shown in black. The kept atoms were chosen as the 24 atoms (model A), 12 atoms (model B), 24 atoms (model C), and 24 atoms (model D) with the lowest atom contribution values. In Supplementary Fig. 6 a similar representation is shown, but where the atom contribution values are directly shown using a continuous colour bar.

We have also used the ML-MotEx method for a larger ionic cluster, namely [Bi38O45]. Here, we use the β-Bi2O3 structure as starting model and used a ‘cookie-cutter’ strategy to generate structures for the motif catalogue. This example is described in Supplementary Discussion 2, and the ‘cookie-cutter’ approach is shown in Supplementary Fig. 11.

Discussion

In the four examples presented above, we have shown how explainable ML can aid in identifying structural motifs in nanostructured materials and presented an approach to structure characterization. Traditional PDF analysis investigates how an entire structure model agrees with an experimental PDF, rather than identifying how different features in the model affect the fit quality. Instead, ML-MotEx provides a quantitative measure of how each atom or feature contributes to the fit. The use of ML furthermore allows the screening of a large number of models in an automated and fast manner. In the examples described here, ML-MotEx has been used with various starting models with up to 256 metal atoms; however, the algorithm can handle larger systems, as it is highly scalable. In comparison, a full brute-force approach is computationally restricted to systems with up to 15–30 atoms. For the type of systems described here, it is possible to use the method in quasi-experimental time, which could, for example, be useful for analysis of time-resolved scattering data, where the structural motifs present might change with time, which would be revealed by changing SHAP values.

ML-MotEx shares some similarities with the cluster build-up algorithm LIGA11,12, which automatically builds clusters of different sizes based on information that is contained in inter-atomic distance lists extracted from the PDF. LIGA has shown to be successful at automatically reconstructing clusters (up to 150 atoms) with no user input except the inter-atomic distance list, extracted from an experimental PDF, and at a low computational cost. However, its use has not caught on because extracting the distance list from the data presents significant practical difficulties, and is not unique. As with ML-MotEx it uses the error each atom in a cluster contributes to the fit to weigh the decision about which atom to include in the model. Presumably, part of the success of LIGA and ML-MotEx is its use of this atom contribution for rapidly finding good candidate motifs. Unlike LIGA, ML-MotEx requires a starting model that contains the target structural motif, and it leverages ML to rapidly compute the atom contributions. It can therefore be positioned between traditional refinement (where the complete starting model is needed) and LIGA (which is ab initio) as it finds structural motifs from within a larger model as a starting model for subsequent refinement. However, it has a significant advantage over LIGA that it works directly on the measured PDF and does not require the inter-atomic distance list to be extracted from the PDF data, and we expect it to be of great practical value. With this in mind, we plan to deploy ML-MotEx as an application on the PDFitc.org web server36.

It may be considered a significant drawback that ML-MotEx requires as an input a structure fragment that contains the target motif within it in order to work. We provide a confidence factor for the starting model, but ML-MotEx still requires significant chemical/structural knowledge and intuition to be of use. We first note that such intuition is widespread in the chemistry community and is unlikely to be a significant drawback in practice. For example, we have recently used the method to identify the structure of intermediates in the formation of transition metal tungstates from polyoxometalate ions using in situ PDF data37, and for identifying stacking fault domain sizes in manganese oxides from PDF and PXRD38. We also note that the method is sufficiently fast that it would be possible to combine it with structural screening applications such as structureMining@PDFitc15,36. Given chemical information about elements that are present, structureMining searches structural databases for candidate structures. These are then refined to a target dataset, and a rank-ordered list is returned to the user. If the PDF represents a signal from a short-range ordered structural motif, we could insert ML-MotEx between the database mining and refinement steps to search over sets of plausible structures to look for structural sub-motifs. It may be possible to first use structure mining to identify starting models, which could then be used for ML-MotEx analysis. The models could then be further evaluated using both the resulting Rwp values and confidence factor.

The ML-MotEx method is currently limited to PDF analysis in the fitting procedure of the algorithm (step 2), however, the rest of ML-MotEx (step 1 + 3 + 4) is ready to use with data from other techniques. We are confident that a similar approach, taking advantage of explainable ML and SHAP values, can be broadly useful for enhancing and developing how models for data analysis are identified and constructed.

Methods

Step 1: Creation of a catalogue of candidate structure motifs

The first step in ML-MotEx is to use a starting structure model to generate a catalogue of candidate structure motifs, which are all fitted to the data. The structures are generated by removing different numbers of atoms from the original starting structure resulting in thousands of smaller, candidate structure motifs.

This process, which we refer to as ‘structure permutation’, is illustrated in Fig. 10. Here, the starting model contains four metal atoms, which are each bonded to six oxygen atoms. Before candidate structure motifs are generated, we select which atom type should be included in the permutation process. For the project discussed here, this selection is based on the X-ray scattering power of the atoms (i.e. heavier atoms scatter X-rays strongly, while lighter ones do not), and we, therefore, choose to permute over the four metal atoms in the structure rather than oxygen atoms. The total number of atoms that are selected for permutation (here 4) is referred to as the permutation number, N. Note that we do not take symmetry into account in this process.

Fig. 10: Example of how structure motifs can be extracted from a starting model with 4 metal atoms coordinated to oxygen and used as input to the GBDT model.
figure 10

a The metal atoms are permuted randomly by creating an array of zeros and ones, where 0 refers to a deleted atom and 1 refers to an atom that is kept in the structure. Oxygen atoms are removed if they do not bond to any metal atoms within a distance threshold that is set by the user. Note that the metal atoms (blue) are slightly distorted from the centre of the octahedra. b Example of how the four structures from panel a and Fig. 1 are given as input to the GBDT model which predicts the Rwp value.

The selected atoms are removed or kept in the model by randomly associating them with zeros and ones, where 0 means that we remove the atom and 1 means we keep it. This is repeated multiple times to generate a large catalogue of candidate structure motifs. The total number of possible motifs from the permutations is equal to 2N − 1, but only a small fraction of these needs to be produced for ML-MotEx to provide satisfactory results. In Supplementary Discussion 3, we discuss how large a catalogue of candidate structure motifs ML-MotEx needs as training data to output reasonable results. This is likely to be highly system dependent and especially dependent on N and structural symmetry. For the examples presented in the paper, we use ~140–3000 structure motifs per N.

The atoms which were not chosen for permutation, in this case oxygen, are removed if they are not within a distance threshold from any other atom. The threshold is user-defined and can be set according to PDF peaks and/or chemically valid distances (i.e. bond lengths) for the expected compounds.

Step 2: Fitting the catalogue of candidate structure motifs to the data

In the next step, we fit each of the candidate structures in the catalogue to the experimental PDF. We here use the Python-based program DiffPy-CMI30 for PDF fitting39,40,41 and apply the Debye equation for the calculation of scattering intensities and PDFs from the structures. The fitting strategies and parameters for each of the examples presented below are listed in Supplementary Table 2. The output of the fit is a Rwp value reflecting the quality of the fit:

$$R_{{{{\mathrm{wp}}}}} = \sqrt {\frac{{\mathop {\sum }\nolimits_{{{{\mathrm{i}}}} = 1}^{{{\mathrm{n}}}} \left[ {G_{{{{\mathrm{obs}}}}}\left( {r_{{{\mathrm{i}}}}} \right) - G_{{{{\mathrm{calc}}}}}\left( {r_{{{\mathrm{i}}}},P} \right)} \right]^2}}{{\mathop {\sum }\nolimits_{{{{\mathrm{i}}}} = 1}^{{{\mathrm{n}}}} G_{{{{\mathrm{obs}}}}}\left( {r_{{{\mathrm{i}}}}} \right)^2}}} \cdot 100\,{{{\mathrm{\% }}}}$$
(1)

Here, Gobs and Gcalc are the observed and calculated PDFs, and P is the refinement parameters in the model.

Step 3: Predicting Rwp values using Gradient Boosting Decision Trees

Gradient Boosting Decision Trees (GBDTs)25 are a tool that can do classification or regression using decision trees. In this work, we are using XGBoost25 as the GBDT algorithm to do the regression task of predicting the fit quality (step 2) based on the structural input given as zeros or ones (step 1) and the number of atoms in the structure. Figure 10b shows the input to the GBDT model.

The optimization is done by making trees of ‘yes’ and ‘no’ questions on whether to keep an atom in the structure or not, based on the resulting Rwp value. A hypothetical example of a simple tree can be seen in Fig. 1, step 3. When atom 4 is present in the structure, the GBDT model will predict a Rwp value which is 5% lower than if atom 4 is not present in the structure. In the same way, it will predict an Rwp value which is 12% lower if atom 1 is present in the structure. In the decision tree, the algorithm will therefore say ‘yes’ to keep both atoms 1 and 4 in the structure. In this project, the GBDT model predicts the Rwp value using a weighted average of 100 trees.

The GBDT model performance is improved with a large amount of training data, which in this tool is provided by creating a larger catalogue of candidate structure motifs and fitting them to the data.

The GBDT model is trained on 80% of the data, which is referred to as the training set. XGBoost25 were used with default parameters except for learning rate and max depth, which were optimized with the use of Bayesian optimization using 50 iterations and cross-validation split on 342,43. While this procedure automates the hyperparameter tuning, we demonstrate in Supplementary Fig. 12 that similar results are achieved across various hyperparameters, and in Supplementary Fig. 13 we show that similar results are achieved across various seeds. The last 20% of the data is used to evaluate the performance of the algorithm and is referred to as the test set.

Step 4: Quantifying the effect of structural features using SHAP values, assigning atomic contribution values

SHAP values are used to analyse the Rwp values resulting from the process described above. For each fit (step 2), each atom in the starting model is assigned a SHAP value. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important. Each atom in the starting model will thus get F number of SHAP values, where F corresponds to the number of fits made in step 2 of the algorithm. We divide the F number of SHAP values into two categories; firstly, the ones where the atom was kept in the structure motif (kept atom SHAP value list) and second, the ones where the atom was removed to create the structure motif (removed atom SHAP value list). From each of the two lists, an average SHAP value for the atoms can be calculated, defined as SHAPaverage-kept and SHAPaverage-removed. We then define an atom contribution value, which is calculated as the difference between two average SHAP values, i.e.:

$${{{\mathrm{Atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}} = {{{\mathrm{SHAP}}}}_{{{{\mathrm{average}} {\mbox{-}} {\mathrm{kept}}}}}-{{{\mathrm{SHAP}}}}_{{{{\mathrm{average}} {\mbox{-}} {\mathrm{removed}}}}}$$
(2)

We also define the uncertainty of this value as described in Eq. 3:

$$Atom\,contribution\,value\,RMS = {({SHAP_{average-kept}^{RMS}}^{2}-{SHAP_{average-removed}^{RMS}}^{2})}^{0.5}$$
(3)

We define a confidence factor for each atom that describes how confident we can be about including/excluding that atom in a structural motif;

$${{{\mathrm{Confidence}}}}\,{{{\mathrm{factor}}}} = {{{\mathrm{atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}}/{{{\mathrm{atom}}}}\,{{{\mathrm{contribution}}}}\,{{{\mathrm{value}}}}\,{{{\mathrm{RMS}}}}$$
(4)

The results of ML-MotEx can be visually inspected as the atoms in the starting model are coloured according to their atom contribution value, using yellow for low atom contribution value (tendency to keep atom, pushing Rwp down) and black for high atom contribution value (tendency to remove atom, pushing Rwp up). ML-MotEx outputs a VESTA44 and CrystalMaker45 file where all the atoms are coloured with regard to their atom contribution value.