Abstract
Characterization of material structure with X-ray or neutron scattering using e.g. Pair Distribution Function (PDF) analysis most often rely on refining a structure model against an experimental dataset. However, identifying a suitable model is often a bottleneck. Recently, automated approaches have made it possible to test thousands of models for each dataset, but these methods are computationally expensive and analysing the output, i.e. extracting structural information from the resulting fits in a meaningful way, is challenging. Our Machine Learning based Motif Extractor (ML-MotEx) trains an ML algorithm on thousands of fits, and uses SHAP (SHapley Additive exPlanation) values to identify which model features are important for the fit quality. We use the method for 4 different chemical systems, including disordered nanomaterials and clusters. ML-MotEx opens for a type of modelling where each feature in a model is assigned an importance value for the fit quality based on explainable ML.
Similar content being viewed by others
Introduction
The development of advanced, functional materials builds on an understanding of the intricate relationship between material structure and properties, and over the past century, crystallographic methods using scattering and diffraction have thus been essential for materials science. Crystallography allows ab initio determination of crystal structures from diffraction data and has provided us with the vast knowledge of crystal chemistry that is now used in the design of functional materials. However, in the case of nanomaterials with limited long-range order, crystallographic methods are challenged, and ab initio structure determination, or structure solution, is not currently possible. Over the past decades, total scattering with Pair Distribution Function (PDF) analysis has become an essential tool for the characterization of nanomaterial structure1,2. The PDF is the Fourier transform of normalized and corrected X-ray, neutron, or electron scattering intensities, and is a function in real space representing a histogram of inter-atomic distances in the sample. Compared to crystallographic methods relying on long-range order, PDF analysis can be applied for nanomaterials3,4,5, disordered1,6,7, or amorphous materials3,5,8. However, structure solution from the PDF is not possible except in a very few simple cases9, using either the Reverse Monte Carlo method10 or the LIGA algorithm11,12. In the absence of broadly applicable ab initio nanostructure determination methods, it is, therefore, necessary to propose reasonable starting models and to then ‘refine’ the model parameters against the data using local minimization methods. The step of finding a starting model can be a major challenge and is thus a bottleneck in complex material characterization. In the case of PDF analysis of nanomaterials, such models are often guessed at by considering related bulk materials; however, these are often not good starting models for very small clusters and nanoparticles, where significant structural changes may take place3,5,13,14. A way of building plausible starting models is thus needed, where structure models can be built capturing local bonding topologies suggested by known chemistries.
Recently, automated methods such as ‘structure mining’ and ‘cluster mining’ have appeared in the literature to help overcome this challenge15,16,17. In a study of the structure of metallic nanoparticles, Banerjee et al. automatically generated thousands of discrete metal nanocluster structures and fitted PDFs from each of them to experimental data to identify the best model in an automated manner17. In a recent study of molybdenum oxide nanomaterials, we introduced another approach, where we automatically generated a large number of MoOx cluster structure models and compared their PDFs to experimental data in order to identify dominating structural motifs in the sample, i.e. arrangements of atoms that dominate the material structure on the local scale7. We hypothesized that the structural motifs present in amorphous molybdenum oxides can also be found in crystalline structures, and therefore used crystal structures of molybdenum oxides as starting models. From these models, we cut out thousands of different cluster structure models of different sizes to build a ‘catalogue’ of structure candidates. These models were all tested against our data to identify the best fitting structural motif. We recently used a similar approach for the identification of a bismuth oxido cluster intermediate structure in a study of cluster growth18.
While these approaches can extend the structural space searched when identifying models for structure refinement, new challenges arise. Firstly, the refinement processes can be computationally heavy, which can limit the number of catalogue structures that are tested. For example, our brute-force approach for cluster identification above generates 2N − 1 structures for starting model sizes with N atoms. Each structure must have its PDF computed and then refined against the target measured PDF, so that its fit quality can be evaluated. This process is computationally costly and does not scale well with the number of structure candidates. Furthermore, for disordered, amorphous, and nanostructured systems, many hundred models may provide similar fit qualities, and if only reporting a few of them, it is difficult to assess which structural features of these models are important. We, therefore, need effective and unbiased methods to compare many fits to extract structural information.
Here, we introduce a method that uses an explainable Machine Learning (ML) model that, after training, will predict the agreement factor for a test cluster with a given dataset. Furthermore, the use of explainable ML informs which features in the model are important for the agreement factor19,20,21,22,23,24. Our Machine Learning based Motif Extractor (ML-MotEx) model is illustrated in Fig. 1. Firstly, it builds a large catalogue of thousands of candidate structural motifs, which are ‘cut outs’ from a chosen bulk structure7,18 (step 1). The PDF is then computed from each one, and each model is fit to the target dataset (step 2). The structures and Rwp values from each fit are handed to an ML algorithm applying gradient boosting decision trees (GBDTs)25, which learns to predict Rwp values for new fits based on an atomic structure model (step 3). The ML-MotEx algorithm then outputs quantified values of how important each atom or feature in the starting structure is for the fit to yield a low Rwp value with the given fitting algorithm (step 4). This is done by using SHAP (Shapley Additive exPlanation)26,27 values, which is a known method for explaining tree-based ML models. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important.
Compared to the automated, brute-force methods previously introduced for PDF analysis7,15,16,17, we can much faster screen a larger number of structures. Our method only needs to screen a sub-sample (~10,000) of the much larger number of motifs that can be generated from bulk material to learn how to predict which structures provide a good agreement with the data. The analysis done for the examples presented below would take ~24 days for starting models with 24 atoms, ~3 × 106 years for starting models with 48 atoms, and ~6 × 1013 years for starting models with 72 atoms using a brute-force approach (Supplementary Notes 1), while ML-MotEx analysis is done in minutes or hours. Furthermore, the use of explainable ML provides a way to better analyse the output of the screening: instead of just identifying the model that provides the lowest Rwp value, we are able to output a measure of how important each atom or feature (e.g. size or shape) in the starting model is for the fit to yield a low Rwp value (step 4). This procedure is automated and can be done in quasi-real experimental time and without human bias.
We illustrate the use of ML-MotEx using four different examples. We first show the principles of the method using a simple model system based on simulated X-ray PDF data from a C60 buckyball. We further demonstrate the use of ML-MotEx on experimental X-ray PDF data from amorphous, disordered molybdenum oxides7 and tungstate α-Keggin clusters in solution28, where it allows identifying the main structural motifs present in the samples using different starting models. Lastly, we extend the method to use a ‘cookie-cutter’ strategy to generate structures for the catalogue of candidate motifs. Here, the algorithm is used to identify a bismuth oxido cluster by using a cut-out of the β-Bi2O3 structure as starting model. The examples illustrate that it is possible to obtain knowledge of dominating structural motifs from PDF in an automated manner using ML.
Results
ML-MotEx algorithm
ML-MotEx consists of four steps. These four steps are shown in Fig. 1 and the simplified pseudo-code of the algorithm in Fig. 2. In the first step, a starting structure model is used to generate a catalogue of candidate structure motifs. As detailed in the Methods section, the structures are generated by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. In the second step, a fitting script is used to fit the generated candidate structures to the dataset. In the third step, the fitting results are handed to the explainable ML algorithm, which is optimized and trained. By using this information, SHAP values of the atoms or structural features in the starting model are calculated in the fourth step. The output of the algorithm is thus the starting model along with SHAP values, indicating the importance of each individual atom in the structure for the fit quality, or in other words; how much each individual atom or feature affects the Rwp value either positively or negatively. We refer to this value as the “atom contribution value”. We furthermore define the ratio between the atom contribution value and its uncertainty as the “confidence factor”. Further definitions and descriptions of the individual steps of the algorithm are given in the Methods section.
Example 1: Proof-of-concept: identification of the C60 buckyball
We first show the use of ML-MotEx with a simple, proof-of-concept example, using a calculated PDF from an ideal C60 buckyball (Fig. 3a). The aim is to identify the structural motif, the C60 buckyball, from the data.
We first need a starting structure that contains the motifs we are looking for. In this simplified example, we use a single unit cell of the crystal structure of C6029. However, we discarded all symmetry and generated a discrete structure model corresponding to the 132 atoms in one unit cell. This model is shown in Fig. 3b, where one whole C60 structure (Fig. 3a) is seen along with fragments of the neighbouring C60 buckyballs. The simulated PDF of the C60 buckyball and the starting model is shown in Fig. 3c.
We can now use this starting model to generate a catalogue of structures, which are all fitted to the data. The structures are created by removing different numbers of atoms from the original starting structure, which results in thousands of smaller, candidate structure motifs. This model generation and fitting steps are identical to our previously reported brute-force approach, where we simply compare the Rwp values of all the fits to identify the best structure motif. We first consider this simple approach. One of the limitations of the brute-force method is that the possible candidate structures are exponential in N, the number of atoms in the model. Since each atom in the starting model can be present or absent, the number of possible subclusters is equal to 2N − 1. For large models such as the C60 starting model containing 132 atoms, this is ~1040, a gigantic number, making it impossible to investigate all candidate structures. For this example, we used 384,260 structures to train ML-MotEx, which is only a very small fraction of the 2132 − 1 possible candidate structures. Note that the model with a single C60 buckyball was not in the generated structure catalogue.
All these 384,260 structures were fitted to the PDF calculated from the C60 cluster. Only a scale factor, an isotropic expansion/contraction factor, and isotropic Atomic Displacement Parameters (ADPs) were refined, as detailed in Supplementary Table 2. We note that refinement of the atom positions can be added to the fitting procedure to expand the chemical space that is investigated. However, this would be computationally expensive, and it would allow deviations from the chemical topologies set up in the starting model.
To get an overview of the results from these fits, we plot the Rwp value versus the number of atoms in the structure. To further investigate the results, one must visually inspect the fits of the catalogue of candidate structure motifs and their Rwp value. Some of the candidate structure motifs are shown as inserts in Fig. 3e, where transparent grey atoms represent atoms deleted from the models. The fits of these structures to the dataset are presented in Fig. 3e, along with the Rwp values. The Rwp value appears to drop when the ‘outer’ atoms are removed, while it increases when the atoms that are part of the centre C60 buckyball are removed. From investigating these few, but manually selected, structures and their corresponding fitted Rwp value, one can hypothesize that the structure giving the best fit should be the C60 buckyball. However, this method can be biased by human interaction, and it is time-consuming and difficult to go through the many fits to extract structural information.
We, therefore, move on to the ML-MotEx method. Using the catalogue of candidate structure motifs and the corresponding Rwp values obtained above, we train a GBDT model on the training set to predict the Rwp value of the candidate structure motifs. Figure 4 shows the predicted Rwp values of the ML algorithm versus the Rwp value of the structures when they are fitted to the simulated C60 dataset in DiffPy-CMI30. For the structures used in the test set, the GBDT model predicts the Rwp value with a mean absolute error of 2.0%.
We now use explainable ML to explain Rwp values with the use of the feature importance tool SHAP values27. As described in detail in the Methods section, a SHAP value is calculated for each structural feature (here, each atom and the cluster size) for each candidate structure motif that is fitted to the PDF during the training process. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 100% (poor fit) or 0% (perfect fit), in other words why it is important.
Figure 5a shows the most important results from the SHAP value analysis. The first feature we consider is the number of atoms, with SHAP values shown in the top part of Fig. 5a. The plot represents SHAP values for the cluster size feature with the size shown on a colour scale, going from small (blue) to large clusters (red). From the large amplitude of some of the SHAP values observed from this feature, we see that the number of atoms in the structure motif is the most important feature for the Rwp value. All small clusters (0–34 atoms, plotted in blue colours) show a large positive SHAP value, which implies that the Rwp value of the fit to the PDF data is high, i.e. the fit quality is low. All small clusters can thereby be discarded as structural models for satisfyingly describing the data.
Next, we can investigate the SHAP values obtained for the individual atoms in the structure. We first consider atom 13, as labelled in the structure drawing in Fig. 5b. The SHAP values obtained from this atom for each of the fits in the training set are all plotted on the SHAP axis. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. If first considering the cases where the atom is kept in the model, the atom 13 SHAP values are generally negative, which means that the presence of this atom pushes the Rwp value towards 0%. We interpret this as ML-MotEx wanting to keep the atom in the model. The SHAP values obtained for the fits without the atom present are positive, which confirms that if removing the atom, the fit quality becomes worse. Based on the SHAP values obtained for the atom in each fit, we calculate an atom contribution value. The atom contribution value is defined in the Methods section and is calculated as the difference between the average SHAP values obtained for the atom when kept in the model, and when removed from the model. A negative atom contribution value means that the atom pushes the Rwp value down if kept in the structure. The atom contribution value obtained for atom 13 is negative, and we, therefore, colour it yellow in the structural representation in Fig. 5b to indicate that it should be kept in the model. We use this strategy to automatically go through all the atoms in the starting model and colour them yellow/black based on their impact on the Rwp value. The result can be seen in Fig. 5b where the 60 atoms with the lowest atom contribution values are coloured yellow. The results are also shown in Supplementary Fig. 1, where the atom contribution values are plotted using a continuous colour bar. The results show that ML-MotEx mainly favours the atoms comprising the central buckyball. While the average confidence factor (as defined in the Methods section) is 1.26 for all of the atoms in the starting model, we observe that the average confidence factor of the mislabelled atoms is 0.37, meaning that ML-MotEx is less confident about the atom contribution values of those.
The ML-MotEx algorithm thus provides an unbiased method to extract important motifs from PDF data, without any inputs other than a starting model and a fitting script. We emphasize that the structural motifs extracted with ML-MotEx are based on the Rwp value of the fits and are thereby not necessarily physically reasonable. It is therefore important to still critically consider the extracted motif with chemical knowledge, in the same manner as for conventional PDF refinements. In this process, one could refine additional parameters such as atom positions. Consequently, in Fig. 5b, the user should identify the full C60 buckyball as the structural motif rather than just choosing the motif of the yellow atoms. Another approach to avoid unphysically results from ML-MotEx would be to include e.g. density function theory (DFT) calculations in the goodness-of-fit value.
Example 2: Identification of structural motifs in disordered molybdenum oxides
As discussed above, we have recently used the brute-force automated modelling method to identify structural motifs in disordered molybdenum oxides from PDF analysis7. Here we show that by reassessing the data with ML-MotEx, we can reproduce the results from Christiansen et al.7 but in an automated way that allows analysis of the resulting structure model using SHAP values. Figure 6a shows the difference-PDF (d-PDF) obtained from amorphous molybdenum oxide supported on γ-Al2O3 nanoparticles (15 w% Mo), where the signal from the γ-Al2O3 nanoparticles has been subtracted. The d-PDF thus only reflects the structure of the supporting material. The aim is to develop a structural model for the amorphous MoOx. In our previous paper, different starting models were tested, which were all based on structures of molybdenum-based polyoxometalates (POMs) built from [MoO4] tetrahedra and [MoO6] octahedra. The analysis showed that the best fitting models did not contain tetrahedral motifs. Instead, the brute-force automated modelling approach hinted at a unit of three edge-sharing [MoO6] octahedra, or a ‘triad’, to be present in the structure. However, the use of the computationally expensive brute-force method limited the number of atoms that could be included in the starting model. This meant that a range of different smaller starting models were used to test different structure hypotheses. With ML-MotEx, we can instead test much larger systems and thereby include several different structural motifs at the same time in one starting model, as well as quantitatively analyse the results using SHAP values. We, therefore, use a larger POM as starting model, namely the entire Mo36O128 cluster cut-out of the K8(Mo36O112(H2O)16)·(H2O)37 crystal structure31, which contains a range of different chemical topologies. Figure 6a shows the simulated data from the Mo36O128 cluster, which has some similarities to the experimental PDF, and Fig. 6b shows the structure of the Mo36O128 cluster.
We apply ML-MotEx to the molybdenum oxide system in the same manner as we did to the C60 buckyball. First, we used the starting model to make a catalogue of candidate structure motifs, as described in detail in the Methods section. These are all fit to the experimental PDF, and the results are used to train the GBDT model. The fits are made with the same fitting algorithm as used in the paper from Christiansen et al.7 Figure 6c illustrates the Rwp values of the fits, plotted as a function of the number of molybdenum atoms present in the structural motif. The best fitting models contain 5–7 molybdenum atoms. The model that fits the data with the lowest Rwp value (45%) can be identified as a Mo5O24 structure, as shown in Supplementary Fig. 2. However, it is difficult to justify that this structural model is unique in representing the structure in the sample, purely based on the Rwp value.
We, therefore, use steps 3 and 4 of ML-MotEx to analyse the results of the ensemble of fits. The resulting SHAP values are shown in Fig. 7a. The plot should be interpreted in the same way as for the C60 example: Each atom is assigned a SHAP value in each of the fits in the training set. For the models where the atom is not present in the model, the SHAP value is shown in blue, while it is shown in red for the atoms where it is present in the model. When considering the amplitudes of the SHAP values, we see that the atoms labelled with 14, 15, 19, and 20 are marked as very important by ML-MotEx. When these atoms are present in the structure (red), they all have large negative SHAP values, indicating that their presence in the model pushes the Rwp down. When they are not present in the structure (blue), they all have large positive SHAP values, also indicating that they should be present in the structure to obtain a good fit. Atoms 22 and 23 are examples of atoms that ML-MotEx do not suggest keeping in the structure. As seen from the SHAP values, their presence pushes up the Rwp value.
Based on the SHAP analysis, atom contribution values were calculated. The results are visually illustrated in Fig. 7b, where the molybdenum atoms in the structure are coloured yellow if the atom contributed to a better fit quality, otherwise it is coloured black. Figure 7b clearly shows a specific motif that ML-MotEx wants to keep in the model. The yellow molybdenum atoms are all part of a ‘triad’ structure, where three [MoO6] octahedra share edges and all oxygen atoms that bond to 3 or 4 Mo atoms is connected to yellow molybdenum atoms. This is further illustrated in Supplementary Fig. 3. Specifically, the resulting structural unit that ML-MotEx wants to keep is similar to heptamolybdate [Mo7O24]6−, which can be described as several triads connected through edge-sharing. These results indicate that a motif of connected edge-sharing triads, as shown in Fig. 7b, is important in order to describe the data of the disordered molybdenum oxides, which was also found by Christiansen et al.7 We note here that when fitting this model to the PDF itself, we cannot describe the medium-range order present in the PDF. The ML-MotEx rather allows identifying the main local motifs in the data.
Example 3: Identification of the ionic cluster structure from PDFs
To investigate the reproducibility of the ML-MotEx method, we investigate if similar results are achieved with different starting models, all containing the correct structure motif. We here model a PDF obtained from a solution of 0.05 M ammonium metatungstate hydrate, (NH4)6[H2W12O40]·H2O in water, which dissolves to form monodisperse α-Keggin clusters28. Experimental details are provided in Supplementary Methods.
To test the ML-MotEx method, we use four different starting models of tungstate oxide crystals, all including the α-Keggin cluster motif with varying complexity. Unit cells from the four following crystal structures were used as starting models: [Hpy]4H2[H2W12O40] (py = pyridine) [1]32, (CH3)4N)4SiW12O40 [2]33, (((CH3)2NH2)6 (Cu(HCON(CH3)2)4)(GeW12O40)2)(HCON(CH3)2)234 [3], and ((CH3)2NH2)3(PW12O40) [4]35. Again, we discarded all symmetry and generated a discrete structure model corresponding to the atoms in one single unit cell. All other atoms than tungsten and oxygen were furthermore removed from the structures before catalogue structures were created. Figure 8a shows the experimental dataset with simulated PDFs from the four different starting models. Figure 8b illustrates a W12O40 α-Keggin structure.
Again, we first build structure catalogues based on the starting models (step 1) and fit them to the experimental PDF (step 2). In this case, we extract 104 structures from each starting model, which is just a small fraction of all possible structures that can be made from the starting models that have 24 ([2]), 48 ([1] and [3]), and 72 ([4]) atoms that are permuted. Again, a GBDT model was trained to predict the Rwp values of the structures (step 3), and SHAP values were obtained to calculate atom contribution values (step 4). The resulting SHAP value plots can be seen in Supplementary Figs. 4–5. While ML-MotEx takes about 100 s on an AMD Ryzen Threadripper 3990X with 64-core 2.9/4.3 GHz using 104 fits on a structure with 48 atoms, it would take about ~3 × 106 years (Supplementary Notes 1) to make fits of all the 248 − 1 possible structures using the brute-force approach. Supplementary Table 5 in the Supplementary Information shows the exact computer time of the fits on a MacBook Pro and a Threadripper, which clearly demonstrates the scalability of ML-MotEx.
Figure 9 shows the results of applying ML-MotEx to the 4 different starting models. For structures [1], [3], and [4], the 24 atoms most preferred by ML-MotEx were coloured yellow, while the rest were coloured black. For structure [2], 12 atoms were coloured yellow. In all 4 examples, the yellow atoms have a motif of a α-Keggin cluster, however, in Fig. 9c, d, we see a few mislabelled atoms (2 of 24 atoms in the worst case). The mislabelled atoms are found in the starting models containing most atoms, i.e. with the highest permutation value N. To achieve a better prediction, we could have built larger catalogues of candidate structure motifs and thus performed more fits. We, therefore, conclude that the ML-MotEx method is not completely insensitive to the starting model, but that it yields very similar results for all the tested starting models if it contains similar motifs. Furthermore, the example shows that ML-MotEx can be used to investigate PDF data from clusters in solution, whose structure also is part of known crystal structures. As seen from the results in Supplementary Figs. 7–10, we performed an identical analysis of a different dataset also obtained from the second solution of 0.05 M ammonium metatungstate hydrate. This analysis provided highly comparable results, as discussed in the Supplementary Information. This illustrates the reproducibility of the method. In Supplementary Discussion 1, we discuss what happens if a poor staring model is used, and how one can identify if the starting model does contain the right motif using the confidence factor.
We have also used the ML-MotEx method for a larger ionic cluster, namely [Bi38O45]. Here, we use the β-Bi2O3 structure as starting model and used a ‘cookie-cutter’ strategy to generate structures for the motif catalogue. This example is described in Supplementary Discussion 2, and the ‘cookie-cutter’ approach is shown in Supplementary Fig. 11.
Discussion
In the four examples presented above, we have shown how explainable ML can aid in identifying structural motifs in nanostructured materials and presented an approach to structure characterization. Traditional PDF analysis investigates how an entire structure model agrees with an experimental PDF, rather than identifying how different features in the model affect the fit quality. Instead, ML-MotEx provides a quantitative measure of how each atom or feature contributes to the fit. The use of ML furthermore allows the screening of a large number of models in an automated and fast manner. In the examples described here, ML-MotEx has been used with various starting models with up to 256 metal atoms; however, the algorithm can handle larger systems, as it is highly scalable. In comparison, a full brute-force approach is computationally restricted to systems with up to 15–30 atoms. For the type of systems described here, it is possible to use the method in quasi-experimental time, which could, for example, be useful for analysis of time-resolved scattering data, where the structural motifs present might change with time, which would be revealed by changing SHAP values.
ML-MotEx shares some similarities with the cluster build-up algorithm LIGA11,12, which automatically builds clusters of different sizes based on information that is contained in inter-atomic distance lists extracted from the PDF. LIGA has shown to be successful at automatically reconstructing clusters (up to 150 atoms) with no user input except the inter-atomic distance list, extracted from an experimental PDF, and at a low computational cost. However, its use has not caught on because extracting the distance list from the data presents significant practical difficulties, and is not unique. As with ML-MotEx it uses the error each atom in a cluster contributes to the fit to weigh the decision about which atom to include in the model. Presumably, part of the success of LIGA and ML-MotEx is its use of this atom contribution for rapidly finding good candidate motifs. Unlike LIGA, ML-MotEx requires a starting model that contains the target structural motif, and it leverages ML to rapidly compute the atom contributions. It can therefore be positioned between traditional refinement (where the complete starting model is needed) and LIGA (which is ab initio) as it finds structural motifs from within a larger model as a starting model for subsequent refinement. However, it has a significant advantage over LIGA that it works directly on the measured PDF and does not require the inter-atomic distance list to be extracted from the PDF data, and we expect it to be of great practical value. With this in mind, we plan to deploy ML-MotEx as an application on the PDFitc.org web server36.
It may be considered a significant drawback that ML-MotEx requires as an input a structure fragment that contains the target motif within it in order to work. We provide a confidence factor for the starting model, but ML-MotEx still requires significant chemical/structural knowledge and intuition to be of use. We first note that such intuition is widespread in the chemistry community and is unlikely to be a significant drawback in practice. For example, we have recently used the method to identify the structure of intermediates in the formation of transition metal tungstates from polyoxometalate ions using in situ PDF data37, and for identifying stacking fault domain sizes in manganese oxides from PDF and PXRD38. We also note that the method is sufficiently fast that it would be possible to combine it with structural screening applications such as structureMining@PDFitc15,36. Given chemical information about elements that are present, structureMining searches structural databases for candidate structures. These are then refined to a target dataset, and a rank-ordered list is returned to the user. If the PDF represents a signal from a short-range ordered structural motif, we could insert ML-MotEx between the database mining and refinement steps to search over sets of plausible structures to look for structural sub-motifs. It may be possible to first use structure mining to identify starting models, which could then be used for ML-MotEx analysis. The models could then be further evaluated using both the resulting Rwp values and confidence factor.
The ML-MotEx method is currently limited to PDF analysis in the fitting procedure of the algorithm (step 2), however, the rest of ML-MotEx (step 1 + 3 + 4) is ready to use with data from other techniques. We are confident that a similar approach, taking advantage of explainable ML and SHAP values, can be broadly useful for enhancing and developing how models for data analysis are identified and constructed.
Methods
Step 1: Creation of a catalogue of candidate structure motifs
The first step in ML-MotEx is to use a starting structure model to generate a catalogue of candidate structure motifs, which are all fitted to the data. The structures are generated by removing different numbers of atoms from the original starting structure resulting in thousands of smaller, candidate structure motifs.
This process, which we refer to as ‘structure permutation’, is illustrated in Fig. 10. Here, the starting model contains four metal atoms, which are each bonded to six oxygen atoms. Before candidate structure motifs are generated, we select which atom type should be included in the permutation process. For the project discussed here, this selection is based on the X-ray scattering power of the atoms (i.e. heavier atoms scatter X-rays strongly, while lighter ones do not), and we, therefore, choose to permute over the four metal atoms in the structure rather than oxygen atoms. The total number of atoms that are selected for permutation (here 4) is referred to as the permutation number, N. Note that we do not take symmetry into account in this process.
The selected atoms are removed or kept in the model by randomly associating them with zeros and ones, where 0 means that we remove the atom and 1 means we keep it. This is repeated multiple times to generate a large catalogue of candidate structure motifs. The total number of possible motifs from the permutations is equal to 2N − 1, but only a small fraction of these needs to be produced for ML-MotEx to provide satisfactory results. In Supplementary Discussion 3, we discuss how large a catalogue of candidate structure motifs ML-MotEx needs as training data to output reasonable results. This is likely to be highly system dependent and especially dependent on N and structural symmetry. For the examples presented in the paper, we use ~140–3000 structure motifs per N.
The atoms which were not chosen for permutation, in this case oxygen, are removed if they are not within a distance threshold from any other atom. The threshold is user-defined and can be set according to PDF peaks and/or chemically valid distances (i.e. bond lengths) for the expected compounds.
Step 2: Fitting the catalogue of candidate structure motifs to the data
In the next step, we fit each of the candidate structures in the catalogue to the experimental PDF. We here use the Python-based program DiffPy-CMI30 for PDF fitting39,40,41 and apply the Debye equation for the calculation of scattering intensities and PDFs from the structures. The fitting strategies and parameters for each of the examples presented below are listed in Supplementary Table 2. The output of the fit is a Rwp value reflecting the quality of the fit:
Here, Gobs and Gcalc are the observed and calculated PDFs, and P is the refinement parameters in the model.
Step 3: Predicting Rwp values using Gradient Boosting Decision Trees
Gradient Boosting Decision Trees (GBDTs)25 are a tool that can do classification or regression using decision trees. In this work, we are using XGBoost25 as the GBDT algorithm to do the regression task of predicting the fit quality (step 2) based on the structural input given as zeros or ones (step 1) and the number of atoms in the structure. Figure 10b shows the input to the GBDT model.
The optimization is done by making trees of ‘yes’ and ‘no’ questions on whether to keep an atom in the structure or not, based on the resulting Rwp value. A hypothetical example of a simple tree can be seen in Fig. 1, step 3. When atom 4 is present in the structure, the GBDT model will predict a Rwp value which is 5% lower than if atom 4 is not present in the structure. In the same way, it will predict an Rwp value which is 12% lower if atom 1 is present in the structure. In the decision tree, the algorithm will therefore say ‘yes’ to keep both atoms 1 and 4 in the structure. In this project, the GBDT model predicts the Rwp value using a weighted average of 100 trees.
The GBDT model performance is improved with a large amount of training data, which in this tool is provided by creating a larger catalogue of candidate structure motifs and fitting them to the data.
The GBDT model is trained on 80% of the data, which is referred to as the training set. XGBoost25 were used with default parameters except for learning rate and max depth, which were optimized with the use of Bayesian optimization using 50 iterations and cross-validation split on 342,43. While this procedure automates the hyperparameter tuning, we demonstrate in Supplementary Fig. 12 that similar results are achieved across various hyperparameters, and in Supplementary Fig. 13 we show that similar results are achieved across various seeds. The last 20% of the data is used to evaluate the performance of the algorithm and is referred to as the test set.
Step 4: Quantifying the effect of structural features using SHAP values, assigning atomic contribution values
SHAP values are used to analyse the Rwp values resulting from the process described above. For each fit (step 2), each atom in the starting model is assigned a SHAP value. The amplitude of the SHAP value reflects how important a structural feature is for the fit quality, while the sign of the SHAP value reflects whether the feature affects the Rwp value of the fit towards 1 (poor fit) or 0 (perfect fit), in other words why it is important. Each atom in the starting model will thus get F number of SHAP values, where F corresponds to the number of fits made in step 2 of the algorithm. We divide the F number of SHAP values into two categories; firstly, the ones where the atom was kept in the structure motif (kept atom SHAP value list) and second, the ones where the atom was removed to create the structure motif (removed atom SHAP value list). From each of the two lists, an average SHAP value for the atoms can be calculated, defined as SHAPaverage-kept and SHAPaverage-removed. We then define an atom contribution value, which is calculated as the difference between two average SHAP values, i.e.:
We also define the uncertainty of this value as described in Eq. 3:
We define a confidence factor for each atom that describes how confident we can be about including/excluding that atom in a structural motif;
The results of ML-MotEx can be visually inspected as the atoms in the starting model are coloured according to their atom contribution value, using yellow for low atom contribution value (tendency to keep atom, pushing Rwp down) and black for high atom contribution value (tendency to remove atom, pushing Rwp up). ML-MotEx outputs a VESTA44 and CrystalMaker45 file where all the atoms are coloured with regard to their atom contribution value.
Data availability
The authors declare that the data supporting this study are available within the paper, its Supplementary Information files and the associated Github to the paper: https://github.com/AndySAnker/ML-MotEx. Additional data that support the findings of this study are available from the corresponding author upon request.
Code availability
The authors declare that the codes supporting this study are available on the associated Github to the paper: https://github.com/AndySAnker/ML-MotEx. Additional codes that support the findings of this study are available from the corresponding author upon request.
References
Billinge, S. J. L. & Kanatzidis, M. G. Beyond crystallography: the study of disorder, nanocrystallinity and crystallographically challenged materials with pair distribution functions. Chem. Commun. 7, 749–760 (2004).
Keen, D. A. & Goodwin, A. L. The crystallography of correlated disorder. Nature 521, 303–309 (2015).
Christiansen, T. L., Cooper, S. R. & Jensen, K. M. Ø. There’s no place like real-space: elucidating size-dependent atomic structure of nanomaterials using pair distribution function analysis. Nanoscale Adv. 2, 2234–2254 (2020).
Billinge, S. J. L. & Levin, I. The problem with determining atomic structure at the nanoscale. Science 316, 561–565 (2007).
Juelsholt, M. et al. Size-induced amorphous structure in tungsten oxide nanoparticles. Nanoscale 13, 20144–20156 (2021).
Yang, X. et al. Confirmation of disordered structure of ultrasmall CdSe nanoparticles from X-ray atomic pair distribution function analysis. Phys. Chem. Chem. Phys. 15, 8480–8486 (2013).
Christiansen, T. L. et al. Structure analysis of supported disordered molybdenum oxides using pair distribution function analysis and automated cluster modelling. J. Appl. Crystallogr. 53, 148–158 (2020).
Bennett, T. D. & Cheetham, A. K. Amorphous metal–organic frameworks. Acc. Chem. Res. 47, 1555–1562 (2014).
Kjær, E. T. S. et al. DeepStruc: Towards structure solution from pair distribution function data using deep generative models. Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/62fa600cd0c5cb353465329f (2022).
Cliffe, M. J., Dove, M. T., Drabold, D. & Goodwin, A. L. Structure determination of disordered materials from diffraction data. Phys. Rev. Lett. 104, 125501 (2010).
Juhás, P., Cherba, D. M., Duxbury, P. M., Punch, W. F. & Billinge, S. J. L. Ab initio determination of solid-state nanostructure. Nature 440, 655–658 (2006).
Juhás, P., Granlund, L., Duxbury, P. M., Punch, W. F. & Billinge, S. J. L. The Liga algorithm for ab initio determination of nanostructure. Acta Crystallogr A 64, 631–640 (2008).
Christiansen, T. L., Bøjesen, E. D., Juelsholt, M., Etheridge, J. & Jensen, K. M. Ø. Size induced structural changes in molybdenum oxide nanoparticles. ACS Nano 13, 8725–8735 (2019).
Aalling-Frederiksen, O., Juelsholt, M., Anker, A. S. & Jensen, K. M. Ø. Formation and growth mechanism for niobium oxide nanoparticles: atomistic insight from in situ X-ray total scattering. Nanoscale 13, 8087–8097 (2021).
Yang, L., Juhás, P., Terban, M. W., Tucker, M. G. & Billinge, S. J. L. Structure-mining: screening structure models by automated fitting to the atomic pair distribution function over large numbers of models. Acta Crystallogr. A 76, 395–409 (2020).
Anker, A. S. et al. Characterising the Atomic Structure of Mono-Metallic Nanoparticles from X-Ray Scattering Data Using Conditional Generative Models. In Proc. 16th International Workshop on Mining and Learning with Graphs (MLG). (Association for Computing Machinery, New York, NY, 2020) https://www.mlgworkshop.org/2020/.
Banerjee, S. et al. Cluster-mining: an approach for determining core structures of metallic nanoparticles from atomic pair distribution function data. Acta Crystallogr. A 76, 24–31 (2020).
Anker, A. S. et al. Structural changes during the growth of atomically precise metal oxido nanoclusters from combined pair distribution function and small-angle X-ray scattering analysis. Angew. Chem. Int. Ed. 60, 2–12 (2021).
Butler, K. T., Le, M. D., Thiyagalingam, J. & Perring, T. G. Interpretable, calibrated neural networks for analysis and understanding of inelastic neutron scattering data. J. Phys. Condens. Matter 33, 194006 (2021).
Suzuki, Y. et al. Symmetry prediction and knowledge discovery from X-ray diffraction patterns using an interpretable machine learning approach. Sci. Rep. 10, 21790 (2020).
Torrisi, S. B. et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. npj Comput. Mater. 6, 109 (2020).
Oviedo, F., Ferres, J. L., Buonassisi, T. & Butler, K. T. Interpretable and Explainable Machine Learning for Materials Science and Chemistry. Acc. Mater. Res., 3, 6, 597–607 (2022)
Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5, 83 (2019).
Lee, K., Ayyasamy, M. V., Delsa, P., Hartnett, T. Q. & Balachandran, P. V. Phase classification of multi-principal element alloys via interpretable machine learning. npj Comput. Mater. 8, 25 (2022).
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, San Francisco, 2016).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. in Proc. 31st International Conference on Neural Information Processing Systems, 4765–4774 (Curran Associates, Inc., 2017).
Juelsholt, M., Lindahl Christiansen, T. & Jensen, K. M. Ø. Mechanisms for tungsten oxide nanoparticle formation in solvothermal synthesis: from polyoxometalates to crystalline materials. J. Phys. Chem. C. 123, 5110–5119 (2019).
Chen, X. & Yamanaka, S. Single-crystal X-ray structural refinement of the ‘tetragonal’ C60 polymer. Chem. Phys. Lett. 360, 501–508 (2002).
Juhas, P., Farrow, C. L., Yang, X., Knox, K. R. & Billinge, S. J. L. Complex modeling: a strategy and software program for combining multiple information sources to solve ill posed structure and nanostructure inverse problems. Acta Crystallogr. A 71, 562–568 (2015).
Krebs, B. & Paulat-Böschen, I. The structure of the potassium isopolymolybdate K8[Mo36O12(H2O)16]·nH2O (n = 36⋯40). Acta Crystallogr. B 38, 1710–1718 (1982).
Niu, J., Zhao, J., Wang, J. & Bo, Y. Syntheses, spectroscopic characterization, thermal behavior, electrochemistry and crystal structures of two novel pyridine metatungstates. J. Coord. Chem. 57, 935–946 (2004).
Joachim, F., Axel, T. & Rosemarie, P. Strukturen und Schwingungsspektren des Tetramethylammonium-α-dodekawolframatosilikats und des Tetrabutylammonium-β-dodekawolframatosilikats: Structures and Vibrational Spectra of Tetramethylammonium α-Dodecatungstosilicate and Tetrabutylammonium β-Dodecatungstosilicate. Z. Naturforsch. B 36, 161–171 (1981).
Niu, J.-Y., Han, Q.-X. & Wang, J.-P. A novel Keggin units-supported complex: synthesis, characterization and crystal structure of [(CH3)2NH2]6[Cu(DMF)4(GeW12O40)2]·2DMF. J. Coord. Chem. 56, 523–530 (2003).
Busbongthong, S. & Ozeki, T. Structural relationships among methyl-, dimethyl-, and trimethylammonium phosphododecatungstates. Bull. Chem. Soc. Jpn. 82, 1393–1397 (2009).
Yang, L. et al. A cloud platform for atomic pair distribution function analysis: PDFitc. Acta Crystallogr. A 77, 2–6 (2021).
Skjærvø, S. L. et al. Atomic structural changes in the formation of transition metal tungstates: the role of polyoxometalate structures in material crystallization. Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/62ebefdcd131b71fc70c4ef2 (2022).
Magnard, N. P. L., Anker, A. S., Aalling-Frederiksen, O., Kirsch, A. & Jensen, K. M. Ø. Characterisation of intergrowth in metal oxide materials using structure-mining: the case of γ-MnO2. Dalton Trans., https://doi.org/10.1039/D2DT02153F (2022).
Proffen, T. & Neder, R. B. DISCUS, a program for diffuse scattering and defect structure simulations – update. J. Appl. Crystallogr. 32, 838–839 (1999).
Proffen, T. & Neder, R. B. DISCUS: a program for diffuse scattering and defect-structure simulation. J. Appl. Cryst. 30, 171–175 (1997).
Coelho, A. A. TOPAS and TOPAS-Academic: an optimization program integrating computer algebra and crystallographic objects written in C++. J. Appl. Crystallogr. 51, 210–218 (2018).
Nogueira, F. Bayesian Optimization: Open source constrained global optimization tool for Python. https://github.com/fmfn/BayesianOptimization (2014).
Putatunda, S. & Rama, K. A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost. In Proc. 2018 International Conference on Signal Processing and Machine Learning, 6–10 (Association for Computing Machinery, New York, 2018).
Momma, K. & Izumi, F. VESTA 3 for three-dimensional visualization of crystal, volumetric and morphology data. J. Appl. Crystallogr. 44, 1272–1276 (2011).
Palmer, D. C. Visualization and analysis of crystal structures using CrystalMaker software. Z. Kristallogr. Cryst. Mater. 230, 559–572 (2015).
Acknowledgements
This work is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. 804066). We are grateful to the Villum Foundation for financial support through a Villum Young Investigator grant (VKR00015416). Funding from the Danish Ministry of Higher Education and Science through the SMART Lighthouse is gratefully acknowledged. We furthermore thank DANSCATT (supported by the Danish Agency for Science and Higher Education) for support. We acknowledge DESY (Hamburg, Germany), a member of the Helmholtz Association HGF, for the provision of experimental facilities. Parts of this research were carried out at beamline P02.1 and P07 at Petra III and we would like to thank Martin Etter, Jozef Bednarcik, and Ann-Christin Dippel for assistance in using the beamlines. This research used resources of the Advanced Photon Source, a US Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under contract No. DE-AC02-06CH11357. We acknowledge MAX IV Laboratory for time on Beamline DanMAX under Proposal 20200731. Research conducted at MAX IV is supported by the Swedish Research council under contract 2018-07152, the Swedish Governmental Agency for Innovation Systems under contract 2018-04969, and Formas under contract 2019-02496. DanMAX is funded by the NUFI grant no. 4059-00009B. S.J.L.B. was supported by the U.S. National Science Foundation through grant DMREF-1922234.
Author information
Authors and Affiliations
Contributions
A.S.A., E.T.S.K., and K.M.Ø.J. conceptualized the project. A.S.A., E.T.S.K., M.J., T.L.C., S.L.S., S.J.L.B., and R.S. designed the methodology and A.S.A. and E.T.S.K. wrote the code. A.S.A., M.J., T.L.C., M.R.V.J., I.K., and D.R.S. collected the data. K.M.Ø.J. procured funding and supervised the project. All authors were involved with the writing of the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Anker, A.S., Kjær, E.T.S., Juelsholt, M. et al. Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning. npj Comput Mater 8, 213 (2022). https://doi.org/10.1038/s41524-022-00896-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524-022-00896-3
This article is cited by
-
A deep learning approach for quantum dots sizing from wide-angle X-ray scattering data
npj Computational Materials (2024)
-
Integrated analysis of X-ray diffraction patterns and pair distribution functions for machine-learned phase identification
npj Computational Materials (2024)
-
Enhancing interpretability in the exploration of high-energy conversion efficiency in CsSnBr3−xIx configurations using crystal graph convolutional neural networks and adversarial example methods
Science China Materials (2024)