Abstract
Xray absorption nearedge structure (XANES) spectra are the fingerprint of the local atomic and electronic structures around the absorbing atom. However, the quantitative analysis of these spectra is not straightforward. Even with the most recent advances in this area, for a given spectrum, it is not clear a priori which structural parameters can be refined and how uncertainties should be estimated. Here, we present an alternative concept for the analysis of XANES spectra, which is based on machine learning algorithms and establishes the relationship between intuitive descriptors of spectra, such as edge position, intensities, positions, and curvatures of minima and maxima on the one hand, and those related to the local atomic and electronic structure which are the coordination numbers, bond distances and angles and oxidation state on the other hand. This approach overcoms the problem of the systematic difference between theoretical and experimental spectra. Furthermore, the numerical relations can be expressed in analytical formulas providing a simple and fast tool to extract structural parameters based on the spectral shape. The methodology was successfully applied to experimental data for the multicomponent Fe:SiO_{2} system and reference iron compounds, demonstrating the high prediction quality for both the theoretical validation sets and experimental data.
Similar content being viewed by others
Introduction
Xray absorption spectroscopy is widely employed to probe the local atomic and electronic structure around the absorbing atom^{1,2,3}. The Xray absorption nearedge structure (XANES), spanning a region of 50–200 eV above the absorption edge, contains information about the structural descriptors involving the bond distances and angles, type of ligand surrounding, oxidation state, which affect the spectral descriptors; edge position, shapes and positions of spectral maxima and minima. An experienced researcher can, for example, distinguish the pure metallic state from metal oxide compound, or discriminate between tetrahedral and octahedral surroundings based on a qualitative inspection of the related spectral features.
Figure 1 shows a series of typical experimental Cu Kedge XANES spectra for different copper compounds. The preedge feature A originates from the transition to the spatially localized 3d states. The preedge shape depends on the number of electrons in the dshell^{4}, its intensity is proportional to the amount of 3d–4p hybridization^{5}, while its energy position can be employed to realize the calibration of the 3d metal oxidation state^{6}. The sharp shoulder B on the rising edge is indicative of a linear or square planar geometry with a lower energy of empty 4p orbitals perpendicular to the chemical bonds^{7}. A similar shoulder appears in the spectra of metals. Kedge XANES of metals with an fcc structure is further characterized by the splitting of the main peak into M1 and M2 features. Intensities of M2 and M3 are sensitive to the scattering from the second coordination shell^{8} similar to the feature D in molecular covalent complexes, and their reduction can be therefore used to probe the nanosized effects^{9}. Positions of M4 and further highenergy maxima relate to the interatomic distances via the semiempirical Natoli’s rule^{10}. The absorption edge position depends on the oxidation state^{11} and also interatomic distances^{12}. The intensity of the white line C is higher in spectra of metal complexes with the octahedral coordination. Planar complexes are characterized by energy splitting of this peak^{13}. Characteristic spectral features can be further established for the Kedges of light atoms, L_{2,3} edges for 3d metals with strong multiplet splitting, or L_{2,3} spectra for 4d metals possessing a characteristic white line.
For a data scientist, the abovementioned spectral features are recognized as descriptors, and the relationships between spectral descriptors and the structural ones (coordination number, geometry, bond distances, angles…) can be established, for example, by using machine learning (ML) algorithms. Using all points of a spectrum as descriptors, Zheng et al.^{14} managed to classify the atomic coordination environments via random forest models. The convolutional neural network was applied to predict Cu–Cu coordination numbers (CN)^{15} and to evaluate several CNs for platinum nanoparticles to refine their sizes and shapes^{16}. Rankine et al.^{17} demonstrated the ability of a deep neural network to predict a XANES spectrum from geometric information about the local environment around the absorbing atom. To achieve better performance of ML, the dimensionality of both spectral and structural descriptors should be reduced. For example, 3 N atomic coordinates for N atoms can be converted into radial and angle distribution functions^{18} or into generalized radial distribution functions^{19}. Alternatively, geometry parameters can be filtered in terms of their importance for XANES variation^{20}. The multiple points of a spectrum can also be reduced to only a few descriptors. Commonly used spectral descriptors are the preedge centroid and the preedge area which are, for example, combined to analyze the Fe oxidation state and coordination number^{6}. Carbone et al. demonstrated that the principal components calculated from a series of theoretical spectra can be used to realize the classification of four, five, and sixcoordinated metal environments^{21} and of type of functional groups^{22}. Recently, Torrisi et al.^{23} demonstrated the concept of constructing descriptors from a polynomial fit of equidistant energy intervals of the spectrum.
The present work aims to extend these approaches of identifying descriptors of the spectral features, such as positions of the absorption edge, minima, and maxima, their amplitudes, and curvatures. We present a stepbystep procedure to prepare a training set, evaluate the descriptors, train the machine learning algorithm, apply crossvalidation, and finally analyze the experimental data. The variable structural parameters are introduced and the problem of classification of the calculated spectra in terms of these parameters is addressed. The analytical formulas establishing the relations between the spectral features and the structural parameters are then derived. Finally, we validate the approach for a set of experimental spectra belonging to oxides, silicates, geological samples (tektites, impactites), and amorphous glasses as well as silicasupported Fe singlesite catalysts prepared via surface organometallic chemistry^{24,25,26}.
Results and discussion
Descriptors of spectrum
In general, the theoretical XANES spectrum contains ~100 energy points. A common approach to improve the efficiency of ML algorithms is to reduce the dimensionality of such object by extracting only informative features, notably the spectral descriptors^{23}. Table 1 and Fig. 2 describe a set of descriptors evaluated for each spectrum: edge position (feature A), white line position and intensity (feature B), first pit (minimum) position and intensity (feature C), the curvature of the white line, projections on the principal components (further called PC descriptors). The arctangent function (red dotted line) was used to fit the whole spectrum and the position of its center and slope were taken as the values of edge position (Edge_{E}) and slope (Edge_{slope}). For some deformations in the local geometry, the white line in the calculated spectra can consist of several close maxima. For a monotonic variation of descriptors across the training set we performed an additional convolution (5 eV Lorentzian width) of the spectral regions near extrema B and C before evaluating the curvature, amplitude, and energy position of these features. The same convolution was applied to the experimental spectra before the evaluation of their descriptors.
Principal component analysis (PCA) was applied to the whole data set of theoretical spectra. Based on the singular value decomposition (SVD, see details in Supplementary Methods Section) three first principal components were evaluated. Each spectrum was projected on these components and the projections were used as the descriptors of the spectrum. We also applied SVD analysis to the data set where all edge positions (point A in Fig. 2) were aligned. In such a way, another set of three principal components was used to calculate projections of every theoretical spectrum. We call these projections relative PC descriptors (rPC).
If compared with the descriptors based on the curvature of fixed energy intervals for spectrum, as shown in the work Torrisi et al.^{23} the search of minima and maxima along with edge characteristics relies on physically motivated features of the spectrum. As we show below the good prediction quality can be achieved by using just 2 or 3 such descriptors. The stable definition of the extrema may be tricky for flattened spectra or L_{2,3} edges with rich multiplet splitting. Such systems may require additional spectral descriptors such as total variance, centers of mass and areas, fitted peak profiles, etc. These descriptors are beyond the scope of the present work but are included in the supplementary software.
Relationship between spectral features and structure
3d metal complexes can be found in a wide range of CNs and local symmetries around the metal center. The type of ligands determines the interatomic distances and symmetry for the given oxidation and spin state of the metal. Valuable catalysts or geological materials can contain iron ions in a silica matrix where oxygen coordination provides both Fe^{2+} and Fe^{3+} oxidation states along with several possible CNs. To address the problem of quantitative iron speciation, we calculated the training set consisting of spectra for Fe(SiO_{4})_{CN} complexes for CN = 2–6. The first shell distances and bond angles were varied for every CN using the improved Latin hypercube sampling (IHS) resulting in 3000 spectra calculated for all CN. A chemical shift was then applied for each spectrum to simulate absorption from Fe^{2+} and Fe^{3+} sites for every deformation. Figure 3 shows the clusters used for simulations and the variable structural parameters p. The ranges of their variation are listed in Table 2.
The calculated spectra for Fe^{2+} are shown in Fig. 4. The library of spectra for Fe^{3+} contains the same entries but shifted according to the 1 s core level energy difference evaluated within an accurate molecular orbital approach (see Methods section). Therefore, the total number of spectra in the training set was 6000, i.e. twice more than shown in Fig. 4.
Each spectrum in the training set is characterized by several descriptors (Table 1). The whole training set can be projected on a 2D plot for the selected pair of descriptors. Figure 5 compares the distribution of points in the training set over different 2D maps where each point is colored according to its structural parameters. From a mathematical point of view, each color in Fig. 5 defines a class. If points for different classes are well separated on the 2D map, the chosen pair of descriptors is appropriate for the classification. For demonstration, we selected those pairs of the descriptors which separated points according to iron coordination number, Fe‒O distances, or oxidation state. In particular, the best descriptors for discriminating different CN were the curvature of the white line (WL_{curv}), pit energy position (Pit_{E}), edge position (Edge_{E}). The average distances in the iron first coordination shell could be distinguished according to the energy positions of the pit (Pit_{E}) and edge (Edge_{E}). Projections on the principal components were able to separate structures with different distances and oxidation states, while pit energy and white line position (WL_{E}) distinguished between structures with different oxidation states.
Beyond the twodimensional scatter plots, which are informative for the qualitative selection of good descriptors, the best quality of classification and the best choice of descriptors for ML algorithm was determined (Table 3) for combinations of 1, 2, 3, or 4 descriptors to predict CN, oxidation state, or distance in a pure compound (mixtures will be discussed further in section 2.4).
Two descriptors of spectra contain up to 95% of the information necessary for discrimination between Fe^{2+} and Fe^{3+}. Using the value of the edge energy alone provided 80% of the prediction quality. Considering the white line intensity in addition to the Edge_{E} improved the quality to 90%. Other informative descriptors for the oxidation state were the energies of the main maximum and pit, the first principal components. Fe‒O mean distance is uniquely characterized by the combination of edge and pit energies (95% quality). Projections on the second and third relative principal components, rPC_{2} and rPC_{3}, were more important for this task than the first PC. Higher CNs are characterized by a sharp white line and a steep rising edge. Good quality of prediction for CN requires to use of at least three descriptors which include edge energy, slope, and curvature of the main maximum. The lowest accuracy in crossvalidation analysis was observed for the standard deviation from the mean that measures the disorder in the first coordination shell. Four descriptors were necessary to reach the prediction quality equal to 90%.
The optimal choice of the descriptors in Table 3 does not guarantee their transferability to the experimental data and problem of the multicomponent system analysis. In Section 2.4, we address the quality of structural analysis by using descriptors in the training set composed of linear combinations of spectra.
Analytical relations between descriptors: beyond Natoli’s rule
In the early eighties, Natoli formulated an empirical rule^{10} that establishes dependence between peak positions in the XANES spectrum and interatomic distances for the structures with similar symmetry, which can be the case of metals within the same space group (e.g., fcc Cu and Ni, Supplementary Fig. 3) and to structures that undergo a volume expansion, such as palladium after hydrogen sorption^{27}. In the latter case, we have previously observed that the relative intensities of the first two XANES maxima are proportional to the H/Pd ratio in palladium hydride samples^{28,29}. Another example by Zhang et al.^{30} provides an analytical relation between energy positions of maxima in U L_{3}edge XANES spectra of uranyl complexes and distances between the uranium absorber and oxygen ligand atoms. Representing a useful tool for the analysis of XANES spectra, all these examples are limited to the usage of only one spectral descriptor and one descriptor of structure. In this section, we extend such methodology to derive the analytical relation between any set of spectral descriptors and structural parameters using machine learning algorithm. The common approach to find simple analytical relations between known parameters p_{1}…p_{n} and target variable y is the construction of linear regression:
More complex cases include pairwise and higher degree multiplications alongside parameters p_{1}…p_{n}. We are interested in pretty solutions with good approximation quality. The prettiness means the absence of large coefficients w_{i} and the smallest possible number of nonzero w_{i}. For the integer relations problem, the prettiness is achieved by applying special algorithms of integer orthogonalization (see e.g., ^{31} and §2.2 in ref. ^{32}). In a realvalued case, we use feature selecting properties of the Elastic Net algorithm^{33} combined with some heuristics. For the theoretical data set, we restrict ourselves to the parameters p_{1}…p_{n} and their pairwise multiplications, thus the Eq. (1) takes the form
In the first step, the data were normalized to zero mean and unit standard deviation. We implemented the elastic net method that includes the LASSO^{34} and ridge regression. In the case of the group of highly correlated variables, the LASSO algorithm tends to select one variable from a group and ignores the others, thus making feature selection. If the linear formula returned by Elastic Net is heavy, we try to simplify it at the expense of model precision. To do so, we sort the coefficients (w_{i}, w_{ij}) returned by Elastic Net by their absolute values and try to build a linear model based on subsets of features with the largest absolute coefficients. The analysis was performed for the subsets of each size: 1, 2, 3,…, and for all of them R^{2}score was evaluated. Afterward, one can choose between pretty models with moderate quality or more complicated models with higher precision. Table 4 shows the selected analytical relations between descriptors of spectra and structural parameters.
Analytical relations between descriptors extend the qualitative classification of the 2d scatter plots. The obtained formulas explore dependencies between any number of spectral features and structural parameters. While, in general, ML algorithms work as a black box, Table 4 provides a geometrical interpretation of the best combinations of descriptors. For example, up to 90% prediction quality can be achieved for the interatomic distances if the energy positions of the edge, first maximum, and minimum are considered.
The accuracy of the quadratic analytical formulas for the oxidation state is above 80%. The quality of analysis could be improved if chemically relevant restrictions were imposed on the Fe‒O distances for Fe^{2+} or Fe^{3+} ions in the training set. For better generalization, we assumed that the ranges of variations of structural parameters were equal for both the oxidation states. Therefore, the chemical shift of the whole spectrum can be misinterpreted by the edge shift upon distance variation. This effect is partially accounted for by the main maximum intensity (WL_{int}) descriptor that enters the formula. The intensity of the main maximum changes along with Fe‒O bonds contraction therefore this descriptor can help to discriminate between shifts related to the oxidation state or volume changes. Formulas for CN depend on the curvature of the main maximum, which is consistent with the general behavior of EXAFS oscillations, whose amplitude is proportional to CN. One should note, however, that this conclusion should not be generalized to the structures with different types of bonds (e.g., metallic iron has larger CN, but the white line intensity is higher in the octahedral FeO oxide).
The second part of Table 4 interprets the features of the XANES spectrum in terms of geometry parameters. The slope of the edge depends on the average distances and coordination number. The curvature of the white line correlates with the disorder in the first coordination shell of iron, i.e., larger disorder makes the first maximum broader. The position of the first minimum is quite an important feature in the spectrum though it is less often analyzed as compared to the maxima. This feature is by almost 90% determined by the CN and Fe‒O distance. Its intensity is determined by the CN and disorder in the first coordination shell.
Fitting a multicomponent system
If the distribution of absorbing atoms in a material is heterogeneous, a linear combination of theoretical spectra with different oxidation states and coordination is required to describe the experimental spectrum. In this section, we extend the descriptor approach to the case of linear combinations and apply the descriptor analysis to the experimental Fe Kedge XANES data of iron oxide and iron silicate systems. The algorithm was applied to 56 experimental spectra of crystalline compounds^{35}, glasses^{36}, tektites, and impactites^{37,38}, as well as a singlesite silicasupported Fe catalyst^{24,39}. Figure 6 shows experimental spectra and Supplementary Tables 2–6 provide a description of each sample and results of MLanalysis. Iron coordination and oxidation state are heavily dependent on the conditions of synthesis. Studied samples are inherently heterogeneous systems. In particular, tektites are formed from molten highspeed ejecta during the early stages of impact crater formation^{40}. Impactites have a more complex history of their formation and are the result of the melting of various types of rocks located at different depths in the Earth’s crust. Iron in the amorphous silica structure has the potential to be a probe of impact rock formation conditions, such as pressure (P), temperature (T), oxygen fugacity^{41,42}.
Figure 7 shows the steps required to apply the descriptor approach to the multicomponent system. We have constructed a database of linear combinations of theoretical spectra in the training set, using several random concentrations for every pair of spectra. The descriptors were then evaluated for the database. A crossvalidation procedure was applied to different combinations of the descriptors to understand which combination works better for the mixtures.
The appropriate choice of the descriptors for the given structural property should provide good quality of analysis both for the theoretical validation set and a set of experimental references. Therefore, we have calculated the descriptors of the experimental and calculated spectra for the reference structures. The pairs of theoretical and experimental descriptors for the known structures can be used in step 2.2 to calibrate the descriptors in the theoretical training set for systematic energy shifts or intensity differences. The calibration step for intensities may be necessary when experimental spectra are measured in fluorescence mode and are flattened owing to selfabsorption. In this work, we did not apply any calibration after the convolution of theoretical spectra. In step 2.4, the reference spectra are used for validation before predicting results for the unknown structures. Figure 8 shows selected scatter plots for pairs of descriptors that can discriminate efficiently between iron oxidation state, CN, and average Fe‒O distance in the twocomponent mixture. While the classes were well separated in Fig. 5, their overlap occurs in Fig. 8 due to the linear combinations added to the training sample. We projected spectra of several references (hollow circles) on the twodimensional scatter plots. Reference oxides and silicates have quite different structures from the entries in the training set, but descriptors Pit_{E}, WL_{E}, and WL_{curv} provided surprisingly good quality for their analysis. αFe_{2}O_{3} and NaFeSi_{2}O_{6} were properly projected to the region of 6coordinated species. γFe_{2}O_{3} and Fe_{3}O_{4} contain onethird of Fe ions in the tetrahedral positions and this point is projected to the region where 4, 5, and 6coordinated points are overlapped (Fig. 8a). Fe_{2}SiO_{4} reference has the longest Fe‒O distances equal to 2.2 A and it is properly projected to the blue region of the plot in Fig. 8b, while γFe_{2}O_{3} has the shortest. In 8c, Fe_{2}O_{3} and NaFeSi_{2}O_{6} are assigned to Fe^{3+}, Fe_{2}SiO_{4} to Fe^{2+}, whereas Fe_{3}O_{4} contains a mixture of Fe^{2+} and Fe^{3+} sites.
The classes in the training set overlap when linear combinations of spectra are introduced along with the pure species. Figure 8a shows how CN classes are mixed if compared with Fig. 5a. The points with intermediate average valence are also become overlapped. In general, the prediction quality is 5–10% lower for the mixture if compared to the pure compound. The main difference was observed for the iron valence. For common structural parameters, the oxidation state affects only the energy position of spectra. Linear combination of spectra smears the localized distributions of Fe^{2+} and Fe^{3+} points in the scatter plots (compare Figs. 5e and 8c, respectively). Two descriptors can provide the quality of valence discrimination in the mixture up to 80% and the use of three or more descriptors is appreciated. The better choice should consider the joint analysis of descriptors from several spectral regions. Thus, in ref. ^{43}, the multivariate approach was applied to XANES spectra to determine the iron redox state in silicate glasses. It was demonstrated that using the full spectral region from the preedge to the EXAFS provides more accurate results. Preedge descriptors alone can be applied to the charge state analysis as well. Wilke et.al. demonstrated for the Fe KXANES^{44} that the preedge contains information both about the oxidation state and coordination number. The method analyses the 2d scatter plot of the integrated preedge intensities versus the preedge centroid positions. The set of reference spectra was distributed in the localized regions attributed to the 4, 5, and 6coordinated Fe ions in oxidation state Fe^{2+} and Fe^{3+}. The limitations of this methodology arise from the need for welldefined reference spectra since preedge XANES simulations are still difficult for real systems. However, no references were reported with the CNs below 4.
Tables 5 and 6 demonstrate the best combinations of descriptors in terms of their quality calculated over the whole theoretical database or set of experimental references. The best triples of descriptors are different for these two tasks. The fact can be understood due to statistical considerations. The area of variation of parameters in the theoretical training set is large and includes even chemically irrelevant species, e.g., Fe^{3+} with distances longer than in Fe^{2+}. In contrast, the range of structural parameters covered by experimental spectra is smaller and represents a subclass of the training set. The R^{2} score quality is evaluated in the crossvalidation procedure and depends on the size of the sample and its dimensions. Therefore Table 6 contains also the mean absolute error evaluated along with R^{2} score for the experimental validation set. The triples of good descriptors for experimental analysis are listed in Table 6 for each structural parameter: [WL_{E}, Pit_{E}, rPC_{2}] for CN, [WL_{int}, Pit_{int}, rPC_{2}] for Fe‒O distance, [Edge_{E}, WL_{E}, PC_{3}] for valence. Figure 9 reports the predicted structural parameters for reference experimental spectra compared with their actual values. Prediction for all experimental spectra can be found in Supplementary Tables 8–10. The mean absolute errors over the validation set were 0.1 for oxidation state, 0.4 for CNs, and 0.03 for distances. The largest errors of the ML algorithm were observed for crystalline compounds, which have a significantly different structure from entries in the training set. The latter was adapted for Fe:SiO_{2} systems and contains silicon in the second coordination shell, while some reference minerals are composed of oxygen and iron/Al/CO in the nearest coordination shells. The methodology can be directly applied to a new training set extended by ligands of different types. In this case, additional labels (e.g., atom types) should be added as descriptors of structure.
The obtained results for studied samples (“unknown” in Fig. 9) are in good agreement with other experimental methods. Mössbauer spectroscopy confirmed that the fraction of Fe^{3+} ions was larger in impactites (zhamanshinite, irghizite). А number of XASbased^{38,45} and nonXAS investigations have shown that iron oxidation state in tektites from different strew fields is about Fe^{2+} and generally Fe^{3+}/ƩFe ratio <0.15^{42,46}. Iron in impact glasses can cover a wider range of Fe oxidation states^{37,47,48} as compared with tektites, from purely Fe^{2+} to purely Fe^{3+}, and Fe^{3+}/ƩFe values are mainly within 0.25–0.59^{42}. Fe‒O distances are generally smaller for Fe^{3+} ions and we observed a similar trend for impactites as compared to tektites. Fe CNs in tektites is still a disputed issue. EXAFS studies have reported that mean Fe CNs in tektites are close to 4^{45}, whereas the coexistence of four and fivecoordinated Fe was observed in^{38}. Our estimations fall in the range CN = 3.5 ÷ 4.5, reproducing a similar trend as in EXAFS analysis. The absolute values of CN obtained from EXAFS analysis highly correlate with the Debye–Waller factor and can be affected also by selfabsorption effects in the fluorescence regime of measurements (iron catalyst samples). Therefore, in the corresponding panel of Fig. 9, we omitted the expected values of CN to avoid confusion.
Fig. 10 represents the formation process of the singlesite Fe catalyst on silica. The analysis for this system implies that Fe remains at oxidation state +2 throughout the process consistent with Mossbauer analysis and magnetic characterization^{24}. It also shows that after grafting of the molecular precursor, dimeric Fe(II) tris(tertbutoxy) siloxide on SiO_{2} dehydroxylated at 1080 °C, the coordination number of Fe – CN(Fe) – remains close to 4 (Fe@SiO_{2} 1), whereas it decreases to 3 after thermolysis at 1020 °C (Fe@SiO_{2} 2) consistent with previously reported characterization data that show a similar decrease of CN(Fe), albeit to a value of 2^{24}. This confirms that thermal treatment leads to Fe(II) species with low coordination number, probably situated between 2 and 3. It is noteworthy the sample prepared at lower temperature both for the hydroxylation and thermolysis steps display Fe sites with a larger coordination number of 4.
As a concluding remark, we note that usually, the ML algorithms work as a “black box” for researchers since it is difficult to understand what structural information is contained in each part of the spectrum. We approach such understanding by using selected descriptors of the spectrum instead of individual points. The whole spectrum is substituted by several descriptors that intuitively characterize its shape, i.e. energy position of edge, minima, maxima, their intensities, and curvatures. Machine learning analysis established the rational choice of the combinations of descriptors providing the highest prediction accuracy for the structural parameters both for pure compounds and their mixtures. To visualize the spectrumstructure relations we use scatter plots and derive analytical dependencies between the descriptors of the spectrum and structural parameters.
Rational choice of descriptors isolates those features of spectra that are most sensitive to specific structural parameters, avoiding fitting the whole spectrum. The major problem of the practical application of ML methods for experimental data analysis arises from the systematic differences between theoretical calculations and measured data. This discrepancy can arise either from limitations of the theoretical approach or the experimental artefacts. The benefit of using descriptors over the fullspectrum stands in the possibility to correct the systematic differences by calibration on a dataset of theoretical and experimental spectra of reference compounds. However, as all methodologies based on supervised learning, our results are limited to the family of structures described by the training set. As an illustration, the algorithm was trained on FeOSi system; it will thus fail for predicting proper parameters for metallic Fe or sulfide compounds that belong to very different types of materials. This certainly calls for expanding the training set in order to allow for distinguishing, for instance, the ligand types apart from coordination number or interatomic distances.
The further development of the approach is directed toward new ways of descriptor evaluation. A complete set of descriptors should provide the same amount of structural information as in a full spectrum. We foresee that a combination of descriptors from complementary experimental methods (nuclear magnetic resonance, electron paramagnetic resonance, Xray diffraction, etc.) would significantly improve the quality of prediction.
Methods
XANES simulations and energy alignment
Fe Kedge XANES spectra were calculated utilizing the full potential finite difference method^{49} implemented in the FDMNES software^{50}. The photoelectron wave functions were evaluated on a grid of points in a 5.5 Å sphere around the absorbing atom with 0.2 Å interpoint distance. To account for the corehole lifetime broadening and instrumental energy resolution, theoretical spectra were further convoluted using the arctangent function to model the energy dependence of the Lorentzian width.
For an accurate energy calibration of the spectra, the iron 1 s core level energy shifts between Fe^{2+} and Fe^{3+} oxidation states for each coordination number were estimated within the molecular orbital approach. The energy levels and the corresponding wave functions were calculated by density functional theory using the B3LYP exchangecorrelation functional^{51}. The largest available QZ4P basis set implemented in the ADF2019 software^{52,53} was used. For every coordination number in the range between two and six, we constructed a symmetric complex with FeO distances equal to 2 Å and evaluated transition matrix elements in the 50 eV energy interval both for the Fe^{2+} and Fe^{3+} oxidation states. The proper oxidation state was achieved by specifying the charge and spin state of the whole complex. After the convergence of the selfconsistent procedure was achieved the charge states of iron atoms were confirmed by Mulliken charge analysis. Chemical shifts of the 1 s core levels were evaluated and applied to the spectrum calculated by the finite difference method. In this way, we simulated absorption from Fe^{2+} and Fe^{3+} sites for given values of structural parameters.
Machine learning algorithms
When we apply machine learning based on spectrum descriptors (calculate the quality of labels prediction, predict labels for experimental data) we use Extra Trees regressor or classifier models^{54}. It consists of several randomly generated decision trees. A decision tree represents a flowchart of threshold conditions on parameters and divides the parameter space into nonintersecting rectangles, in each of which, for regression, the objective function μ(E, P) is approximated by a linear one using the leastsquares method and for classification  probability table is calculated. The results obtained from several trees are averaged.
For XANES approximation (Supplementary Fig. 8) we use a supervised machine learning algorithm based on the Radial Basis Functions (RBF) that construct a continuous approximation of spectrum, μ(E), as a function of structural parameters P = (p_{1}, p_{2}, …, p_{k}). The RBF method is a wellproven meshfree method^{55,56,57}. The unknown function \(\widehat {\upmu}\)(E, P) is represented in terms of a set of basis functions characterized by certain factors and polynomial terms as follows:
where K(r) is the radial basis function, Polynomial_{E}(P) is a polynomial function of kdimensional vector of structural parameters P with energydependent coefficients. The training set is composed of N calculated spectra. The points (N = 600 for each structure in Fig. 3) in the space of structural parameters P were chosen according to the IHS^{58}. The unknown factors w_{i} and the polynomial coefficients are obtained by the ridge quadric regression method. Every basis function is a function of distance from the training set point P_{i}. In our task, good results were obtained using linear basis functions and a secondorder polynomial (see also Supplementary Table 1 for comparison with other ML methods).
It is important to define a proper norm in (1) to measure the distance between P and P_{i} for a good quality of the approximation. Structural parameters \(p_1,p_2, \ldots ,p_k\) have a different scale, e.g., interatomic distances and angles. Moreover, the variation of the target function, \(\widehat {\upmu}\left( {E,p_1,p_2, \ldots ,p_k} \right)\), greatly varies for different structural parameters. Spectrum changes caused by angle transformation are an order of magnitude less than caused by interatomic distance modification. That’s why we estimate first the average partial variance of the target function (Δ_{i}μ) for each p_{i} and rescale structural parameters in the following way:
The quality of approximation and prediction is calculated during 10fold crossvalidation. The training set, composed of spectra (the task of XANES approximation as a continuous function of structural parameters) or descriptors (the task of structural parameters prediction based on several spectral features) is divided randomly into 10 parts, nine of which are used for algorithm training and the tenth for validation. The quantitative measure of the quality is the R^{2} score for the regression task and accuracy for the classification. Details of their evaluation are described in Supplementary Methods section, while supplementary Jupyter Notebook reports the steps necessary to repeat the calculations in the manuscript Fig. 10.
Section 2.4 of the main text deals with multicomponent systems. The algorithm training is then performed on the linear combinations instead of pure theoretical spectra. In total, more than 5000 pairs were constructed for randomized fractions of components with different CNs, valences, and FeO distances. The flowchart in Fig. 7 describes the details of the procedure for mixture analysis. We found the prediction quality may be improved for reference experimental data when sampling was performed according to the adaptive sampling scheme. Although the IHS scheme provides the uniform sampling over each structural parameter the adaptive sampling (or active learning)^{59,60} chooses the points in the training sample to ensure a uniform variation of the XANES in the selected region of structural parameters. Both training sets are available as SI.
Data availability
The data that support the findings of this study are available at the repository https://github.com/gudasergey/XANES_descriptors along with the source code.
Code availability
The source code and executable Jupyter Notebook used to train the models and generate the figures in this publication are publicly available at the repository https://github.com/gudasergey/XANES_descriptors.
References
Calvin, S. XAFS for Everyone, (Taylor & Francis, 2013).
Henderson, G. S., de Groot, F. M. F. & Moulton, B. J. A. Xray absorption nearedge structure (XANES) spectroscopy. Rev. Mineral. Geochem. 78, 75–138 (2014).
Lamberti, C. & van Bokhoven, J. A. Introduction: historical perspective on XAS. In X‐Ray Absorption and X‐Ray Emission Spectroscopy 1–21 (John Wiley & Sons Ltd, 2016).
de Groot, F., Vanko, G. & Glatzel, P. The 1s xray absorption preedge structures in transition metal oxides. J. Phys. Condens. Matter 21, 104207 (2009).
Westre, T. E. et al. A multiplet analysis of Fe Kedge 1s>3d preedge features of iron complexes. J. Am. Chem. Soc. 119, 6297–6314 (1997).
Wilke, M., Farges, F., Petit, P. E., Brown, G. E. & Martin, F. Oxidation state and coordination of Fe in minerals: an FeKXANES spectroscopic study. Am. Mineral. 86, 714–730 (2001).
Zhang, R. Q. & McEwen, J. S. Local environment sensitivity of the Cu KEdge XANES features in CuSSZ13: analysis from firstprinciples. J. Phys. Chem. Lett. 9, 3035–3042 (2018).
Oyanagi, H. et al. Small copper clusters studied by xray absorption nearedge structure. J. Appl. Phys. 111, 084315 (2012).
Gombac, V. et al. CuOxTiO2 photocatalysts for H2 production from ethanol and glycerol solutions. J. Phys. Chem. A 114, 3916–3925 (2010).
Natoli, C. R. Distance Dependence of Continuum and Bound State of Excitonic Resonances in Xray absorption nearedge structure (XANES). In EXAFS and Near Edge Structure III. Springer Proceedings in Physics, 2. 38–42 (Springer, 1984).
Arcon, I., Mirtic, B. & Kodre, A. Determination of valence states of chromium in calcium chromates by using Xray absorption nearedge structure (XANES) spectroscopy. J. Am. Chem. Soc. 81, 222–224 (1998).
Glatzel, P., Smolentsev, G. & Bunker, G. The electronic structure in 3d transition metal complexes: can we measure oxidation states? J. Phys. Conf. Ser. 190, 012046 (2009).
Chaboy, J., MunozPaez, A., Carrera, F., Merkling, P. & Marcos, E. S. Ab initio xray absorption study of copper Kedge XANES spectra in Cu(II) compounds. Phys. Rev. B 71, 134208 (2005).
Zheng, C., Chen, C., Chen, Y. & Ong, S. P. Random forest models for accurate identification of coordination environments from Xray absorption nearedge. Struct. Patterns 1, 100013 (2020).
Liu, Y. et al. Mapping XANES spectra on structural descriptors of copper oxide clusters using supervised machine learning. J. Chem. Phys. 151, 164201 (2019).
Timoshenko, J., Lu, D. Y., Lin, Y. W. & Frenkel, A. I. Supervised machinelearningbased determination of threedimensional structure of metallic nanoparticles. J. Phys. Chem. Lett. 8, 5091–5098 (2017).
Rankine, C. D., Madkhali, M. M. M. & Penfold, T. J. A deep neural network for the rapid prediction of Xray absorption spectra. J. Phys. Chem. A 124, 4263–4270 (2020).
Martini, A. et al. PyFitit: The software for quantitative analysis of XANES spectra using machinelearning algorithms. Comput. Phys. Commun. 250, 107064 (2019).
Schmidt, J., Marques, M. R. G., Botti, S. & Marques, M. A. L. Recent advances and applications of machine learning in solidstate materials science. npj Comput. Mater. 5, 83 (2019).
Trejo, O. et al. Elucidating the evolving atomic structure in atomic layer deposition reactions with in situ XANES and machine learning. Chem. Mater. 31, 8937–8947 (2019).
Carbone, M. R., Yoo, S., Topsakal, M. & Lu, D. Y. Classification of local chemical environments from xray absorption spectra using supervised machine learning. Phys. Rev. Mater. 3, 033604 (2019).
Carbone, M. R., Topsakal, M., Lu, D. Y. & Yoo, S. Machinelearning Xray absorption spectra to quantitative accuracy. Phys. Rev. Lett. 124, 156401 (2020).
Torrisi, S. B. et al. Random forest machine learning models for interpretable Xray absorption nearedge structure spectrumproperty relationships. npj Comput. Mater. 6, 109 (2020).
Sot, P. et al. Nonoxidative methane coupling over silica versus silicasupported iron(II) single sites. Chem. Eur. J. 26, 8012–8016 (2020).
Pak, C., Bell, A. T. & Tilley, T. D. Oxidative dehydrogenation of propane over vanadiamagnesia catalysts prepared by thermolysis of OV((OBu)But)(3) in the presence of nanocrystalline MgO. J. Catal. 206, 49–59 (2002).
Coperet, C. et al. Surface organometallic and coordination chemistry toward singlesite heterogeneous catalysts: strategies, methods, structures, and activities. Chem. Rev. 116, 323–421 (2016).
Bugaev, A. L. et al. Temperature and pressuredependent hydrogen concentration in supported PdHx nanoparticles by Pd Kedge Xray absorption spectroscopy. J. Phys. Chem. C. 118, 10416–10423 (2014).
Bugaev, A. L., Srabionyan, V. V., Soldatov, A. V., Bugaev, L. A. & van Bokhoven, J. A. The role of hydrogen in formation of Pd XANES in Pdnanoparticles. J. Phys. Conf. Ser. 430, 012028 (2013).
Bugaev, A. L. et al. Hydride phase formation in carbon supported palladium hydride nanoparticles by in situ EXAFS and XRD. J. Phys. Conf. Ser. 712, 012032 (2016).
Zhang, L. J. et al. Extraction of local coordination structure in a lowconcentration uranyl system by XANES. J. Synchrotron Rad. 23, 758–768 (2016).
Bailey, D. H. Integer relation detection. Comput. Sci. Eng. 2, 24–28 (2000).
Bailey, D. H. et al. Experimental Mathematics in Action (CRC Press, 2007).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005).
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Giuli, G., Paris, E., Pratesi, G., Koeberl, C. & Cipriani, C. Iron oxidation state in the Ferich layer and silica matrix of Libyan Desert Glass: a highresolution XANES study. Meteorit. Planet. Sci. 38, 1181–1186 (2003).
Berry, A. J., O’Neill, H. S., Jayasuriya, K. D., Campbell, S. J. & Foran, G. J. XANES calibrations for the oxidation state of iron in a silicate glass. Am. Mineral. 88, 967–977 (2003).
Giuli, G., Eeckhout, S. G., Paris, E., Koeberl, C. & Pratesi, G. Iron oxidation state in impact glass from the K/T boundary at Beloc, Haiti, by highresolution XANES spectroscopy. Meteorit. Planet. Sci. 40, 1575–1580 (2005).
Wang, L. et al. Local structure of iron in tektites and natural glass: an insight through Xray absorption fine structure spectroscopy. J. Mineral. Petrol. Sci. 108, 288–294 (2013).
Holland, A. W. et al. New Fe/SiO2 materials prepared using diiron molecular precursors: synthesis, characterization and catalysis. J. Catal. 235, 150–163 (2005).
Artemieva, N. Highvelocity impact ejecta: tektites and martian meteorites. In Catastrophic Events Caused by Cosmic Objects 267–289 (Springer, Dordrecht, 2008).
Moretti, R. & Ottonello, G. Polymerization and disproportionation of iron and sulfur in silicate melts: insights from an optical basicitybased approach. J. Non Cryst. Solids 323, 111–119 (2003).
Lukanin, O. A. & Kadik, A. A. Decompression mechanism of ferric iron reduction in tektite melts during their formation in the impact process. Geochem. Int. 45, 857–881 (2007).
Dyar, M. D., McCanta, M., Breves, E., Carey, C. J. & Lanzirotti, A. Accurate predictions of iron redox state in silicate glasses: a multivariate approach using Xray absorption spectroscopy. Am. Mineral. 101, 744–747 (2016).
Wilke, M., Farges, F. O., Petit, P.E., Brown, G. E. Jr. & Martin, F. O. Oxidation state and coordination of Fe in minerals: an Fe KXANES spectroscopic study. Am. Mineral. 86, 714–730 (2001).
Giuli, G., Pratesi, G., Cipriani, C. & Paris, E. Iron local structure in tektites and impact glasses by extended Xray absorption fine structure and highresolution Xray absorption nearedge structure spectroscopy. Geochim. Cosmochim. Acta 66, 4347–4353 (2002).
Giuli, G. Tektites and microtektites iron oxidation state and water content. Rend. Lincei Sci. Fis. Nat. 28, 615–621 (2017).
Giuli, G., Eeckhout, S. G., Koeberl, C., Pratesi, G. & Paris, E. Yellow impact glass from the K/T boundary at Beloc (Haiti): XANES determination of the Fe oxidation state and implications for formation conditions. Meteorit. Planet. Sci. 43, 981–986 (2008).
Kravtsova, A. N. et al. Iron oxidation state of impact glasses from the Zhamanshin crater studied by Xray absorption spectroscopy. Radiat. Phys. Chem. 175, 108097 (2020).
Joly, Y. Xray absorption nearedge structure calculations beyond the muffintin approximation. Phys. Rev. B 63, 125120 (2001).
Guda, S. A. et al. Optimized finite difference method for the fullpotential XANES simulations: application to molecular adsorption geometries in mofs and metalligand intersystem crossing transients. J. Chem. Theory Comput. 11, 4512–4521 (2015).
Reiher, M., Salomon, O. & Hess, B. A. Reparameterization of hybrid functionals based on energy differences of states of different multiplicity. Theor. Chem. Acc. 107, 48–55 (2001).
Guerra, C. F., Snijders, J. G., te Velde, G. & Baerends, E. J. Towards an orderN DFT method. Theor. Chem. Acc. 99, 391–403 (1998).
te Velde, G. et al. Chemistry with ADF. J. Comput. Chem. 22, 931–967 (2001).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Fasshauer, G. E. Meshfree Approximation Methods with Matlab, 6 (WORLD SCIENTIFIC, 2007).
Myers, D. E. Smoothing and interpolation with radial basis functions. In Boundary Element Technology Xiii: Incorporating Computational Methods and Testing for Engineering Integrity 2, 365–374 (WIT Press, 1999).
Wendland, H. Computational aspects of radial basis function approximation. In Studies in Computational Mathematics, Vol. 12, 12231–256 (Elsevier, 2006).
Beachkofski, B. & Grandhi, R. Improved distributed hypercube sampling. In 43rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference (2002).
Fuhg, J. N., Fau, A. & Nackenhorst, U. Stateoftheart and comparative review of adaptive sampling methods for kriging. Arch. Comput. Methods Eng. 28, 2689–2747 (2021).
Liu, H., Ong, Y.S. & Cai, J. A survey of adaptive sampling for global metamodeling in support of simulationbased complex engineering design. Struct. Multidiscipl. Optim. 57, 393–416 (2018).
Acknowledgements
A. Guda acknowledges the financial support from the Russian Foundation for Basic Research (project number 203270227) for the work on the multicomponent mixtures. A. Bugaev and A.V. Soldatov acknowledge the Russian Science Foundation grant #204301015 for the financial support for the work on the spectral descriptors. Authors acknowledge D.D. Badyukov from Vernadsky Institute of Geochemistry and Analytical Chemistry of Russian Academy of Sciences for providing samples for analysis. P. Šot acknowledges the Shell Global Solutions International, B.V. for funding the work on the synthesis of Fecontaining catalyst, and European Synchrotron Research Facility for awarded beamtimes at beamlines ID26, BM25, and Swiss Light Source for the beamtime at SuperXAS beamline.
Author information
Authors and Affiliations
Contributions
A.A.G., A.M., and S.A.G. contributed equally and developed the concept of the approach, selected a set of descriptors for spectra and wrote the manuscript. A.A. contributed to the opensource PyFitIt code development within Jupyter Notebooks interface. A.V.S. and A.B. designed the theoretical structures, calculated training sets, and performed calibration. A.N.K., L.V.G., and S.P.K. performed experimental characterization of references and studied tektites and impactites, interpreted results for these samples from machine learning analysis. P.Š., J.A.v.B., and C.C. performed the synthesis and measurements of Fe Kedge XANES for Fe@SiO_{2} catalyst. A.V.S. supervised the project and provided guidance. All authors provided contributions to the manuscript and discussed the results.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guda, A.A., Guda, S.A., Martini, A. et al. Understanding Xray absorption spectra by means of descriptors and machine learning algorithms. npj Comput Mater 7, 203 (2021). https://doi.org/10.1038/s41524021006649
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524021006649
This article is cited by

Revealing the atomic and electronic mechanism of human manganese superoxide dismutase product inhibition
Nature Communications (2024)

Rational partitioning of spectral feature space for effective clustering of massive spectral image data
Scientific Reports (2024)

Efficiency improvement of spinresolved ARPES experiments using Gaussian process regression
Scientific Reports (2024)

Biaxial strain induced OH engineer for accelerating alkaline hydrogen evolution
Nature Communications (2024)

Advances in in situ/operando techniques for catalysis research: enhancing insights and discoveries
Surface Science and Technology (2024)