Introduction

X-ray absorption spectroscopy is widely employed to probe the local atomic and electronic structure around the absorbing atom1,2,3. The X-ray absorption near-edge structure (XANES), spanning a region of 50–200 eV above the absorption edge, contains information about the structural descriptors involving the bond distances and angles, type of ligand surrounding, oxidation state, which affect the spectral descriptors; edge position, shapes and positions of spectral maxima and minima. An experienced researcher can, for example, distinguish the pure metallic state from metal oxide compound, or discriminate between tetrahedral and octahedral surroundings based on a qualitative inspection of the related spectral features.

Figure 1 shows a series of typical experimental Cu K-edge XANES spectra for different copper compounds. The pre-edge feature A originates from the transition to the spatially localized 3d states. The pre-edge shape depends on the number of electrons in the d-shell4, its intensity is proportional to the amount of 3d–4p hybridization5, while its energy position can be employed to realize the calibration of the 3d metal oxidation state6. The sharp shoulder B on the rising edge is indicative of a linear or square planar geometry with a lower energy of empty 4p orbitals perpendicular to the chemical bonds7. A similar shoulder appears in the spectra of metals. K-edge XANES of metals with an fcc structure is further characterized by the splitting of the main peak into M1 and M2 features. Intensities of M2 and M3 are sensitive to the scattering from the second coordination shell8 similar to the feature D in molecular covalent complexes, and their reduction can be therefore used to probe the nanosized effects9. Positions of M4 and further high-energy maxima relate to the interatomic distances via the semi-empirical Natoli’s rule10. The absorption edge position depends on the oxidation state11 and also interatomic distances12. The intensity of the white line C is higher in spectra of metal complexes with the octahedral coordination. Planar complexes are characterized by energy splitting of this peak13. Characteristic spectral features can be further established for the K-edges of light atoms, L2,3 edges for 3d metals with strong multiplet splitting, or L2,3 spectra for 4d metals possessing a characteristic white line.

Fig. 1: Characteristic spectral features for different structural motifs.
figure 1

Cu K-edge XANES spectra for different oxidation states and local coordinations of the copper atom (from bottom to top): Cu0 in fcc metal, linear CuI in Cu2O, pseudo-tetrahedral CuI and CuII in Cu(phen)2 and [CuCl4]2− complexes, square planar CuII in CuO, pseudo-octahedral CuII in [Cu(H2O)6]2+.

For a data scientist, the above-mentioned spectral features are recognized as descriptors, and the relationships between spectral descriptors and the structural ones (coordination number, geometry, bond distances, angles…) can be established, for example, by using machine learning (ML) algorithms. Using all points of a spectrum as descriptors, Zheng et al.14 managed to classify the atomic coordination environments via random forest models. The convolutional neural network was applied to predict Cu–Cu coordination numbers (CN)15 and to evaluate several CNs for platinum nanoparticles to refine their sizes and shapes16. Rankine et al.17 demonstrated the ability of a deep neural network to predict a XANES spectrum from geometric information about the local environment around the absorbing atom. To achieve better performance of ML, the dimensionality of both spectral and structural descriptors should be reduced. For example, 3 N atomic coordinates for N atoms can be converted into radial and angle distribution functions18 or into generalized radial distribution functions19. Alternatively, geometry parameters can be filtered in terms of their importance for XANES variation20. The multiple points of a spectrum can also be reduced to only a few descriptors. Commonly used spectral descriptors are the pre-edge centroid and the pre-edge area which are, for example, combined to analyze the Fe oxidation state and coordination number6. Carbone et al. demonstrated that the principal components calculated from a series of theoretical spectra can be used to realize the classification of four-, five-, and six-coordinated metal environments21 and of type of functional groups22. Recently, Torrisi et al.23 demonstrated the concept of constructing descriptors from a polynomial fit of equidistant energy intervals of the spectrum.

The present work aims to extend these approaches of identifying descriptors of the spectral features, such as positions of the absorption edge, minima, and maxima, their amplitudes, and curvatures. We present a step-by-step procedure to prepare a training set, evaluate the descriptors, train the machine learning algorithm, apply cross-validation, and finally analyze the experimental data. The variable structural parameters are introduced and the problem of classification of the calculated spectra in terms of these parameters is addressed. The analytical formulas establishing the relations between the spectral features and the structural parameters are then derived. Finally, we validate the approach for a set of experimental spectra belonging to oxides, silicates, geological samples (tektites, impactites), and amorphous glasses as well as silica-supported Fe single-site catalysts prepared via surface organometallic chemistry24,25,26.

Results and discussion

Descriptors of spectrum

In general, the theoretical XANES spectrum contains ~100 energy points. A common approach to improve the efficiency of ML algorithms is to reduce the dimensionality of such object by extracting only informative features, notably the spectral descriptors23. Table 1 and Fig. 2 describe a set of descriptors evaluated for each spectrum: edge position (feature A), white line position and intensity (feature B), first pit (minimum) position and intensity (feature C), the curvature of the white line, projections on the principal components (further called PC descriptors). The arctangent function (red dotted line) was used to fit the whole spectrum and the position of its center and slope were taken as the values of edge position (EdgeE) and slope (Edgeslope). For some deformations in the local geometry, the white line in the calculated spectra can consist of several close maxima. For a monotonic variation of descriptors across the training set we performed an additional convolution (5 eV Lorentzian width) of the spectral regions near extrema B and C before evaluating the curvature, amplitude, and energy position of these features. The same convolution was applied to the experimental spectra before the evaluation of their descriptors.

Table 1 Descriptors of spectra with their short notation and details on the evaluation.
Fig. 2: A set of suggested descriptors for a XANES spectrum.
figure 2

Edge energy and edge slope are evaluated in point A, white line intensity, curvature, and position are evaluated in B, pit intensity, curvature, and position are evaluated in C. Arctangent function (red dotted line) is used to determine the edge position and its slope.

Principal component analysis (PCA) was applied to the whole data set of theoretical spectra. Based on the singular value decomposition (SVD, see details in Supplementary Methods Section) three first principal components were evaluated. Each spectrum was projected on these components and the projections were used as the descriptors of the spectrum. We also applied SVD analysis to the data set where all edge positions (point A in Fig. 2) were aligned. In such a way, another set of three principal components was used to calculate projections of every theoretical spectrum. We call these projections relative PC descriptors (rPC).

If compared with the descriptors based on the curvature of fixed energy intervals for spectrum, as shown in the work Torrisi et al.23 the search of minima and maxima along with edge characteristics relies on physically motivated features of the spectrum. As we show below the good prediction quality can be achieved by using just 2 or 3 such descriptors. The stable definition of the extrema may be tricky for flattened spectra or L2,3 edges with rich multiplet splitting. Such systems may require additional spectral descriptors such as total variance, centers of mass and areas, fitted peak profiles, etc. These descriptors are beyond the scope of the present work but are included in the supplementary software.

Relationship between spectral features and structure

3d metal complexes can be found in a wide range of CNs and local symmetries around the metal center. The type of ligands determines the interatomic distances and symmetry for the given oxidation and spin state of the metal. Valuable catalysts or geological materials can contain iron ions in a silica matrix where oxygen coordination provides both Fe2+ and Fe3+ oxidation states along with several possible CNs. To address the problem of quantitative iron speciation, we calculated the training set consisting of spectra for Fe(SiO4)CN complexes for CN = 2–6. The first shell distances and bond angles were varied for every CN using the improved Latin hypercube sampling (IHS) resulting in 3000 spectra calculated for all CN. A chemical shift was then applied for each spectrum to simulate absorption from Fe2+ and Fe3+ sites for every deformation. Figure 3 shows the clusters used for simulations and the variable structural parameters p. The ranges of their variation are listed in Table 2.

Fig. 3: The structures and their deformations applied for constructing the training set.
figure 3

Fe(SiO4)CN clusters constructed for coordination numbers CN = 2, 3 …6. The variable structural deformations p1...p5 are applied to each structure and reproduce variety of iron local geometries in the amorphous silica.

Table 2 The ranges of variation for structural parameters p1, p2, p3, p4.

The calculated spectra for Fe2+ are shown in Fig. 4. The library of spectra for Fe3+ contains the same entries but shifted according to the 1 s core level energy difference evaluated within an accurate molecular orbital approach (see Methods section). Therefore, the total number of spectra in the training set was 6000, i.e. twice more than shown in Fig. 4.

Fig. 4: Visualization of the theoretical training set and the trends in variations of XANES spectra upon studied deformations.
figure 4

a Six hundreds of Fe K-edge XANES spectra calculated for each coordination number by varying structural parameters (Fig. 3). b Comparison between shapes of spectra upon variation of coordination numbers while all Fe-O bond lengths fixed to 2.1 Å. c Sensitivity of the spectrum to variations of bending angles and interatomic distance for a five-coordinated model (only first coordination shell is shown for simplicity).

Each spectrum in the training set is characterized by several descriptors (Table 1). The whole training set can be projected on a 2D plot for the selected pair of descriptors. Figure 5 compares the distribution of points in the training set over different 2D maps where each point is colored according to its structural parameters. From a mathematical point of view, each color in Fig. 5 defines a class. If points for different classes are well separated on the 2D map, the chosen pair of descriptors is appropriate for the classification. For demonstration, we selected those pairs of the descriptors which separated points according to iron coordination number, Fe‒O distances, or oxidation state. In particular, the best descriptors for discriminating different CN were the curvature of the white line (WLcurv), pit energy position (PitE), edge position (EdgeE). The average distances in the iron first coordination shell could be distinguished according to the energy positions of the pit (PitE) and edge (EdgeE). Projections on the principal components were able to separate structures with different distances and oxidation states, while pit energy and white line position (WLE) distinguished between structures with different oxidation states.

Fig. 5: Scatter plots for the selected pairs of descriptors.
figure 5

Each point corresponds to a single spectrum in the training set from Fig. 4a. The color reflects the CN values in a, b, the average Fe-O distance in c, d, and iron valence in e, f.

Beyond the two-dimensional scatter plots, which are informative for the qualitative selection of good descriptors, the best quality of classification and the best choice of descriptors for ML algorithm was determined (Table 3) for combinations of 1, 2, 3, or 4 descriptors to predict CN, oxidation state, or distance in a pure compound (mixtures will be discussed further in section 2.4).

Table 3 The quality of structural parameters prediction by using selected good combinations of the descriptors of spectra from data set in Fig. 4a.

Two descriptors of spectra contain up to 95% of the information necessary for discrimination between Fe2+ and Fe3+. Using the value of the edge energy alone provided 80% of the prediction quality. Considering the white line intensity in addition to the EdgeE improved the quality to 90%. Other informative descriptors for the oxidation state were the energies of the main maximum and pit, the first principal components. Fe‒O mean distance is uniquely characterized by the combination of edge and pit energies (95% quality). Projections on the second and third relative principal components, rPC2 and rPC3, were more important for this task than the first PC. Higher CNs are characterized by a sharp white line and a steep rising edge. Good quality of prediction for CN requires to use of at least three descriptors which include edge energy, slope, and curvature of the main maximum. The lowest accuracy in cross-validation analysis was observed for the standard deviation from the mean that measures the disorder in the first coordination shell. Four descriptors were necessary to reach the prediction quality equal to 90%.

The optimal choice of the descriptors in Table 3 does not guarantee their transferability to the experimental data and problem of the multicomponent system analysis. In Section 2.4, we address the quality of structural analysis by using descriptors in the training set composed of linear combinations of spectra.

Analytical relations between descriptors: beyond Natoli’s rule

In the early eighties, Natoli formulated an empirical rule10 that establishes dependence between peak positions in the XANES spectrum and interatomic distances for the structures with similar symmetry, which can be the case of metals within the same space group (e.g., fcc Cu and Ni, Supplementary Fig. 3) and to structures that undergo a volume expansion, such as palladium after hydrogen sorption27. In the latter case, we have previously observed that the relative intensities of the first two XANES maxima are proportional to the H/Pd ratio in palladium hydride samples28,29. Another example by Zhang et al.30 provides an analytical relation between energy positions of maxima in U L3-edge XANES spectra of uranyl complexes and distances between the uranium absorber and oxygen ligand atoms. Representing a useful tool for the analysis of XANES spectra, all these examples are limited to the usage of only one spectral descriptor and one descriptor of structure. In this section, we extend such methodology to derive the analytical relation between any set of spectral descriptors and structural parameters using machine learning algorithm. The common approach to find simple analytical relations between known parameters p1pn and target variable y is the construction of linear regression:

$$y = w_1\cdot p_1 + w_2\cdot p_2 + \ldots + w_n\cdot p_n$$
(1)

More complex cases include pairwise and higher degree multiplications alongside parameters p1pn. We are interested in pretty solutions with good approximation quality. The prettiness means the absence of large coefficients wi and the smallest possible number of nonzero wi. For the integer relations problem, the prettiness is achieved by applying special algorithms of integer orthogonalization (see e.g., 31 and §2.2 in ref. 32). In a real-valued case, we use feature selecting properties of the Elastic Net algorithm33 combined with some heuristics. For the theoretical data set, we restrict ourselves to the parameters p1pn and their pairwise multiplications, thus the Eq. (1) takes the form

$$y = \mathop {\sum }\limits_{i = 1}^n w_i\cdot p_i + \mathop {\sum }\limits_{i = 1}^n \mathop {\sum }\limits_{j = 1}^n w_{ij}\cdot p_i\cdot p_j$$
(2)

In the first step, the data were normalized to zero mean and unit standard deviation. We implemented the elastic net method that includes the LASSO34 and ridge regression. In the case of the group of highly correlated variables, the LASSO algorithm tends to select one variable from a group and ignores the others, thus making feature selection. If the linear formula returned by Elastic Net is heavy, we try to simplify it at the expense of model precision. To do so, we sort the coefficients (wi, wij) returned by Elastic Net by their absolute values and try to build a linear model based on subsets of features with the largest absolute coefficients. The analysis was performed for the subsets of each size: 1, 2, 3,…, and for all of them R2-score was evaluated. Afterward, one can choose between pretty models with moderate quality or more complicated models with higher precision. Table 4 shows the selected analytical relations between descriptors of spectra and structural parameters.

Table 4 Analytical relations between descriptors of spectra and descriptors of structure.

Analytical relations between descriptors extend the qualitative classification of the 2d scatter plots. The obtained formulas explore dependencies between any number of spectral features and structural parameters. While, in general, ML algorithms work as a black box, Table 4 provides a geometrical interpretation of the best combinations of descriptors. For example, up to 90% prediction quality can be achieved for the interatomic distances if the energy positions of the edge, first maximum, and minimum are considered.

The accuracy of the quadratic analytical formulas for the oxidation state is above 80%. The quality of analysis could be improved if chemically relevant restrictions were imposed on the Fe‒O distances for Fe2+ or Fe3+ ions in the training set. For better generalization, we assumed that the ranges of variations of structural parameters were equal for both the oxidation states. Therefore, the chemical shift of the whole spectrum can be misinterpreted by the edge shift upon distance variation. This effect is partially accounted for by the main maximum intensity (WLint) descriptor that enters the formula. The intensity of the main maximum changes along with Fe‒O bonds contraction therefore this descriptor can help to discriminate between shifts related to the oxidation state or volume changes. Formulas for CN depend on the curvature of the main maximum, which is consistent with the general behavior of EXAFS oscillations, whose amplitude is proportional to CN. One should note, however, that this conclusion should not be generalized to the structures with different types of bonds (e.g., metallic iron has larger CN, but the white line intensity is higher in the octahedral Fe-O oxide).

The second part of Table 4 interprets the features of the XANES spectrum in terms of geometry parameters. The slope of the edge depends on the average distances and coordination number. The curvature of the white line correlates with the disorder in the first coordination shell of iron, i.e., larger disorder makes the first maximum broader. The position of the first minimum is quite an important feature in the spectrum though it is less often analyzed as compared to the maxima. This feature is by almost 90% determined by the CN and Fe‒O distance. Its intensity is determined by the CN and disorder in the first coordination shell.

Fitting a multi-component system

If the distribution of absorbing atoms in a material is heterogeneous, a linear combination of theoretical spectra with different oxidation states and coordination is required to describe the experimental spectrum. In this section, we extend the descriptor approach to the case of linear combinations and apply the descriptor analysis to the experimental Fe K-edge XANES data of iron oxide and iron silicate systems. The algorithm was applied to 56 experimental spectra of crystalline compounds35, glasses36, tektites, and impactites37,38, as well as a single-site silica-supported Fe catalyst24,39. Figure 6 shows experimental spectra and Supplementary Tables 26 provide a description of each sample and results of ML-analysis. Iron coordination and oxidation state are heavily dependent on the conditions of synthesis. Studied samples are inherently heterogeneous systems. In particular, tektites are formed from molten high-speed ejecta during the early stages of impact crater formation40. Impactites have a more complex history of their formation and are the result of the melting of various types of rocks located at different depths in the Earth’s crust. Iron in the amorphous silica structure has the potential to be a probe of impact rock formation conditions, such as pressure (P), temperature (T), oxygen fugacity41,42.

Fig. 6: Overview of the experimental validation dataset.
figure 6

The types of systems covered by the theoretical Fe(SiO4)CN training set (a) and series of analyzed experimental Fe K-edge XANES spectra for (b) glasses, tektites, impactites, (c) single-site silica-supported Fe catalyst, (d) crystalline minerals. Only presented energy intervals of spectra were used for the analysis of Fe valence, Fe-O distances, and coordination numbers. See the complete list of studied samples and their description in Supplementary Tables 25.

Figure 7 shows the steps required to apply the descriptor approach to the multicomponent system. We have constructed a database of linear combinations of theoretical spectra in the training set, using several random concentrations for every pair of spectra. The descriptors were then evaluated for the database. A cross-validation procedure was applied to different combinations of the descriptors to understand which combination works better for the mixtures.

Fig. 7: The flowchart demonstrating how the descriptor analysis was applied to the mixture of spectra.
figure 7

Before training the algorithm, we normalize the descriptors to make them comparable (for a set of descriptors subtract the average and divide all entries by standard deviation over the set).

The appropriate choice of the descriptors for the given structural property should provide good quality of analysis both for the theoretical validation set and a set of experimental references. Therefore, we have calculated the descriptors of the experimental and calculated spectra for the reference structures. The pairs of theoretical and experimental descriptors for the known structures can be used in step 2.2 to calibrate the descriptors in the theoretical training set for systematic energy shifts or intensity differences. The calibration step for intensities may be necessary when experimental spectra are measured in fluorescence mode and are flattened owing to self-absorption. In this work, we did not apply any calibration after the convolution of theoretical spectra. In step 2.4, the reference spectra are used for validation before predicting results for the unknown structures. Figure 8 shows selected scatter plots for pairs of descriptors that can discriminate efficiently between iron oxidation state, CN, and average Fe‒O distance in the two-component mixture. While the classes were well separated in Fig. 5, their overlap occurs in Fig. 8 due to the linear combinations added to the training sample. We projected spectra of several references (hollow circles) on the two-dimensional scatter plots. Reference oxides and silicates have quite different structures from the entries in the training set, but descriptors PitE, WLE, and WLcurv provided surprisingly good quality for their analysis. α-Fe2O3 and NaFeSi2O6 were properly projected to the region of 6-coordinated species. γ-Fe2O3 and Fe3O4 contain one-third of Fe ions in the tetrahedral positions and this point is projected to the region where 4-, 5-, and 6-coordinated points are overlapped (Fig. 8a). Fe2SiO4 reference has the longest Fe‒O distances equal to 2.2 A and it is properly projected to the blue region of the plot in Fig. 8b, while γ-Fe2O3 has the shortest. In 8c, Fe2O3 and NaFeSi2O6 are assigned to Fe3+, Fe2SiO4 to Fe2+, whereas Fe3O4 contains a mixture of Fe2+ and Fe3+ sites.

Fig. 8: Scatter plots for the selected pairs of descriptors for the library of mixtures.
figure 8

Descriptors were calculated for the theoretical database extended with the linear combinations of the spectra from Fig. 4. The experimental data for selected references (hollow circles) were projected onto each plot. Compare to the analogous Fig. 5 constructed for pure compounds.

The classes in the training set overlap when linear combinations of spectra are introduced along with the pure species. Figure 8a shows how CN classes are mixed if compared with Fig. 5a. The points with intermediate average valence are also become overlapped. In general, the prediction quality is 5–10% lower for the mixture if compared to the pure compound. The main difference was observed for the iron valence. For common structural parameters, the oxidation state affects only the energy position of spectra. Linear combination of spectra smears the localized distributions of Fe2+ and Fe3+ points in the scatter plots (compare Figs. 5e and 8c, respectively). Two descriptors can provide the quality of valence discrimination in the mixture up to 80% and the use of three or more descriptors is appreciated. The better choice should consider the joint analysis of descriptors from several spectral regions. Thus, in ref. 43, the multivariate approach was applied to XANES spectra to determine the iron redox state in silicate glasses. It was demonstrated that using the full spectral region from the pre-edge to the EXAFS provides more accurate results. Pre-edge descriptors alone can be applied to the charge state analysis as well. Wilke et.al. demonstrated for the Fe K-XANES44 that the pre-edge contains information both about the oxidation state and coordination number. The method analyses the 2d scatter plot of the integrated pre-edge intensities versus the pre-edge centroid positions. The set of reference spectra was distributed in the localized regions attributed to the 4, 5, and 6-coordinated Fe ions in oxidation state Fe2+ and Fe3+. The limitations of this methodology arise from the need for well-defined reference spectra since pre-edge XANES simulations are still difficult for real systems. However, no references were reported with the CNs below 4.

Tables 5 and 6 demonstrate the best combinations of descriptors in terms of their quality calculated over the whole theoretical database or set of experimental references. The best triples of descriptors are different for these two tasks. The fact can be understood due to statistical considerations. The area of variation of parameters in the theoretical training set is large and includes even chemically irrelevant species, e.g., Fe3+ with distances longer than in Fe2+. In contrast, the range of structural parameters covered by experimental spectra is smaller and represents a subclass of the training set. The R2 score quality is evaluated in the cross-validation procedure and depends on the size of the sample and its dimensions. Therefore Table 6 contains also the mean absolute error evaluated along with R2 score for the experimental validation set. The triples of good descriptors for experimental analysis are listed in Table 6 for each structural parameter: [WLE, PitE, rPC2] for CN, [WLint, Pitint, rPC2] for Fe‒O distance, [EdgeE, WLE, PC3] for valence. Figure 9 reports the predicted structural parameters for reference experimental spectra compared with their actual values. Prediction for all experimental spectra can be found in Supplementary Tables 810. The mean absolute errors over the validation set were 0.1 for oxidation state, 0.4 for CNs, and 0.03 for distances. The largest errors of the ML algorithm were observed for crystalline compounds, which have a significantly different structure from entries in the training set. The latter was adapted for Fe:SiO2 systems and contains silicon in the second coordination shell, while some reference minerals are composed of oxygen and iron/Al/CO in the nearest coordination shells. The methodology can be directly applied to a new training set extended by ligands of different types. In this case, additional labels (e.g., atom types) should be added as descriptors of structure.

Table 5 Descriptor performance over the database of linear combinations (mixtures) of theoretical spectra from the training set in Fig. 4a.
Table 6 Descriptor performance over the database of experimental references.
Fig. 9: Experimental validation of the Extra Trees algorithm trained over the theoretical data set.
figure 9

Prediction is based on three descriptors: [WLint, Pitint, rPC2] for distances (a, b), [EdgeE, WLE, PC3] for iron valence (c, d) and [WLE, PitE, rPC2] for CN (e, f). The green bars in a, c, e are the values reported in the literature, and ones in b, d is the results of EXAFS and Mossbauer analysis performed independently by present authors. The error bars in b indicate the range of uncertainties provided by the EXAFS analysis. See also Supplementary Notes section for details.

The obtained results for studied samples (“unknown” in Fig. 9) are in good agreement with other experimental methods. Mössbauer spectroscopy confirmed that the fraction of Fe3+ ions was larger in impactites (zhamanshinite, irghizite). А number of XAS-based38,45 and non-XAS investigations have shown that iron oxidation state in tektites from different strew fields is about Fe2+ and generally Fe3+/ƩFe ratio <0.1542,46. Iron in impact glasses can cover a wider range of Fe oxidation states37,47,48 as compared with tektites, from purely Fe2+ to purely Fe3+, and Fe3+/ƩFe values are mainly within 0.25–0.5942. Fe‒O distances are generally smaller for Fe3+ ions and we observed a similar trend for impactites as compared to tektites. Fe CNs in tektites is still a disputed issue. EXAFS studies have reported that mean Fe CNs in tektites are close to 445, whereas the coexistence of four and five-coordinated Fe was observed in38. Our estimations fall in the range CN = 3.5 ÷ 4.5, reproducing a similar trend as in EXAFS analysis. The absolute values of CN obtained from EXAFS analysis highly correlate with the Debye–Waller factor and can be affected also by self-absorption effects in the fluorescence regime of measurements (iron catalyst samples). Therefore, in the corresponding panel of Fig. 9, we omitted the expected values of CN to avoid confusion.

Fig. 10 represents the formation process of the single-site Fe catalyst on silica. The analysis for this system implies that Fe remains at oxidation state +2 throughout the process consistent with Mossbauer analysis and magnetic characterization24. It also shows that after grafting of the molecular precursor, dimeric Fe(II) tris(tert-butoxy) siloxide on SiO2 dehydroxylated at 1080 °C, the coordination number of Fe – CN(Fe) – remains close to 4 (Fe@SiO2 1), whereas it decreases to 3 after thermolysis at 1020 °C (Fe@SiO2 2) consistent with previously reported characterization data that show a similar decrease of CN(Fe), albeit to a value of 224. This confirms that thermal treatment leads to Fe(II) species with low coordination number, probably situated between 2 and 3. It is noteworthy the sample prepared at lower temperature both for the hydroxylation and thermolysis steps display Fe sites with a larger coordination number of 4.

As a concluding remark, we note that usually, the ML algorithms work as a “black box” for researchers since it is difficult to understand what structural information is contained in each part of the spectrum. We approach such understanding by using selected descriptors of the spectrum instead of individual points. The whole spectrum is substituted by several descriptors that intuitively characterize its shape, i.e. energy position of edge, minima, maxima, their intensities, and curvatures. Machine learning analysis established the rational choice of the combinations of descriptors providing the highest prediction accuracy for the structural parameters both for pure compounds and their mixtures. To visualize the spectrum-structure relations we use scatter plots and derive analytical dependencies between the descriptors of the spectrum and structural parameters.

Rational choice of descriptors isolates those features of spectra that are most sensitive to specific structural parameters, avoiding fitting the whole spectrum. The major problem of the practical application of ML methods for experimental data analysis arises from the systematic differences between theoretical calculations and measured data. This discrepancy can arise either from limitations of the theoretical approach or the experimental artefacts. The benefit of using descriptors over the full-spectrum stands in the possibility to correct the systematic differences by calibration on a dataset of theoretical and experimental spectra of reference compounds. However, as all methodologies based on supervised learning, our results are limited to the family of structures described by the training set. As an illustration, the algorithm was trained on Fe-O-Si system; it will thus fail for predicting proper parameters for metallic Fe or sulfide compounds that belong to very different types of materials. This certainly calls for expanding the training set in order to allow for distinguishing, for instance, the ligand types apart from coordination number or interatomic distances.

The further development of the approach is directed toward new ways of descriptor evaluation. A complete set of descriptors should provide the same amount of structural information as in a full spectrum. We foresee that a combination of descriptors from complementary experimental methods (nuclear magnetic resonance, electron paramagnetic resonance, X-ray diffraction, etc.) would significantly improve the quality of prediction.

Methods

XANES simulations and energy alignment

Fe K-edge XANES spectra were calculated utilizing the full potential finite difference method49 implemented in the FDMNES software50. The photoelectron wave functions were evaluated on a grid of points in a 5.5 Å sphere around the absorbing atom with 0.2 Å interpoint distance. To account for the core-hole lifetime broadening and instrumental energy resolution, theoretical spectra were further convoluted using the arctangent function to model the energy dependence of the Lorentzian width.

For an accurate energy calibration of the spectra, the iron 1 s core level energy shifts between Fe2+ and Fe3+ oxidation states for each coordination number were estimated within the molecular orbital approach. The energy levels and the corresponding wave functions were calculated by density functional theory using the B3LYP exchange-correlation functional51. The largest available QZ4P basis set implemented in the ADF-2019 software52,53 was used. For every coordination number in the range between two and six, we constructed a symmetric complex with Fe-O distances equal to 2 Å and evaluated transition matrix elements in the 50 eV energy interval both for the Fe2+ and Fe3+ oxidation states. The proper oxidation state was achieved by specifying the charge and spin state of the whole complex. After the convergence of the self-consistent procedure was achieved the charge states of iron atoms were confirmed by Mulliken charge analysis. Chemical shifts of the 1 s core levels were evaluated and applied to the spectrum calculated by the finite difference method. In this way, we simulated absorption from Fe2+ and Fe3+ sites for given values of structural parameters.

Machine learning algorithms

When we apply machine learning based on spectrum descriptors (calculate the quality of labels prediction, predict labels for experimental data) we use Extra Trees regressor or classifier models54. It consists of several randomly generated decision trees. A decision tree represents a flowchart of threshold conditions on parameters and divides the parameter space into non-intersecting rectangles, in each of which, for regression, the objective function μ(E, P) is approximated by a linear one using the least-squares method and for classification - probability table is calculated. The results obtained from several trees are averaged.

For XANES approximation (Supplementary Fig. 8) we use a supervised machine learning algorithm based on the Radial Basis Functions (RBF) that construct a continuous approximation of spectrum, μ(E), as a function of structural parameters P = (p1, p2, …, pk). The RBF method is a well-proven mesh-free method55,56,57. The unknown function \(\widehat {\upmu}\)(E, P) is represented in terms of a set of basis functions characterized by certain factors and polynomial terms as follows:

$$\widehat {\upmu}\left( {E,{{{\boldsymbol{P}}}}} \right) = \mathop {\sum }\limits_{i = 1}^N w_i\left( E \right)\cdot K\left( ||{{{{\boldsymbol{P}}}} - {{{\boldsymbol{P}}}}_i}|| \right) + {{{\mathrm{Polynomial}}}}_E\left( {{{\boldsymbol{P}}}} \right)$$
(3)

where K(r) is the radial basis function, PolynomialE(P) is a polynomial function of k-dimensional vector of structural parameters P with energy-dependent coefficients. The training set is composed of N calculated spectra. The points (N = 600 for each structure in Fig. 3) in the space of structural parameters P were chosen according to the IHS58. The unknown factors wi and the polynomial coefficients are obtained by the ridge quadric regression method. Every basis function is a function of distance from the training set point Pi. In our task, good results were obtained using linear basis functions and a second-order polynomial (see also Supplementary Table 1 for comparison with other ML methods).

It is important to define a proper norm in (1) to measure the distance between P and Pi for a good quality of the approximation. Structural parameters \(p_1,p_2, \ldots ,p_k\) have a different scale, e.g., interatomic distances and angles. Moreover, the variation of the target function, \(\widehat {\upmu}\left( {E,p_1,p_2, \ldots ,p_k} \right)\), greatly varies for different structural parameters. Spectrum changes caused by angle transformation are an order of magnitude less than caused by interatomic distance modification. That’s why we estimate first the average partial variance of the target function (Δiμ) for each pi and rescale structural parameters in the following way:

$$p_i^\prime = p_i\frac{{{{{\mathrm{{\Delta}}}}}_i{\upmu}}}{{max p_i - min p_i}}$$
(4)

The quality of approximation and prediction is calculated during 10-fold cross-validation. The training set, composed of spectra (the task of XANES approximation as a continuous function of structural parameters) or descriptors (the task of structural parameters prediction based on several spectral features) is divided randomly into 10 parts, nine of which are used for algorithm training and the tenth for validation. The quantitative measure of the quality is the R2 score for the regression task and accuracy for the classification. Details of their evaluation are described in Supplementary Methods section, while supplementary Jupyter Notebook reports the steps necessary to repeat the calculations in the manuscript Fig. 10.

Fig. 10: The grafting and thermolysis process for highly dehydroxylated SiO2.
figure 10

The iron coordination number and local symmetry change upon thermal treatment depending on the temperature and atmosphere.

Section 2.4 of the main text deals with multicomponent systems. The algorithm training is then performed on the linear combinations instead of pure theoretical spectra. In total, more than 5000 pairs were constructed for randomized fractions of components with different CNs, valences, and Fe-O distances. The flowchart in Fig. 7 describes the details of the procedure for mixture analysis. We found the prediction quality may be improved for reference experimental data when sampling was performed according to the adaptive sampling scheme. Although the IHS scheme provides the uniform sampling over each structural parameter the adaptive sampling (or active learning)59,60 chooses the points in the training sample to ensure a uniform variation of the XANES in the selected region of structural parameters. Both training sets are available as SI.