Understanding X-ray absorption spectra by means of descriptors and machine learning algorithms

X-ray absorption near-edge structure (XANES) spectra are the fingerprint of the local atomic and electronic structures around the absorbing atom. However, the quantitative analysis of these spectra is not straightforward. Even with the most recent advances in this area, for a given spectrum, it is not clear a priori which structural parameters can be refined and how uncertainties should be estimated. Here, we present an alternative concept for the analysis of XANES spectra, which is based on machine learning algorithms and establishes the relationship between intuitive descriptors of spectra, such as edge position, intensities, positions, and curvatures of minima and maxima on the one hand, and those related to the local atomic and electronic structure which are the coordination numbers, bond distances and angles and oxidation state on the other hand. This approach overcoms the problem of the systematic difference between theoretical and experimental spectra. Furthermore, the numerical relations can be expressed in analytical formulas providing a simple and fast tool to extract structural parameters based on the spectral shape. The methodology was successfully applied to experimental data for the multicomponent Fe:SiO2 system and reference iron compounds, demonstrating the high prediction quality for both the theoretical validation sets and experimental data.

INTRODUCTION X-ray absorption spectroscopy is widely employed to probe the local atomic and electronic structure around the absorbing atom [1][2][3] . The X-ray absorption near-edge structure (XANES), spanning a region of 50-200 eV above the absorption edge, contains information about the structural descriptors involving the bond distances and angles, type of ligand surrounding, oxidation state, which affect the spectral descriptors; edge position, shapes and positions of spectral maxima and minima. An experienced researcher can, for example, distinguish the pure metallic state from metal oxide compound, or discriminate between tetrahedral and octahedral surroundings based on a qualitative inspection of the related spectral features. Figure 1 shows a series of typical experimental Cu K-edge XANES spectra for different copper compounds. The pre-edge feature A originates from the transition to the spatially localized 3d states. The pre-edge shape depends on the number of electrons in the d-shell 4 , its intensity is proportional to the amount of 3d-4p hybridization 5 , while its energy position can be employed to realize the calibration of the 3d metal oxidation state 6 . The sharp shoulder B on the rising edge is indicative of a linear or square planar geometry with a lower energy of empty 4p orbitals perpendicular to the chemical bonds 7 . A similar shoulder appears in the spectra of metals. K-edge XANES of metals with an fcc structure is further characterized by the splitting of the main peak into M1 and M2 features. Intensities of M2 and M3 are sensitive to the scattering from the second coordination shell 8 similar to the feature D in molecular covalent complexes, and their reduction can be therefore used to probe the nanosized effects 9 . Positions of M4 and further high-energy maxima relate to the interatomic distances via the semi-empirical Natoli's rule 10 . The absorption edge position depends on the oxidation state 11 and also interatomic distances 12 . The intensity of the white line C is higher in spectra of metal complexes with the octahedral coordination. Planar complexes are characterized by energy splitting of this peak 13 . Characteristic spectral features can be further established for the K-edges of light atoms, L 2,3 edges for 3d metals with strong multiplet splitting, or L 2,3 spectra for 4d metals possessing a characteristic white line.
For a data scientist, the above-mentioned spectral features are recognized as descriptors, and the relationships between spectral descriptors and the structural ones (coordination number, geometry, bond distances, angles…) can be established, for example, by using machine learning (ML) algorithms. Using all points of a spectrum as descriptors, Zheng et al. 14 managed to classify the atomic coordination environments via random forest models. The convolutional neural network was applied to predict Cu-Cu coordination numbers (CN) 15 and to evaluate several CNs for platinum nanoparticles to refine their sizes and shapes 16 . Rankine et al. 17 demonstrated the ability of a deep neural network to predict a XANES spectrum from geometric information about the local environment around the absorbing atom. To achieve better performance of ML, the dimensionality of both spectral and structural descriptors should be reduced. For example, 3 N atomic coordinates for N atoms can be converted into radial and angle distribution functions 18 or into generalized radial distribution functions 19 . Alternatively, geometry parameters can be filtered in terms of their importance for XANES variation 20 . The multiple points of a spectrum can also be reduced to only a few descriptors. Commonly used spectral descriptors are the pre-edge centroid and the pre-edge area which are, for example, combined to analyze the Fe oxidation state and coordination number 6 . Carbone et al. demonstrated that the principal components calculated from a series of theoretical spectra can be used to realize the classification of four-, five-, and sixcoordinated metal environments 21 and of type of functional groups 22 . Recently, Torrisi et al. 23 demonstrated the concept of constructing descriptors from a polynomial fit of equidistant energy intervals of the spectrum.
The present work aims to extend these approaches of identifying descriptors of the spectral features, such as positions of the absorption edge, minima, and maxima, their amplitudes, and curvatures. We present a step-by-step procedure to prepare a training set, evaluate the descriptors, train the machine learning algorithm, apply cross-validation, and finally analyze the experimental data. The variable structural parameters are introduced and the problem of classification of the calculated spectra in terms of these parameters is addressed. The analytical formulas establishing the relations between the spectral features and the structural parameters are then derived. Finally, we validate the approach for a set of experimental spectra belonging to oxides, silicates, geological samples (tektites, impactites), and amorphous glasses as well as silica-supported Fe single-site catalysts prepared via surface organometallic chemistry [24][25][26] .

RESULTS AND DISCUSSION Descriptors of spectrum
In general, the theoretical XANES spectrum contains~100 energy points. A common approach to improve the efficiency of ML algorithms is to reduce the dimensionality of such object by extracting only informative features, notably the spectral descriptors 23 . Table 1 and Fig. 2 describe a set of descriptors evaluated for each spectrum: edge position (feature A), white line position and intensity (feature B), first pit (minimum) position and intensity (feature C), the curvature of the white line, projections on the principal components (further called PC descriptors). The arctangent function (red dotted line) was used to fit the whole spectrum and the position of its center and slope were taken as the values of edge position (Edge E ) and slope (Edge slope ). For some deformations in the local geometry, the white line in the calculated spectra can consist of several close maxima. For a monotonic variation of descriptors across the training set we performed an additional convolution (5 eV Lorentzian width) of the spectral regions near extrema B and C before evaluating the curvature, amplitude, and energy position of these features.    The same convolution was applied to the experimental spectra before the evaluation of their descriptors. Principal component analysis (PCA) was applied to the whole data set of theoretical spectra. Based on the singular value decomposition (SVD, see details in Supplementary Methods Section) three first principal components were evaluated. Each spectrum was projected on these components and the projections were used as the descriptors of the spectrum. We also applied SVD analysis to the data set where all edge positions (point A in Fig. 2) were aligned. In such a way, another set of three principal components was used to calculate projections of every theoretical spectrum. We call these projections relative PC descriptors (rPC).
If compared with the descriptors based on the curvature of fixed energy intervals for spectrum, as shown in the work Torrisi et al. 23 the search of minima and maxima along with edge characteristics relies on physically motivated features of the spectrum. As we show below the good prediction quality can be achieved by using just 2 or 3 such descriptors. The stable definition of the extrema may be tricky for flattened spectra or L 2,3 edges with rich multiplet splitting. Such systems may require additional spectral descriptors such as total variance, centers of mass and areas, fitted peak profiles, etc. These descriptors are beyond the scope of the present work but are included in the supplementary software.
Relationship between spectral features and structure 3d metal complexes can be found in a wide range of CNs and local symmetries around the metal center. The type of ligands determines the interatomic distances and symmetry for the given oxidation and spin state of the metal. Valuable catalysts or geological materials can contain iron ions in a silica matrix where oxygen coordination provides both Fe 2+ and Fe 3+ oxidation states along with several possible CNs. To address the problem of quantitative iron speciation, we calculated the training set consisting of spectra for Fe(SiO 4 ) CN complexes for CN = 2-6. The first shell distances and bond angles were varied for every CN using the improved Latin hypercube sampling (IHS) resulting in 3000 spectra calculated for all CN. A chemical shift was then applied for each spectrum to simulate absorption from Fe 2+ and Fe 3+ sites for every deformation. Figure 3 shows the clusters used for simulations and the variable structural parameters p. The ranges of their variation are listed in Table 2.
The calculated spectra for Fe 2+ are shown in Fig. 4. The library of spectra for Fe 3+ contains the same entries but shifted according to the 1 s core level energy difference evaluated within an accurate molecular orbital approach (see Methods section). Therefore, the total number of spectra in the training set was 6000, i.e. twice more than shown in Fig. 4.
Each spectrum in the training set is characterized by several descriptors ( Table 1). The whole training set can be projected on a 2D plot for the selected pair of descriptors. Figure 5 compares the distribution of points in the training set over different 2D maps where each point is colored according to its structural parameters.

CN
Deformation Range Beyond the two-dimensional scatter plots, which are informative for the qualitative selection of good descriptors, the best quality of classification and the best choice of descriptors for ML algorithm was determined ( Table 3) for combinations of 1, 2, 3, or 4 descriptors to predict CN, oxidation state, or distance in a pure compound (mixtures will be discussed further in section 2.4).
Two descriptors of spectra contain up to 95% of the information necessary for discrimination between Fe 2+ and Fe 3+ . Using the value of the edge energy alone provided 80% of the prediction quality. Considering the white line intensity in addition to the Edge E improved the quality to 90%. Other informative descriptors for the oxidation state were the energies of the main maximum and pit, the first principal components. Fe-O mean distance is uniquely characterized by the combination of edge and pit energies (95% quality). Projections on the second and third relative principal components, rPC 2 and rPC 3 , were more important for this task than the first PC. Higher CNs are characterized by a sharp white line and a steep rising edge.
Good quality of prediction for CN requires to use of at least three descriptors which include edge energy, slope, and curvature of the main maximum. The lowest accuracy in cross-validation analysis was observed for the standard deviation from the mean that measures the disorder in the first coordination shell. Four descriptors were necessary to reach the prediction quality equal to 90%.
The optimal choice of the descriptors in Table 3 does not guarantee their transferability to the experimental data and problem of the multicomponent system analysis. In Section 2.4, we address the quality of structural analysis by using descriptors in the training set composed of linear combinations of spectra.

Analytical relations between descriptors: beyond Natoli's rule
In the early eighties, Natoli formulated an empirical rule 10 that establishes dependence between peak positions in the XANES spectrum and interatomic distances for the structures with similar symmetry, which can be the case of metals within the same space group (e.g., fcc Cu and Ni, Supplementary Fig. 3) and to structures that undergo a volume expansion, such as palladium after hydrogen sorption 27 . In the latter case, we have previously observed that the relative intensities of the first two XANES maxima are proportional to the H/Pd ratio in palladium hydride samples 28,29 . Another example by Zhang et al. 30 provides an analytical relation between energy positions of maxima in U L 3edge XANES spectra of uranyl complexes and distances between the uranium absorber and oxygen ligand atoms. Representing a useful tool for the analysis of XANES spectra, all these examples are limited to the usage of only one spectral descriptor and one descriptor of structure. In this section, we extend such methodology to derive the analytical relation between any set of spectral descriptors and structural parameters using machine learning algorithm. The common approach to find simple analytical Fig. 4 Visualization of the theoretical training set and the trends in variations of XANES spectra upon studied deformations. a Six hundreds of Fe K-edge XANES spectra calculated for each coordination number by varying structural parameters (Fig. 3). b Comparison between shapes of spectra upon variation of coordination numbers while all Fe-O bond lengths fixed to 2.1 Å. c Sensitivity of the spectrum to variations of bending angles and interatomic distance for a five-coordinated model (only first coordination shell is shown for simplicity).
A.A. Guda et al. relations between known parameters p 1 …p n and target variable y is the construction of linear regression: More complex cases include pairwise and higher degree multiplications alongside parameters p 1 …p n . We are interested in pretty solutions with good approximation quality. The prettiness means the absence of large coefficients w i and the smallest possible number of nonzero w i . For the integer relations problem, the prettiness is achieved by applying special algorithms of integer orthogonalization (see e.g., 31 and §2.2 in ref. 32 ). In a real-valued case, we use feature selecting properties of the Elastic Net algorithm 33 combined with some heuristics. For the theoretical data set, we restrict ourselves to the parameters p 1 …p n and their pairwise multiplications, thus the Eq. (1) takes the form In the first step, the data were normalized to zero mean and unit standard deviation. We implemented the elastic net method that includes the LASSO 34 and ridge regression. In the case of the group of highly correlated variables, the LASSO algorithm tends to select one variable from a group and ignores the others, thus making feature selection. If the linear formula returned by Elastic Net is heavy, we try to simplify it at the expense of model precision. To do so, we sort the coefficients (w i , w ij ) returned by Elastic Net by their absolute values and try to build a linear model based on subsets of features with the largest absolute coefficients. The analysis was performed for the subsets of each size: 1, 2, 3,…, and for all of them R 2 -score was evaluated. Afterward, one can choose between pretty models with moderate quality or more complicated models with higher precision. Table 4 shows the selected analytical relations between descriptors of spectra and structural parameters.
Analytical relations between descriptors extend the qualitative classification of the 2d scatter plots. The obtained formulas explore dependencies between any number of spectral features  Table 4 provides a geometrical interpretation of the best combinations of descriptors. For example, up to 90% prediction quality can be achieved for the interatomic distances if the energy positions of the edge, first maximum, and minimum are considered.
The accuracy of the quadratic analytical formulas for the oxidation state is above 80%. The quality of analysis could be improved if chemically relevant restrictions were imposed on the Fe-O distances for Fe 2+ or Fe 3+ ions in the training set. For better generalization, we assumed that the ranges of variations of structural parameters were equal for both the oxidation states. Therefore, the chemical shift of the whole spectrum can be misinterpreted by the edge shift upon distance variation. This effect is partially accounted for by the main maximum intensity (WL int ) descriptor that enters the formula. The intensity of the main maximum changes along with Fe-O bonds contraction therefore this descriptor can help to discriminate between shifts related to the oxidation state or volume changes. Formulas for CN depend on the curvature of the main maximum, which is consistent with the general behavior of EXAFS oscillations, whose amplitude is proportional to CN. One should note, however, that this conclusion should not be generalized to the structures with different types of bonds (e.g., metallic iron has larger CN, but the white line intensity is higher in the octahedral Fe-O oxide).
The second part of Table 4 interprets the features of the XANES spectrum in terms of geometry parameters. The slope of the edge depends on the average distances and coordination number. The curvature of the white line correlates with the disorder in the first coordination shell of iron, i.e., larger disorder makes the first maximum broader. The position of the first minimum is quite an important feature in the spectrum though it is less often analyzed as compared to the maxima. This feature is by almost 90% determined by the CN and Fe-O distance. Its intensity is determined by the CN and disorder in the first coordination shell.
Fitting a multi-component system If the distribution of absorbing atoms in a material is heterogeneous, a linear combination of theoretical spectra with different oxidation states and coordination is required to describe the experimental spectrum. In this section, we extend the descriptor approach to the case of linear combinations and apply the descriptor analysis to the experimental Fe K-edge XANES data of iron oxide and iron silicate systems. The algorithm was applied to 56 experimental spectra of crystalline compounds 35 , glasses 36 , tektites, and impactites 37,38 , as well as a single-site silicasupported Fe catalyst 24,39 . Figure 6 shows experimental spectra and Supplementary Tables 2-6 provide a description of each sample and results of ML-analysis. Iron coordination and oxidation state are heavily dependent on the conditions of synthesis. Studied samples are inherently heterogeneous systems. In particular, tektites are formed from molten high-speed ejecta during the early stages of impact crater formation 40 . Impactites have a more complex history of their formation and are the result of the melting of various types of rocks located at different depths in the Earth's crust. Iron in the amorphous silica structure has the potential to be a probe of impact rock formation conditions, such as pressure (P), temperature (T), oxygen fugacity 41,42 . Table 3. The quality of structural parameters prediction by using selected good combinations of the descriptors of spectra from data set in Fig. 4a Figure 7 shows the steps required to apply the descriptor approach to the multicomponent system. We have constructed a database of linear combinations of theoretical spectra in the training set, using several random concentrations for every pair of spectra. The descriptors were then evaluated for the database. A cross-validation procedure was applied to different combinations of the descriptors to understand which combination works better for the mixtures.
The appropriate choice of the descriptors for the given structural property should provide good quality of analysis both for the theoretical validation set and a set of experimental references. Therefore, we have calculated the descriptors of the experimental and calculated spectra for the reference structures. The pairs of theoretical and experimental descriptors for the known structures can be used in step 2.2 to calibrate the descriptors in the theoretical training set for systematic energy shifts or intensity differences. The calibration step for intensities may be necessary when experimental spectra are measured in fluorescence mode and are flattened owing to self-absorption. In this work, we did not apply any calibration after the convolution of theoretical spectra. In step 2.4, the reference spectra are used for validation before predicting results for the unknown structures. Figure 8 shows selected scatter plots for pairs of descriptors that can discriminate efficiently between iron oxidation state, CN, and average Fe-O distance in the two-component mixture. While the classes were well separated in Fig. 5, their overlap occurs in Fig. 8 due to the linear combinations added to the training sample. We projected spectra of several references (hollow circles) on the twodimensional scatter plots. Reference oxides and silicates have quite different structures from the entries in the training set, but descriptors Pit E , WL E , and WL curv provided surprisingly good quality for their analysis. α-Fe 2 O 3 and NaFeSi 2 O 6 were properly projected to the region of 6-coordinated species. γ-Fe 2 O 3 and Fe 3 O 4 contain one-third of Fe ions in the tetrahedral positions and this point is projected to the region where 4-, 5-, and 6-coordinated points are overlapped (Fig. 8a). Fe 2 SiO 4 reference has the longest Fe-O distances equal to 2.2 A and it is properly projected to the blue region of the plot in Fig. 8b, while γ-Fe 2 O 3 has the shortest. In 8c, Fe 2 O 3 and NaFeSi 2 O 6 are assigned to Fe 3+ , Fe 2 SiO 4 to Fe 2+ , whereas Fe 3 O 4 contains a mixture of Fe 2+ and Fe 3+ sites.
The classes in the training set overlap when linear combinations of spectra are introduced along with the pure species. Figure 8a shows how CN classes are mixed if compared with Fig. 5a. The points with intermediate average valence are also become overlapped. In general, the prediction quality is 5-10% lower for the mixture if compared to the pure compound. The main difference was observed for the iron valence. For common structural parameters, the oxidation state affects only the energy position of spectra. Linear combination of spectra smears the localized distributions of Fe 2+ and Fe 3+ points in the scatter plots (compare Figs. 5e and 8c, respectively). Two descriptors can provide the quality of valence discrimination in the mixture up to 80% and the use of three or more descriptors is appreciated. The better choice should consider the joint analysis of descriptors from several spectral regions. Thus, in ref. 43 , the multivariate approach was applied to XANES spectra to determine the iron redox state in silicate glasses. It was demonstrated that using the full spectral region from the pre-edge to the EXAFS provides more accurate results. Pre-edge descriptors alone can be applied to the charge state analysis as well. Wilke et.al. demonstrated for the Fe K-XANES 44 that the pre-edge contains information both about the oxidation state and coordination number. The method analyses the 2d scatter plot of the integrated pre-edge intensities versus the pre-edge centroid positions. The set of reference spectra was distributed in the localized regions attributed to the 4, 5, and 6-coordinated Fe ions in oxidation state Fe 2+ and Fe 3+ . The limitations of this methodology arise from the need for welldefined reference spectra since pre-edge XANES simulations are still difficult for real systems. However, no references were reported with the CNs below 4.
Tables 5 and 6 demonstrate the best combinations of descriptors in terms of their quality calculated over the whole theoretical database or set of experimental references. The best triples of descriptors are different for these two tasks. The fact can be understood due to statistical considerations. The area of variation of parameters in the theoretical training set is large and includes even chemically irrelevant species, e.g., Fe 3+ with distances longer than in Fe 2+ . In contrast, the range of structural parameters covered by experimental spectra is smaller and represents a subclass of the training set. The R 2 score quality is evaluated in the cross-validation procedure and depends on the size of the sample and its dimensions. Therefore Table 6 contains also the mean absolute error evaluated along with R 2 score for the experimental validation set. The triples of good descriptors for experimental analysis are listed in Table 6 Figure 9 reports the predicted structural parameters for reference experimental spectra compared with their actual values. Prediction for all experimental spectra can be found in Supplementary Tables 8-10. The mean absolute errors over the validation set were 0.1 for oxidation state, 0.4 for CNs, and 0.03 for distances. The largest errors of the ML algorithm were observed for crystalline compounds, which have a significantly different structure from entries in the training set. The latter was adapted for Fe:SiO 2 systems and contains silicon in the second coordination shell, while some reference minerals are composed of oxygen and iron/Al/CO in the nearest coordination Table 4. Analytical relations between descriptors of spectra and descriptors of structure.

No. Descriptor
Analytical formula R 2 score Descriptors of structure Fe oxidation state 0.97·Edge E + 0.52·Pit int 0.7 Label "R Fe-O " is used for the average Fe-O distances in the first coordination shell. "Std" is used for the standard deviation of Fe-O distances from mean, the parameter which measures disorder in the first coordination shell. Before constructing analytical dependencies, the descriptors of the training set were normalized to zero mean and unit standard deviation.
A.A. Guda et al. shells. The methodology can be directly applied to a new training set extended by ligands of different types. In this case, additional labels (e.g., atom types) should be added as descriptors of structure.
The obtained results for studied samples ("unknown" in Fig. 9) are in good agreement with other experimental methods. Mössbauer spectroscopy confirmed that the fraction of Fe 3+ ions was larger in impactites (zhamanshinite, irghizite). А number of XAS-based 38,45 and non-XAS investigations have shown that iron oxidation state in tektites from different strew fields is about Fe 2+ and generally Fe 3+ /ƩFe ratio <0.15 42,46 . Iron in impact glasses can cover a wider range of Fe oxidation states 37,47,48 as compared with tektites, from purely Fe 2+ to purely Fe 3+ , and Fe 3+ /ƩFe values are mainly within 0.25-0.59 42 . Fe-O distances are generally smaller for Fe 3+ ions and we observed a similar trend for impactites as compared to tektites. Fe CNs in tektites is still a disputed issue. EXAFS studies have reported that mean Fe CNs in tektites are close to 4 45 , whereas the coexistence of four and five-coordinated Fe was observed in 38 . Our estimations fall in the range CN = 3.5 ÷ 4.5, reproducing a similar trend as in EXAFS analysis. The absolute values of CN obtained from EXAFS analysis highly correlate with the Debye-Waller factor and can be affected also by selfabsorption effects in the fluorescence regime of measurements (iron catalyst samples). Therefore, in the corresponding panel of Fig. 9, we omitted the expected values of CN to avoid confusion. Fig. 10 represents the formation process of the single-site Fe catalyst on silica. The analysis for this system implies that Fe remains at oxidation state +2 throughout the process consistent with Mossbauer analysis and magnetic characterization 24 . It also shows that after grafting of the molecular precursor, dimeric Fe(II) tris(tert-butoxy) siloxide on SiO 2 dehydroxylated at 1080°C, the coordination number of Fe -CN(Fe)remains close to 4 (Fe@SiO 2 1), whereas it decreases to 3 after thermolysis at 1020°C (Fe@SiO 2 2) consistent with previously reported characterization data that   7 The flowchart demonstrating how the descriptor analysis was applied to the mixture of spectra. Before training the algorithm, we normalize the descriptors to make them comparable (for a set of descriptors subtract the average and divide all entries by standard deviation over the set). show a similar decrease of CN(Fe), albeit to a value of 2 24 . This confirms that thermal treatment leads to Fe(II) species with low coordination number, probably situated between 2 and 3. It is noteworthy the sample prepared at lower temperature both for the hydroxylation and thermolysis steps display Fe sites with a larger coordination number of 4.
As a concluding remark, we note that usually, the ML algorithms work as a "black box" for researchers since it is difficult to understand what structural information is contained in each part of the spectrum. We approach such understanding by using selected descriptors of the spectrum instead of individual points. The whole spectrum is substituted by several descriptors that intuitively characterize its shape, i.e. energy position of edge, minima, maxima, their intensities, and curvatures. Machine learning analysis established the rational choice of the combinations of descriptors providing the highest prediction accuracy for the structural parameters both for pure compounds and their mixtures. To visualize the spectrumstructure relations we use scatter plots and derive analytical dependencies between the descriptors of the spectrum and structural parameters.
Rational choice of descriptors isolates those features of spectra that are most sensitive to specific structural parameters, avoiding fitting the whole spectrum. The major problem of the practical application of ML methods for experimental data analysis arises from the systematic differences between theoretical calculations and measured data. This discrepancy can arise either from limitations of the theoretical approach or the experimental artefacts. The benefit of using descriptors over the full-spectrum stands in the possibility to correct the systematic differences by calibration on a dataset of theoretical and experimental spectra of reference compounds. However, as all methodologies based on supervised learning, our results are limited to the family of structures described by the training set. As an illustration, the algorithm was trained on Fe-O-Si system; it will thus fail for predicting proper parameters for metallic Fe or sulfide compounds that belong to very different types of materials. This certainly calls for expanding the training set in order to allow for distinguishing, for instance, the ligand types apart from coordination number or interatomic distances.
The further development of the approach is directed toward new ways of descriptor evaluation. A complete set of descriptors should provide the same amount of structural information as in a full spectrum. We foresee that a combination of descriptors from complementary experimental methods (nuclear magnetic resonance, electron paramagnetic resonance, X-ray diffraction, etc.) would significantly improve the quality of prediction.

XANES simulations and energy alignment
Fe K-edge XANES spectra were calculated utilizing the full potential finite difference method 49 implemented in the FDMNES software 50 . The photoelectron wave functions were evaluated on a grid of points in a 5.5 Å sphere around the absorbing atom with 0.2 Å interpoint distance. To account for the core-hole lifetime broadening and instrumental energy resolution, theoretical spectra were further convoluted using the arctangent function to model the energy dependence of the Lorentzian width. For an accurate energy calibration of the spectra, the iron 1 s core level energy shifts between Fe 2+ and Fe 3+ oxidation states for each coordination number were estimated within the molecular orbital approach. The energy levels and the corresponding wave functions were calculated by density functional theory using the B3LYP exchangecorrelation functional 51 . The largest available QZ4P basis set implemented in the ADF-2019 software 52,53 was used. For every coordination number in the range between two and six, we constructed a symmetric complex with Fe-O distances equal to 2 Å and evaluated transition matrix elements in the 50 eV energy interval both for the Fe 2+ and Fe 3+ oxidation states. The proper oxidation state was achieved by specifying the charge and spin state of the whole complex. After the convergence of the self-consistent procedure was achieved the charge states of iron atoms were confirmed by Mulliken charge analysis. Chemical shifts of the 1 s core levels were evaluated and applied to the spectrum calculated by the finite difference method. In this way, we simulated absorption from Fe 2+ and Fe 3+ sites for given values of structural parameters.

Machine learning algorithms
When we apply machine learning based on spectrum descriptors (calculate the quality of labels prediction, predict labels for experimental data) we use Extra Trees regressor or classifier models 54 . It consists of several randomly generated decision trees. A decision tree represents a flowchart of threshold conditions on parameters and divides the parameter space into non-intersecting rectangles, in each of which, for regression, the objective function μ(E, P) is approximated by a linear one using the leastsquares method and for classification -probability table is calculated. The results obtained from several trees are averaged.
For XANES approximation ( Supplementary Fig. 8) we use a supervised machine learning algorithm based on the Radial Basis Functions (RBF) that construct a continuous approximation of spectrum, μ(E), as a function of structural parameters P = (p 1 , p 2 , …, p k ). The RBF method is a well-proven mesh-free method [55][56][57] . The unknown function b μ(E, P) is represented in terms of a set of basis functions characterized by certain factors and polynomial terms as follows: where K(r) is the radial basis function, Polynomial E (P) is a polynomial function of k-dimensional vector of structural parameters P with energydependent coefficients. The training set is composed of N calculated spectra. The points (N = 600 for each structure in Fig. 3) in the space of structural parameters P were chosen according to the IHS 58 . The unknown factors w i and the polynomial coefficients are obtained by the ridge quadric regression method. Every basis function is a function of distance from the training set point P i . In our task, good results were obtained using linear basis functions and a second-order polynomial (see also Supplementary Table 1 for comparison with other ML methods). It is important to define a proper norm in (1) to measure the distance between P and P i for a good quality of the approximation. Structural parameters p 1 ; p 2 ; ; p k have a different scale, e.g., interatomic distances and angles. Moreover, the variation of the target function, b μ E; p 1 ; p 2 ; ; p k ð Þ , greatly varies for different structural parameters. Spectrum changes caused by angle transformation are an order of magnitude less than caused by interatomic distance modification. That's why we estimate first the average partial variance of the target function (Δ i μ) for each p i and rescale structural parameters in the following way: The quality of approximation and prediction is calculated during 10-fold cross-validation. The training set, composed of spectra (the task of XANES approximation as a continuous function of structural parameters) or descriptors (the task of structural parameters prediction based on several spectral features) is divided randomly into 10 parts, nine of which are used for algorithm training and the tenth for validation. The quantitative measure of the quality is the R 2 score for the regression task and accuracy for the classification. Details of their evaluation are described in Supplementary Methods section, while supplementary Jupyter Notebook reports the steps necessary to repeat the calculations in the manuscript Fig. 10.
Section 2.4 of the main text deals with multicomponent systems. The algorithm training is then performed on the linear combinations instead of pure theoretical spectra. In total, more than 5000 pairs were constructed for randomized fractions of components with different CNs, valences, and Fe-O distances. The flowchart in Fig. 7 describes the details of the procedure for mixture analysis. We found the prediction quality may be improved for reference experimental data when sampling was performed according to the adaptive sampling scheme. Although the IHS scheme provides the uniform sampling over each structural parameter the adaptive sampling (or active learning) 59,60 chooses the points in the training sample to ensure a uniform variation of the XANES in the selected region of structural parameters. Both training sets are available as SI.

DATA AVAILABILITY
The data that support the findings of this study are available at the repository https:// github.com/gudasergey/XANES_descriptors along with the source code. Fig. 10 The grafting and thermolysis process for highly dehydroxylated SiO 2 . The iron coordination number and local symmetry change upon thermal treatment depending on the temperature and atmosphere.