Introduction

Recent years have seen a considerable improvement in the throughput of the fabrication and characterisation of materials.1,2 Using a multi-target sputtering technique, for example, has enabled the methodology of fabricating a sample containing all phases of an alloy to be established.3 In the case of X-ray diffraction in synchrotron radiation facilities, 5000 samples can be measured per day.4 However, despite the recognition that the acceleration of the fabrication and characterisation of materials is an important problem that has attracted considerable attention, measured data often continues to be analysed via the old manual way. Because data analysis using this conventional approach could take several days to several months, it may become a bottleneck in the process. The objective of efficient materials research with materials informatics is to eliminate the bottleneck, and to accelerate the research flow consisting of the fabrication and characterisation of materials followed by data analysis.5,6,7

It is thus important to establish a methodology that automatically and quantitatively extracts the materials parameter from the measured data.8,9 This technique allows the on-the-fly data analysis to be completed as part of the online characterisation in that it provides a combined procedure ranging from material fabrication to material discovery, thereby eliminating the bottleneck. The efficiency of the investigation of a material can be expected to drastically improve if the entire procedure flows smoothly. Therefore, automating and enhancing the speed of data analysis for high-throughput materials research has become increasingly important for the discovery of innovative materials.10,11

Spectroscopy is widely employed to evaluate the properties of materials. An example of a spectroscopic method is X-ray absorption spectroscopy (XAS) and electron energy-loss spectra (EELS), which provide information about the electronic and chemical state of a specific atom. The crystal field parameter 10 Dq is one of the most significant material parameters that can be gained from XAS and EELS. It represents the energy splitting originating from the crystal field and provides an important hint relating to material properties such as magnetism and optical properties.

It is possible to calculate the XAS or EELS spectrum of a 3d transition metal if the value of the materials parameter is given. The calculation for XAS and EELS spectra is usually performed based on atomic multiplet calculations with crystal field multiplet and charge transfer multiplet calculations12 or first-principles calculations.13,14,15 The spectral shape very complicated and the estimation of materials parameter (crystal field parameter) directly from the spectrum should be ill-posed inverse problem and mathematically intractable. Previously, the typical method to evaluate a physical value from a spectrum consisted of visually comparing the measured spectrum with the calculated spectrum.16,17

This suggests that if the similarity measures for spectrum comparison could be established, the task of spectrum analysis could be automated and the materials parameters extracted automatically. In addition, statistical machine-learning methods (e.g. clustering18 and matrix factorisation19) could be applied to materials research, whereupon further improvement in the research efficiency could be expected.

In materials informatics, the methodology to analyse the big data obtained from high-throughput experiments and simulations is extremely important when attempting to utilise machine learning.8,20,21,22,23,24,25 Unsupervised learning methods such as clustering extract information based on the relationships among input data; thus, the results may greatly depend on the similarity measure that is used.26,27 The selection of appropriate measures is one of the most important steps when applying unsupervised learning methods to evaluate and analyse materials.26,27,28,29

The estimation of materials parameters from experimental data via similarity measures has great potential for automated data analysis in materials research. The most important aspect of automated materials parameter estimation is the choice of a similarity measure (kernel functions,30) which is not trivial and varies with the experimental method. The similarity measures should be selected specifically for each combination of materials parameters (e.g. 10 Dq, lattice parameters), measurement method (e.g. XAS, XRD), and material (e.g. transition metal, metal oxide). Data obtained from high-throughput characterisation often include imperfections such as added noise and deteriorating resolution; hence, the similarity measures for these data should be robust against these imperfections. We suggest that good similarity measures should meet the following two requirements: 1. The measure can accurately estimate the materials parameter of interest. 2. The metric is robust against the imperfections of measurement such as noise addition and resolution deterioration.

We depicted the workflow for the estimation of materials parameters from spectra in Fig. 1. In the first step, the spectral dataset for building the statistical models is prepared by simulation or from a large set of experimental data. Then, the spectra are mapped into kernel (similarity) space, using appropriate similarity measures. The similarity between the measured spectrum and the standard spectrum is calculated to estimate the materials parameter from measured spectra. The discrete materials parameters (i.e. the charge of an atom) are estimated with dimensionality reduction (unsupervised learning) and human decision. If needed, this step can be replaced by classification (supervised learning), which does not need a human decision. The continuous materials parameters (i.e. the crystal field parameter) of measured spectra are estimated using the regression model (supervised learning). In the following section, we describe the results of each step in this workflow.

Fig. 1
figure 1

Statistical learning methodology. First, the spectral dataset is prepared by simulation with various parameters. The similarity values between the reference spectrum and each of the spectra are calculated. This enables the statistical learning model for estimating the materials parameters from the spectra to be constructed. Continuous values are estimated by utilising regression models, whereas a discrete value or category is estimated by employing dimensionality reduction. In this procedure, the physical parameters of each spectrum are paired with a similarity value. Finally, the physical parameters of the measured spectra are estimated with the models

We compared the similarity measures for XAS spectra to determine whether data from high-throughput measurement could be analysed promptly. In this respect, the Euclidean distance (ED) (L2 norm) and Manhattan distance (L1 norm) are widely used in many fields as similarity/distance metrics; however, these metrics may perform poorly as similarity measures between measured data,26,27,29 and the appropriate measure of similarity is not trivial.26,29,31

We investigated measures that are robust to noise and peak broadening and are sensitive to changes in the material parameters. We demonstrate that an important material property, such as the crystal field parameter 10Dq, can be estimated automatically and promptly by the constructed regression model based on the similarity measure.

Results

Similarity measures

The spectra of interest are the Mn2+ L2,3 XAS or EELS spectra of MnO obtained from both calculations and experiments. The similarity measures can be defined by using various distance metrics. A distance is a metric that represents how far apart objects are. When the distance between vector x and y is written as d(x, y), d is known as the distance function, and the following conditions are satisfied:32

$$d(x,y) \ge 0,$$
(1)
$$x = y \Rightarrow d(x,y) = 0,$$
(2)
$$d(x,y) = d(y,x),$$
(3)
$$d(x,y) + d(y,z) \ge d(x,z).$$
(4)

The similarity s and the distance d are related as s = 1 − d, when d is normalised in the range [0, 1]. In general, distances are normalised by using their value range; in this work, normalisation was achieved by using the maximum distance estimated by the physical constraints. The crystal field parameter 10 Dq is the difference between the energy levels originating from the breaking of degeneracies of electron orbital states. The maximum value of 10 Dq can be extracted from physical properties such as the atomic number and the crystal structure. Including the physical constraints, the value of 10 Dq can be normalised and included as a metric of which the value is not limited by a maximum AND/OR minimum.

This study evaluated the following distance functions: the ED, city block distance (CD), cosine, Jensen–Shannon divergence (JSD), Pearson's correlation coefficient (PCC), dynamic time warping (DTW), and earth mover’s distance (EMD). DTW and EMD require a base measure, and the Manhattan distance was employed in this work.

Let x and y be n-dimensional vectors represented by x = (x1, x2,...,xn). The definitions of the metrics that were used are as follows.

The ED and CD are special cases of the Minkowski distance (p = 2,1):

$$d_{{\mathrm{Minkowski}}}(x,y) = \left( {\mathop {\sum}\limits_{i = 1}^n | x_i - y_i|^p} \right)^{{1}/{p}}.$$
(5)

The cosine metric represents the cosine of the angle of vector x and y in n-dimensional space; it is constant against changes in the length of the vectors because the cosine metric is robust to intensity changes of the whole spectrum:

$$d_{{\mathrm{Cosine}}}(x,y) = \frac{{\mathop {\sum}_{i = 1}^n {x_i} y_i}}{{\mathop {\sum}_{i = 1}^n {x_i^2} \mathop {\sum}_{i = 1}^n {y_i^2} }}.$$
(6)

Pearson’s product–moment correlation coefficient (PCC) is similar to the cosine metric; it is the cosine between the vector x and y, and their means:

$$d_{{\mathrm{PCC}}}(x,y) = \frac{{\mathop {\sum}_{i = 1}^n {(x_i - \bar x)} (y_i - \bar y)}}{{\mathop {\sum}_{i = 1}^n {(x_i - \bar x)^2} \mathop {\sum}_{i = 1}^n {(y_i - \bar y)^2} }}.$$
(7)

JSD is one of the metrics representing the distance between probability distributions, and it is a modification of the Kullback–Leibler divergence (KLD) to satisfy the symmetry rule.32 KLD is a metric of the extent to which one probability distribution diverges from another and is known as the relative entropy:

$$d_{{\mathrm{JSD}}}(x,y) = \frac{1}{2}D_{{\mathrm{KL}}}(x,M) + \frac{1}{2}D_{{\mathrm{KL}}}(y,M),$$
(8)
$$D_{{\mathrm{KL}}}(x,y) = \mathop {\sum}\limits_{i = 1}^n x_i{\mathrm{log}}\frac{{x_i}}{{y_i}},$$
(9)
$$M = \frac{1}{2}(x_i + y_i).$$
(10)

Here we assume that vectors x and y are normalised to be non-negative and that the summation of elements is one.

DTW, which makes it possible to compare the similarity of distributions that do not have the same length, is utilised especially in voice recognition. The DTW of vector or time-series data x and y is calculated according to the following procedure: first, set the window size; stretch the length of x in each window to minimise the distance to y. The summation of these distances is the DTW of x and y.33

EMD is a metric related with the optimisation problem of transportation. The Minkowski distance and KLD are bin-to-bin distances and compare the same bins of histograms for which the similarity values decrease with a slight shift of the histograms. EMD is the cross-bin distance, such that it is robust to the shift of the whole histogram.32,34,35 Both DTW and EMD are not exact distances, because they do not satisfy the triangular inequality;36 instead, they are designed to preserve a specific characteristic.

Dimensionality reduction and visualisation

Before estimation of materials parameter (10 Dq) from the spectra, it is important to determine the element and valence of the material. Elements can be differentiated from one another using the photon energy of the location for the peaks in absorption spectra

In many cases, the intrinsic dimension of high-dimensional data is low, and the data is distributed in low dimension manifolds.37,38 Based on that idea, we attempted to reduce the dimension of the spectrum by manifold learning and visualise it. Multi-dimensional scaling (MDS) is one of the simplest dimensionality reduction algorithm and is possible to represent high-dimensional data in a low-dimensional space by approximating the distance in the original space.39

In general, there are several intrinsic dimension estimation methods to estimate the optimal number of dimensions,40,41,42 although we do not put emphasis on it in this work. The spectra of Mn, with various valences, and experimentally obtained spectrum of MnO were calculated and represented in two dimensions by MDS. The results are shown in Fig. 2. The numbers in the figure represent the value of the crystal field parameter (eV) multiplied by 10. The valence of Mn was set as 2, 3, and 4+ with the symmetry as Oh. The ED was used as the distance metric for the sake of simplicity. As can be seen from Fig. 2, spectra with different valences are distinctly separated in the data space, and the distance between the spectra and the value of 10Dq correspond.

Fig. 2
figure 2

Dimensionality reduction and visualisation results for Mn2,3,4+ X-ray absorption spectroscopy (XAS) spectra with multi-dimensional scaling (MDS). The numbers in this figure correspond to the value of the crystal field parameter (eV) multiplied by 10. The inset indicates the experimentally obtained XAS spectrum for MnO.16 The red dots corresponds to the experimentally obtained MnO XAS spectrum

The automated data analysis for XAS/EELS spectra using dimensionality reduction is validated with the experimentally obtained Mn XAS spectrum of MnO that corresponds to Mn2+ and 10 Dq = 0.9 eV.16 The MnO XAS spectra and correspondent dimensionality reduction results (red dot) are plotted in the figure. These spectral data approximated those of Mn2+ and 10 Dq = 0.9 eV closely. This suggests that the estimation of the physical quantity (i.e. the charge, 10 Dq) could be realised by evaluating the distance between the spectra.

Comparison of the similarity measure

We adopt the simplest measure as the similarity measure of the XAS and EELS spectra, although several methods exist according to which to define the similarity measure. We define the similarity of spectra as the similarity between the target spectra and the standard spectrum, in this case, the simulated spectrum with a 10 Dq value of zero. The spectra of interest are the Mn2+ L2,3 XAS or EELS spectra of MnO. We compare the behaviour of each of the similarity measures as a function of the materials parameter 10 Dq. Figure 3 shows the similarity of MnO 2p XAS as a function of 10 Dq. The similarities are calculated between the simulated spectra by varying the value of 10 Dq and the standard spectrum simulated with a 10 Dq value of zero. All of the measures except DTW were found to show a one-to-one relationship between the similarity and the materials parameter. As seen in Fig. 3, PCC, cosine, and JSD were insensitive to 10 Dq at <1.0 eV. If the estimated 10 Dq value is in the insensitive range, coupling another measure could be expected to produce a good result.

Fig. 3
figure 3

Similarity as a function of the crystal field parameter 10 Dq. The similarity of spectra is defined as the similarity between the target spectra and the reference spectrum, in this case, the simulated spectrum with a 1 0Dq value of zero

Estimation of the materials parameter

We built a regression model to estimate the value of the materials parameter 10 Dq from the similarity of the spectra. The trend, according to which the similarity changes, is not trivial against the change in the materials parameter, and we build a regression model from the similarity measure vs. the materials parameter data. A proper regression model is built for each similarity measure with the polynomial function where the degree of the polynomial function is estimated from the Akaike information criterion (AIC).43 The performance of the regression model is sufficient for the estimation of 10 Dq from the similarity.

The performance of the regression model for experimental data was validated by the experimentally obtained 2p XAS spectrum of MnO.16 The spectrum of Mn2+ reconstructed from the estimated value of 10 Dq of 0.9 eV with PCC similarity and the experimentally obtained MnO XAS spectrum are shown in Fig. 4. The figure shows that the spectrum predicted from the similarity measure of PCC corresponds well to the experimentally obtained spectrum. According to the literature,16 the value of 10 Dq estimated by human visual inspection is 0.9 eV, which corresponds well with the estimation from the regression model for PCC.

Fig. 4
figure 4

Comparison of the experimentally obtained MnO X-ray absorption spectroscopy (XAS) spectrum, and the Mn2+ XAS spectrum calculated with the estimated value of 10 Dq (0.9 eV) from the regression model for Pearson's correlation coefficient (PCC)

We compare the performance of the similarity measures on the estimation of the 10 Dq value. The DTW measure was not used since the similarity was not determined uniquely from 10 Dq. All the similarity measures could estimate the value of 10 Dq at ~1.0 eV. Especially, PCC and cosine could correctly estimate the value of 10 Dq as 0.9 eV. The calculation time for the estimation is several milliseconds on a general laptop computer and we were able to estimate the materials parameters from more than 10,000 spectra taken by scanning transmission X-ray microscopy in a reasonably short time.

Therefore, it was demonstrated that the crystal field parameter 10 Dq can be estimated automatically and promptly by using the appropriate measures.

It should be noted that the appropriate similarity measure could be automatically optimised by distance metric learning, which has been studied recently, and may also contribute to improve the insensitivity.44,45,46 We are currently in the process of the automated determination of appropriate similarity measures for a variety of measurement data from other materials characterisation techniques.

Robustness against noise

In high-throughput measurements, the influence of noise is the most significant factor owing to the short measurement time. Thus, similarity that is robust against noise is indispensable for these measurements.

We hence examined whether the similarity measures are robust against noise. We modelled the noise in the XAS or EELS spectroscopy as Gaussian noise. The noise with the varied valance in the Gaussian distribution was added to the calculated 2p XAS of Mn2+. The similarity with and without Gaussian noise is shown in Fig. 5.

Fig. 5
figure 5

Result of robustness against the addition of noise. The noise in the X-ray absorption spectroscopy (XAS) or electron energy-loss spectra (EELS) spectroscopy is modelled as Gaussian noise. The signal-to-noise (S/N) ratio is defined as the ratio between the peak height of the true spectrum and the standard deviation of the noise. a Similarity of the spectra with Gaussian noise. Pearson's correlation coefficient (PCC) showed excellent robustness against the addition of noise. b Simulated Mn2+ XAS spectra with Gaussian noise

The signal-to-noise (S/N) ratio in Fig. 5 is defined as the ratio between the peak height of the true spectrum and the standard deviation of the noise. Obviously, PCC showed excellent robustness against the addition of noise. The results of ED and CD showed the same behaviour.

Using PCC, the similarity of the noisy spectrum with an S/N ratio of 30 was calculated at almost 1.0, whereas it was calculated at below 0.9 with the other measures. Particularly, the result with both ED and CD shows poor robustness against noise, despite the fact that these are commonly used measures. This result suggests that the measurement time can be significantly reduced if an appropriate similarity metric such as PCC is selected.

Robustness against peak broadening

In practical spectroscopy measurements, the energy resolution of the spectroscopy system is one of the most important specifications of the measurement system. The ability to estimate the material parameters with equipment with poor energy resolution may lead to a significant reduction in the cost of an experiment. In this work we established an appropriate similarity measure that is robust against deteriorated energy resolution. We calculated the convolution of XAS spectra and the Gaussian function with varied width. The similarity of the spectra as a function of the width of the Gaussian broadening is shown in Fig. 6. The standard deviation of the Gaussian function, σ was varied in the range from 0.02 to 0.21 eV, and compared to the spectrum with σ = 0.02 eV, which represents the energy resolution of the measurement system. A good measure requires robustness to peak broadening such that it can be applied to a low-resolution measurement. As shown in the Fig. 6, PCC, JSD, and cosine are more robust, whereas ED, CD, and DTW have poor robustness to broadening.

Fig. 6
figure 6

Results of robustness against peak broadening. Similarities are plotted as a function of peak broadening (σ). The standard deviation of the Gaussian function, σ, was varied in the range from 0.02 to 0.21 eV, and compared to the spectrum with σ = 0.02 eV, which represents the energy resolution of the measurement system. It is shown that Pearson's correlation coefficient (PCC), Jensen–Shannon divergence (JSD), and cosine are more robust, whereas Euclidean distance (ED), city block distance (CD), and dynamic time warping (DTW) have poor robustness to broadening

This result suggests the importance of choosing an appropriate measure that enables the estimation of a materials parameter even from measurement systems with poor energy resolution.

Discussion

PCC shows the best performance among the similarity measures for the estimation of 10 Dq, robustness against spectral broadening, and noise. It should be noted that robustness against noise is a very important property required for a similarity measure.

PCC is considered to be the cosine similarity between the averaged vector and the data vector, and it should be robust against fluctuations in the baseline of the spectra caused by noise. In the case of noisy spectra, the real signal components become smaller when the spectra are normalised. The cosine similarity calculates the angle between vectors and the length of the vectors does not affect the cosine similarity. From this point of view, both PCC and cosine similarity should be robust against noise.

We focused on the estimation of materials parameter from Mn XAS/EELS spectra in this study; however, the proposed approach has an extensibility for a wide range of XAS/EELS spectra. Recently, there is a large open database with 500,000 K-edge X-ray absorption near-edge spectra for more than 40,000 unique materials.47 In the next step, we will combine our approach on dimensionality reduction and appropriate similarity measure for XAS with the large XAS spectra dataset to realise automated knowledge discovery from measured XAS/EELS data with high-throughput experiments.

It should be noted that there is no generalisable approach to choose the appropriate measure for an unknown materials parameter at that moment. We think the approach proposed in this study can be automated and applicable for choosing the appropriate measure even for an unknown materials parameter. There is another approach called distance metric learning or similarity learning that we can construct new similarity function or distance metric for an unknown materials parameter with learning from experimental or simulated datasets.

In many cases of high-throughput experiments, the most important information that can be obtained by analysing the acquired data, rather than by acquiring the data, is the material parameters (e.g. the electronic structure and the lattice parameter). The measurement time should be minimised to the necessary and sufficient conditions to enable the desired parameters to be extracted. For this purpose, it is necessary to coordinate the experimental measurements with the data analysis; however, the development thereof is still in progress.8,22 This study led us to identify those measures that are robust to noise and a deterioration of resolution, and that are intended for high-throughput measurement. This result is the basis for the technique that makes it possible to perform on-the-fly extraction of a material parameter from within the measurement. In future, our result is expected to contribute to the realisation of true high-throughput materials discovery, which integrates high-throughput fabrication, characterisation and on-the-fly data analysis.

It is important to point out that we can reduce the measurement point for an experiment with the use of similarity measures and this technique significantly accelerates the characterisation of materials and the automated extraction of material properties, both of which are essential for materials informatics. Now we are working on this problem, and the notable progress was obtained.5,48

Method

Simulation of XAS/EELS spectra

We used CTM4XAS for the simulation of XAS/EELS spectra.12 The dataset for the XAS spectra was prepared by calculating the Mn 2p XAS spectra with Mn2,3,4+ configuration by changing the 10 Dq value from 0 to 2.5 eV. Since the spectra of interest are the L2,3 XAS or EELS spectra of MnO, we set the symmetry of the crystal field to be octahedral (Oh) and tetrahedral (Td), and the other parameters used in this study are identical to those published before.17 The dataset used in this study is available at https://doi.org/10.5281/zenodo.2532856.

Dimensionality reduction

There are numbers of dimensionality reduction or manifold learning algorithms such as Isomap, Locally linear embedding, Laplacian eigenmaps, and t-distributed stochastic neighbour embedding.37 Among them, MDS is the simplest algorithm. In order to clarify the validity of dimensionality reduction upon spectroscopy data, we employed the simplest algorithm. The dimensionality reduction was performed with the scikit-learn package for Python.49 The code for dimensionality reduction is found in the repository.

Regression model

The similarity or distance between spectra is calculated using R with the package proxy, transport and dtw.50 The regression model is built for each similarity measure with the polynomial function where the degree of the polynomial function is estimated from AIC.51 The training data for regression is the similarity of Mn2+ 2p XAS spectra with 10 Dq value from 0 to 2.5 eV in 0.1 eV step. To validate the regression model, we use the experimentally obtained XAS spectrum for MnO, which was scanned from the literature.16 The best regression model selection is performed by MuMin package in R using automated information-theoretic model selection with AIC. The code for regression model is found in the repository.