Bayesian active learning with model selection for spectral experiments

Nabika, Tomohiro; Nagata, Kenji; Mizumaki, Masaichiro; Katakami, Shun; Okada, Masato

doi:10.1038/s41598-024-54329-w

Download PDF

Article
Open access
Published: 14 February 2024

Bayesian active learning with model selection for spectral experiments

Tomohiro Nabika¹,
Kenji Nagata²,
Masaichiro Mizumaki³,
Shun Katakami¹ &
…
Masato Okada¹

Scientific Reports volume 14, Article number: 3680 (2024) Cite this article

274 Accesses
Metrics details

Subjects

Abstract

Active learning is a common approach to improve the efficiency of spectral experiments. Model selection from the candidates and parameter estimation are often required in the analysis of spectral experiments. Therefore, we proposed an active learning with model selection method using multiple parametric models as learning models. Important points for model selection and its parameter estimation were actively measured using Bayesian posterior distribution. The present study demonstrated the effectiveness of our proposed method for spectral deconvolution and Hamiltonian selection in X-ray photoelectron spectroscopy.

Automated stopping criterion for spectral measurements with active learning

Article Open access 25 August 2021

Multi-component background learning automates signal detection for spectroscopic data

Article Open access 19 July 2019

Autonomous atomic Hamiltonian construction and active sampling of X-ray absorption spectroscopy by adversarial Bayesian optimization

Article Open access 30 March 2023

Introduction

Experimental design to reduce the cost of experiments is a fundamental challenge from science to industry and has been extensively studied¹. A sequential experimental design, which selects the measurement point sequentially, has been realized by active learning².

In spectral experiment, two active learning methods have been primarily evaluated. One method is to use a Gaussian process regression (GPR) model as a learning model^3,4,5,6,7,8. As this approach is model-agnostic, it can be applied to an experiment without a formulated physical model. However, its application for the parameter estimation of physical models might be a challenge⁹. Another issue is the approach for measurement noise².

The other method is to fix a physical model before the experiment and use it as a learning model^10,11,12,13. This approach is suitable for the parameter estimation of physical models, but cannot be applied to the experiment where the physical model is not fixed.

However, in the analysis of experimental data, a physical model is selected from the candidates and then its parameters are estimated. To improve the efficiency of such experiments, active learning with model selection for parametric models is required. Active learning with model selection has been separately studied in various fields such as linear regression¹⁴, labeling problems¹⁵, and kernel selection for GPR¹⁶. However, none of these is applicable to spectral experiments.

In this study, we propose an active learning with model selection method using multiple parametric models as learning models to improve the model selection and its parameter estimation for spectral experiments. First, the model and its parameter posterior distribution are calculated; then, they are used to select the next measurement for model selection and its parameter estimation. The posterior probabilities are approximated using the exchange Monte Carlo method^17,18, which allows our methods to be applied to complex physical models.

The results of the present study demonstrated the effectiveness of the proposed method for spectral deconvolution and Hamiltonian selection in X-ray photoelectron spectroscopy (XPS). In the numerical experiment, our method improved the accuracy of model selection and its parameter estimation while reducing the experiment time compared with the experiment without active learning or those with active learning using GPR.

Bayesian model selection and its parameter estimation

We consider the problem of selecting the physical model M from the candidates $\mathcal {M} = \{M_1, \dots , M_K\}$ and estimating its parameter $\theta _M$. Let $D = \{x_i, y_i\}_{i=1}^N$ be the data, where $x_i$ is the measurement point and $y_i$ is the observed value. If the model M and its parameter $\theta _M$ are given, the probability of the data D is given by

$$\begin{aligned} p(D | M, \theta _M) = \prod _{i=1}^N p(y_i | x_i, M, \theta _M), \end{aligned}$$

(1)

where the observed value $y_i$ is assumed to be independently generated.

From Bayes’ theorem, the posterior probability of model M and its parameter $\theta _M$ is given by

$$\begin{aligned}&p(M | D) = \frac{\int p(D|\theta _M,M)p(\theta _M)p(M) d \theta _M}{\sum _{M\in \mathcal {M}} \int p(D|\theta _M,M)p(\theta _M)p(M) d \theta _M}, \end{aligned}$$

(2)

$$\begin{aligned}&p(\theta _M|D,M) = \frac{p(D|\theta _M,M)p(\theta _M)}{\int p(D|\theta _M,M)p(\theta _M)d \theta _M}, \end{aligned}$$

(3)

where p(M) and $p(\theta _M)$ are the prior probabilities of model M and its parameter $\theta _M$, respectively. The numerical computation of these posterior distribution can be realized by the exchange Monte Carlo method^17,18.

Bayesian active learning with model selection for parametric models

The objective of the active learning is to maximize the estimation accuracy of model M and its parameter $\theta _M$ by sequentially selecting the next measurement point. In this study, we propose an active learning method to select the next measurement point based on two criteria: the expected improvement of the parameter estimation and that of the model selection (Fig. 1). The detailed equation transformations are given in the supplementary materials.

Active learning criterion for parameter estimation

When $\{x,y\}$ is added to the data D, the information gain of the posterior distribution of the parameter $\theta _M$ is represented by

$$\begin{aligned} \mathcal {J}_M(x; y) = H(p(\theta _M|D,M)) - H(p(\theta _M|D\cup \{x,y\},M)). \end{aligned}$$

(4)

where H(p) is the entropy of $p(\cdot )$. Therefore, the expected gain provided by x is

$$\begin{aligned} \mathcal {I}_M(x)&= \int \mathcal {J}_M(x; y) p(y|x, D, M) d y \end{aligned}$$

(5)

$$\begin{aligned}&= \int _\Theta KL (p_{x,\theta _M}||p_{x,D})p_D(\theta _M)d \theta _M, \end{aligned}$$

(6)

where $p_{x,\theta }(y) = p(y|x,\theta _M,M)$, $p_D(\theta _M) = p(\theta _M|D,M)$, $p_{x,D}(y) = p(y|x,D,M) = \int p(y|x,\theta _M, M)p_D(\theta _M)d \theta$, and $KL (p||q)$ is the Kullback–Leibler (KL) divergence between p and q¹⁹. From convexity of KL divergence, $\mathcal {I}_M(x)$ is bounded as follows:

$${\mathcal{I}}_{M} (x){\text{ }} \le \int {\int K } L(p_{{x,\theta _{M} }} ||p_{{x,\theta _M' }} )p_{D} (\theta _{M} )p_{D} (\theta _M' )d\theta _M' d\theta _{M}$$

(7)

$$\begin{aligned}&= \widetilde{\mathcal {I}}_M(x) \end{aligned}$$

(8)

When the model M is expressed as

$$\begin{aligned} p(y|x,\theta _M,M)&= Poisson (y;f_{M}(x;\theta _M)) \end{aligned}$$

(9)

$$\begin{aligned}&= \frac{f_M(x;\theta _M)^y\exp (-f_M(x;\theta _M))}{y!}, \end{aligned}$$

(10)

KL divergence of $p_{x,\theta _M}$ and $p_{x,\theta _M'}$ is calculated as follows:

$$KL (p_{x,\theta _M}||p_{x,\theta _M'}) = f_M(x;\theta _M') - f_M(x;\theta _M) + f_M(x;\theta _M')\log \frac{f_M(x;\theta _M)}{f_M(x;\theta _M')},$$

(11)

where $f_M(x;\theta _M)$ is a physical model, and y follows Poisson distribution. By setting $M = \widehat{M} = argmax p(M|D)$, $\widetilde{\mathcal {I}}_M(x)$ can be calculated numerically with $p_D(\theta _M)$, which is obtained by the exchange Monte Carlo method.

Therefore, we consider selecting the next measurement point x that maximizes $\widetilde{\mathcal {I}}_M(x)$.

Active learning criterion for model selection

The aforementioned criterion improves the accuracy of parameter estimation when $\widehat{M}$ is a true model. Here, we consider the criterion to make $\widehat{M}$ a true model. When data is small, a higher signal-to-noise ratio can make complex structures in spectral data less discernible, leading to a higher likelihood of selecting simpler models²⁰. Therefore, we consider the criterion to select samples that favors the more complex model.

Let the second-best model $M' = argmax _{M \ne \widehat{M}}p(M|D)$, two competitive models $\{M_s, M_c\} = \{\widehat{M}, M'\}$, and $M_s$ have a smaller parameter dimension than $M_c$. (Specifically, if $\widehat{M}$ is simpler than $M'$, $M_s = \widehat{M}$ and $M_c = M'$; otherwise, $M_s = M'$ and $M_c = \widehat{M}$). We consider the following criterion to make $p(M_c|D \cup \{x,y\} )$ bigger than $p(M_s|D \cup \{x,y\} )$:

$$\mathcal {I}_{s,c}(x)= \int \log \frac{p(M_c|D \cup \{x,y\} )}{p(M_s|D \cup \{x,y\} )}p(y|x,D) d y$$

(12)

$$= \int {\log } \frac{{p(y|x,D,M_{c} )}}{{p(y|x,D,M_{s} )}}p(y|x,D,M_{c} )dy + C.{\text{ }}$$

(13)

where C is a constant independent of x.

$\frac{p(y|x,D,M_c)}{p(y|x,D,M_s)}$ is referred to as the Bayes factor, a concept well-explored in Bayesian decision theory^21,22.

From convexity of KL divergence, $\mathcal {I}_{s,c}(x)$ is bounded as follows:

$$\begin{aligned}&\mathcal {I}_{s,c}(x) - C \nonumber \\&\le \int \int KL (p_{x,\theta _{M_c}}||p_{x,\theta _{M_s}})p_D(\theta _{M_c})p_D(\theta _{M_s})d \theta _{M_c}d \theta _{M_s} \end{aligned}$$

(14)

$$\begin{aligned}&= \widetilde{\mathcal {I}}_{s,c}(x) \end{aligned}$$

(15)

$\widetilde{\mathcal {I}}_{s,c}(x)$ can be calculated with $p_D(\theta _{M_c}),p_D(\theta _{M_s})$, which are obtained by the exchange Monte Carlo method.

Therefore, the next measurement point x that maximizes $\widetilde{\mathcal {I}}_{s,c}(x)$ is also selected.

Spectral deconvolution

Our proposed method was applied to the spectral deconvolution in XPS, which poses a challenge in estimating the number of peaks and their parameters²⁰.

Problem setting

Let $M_K$ be a model with K peaks, the parameter set $\theta _{M_K}$ be $\theta _{M_K} = \{\{a_k, \mu _k, \sigma _k\}_{k = 1}^K ,B\}$, and the physical model $f_{M_K}(x;\theta _K)$ be $f_{M_K}(x;\theta _K) = \sum _{k = 1}^{K} a_k\exp \left( -\frac{(x-\mu _k)^2}{2\sigma _k^2}\right) + B$ (where $a_k,\mu _k,\sigma _k$, and B correspond to the peak intensity, peak position, peak width, and background intensity, respectively). Since the measurement is performed by photon counting in XPS, the probability distribution of the number of observed photons $p(y|f_{M}(x;\theta _M))$ is $Poisson (y;f_{M}(x;\theta _M)\times T)$ with measurement time T.

Detailed algorithm for spectral deconvolution

To apply our method, a set of candidate models must be given in advance; however, in the Bayesian spectral deconvolution, the number of peaks K can take any integer. Therefore, we consider changing the candidate model set sequentially.

We define the initial model set as $\mathcal {M} = \{M_1,M_2,M_3\}$. At each step, let $\widehat{K}$ be the number of peaks of the best predicted model $\widehat{M}$ ($\widehat{M} = M_{\widehat{K}}$). The following model set was used in the next estimation:

$$\begin{aligned} \mathcal {M} = \left\{ \begin{array}{ll} \{M_1,M_2,M_3\} &{} (\widehat{K} = 1)\\ \{M_{\widehat{K} - 1},M_{\widehat{K}},M_{\widehat{K}+ 1}\}. &{} (otherwise ) \end{array} \right. \end{aligned}$$

(16)

In addition, in the spectral measurement, a short time measurement is performed first, followed by a long time measurement. The specific algorithm that takes these considerations into account is shown in Algorithm 1.

Conventional methods

We compare our method with the following two conventional methods.

Passive learning

Passive learning measures the same measurement time at all measurement points. This method is the most common method in spectral experiments.

Active learning with GPR

In active learning with GPR, a GPR model is used as a learning model, and the next measurement point that maximizes the expected improvement of the measured value estimation is selected. A detailed algorithm is given in the supplementary materials.

Result

Let the true model be the model $M_3$ with $K=3$ peaks, and the true values of the parameters $\theta _{M_3}^* = \{ \{a_k^*,\mu _k^*,\sigma _k^*\}_{k=1}^3,B^*\}$ be as follows:

$$\begin{aligned} \begin{pmatrix} a_1^*\\ a_2^*\\ a_3^*\\ \end{pmatrix}&= \begin{pmatrix} 0. 587\\ 1.522\\ 1.183\\ \end{pmatrix} ,\ \begin{pmatrix} \mu _1^*\\ \mu _2^*\\ \mu _3^*\\ \end{pmatrix} = \begin{pmatrix} 161.032\\ 161.852\\ 162.677\\ \end{pmatrix} , \end{aligned}$$

(17)

$$\begin{aligned} \begin{pmatrix} \sigma _1^*\\ \sigma _2^*\\ \sigma _3^*\\ \end{pmatrix}&= \begin{pmatrix} 0.341\\ 0.275\\ 0.260\\ \end{pmatrix},\ B = 0.1. \end{aligned}$$

(18)

The modeling function $f_{M_3}(x;\theta _{M_3}^*)$ is shown in Fig. 2.

Let the measurement time for one measurement in active learning T be $T=1$, number of measurement points per one experiment $n = 10$, and the candidate set of measurement points $\mathcal {X}$ be $\mathcal {X} = \{157 + 0.025(i-1)\ (eV) \}_{i=1}^{400}$. The prior distributions are shown in the supplementary materials.

The flow of the measurement is shown in Fig. 3. The signal-to-noise ratio is poor at all points at first. However, as the experiment progresses, the signal-to-noise ratio near the peaks improves due to focused measurements. The data and the fitting by the MAP estimator ($\hat{\theta }_{M_3} = \{\{\hat{a}_k, \hat{\mu }_k, \hat{\sigma }_k\}_{k = 1}^K ,\hat{B}\} = \underset{\theta }{\text {argmax }}p(\theta | D, M_3))$ when the total measurement time is 2400 are shown in Fig. 4 (the parameter indices are set so that $\mu _1<\mu _2<\mu _3$). This figure shows that the proposed method focuses on the measurement points near the peaks that are considered to be important in the spectral deconvolution.

In addition, we calculated $p(K = 3 | D)$ and $p(\theta _{M_3}|D, M_3)$ when the total measurement time is $\{ 400 + 100i\}_{i=0}^{36}$. Figure 5A shows the result of the model selection. Active learning with GPR does not improve the model selection because of the high intensity measurement noise. However, our method improves the model selection compared to passive learning. Figure 5B shows the 99% credible interval of the parameter estimation of peak positions $\mu _1, \mu _2, \mu _3$. Our method narrowed the interval width and improved the parameter estimation.

Moreover, the results of 10 independent measurements were compared: (a) passive learning with total measurement time 2400, (b) active learning with total measurement time 2400, and (c) passive learning with total measurement time 7200. Figure 6 shows the result of the model selection. Figure 7 shows the parameter estimation accuracy. Here, we defined the parameter estimation accuracy $W_{\mu _1}, W_{\mu _2}, W_{\mu _3}$ for $\mu _1,\mu _2,\mu _3$ as follows:

$$\begin{aligned} W_{\mu _i} = \max _{\alpha \in [0.005,0.995]}|\mu _i^* - \mu _{i,\alpha }| \end{aligned}$$

(19)

where

$$\begin{aligned} \mu _{i,\alpha }&= \min _{\mu } \left\{ \left( \int _{\mu _i < \mu }p(\mu _i|D,K)d \mu _i\right) > \alpha \right\} . \end{aligned}$$

(20)

Both results show that our method improved the estimation accuracy and shortened the measurement time.