Bayesian active learning with model selection for spectral experiments

Active learning is a common approach to improve the efficiency of spectral experiments. Model selection from the candidates and parameter estimation are often required in the analysis of spectral experiments. Therefore, we proposed an active learning with model selection method using multiple parametric models as learning models. Important points for model selection and its parameter estimation were actively measured using Bayesian posterior distribution. The present study demonstrated the effectiveness of our proposed method for spectral deconvolution and Hamiltonian selection in X-ray photoelectron spectroscopy.


Bayesian active learning with model selection for spectral experiments
Tomohiro Nabika 1 , Kenji Nagata 2 , Masaichiro Mizumaki 3 , Shun Katakami 1 & Masato Okada 1* Active learning is a common approach to improve the efficiency of spectral experiments.Model selection from the candidates and parameter estimation are often required in the analysis of spectral experiments.Therefore, we proposed an active learning with model selection method using multiple parametric models as learning models.Important points for model selection and its parameter estimation were actively measured using Bayesian posterior distribution.The present study demonstrated the effectiveness of our proposed method for spectral deconvolution and Hamiltonian selection in X-ray photoelectron spectroscopy.
Experimental design to reduce the cost of experiments is a fundamental challenge from science to industry and has been extensively studied 1 .A sequential experimental design, which selects the measurement point sequentially, has been realized by active learning 2 .
In spectral experiment, two active learning methods have been primarily evaluated.One method is to use a Gaussian process regression (GPR) model as a learning model [3][4][5][6][7][8] .As this approach is model-agnostic, it can be applied to an experiment without a formulated physical model.However, its application for the parameter estimation of physical models might be a challenge 9 .Another issue is the approach for measurement noise 2 .
The other method is to fix a physical model before the experiment and use it as a learning model [10][11][12][13] .This approach is suitable for the parameter estimation of physical models, but cannot be applied to the experiment where the physical model is not fixed.
However, in the analysis of experimental data, a physical model is selected from the candidates and then its parameters are estimated.To improve the efficiency of such experiments, active learning with model selection for parametric models is required.Active learning with model selection has been separately studied in various fields such as linear regression 14 , labeling problems 15 , and kernel selection for GPR 16 .However, none of these is applicable to spectral experiments.
In this study, we propose an active learning with model selection method using multiple parametric models as learning models to improve the model selection and its parameter estimation for spectral experiments.First, the model and its parameter posterior distribution are calculated; then, they are used to select the next measurement for model selection and its parameter estimation.The posterior probabilities are approximated using the exchange Monte Carlo method 17,18 , which allows our methods to be applied to complex physical models.
The results of the present study demonstrated the effectiveness of the proposed method for spectral deconvolution and Hamiltonian selection in X-ray photoelectron spectroscopy (XPS).In the numerical experiment, our method improved the accuracy of model selection and its parameter estimation while reducing the experiment time compared with the experiment without active learning or those with active learning using GPR.
where the observed value y i is assumed to be independently generated.
From Bayes' theorem, the posterior probability of model M and its parameter θ M is given by where p(M) and p(θ M ) are the prior probabilities of model M and its parameter θ M , respectively.The numerical computation of these posterior distribution can be realized by the exchange Monte Carlo method 17,18 .

Bayesian active learning with model selection for parametric models
The objective of the active learning is to maximize the estimation accuracy of model M and its parameter θ M by sequentially selecting the next measurement point.In this study, we propose an active learning method to select the next measurement point based on two criteria: the expected improvement of the parameter estimation and that of the model selection (Fig. 1).The detailed equation transformations are given in the supplementary materials.

Active learning criterion for parameter estimation
When {x, y} is added to the data D, the information gain of the posterior distribution of the parameter θ M is represented by where H(p) is the entropy of p(•) .Therefore, the expected gain provided by x is where p x,θ (y) = p(y|x, θ M , M) , p D (θ M ) = p(θ M |D, M) , p x,D (y) = p(y|x, D, M) = p(y|x, θ M , M)p D (θ M )dθ , and KL(p||q) is the Kullback-Leibler (KL) divergence between p and q 19 .From convexity of KL divergence, I M (x) is bounded as follows: When the model M is expressed as (5)  where f M (x; θ M ) is a physical model, and y follows Poisson distribution.By setting M = M = argmaxp(M|D) , I M (x) can be calculated numerically with p D (θ M ) , which is obtained by the exchange Monte Carlo method.Therefore, we consider selecting the next measurement point x that maximizes I M (x).

Active learning criterion for model selection
The aforementioned criterion improves the accuracy of parameter estimation when M is a true model.Here, we consider the criterion to make M a true model.When data is small, a higher signal-to-noise ratio can make complex structures in spectral data less discernible, leading to a higher likelihood of selecting simpler models 20 .Therefore, we consider the criterion to select samples that favors the more complex model.Let the second-best model where C is a constant independent of x.
p(y|x,D,M c ) p(y|x,D,M s ) is referred to as the Bayes factor, a concept well-explored in Bayesian decision theory 21,22 .From convexity of KL divergence, I s,c (x) is bounded as follows: I s,c (x) can be calculated with p D (θ M c ), p D (θ M s ) , which are obtained by the exchange Monte Carlo method.
Therefore, the next measurement point x that maximizes I s,c (x) is also selected.

Spectral deconvolution
Our proposed method was applied to the spectral deconvolution in XPS, which poses a challenge in estimating the number of peaks and their parameters 20 .

Problem setting
Let M K be a model with K peaks, the parameter set θ M K be θ , B} , and the physical model

Detailed algorithm for spectral deconvolution
To apply our method, a set of candidate models must be given in advance; however, in the Bayesian spectral deconvolution, the number of peaks K can take any integer.Therefore, we consider changing the candidate model set sequentially.
We define the initial model set as M = {M 1 , M 2 , M 3 } .At each step, let K be the number of peaks of the best predicted model M ( M = M K ).The following model set was used in the next estimation: In addition, in the spectral measurement, a short time measurement is performed first, followed by a long time measurement.The specific algorithm that takes these considerations into account is shown in Algorithm 1.
M s , M c is the best and second best model in M. (M c is more complex than M s ) 7: Select else 15: end if 17: end for Algorithm 1. Sequential experiment for Bayesian spectral deconvolution.

Conventional methods
We compare our method with the following two conventional methods.

Passive learning
Passive learning measures the same measurement time at all measurement points.This method is the most common method in spectral experiments.

Active learning with GPR
In active learning with GPR, a GPR model is used as a learning model, and the next measurement point that maximizes the expected improvement of the measured value estimation is selected.A detailed algorithm is given in the supplementary materials.

Result
Let the true model be the model M 3 with K = 3 peaks, and the true values of the parameters , B * } be as follows: The modeling function f M 3 (x; θ * M 3 ) is shown in Fig. 2. Let the measurement time for one measurement in active learning T be T = 1 , number of measurement points per one experiment n = 10 , and the candidate set of measurement points X be X = {157 + 0.025(i − 1) (eV )} 400 i=1 .The prior distributions are shown in the supplementary materials.
The flow of the measurement is shown in Fig. 3.The signal-to-noise ratio is poor at all points at first.However, as the experiment progresses, the signal-to-noise ratio near the peaks improves due to focused measurements.The data and the fitting by the MAP estimator ( θM 3 = {{â k , μk , σk } K k=1 , B} = argmax θ p(θ|D, M 3 )) when the total measurement time is 2400 are shown in Fig. 4 (the parameter indices are set so that µ 1 < µ 2 < µ 3 ).This figure shows that the proposed method focuses on the measurement points near the peaks that are considered to be important in the spectral deconvolution.In addition, we calculated p(K = 3|D) and p(θ M 3 |D, M 3 ) when the total measurement time is {400 + 100i} 36 i=0 .Figure 5A shows the result of the model selection.Active learning with GPR does not improve the model selection because of the high intensity measurement noise.However, our method improves the model selection compared to passive learning.Figure 5B shows the 99% credible interval of the parameter estimation of peak positions µ 1 , µ 2 , µ 3 .Our method narrowed the interval width and improved the parameter estimation.
Moreover, the results of 10 independent measurements were compared: (a) passive learning with total measurement time 2400, (b) active learning with total measurement time 2400, and (c) passive learning with total measurement time 7200.Figure 6 shows the result of the model selection.Figure 7 shows the parameter estimation accuracy.Here, we defined the parameter estimation accuracy W µ 1 , W µ 2 , W µ 3 for µ 1 , µ 2 , µ 3 as follows:  , and the lower figure shows the total measurement time per measurement point t i = #{j|x j = x i } × T .Although the signal-to-noise ratio of the initial data is poor at all measurement points, the signal-to-noise ratio of the data near the peak is improved by repeating the experiments.where  Both results show that our method improved the estimation accuracy and shortened the measurement time.

Hamiltonian selection
The Hamiltonian selection in XPS 23 was also considered in this study.

Problem setting
Let M 2 be a model using a two-state Hamiltonian, H 2 , and M 3 be a model using a three-state Hamiltonian H 3 , and let M = {M 2 , M 3 } be a set of candidate models.Let θ M 2 = {�, V , Ŵ, U fc , b} and θ are shown in the supplementary materials.As the measure- ment is performed by photon counting in XPS, the probability distribution of the number of observed photons p(y|f M (x; θ M )) is considered to be Poisson(y; f M (x; θ M ) × T) with measurement time T.

Detailed algorithm for Hamiltonian selection
Unlike in the case of spectral deconvolution, the model set M = {M 2 , M 3 } is fixed.The specific algorithm is shown in Algorithm 2.   Ensure: Select n 2 points {x 1 , ..., x n 2 } in descending order of Select n 2 points {x n 2 +1 , ..., x n } in descending order of { I s,c (x i )} xi∈X .

Conventional methods
We compare our method with passive learning and active learning with GPR as in the case of spectral deconvolution.

Result
Let the true model be the model M 3 with H 3 and the true values of its parameters be as follows: This true parameter is derived from 23 .The physical function f M 3 (x; θ K LdivergenceoM 3 ) with the true parameter θ * M 3 = {� * , V * , Ŵ * , U * fc , U * ff , b * } is shown in Fig. 8.The peak around x = 5 is small, indicating that the model selection from model M 2 that generates two peaks and model M 3 that generates three peaks is difficult.Let the measurement time for one measurement in active learning T be T = 1 , number of measurement points per one experiment n = 10 , and the candidate set of measurement points X be X = {−30 + 0.125(i − 1)} 400 i=1 .The prior distribution is shown in the supplementary materials.
The flow of the measurement is shown in Fig. 9.The signal-to-noise ratio is poor at all points at first.However, as the experiment progresses, the signal-to-noise ratio near the peaks improves due to focused measurements.The data and the fitting by the MAP estimator ( θM 3 = argmax θ p(θ|D, M 3 )) when the total measurement time is   , and the lower figure shows the total measurement time per measurement point t i = #{j|x j = x i } × T .It can be observed that the area near the peaks, particularly near the small peak around x = 5 , is measured intensively.10,000 are shown in Fig. 10.This figure shows that the proposed method focuses on the area near the peaks, particularly near the small peak around x = 5.
In addition, we calculated p(M 3 |D) and p(θ M 3 |D, M 3 ) when the total measurement time is {400 + 300i} 32 i=0 .Figure 11A shows the result of the model selection.As in the previous section, active learning with GPR did not improve the model selection because of the high intensity measurement noise.However, our method improved the model selection compared to passive learning.Figure 11B shows the 99% credible interval of the parameter estimation of �, Ŵ, U fc .Our method narrowed the interval width and improved the parameter estimation.Moreover, 10 independent measurements were performed, and the following results were compared: (a) passive learning with total measurement time of 10,000, (b) active learning with total measurement time of 10,000, and (c) passive learning with total measurement time of 40,000.Figure 12 shows the result of the model selection.Figure 13 shows the accuracy of parameter estimation.Here, we defined the accuracy of parameter estimation W � , W Ŵ , W U fc for �, Ŵ, U fc as follows:  Both results show that our method improved the estimation accuracy and shortened the measurement time.

Conclusion and future work
We developed an active learning method using multiple parametric models as learning models to improve the accuracy of model selection and parameter estimation in spectral experiments.In our method, the next measurement points that are important for model selection and its parameter estimation were selected using the model and its parameter posterior distribution.We applied our method to two spectral experiments, namely spectral deconvolution and Hamiltonian selection.In both experiments, the proposed method improved the model selection and accuracy of parameter estimation compared with passive learning and active learning with GPR.
To apply our method to a broader range of actual spectral experiments, we need to consider the following points.Firstly, there is a concern about the calculation cost of the proposed method.To reduce the actual experimental time using the proposed method, the computational time of the Monte Carlo method should be sufficiently small compared to the experiment time.Nevertheless, in cases with a large number of parameters, the exploration range of the Monte Carlo method expands, leading to longer convergence times.Additionally, the computation time per iteration often scales proportionally with the number of measurement points.Therefore, in scenarios with a high number of measurement points, such as in the case of high-dimensional spectral data, the measurement time can significantly increase.To mitigate computational time, one approach is to employ the concept of sequential Monte Carlo methods 24 , utilizing samples obtained from previous simulations to perform sampling from the new posterior distribution.Moreover, the convergence can be improved by appropriately setting the prior distribution using prior knowledge about the experiment 25 .
Another challenge is adjusting the Monte Carlo parameters automatically.The Monte Carlo method has many parameters and setting them up for each experiment is difficult.Thus, an algorithm like NUTS 26 to adjust the Monte Carlo parameters will be required.
Finally, our method is only applicable when candidate models are known in advance.However, background functions often have limited prior knowledge, making modeling challenging in many cases.Such challenges can be solved by using semi-parametric models 27 .

Figure 1 .
Figure 1.Criteria for active learning.(a) Parameter estimation.The gray line represents data D and θ 1 , θ 2 follow p(θ M |D, M) .I M (x) corresponds to the difference between f M (x; θ 1 ) and f M (x; θ 2 ) integrated numerically over p(θ M |D, M) . (b) Model selection.The gray line represents data D and θ 1 , θ 2 follow p(θ M s |D, M s ), p(θ M c |D, M c ) , respectively.I s,c (x) corresponds to the difference between f M s (x; θ 1 ) and f M c (x; θ 2 ) integrated numerically over p(θ M s |D, M s ) and p(θ M c |D, M c ).
and B correspond to the peak intensity, peak position, peak width, and background intensity, respectively).Since the measurement is performed by photon counting in XPS, the probability distribution of the number of observed photons p(y|f M (x; θ M )) is Poisson(y; f M (x; θ M ) × T) with measurement time T.

Figure 2 .
Figure 2. Value of the modeling function f M 3 (x; θ M 3 ) and the example of the observed data {x i , y i } 400 i=1 when T = 6.

Figure 3 .
Figure 3. Flow of the proposed method for spectral deconvolution.The upper figure shows the observed values per measurement time ȳi = x j =x i y j

( 19 )Figure 4 .
Figure 4. Data and fitting obtained by experiments on the spectral deconvolution.The upper figure shows that the number of photons observed per measurement time ȳi and the fitting by the MAP estimator.The lower figure shows the total measurement time per measurement point.(a,d) Passive learning.(b,e) Active learning with GPR.(c,f) Proposed method.

Figure 5 .
Figure 5. (A) Model selection results.The horizontal axis is the total measurement time; the vertical axis, the probability of the true model; the blue line, the result of passive learning; the green line, the result of active learning with GPR; and the orange line, the result of our method.(B) 99% credible interval of the parameter estimation of peak positions µ 1 , µ 2 , µ 3 .The horizontal axis is the total measurement time.The gray area represents the result of passive learning; the colored area, the result of our method; and the dotted lines, the true value of µ 1 , µ 2 , µ 3 .

Figure 6 .
Figure 6.Bar graphs of p(M 1 |D), p(M 2 |D), p(M 3 |D), p(M 4 |D) for the 10 independent trials. (a) Passive learning with a total measurement time of 2400.(b) Proposed method with a total measurement time of 2400.(c) Passive learning with a total measurement time of 7200.

Figure 7 .
Figure 7. Boxplots represent the accuracy of parameter estimation of the peak positions.The left panel, middle panel, and right panels show the boxplots of W µ 1 , W µ 2 , and W µ 3 respectively.(a) Passive learning with a total measurement time of 2400.(b) Proposed method with a total measurement time of 2400.(c) Passive learning with a total measurement time of 7200.

Figure 8 .
Figure 8. Plot of the modeling function f M 3 (x; θ * M 3 ) and the example of the observed data {x i , y i } 400 i=1 when T = 25 .The peak around x = 5 is small, indicating the difficulty of the model selection.

Figure 9 .
Figure 9. Flow of the proposed method for Hamiltonian selection.The upper figure shows the observed values per measurement time ȳi = x j =x i y j

Figure 10 .
Figure 10.Data and fitting obtained by experiments on the Hamiltonian selection.The upper figure shows that the number of photons observed per measurement time ȳi and the fitting by the MAP estimator.The lower figure shows the total measurement time per measurement point.(a,d) Passive learning.(b,e) Active Learning with GPR.(c,f) Proposed method.

Figure 11 .
Figure 11.(A) Model selection results.The horizontal axis is the total measurement time; the vertical axis is the probability of the true model; the blue line is the result of passive learning; the green line is the result of active learning with GPR; and the orange line is the result of our method.(B) 99% credible interval of the parameter estimation of �, Ŵ, U fc .The horizontal axis is the total measurement time.The gray area indicates the result of passive learning, while the colored area indicates the result of our method; the dotted lines represent the true value �, Ŵ, U fc .

Figure 12 .
Figure 12.Bar graphs of p(M 2 |D), p(M 3 |D) for the 10 independent trials. (a) Passive learning with a total measurement time of 10,000.(b) Proposed method with a total measurement time of 10,000.(c) Passive learning with a total measurement time of 40,000.

Figure 13 .
Figure 13.Boxplots represent the accuracy of parameter estimation of Hamiltonian parameters.The left panel, middle panel, and right panels show the boxplots of W � , W Ŵ , and W U fc , respectively. (a) Passive learning with a total measurement time of 10,000.(b) Proposed method with a total measurement time of 10,000.(c) Passive learning with a total measurement time of 40,000.