Local Bayesian Dirichlet mixing of imperfect models

To improve the predictability of complex computational models in the experimentally-unknown domains, we propose a Bayesian statistical machine learning framework utilizing the Dirichlet distribution that combines results of several imperfect models. This framework can be viewed as an extension of Bayesian stacking. To illustrate the method, we study the ability of Bayesian model averaging and mixing techniques to mine nuclear masses. We show that the global and local mixtures of models reach excellent performance on both prediction accuracy and uncertainty quantification and are preferable to classical Bayesian model averaging. Additionally, our statistical analysis indicates that improving model predictions through mixing rather than mixing of corrected models leads to more robust extrapolations.


Introduction
When considering predictions or extrapolations of physical quantities into unknown domains, a reliance on a single imperfect theoretical model can be misleading.To improve the quality of extrapolated prediction, it is in fact advisable to use several different models and mix their results [1][2][3][4] .In this way, the "collective wisdom" of several models could be maximized by providing the best prediction rooted in the most current experimental information.To carry out the mixing, statistical machine learning (ML) methods, with their ability to capture the local features, are tools of choice.Specifically, Bayesian Model Averaging (BMA) can be used to combine multiple models to produce more reliable predictions since it is the natural Bayesian framework to account for the uncertainty on the model itself [1][2][3] .In absence of another established methodology, the application of BMA to scenarios where several models compete to describe the same phenomenon has been relatively widespread in such diverse fields as weather forecasting 5 , political science 6 , transportation 7 , and nuclear physics 8,9 .
It is important to remember, however, that BMA relies on the assumption that one of the proposed models is the true model (i.e., a model that perfectly describes the physical reality), which is clearly inappropriate when dealing with complex systems and approximate modeling.In practice, it often happens that none of the competing state-of-the-art models can be dominated by the others, in the sense that each model does something better than the others.In such a setup, models should not be viewed as exclusive but as complementary, and BMA seems theoretically ill-grounded.In addition, in the standard implementations of the BMA, the weights are global, i.e., they are constant over the input domain, and thus unable to catch local model preferences.
Besides BMA, there exist other methods to combine results of different models.In fact, combining models has been the subject of much research in ML that has led to the development of the topical "ensemble learning methods" (bagging and boosting).These methods remain in their spirit and purpose very close to BMA and typically do not fix the inadequacies mentioned above: their goal is to identify the best performing model given a set of models.See Ref. 10 for review of additional approaches.
In this work, we develop and apply the Local Bayesian Model Mixing (LBMM), an extension of Bayesian stacking [11][12][13][14][15] , for managing competing models.Under the Bayesian stacking framework, one assumes that the true model is a linear combination of the models instead of being one of the models.The extrapolations are thus obtained via a direct mixture of the models, as compared to the mixture of posterior distributions under the standard BMA.Unlike the BMA weights which reflect the fit of a statistical model to data, independently of the set of available models except for normalization, the weights based on model mixing or stacking reflect the model's contribution to the final predictions 13 .The LBMM used in this study makes the use of Dirichlet distribution to infer stacking weights which hierarchically depend on the model input space and thus highlight the local fidelity of theoretical models.Additionally, the LBMM framework well captures uncertainties of individual models and their mixing weights through the proposed hierarchical structure.Below, we first present the general LBMM framework followed by a pedagogical case of global mixture of models that corresponds to classical Bayesian stacking.Subsequently, we let the model weights vary across the model input space and consider several hierarchical Bayesian models based on the Dirichlet distribution.
As an example, we apply the new method to predicting nuclear mass, or binding energy, which is the basic property of the atomic nucleus.Since we consider BMA to be a point of comparison for our LBMM methodology, we briefly review the general predictive framework of BMA in Methods section.The binding energy determines nuclear stability as well as nuclear reactions and decays.Quantifying the nuclear binding is important for many nuclear structure and reaction questions, and for understanding the origin of the elements in the universe.The astrophysical processes responsible for the nucleosynthesis in stars often take place far from the valley of beta stability, where experimental masses are not known.In such cases, missing nuclear information must be provided by extrapolations.Accurate values for nuclear masses and their uncertainties beyond the range of available experimental data are also used in other scientific fields, such as atomic and high-energy physics, as well as in many practical applications.In order to improve the quality of model-based predictions of masses of rare isotopes far from stability, ML approaches can be used that utilize experimental and theoretical information.A broad range of ML tools have been used to mine unknown nuclear masses, including Gaussian processes (GPs), neural networks, frequency-domain bootstrap and kernel ridge regression [16][17][18][19][20][21][22][23][24][25][26] (see the recent review 27 for more references).In a series of papers 8,[28][29][30][31] , the BMA methodology has been applied to nuclear mass predictions.In this work, we propose the LBMM approach to produce model-informed extrapolations of nuclear masses that overcome the limits of BMA mentioned above.

Bayesian model mixing
Let us consider experimental observations y(x i ) of a physical process at locations x i ∈ X ⊂ R q , i = 1, . . ., n, governed by a "true model" y * (x), and let us assume that the true model is not fully captured by one of the proposed models f k , k = 1, . . ., p, but rather a combination of these models.It is then natural to consider a statistical mixture model of the general form: where σ represents the scale of the error of the mixture model, ε i iid ∼ N(0, 1), and f 1 (x i ), . . ., f p (x i ) are theoretical values for the datum y(x i ) provided by the p theoretical models considered.
In practice, the weights ω ω ω(x i ) = (ω 1 (x i ), . . ., ω p (x i )) must be taken in a space where inference is possible.This can be done in many ways.In this work, we will highlight a few alternative models for ω ω ω(x i ) which are tractable, suggestive, and fully Bayesian.
Additionally, one can improve the models by accounting for systematic errors.This can be done by adding to each model the systematic error correction δ f ,k :

Global mixtures
First, we present the simplest application of Bayesian model mixing (BMM) where one assumes global weights (GBMM), i.e., weights that are constant over the input domain.

Linear model (L)
Let us first model the underlying physical process by a global (linear) mixture of the individual models: Using Eq. (3) the log-likelihood of the model can be written as: where y y y = (y(x 1 ), . . ., y(x n )).To ensure that the weights ω ω ω have the same support as the model weights in BMA, it may also be justified to assume the simplex constraints ω k ≥ 0 and ∑ k ω k = 1.In that case, the posterior distributions should also satisfy the simplex constraints.While the first condition is easily met using non-negative priors, the second can be more challenging to enforce with priors.Nevertheless the naive idea of projecting the unconstrained posteriors appears to be relatively efficient in the case of simple linear models 32 .Here projecting the simplex constraints corresponds to substituting

Dirichlet model (D)
As a refinement of the global linear mixture and a step towards local mixtures with simplex constraints, one can suppose that the weights ω ω ω are given hierarchically by a Dirichlet distribution: with the hyperprior π(α α α) on the hyperparameter α α α.The reason for this additional modeling layer is twofold.First, it allows us to express uncertainty about the prior model weighing imposed by α α α.Note that a Dirichlet distribution with size p and the parameter vector α α α > 0 is a multivariate continuous distribution on the simplex {ω 1 , ..., ω p ≥ 0 : ∑ p k=1 ω k = 1} where the average value of ω j is α j / ∑ p k=1 α k .Looking at the shape of the distribution in Fig. 1, it is clear that α α α < 1 is close to model selection while α α α > 1 encourages true mixing.The hyperprior π(α α α) allows us to quantify our uncertainty about these two regimes.Secondly, the hierarchical model for ω ω ω permits, with slight modification, a heterogeneity of weights based on the value of x, which we shall exploit shortly.
Log density of the Dirichlet distribution when p = 3 as a function of ω 1 and ω 2 (ω 3 = 1 − ω 1 − ω 2 ).The parametrization of Dirichlet distribution is such that α α α < 1 is close to model selection (one dominant weight with a high probability) and α α α > 1 leads to true mixing (several large-probability weights).Left: α α α = (0.3, 0.3, 0.3); Right: Consequently, up to a choice of prior π(α α α, δ δ δ f , σ ), the joint posterior distribution of (ω ω ω, α α α, δ δ δ f , σ ) is given by which does not have a closed form in general and needs to be approximated using MCMC.Predictions of observations from a physical process at new locations are then obtained by propagating the posterior samples of (α α α, δ δ δ f , σ ) through the hierarchy described above.The Dirichlet weights encapsulate the contribution of each model to the mixture, which makes the interpretation of the weights probabilistic.In that sense, they carry a different meaning than the BMA weights, which measure the fidelity of individual models.Still, both weights can be compared as they play the same role in the the final predictions -keeping in mind that the information contained in the posterior distribution of the Dirichlet weights is richer than the point values produced by BMA.

Local mixtures
Let us now consider that the observations y(x i ) follow the general statistical model given by Eq. ( 2).The key feature here is that now the weights depend on the location x.Without additional information, the functions ω ω ω(x i ) shall be estimated with a non-parametric estimator satisfying the simplex constraints.

3/15
In order to account for the local dependency of the model weights while satisfying the simplex constraints, we propose a hierarchical framework based on the Dirichlet distribution.Specifically, we take for every x weights ω ω ω(x) as Dirichlet random variables defined by parameters α α α(x) = (α 1 (x), . . ., α p (x)).The correlations between ω k (x)'s for different values of x shall be contained in the corresponding α k (x)-correlations.We investigate two models for α k (x): a Generalized Linear Dirichlet model (GLD) and a Gaussian Process Dirichlet model (GPD).
In particular, we assume that at every location x the model weights ω ω ω(x) follow a Dirichlet distribution with parameters α k (x), for k = 1, . . ., p, now depending on x, that will encode the spatial relationships between the model weights over the input space.Since the Dirichlet distribution is defined for parameter values α k > 0, we additionally apply an exponential link function, i.e., we consider γ k (x) := log(α k (x)) which can be regressed symmetrically.Thus, the purpose of the link function is to allow for unconstrained modeling of γ k (x).
The GLD version of our local mixing framework represents γ k (x) parametrically as where ) is a parameter vector.The linear nature of Eq. ( 7) corresponds to the assumptions that the correlations between local weights have a relatively large spatial range.
As a finer version of LBMM, we propose a non-parametric GPD model for γ k (x) defined by a Gaussian process prior parametrized with a constant mean γ ∞ k and covariances c k (x, x ′ ) given by quadratic exponential kernels 33 .Additionally, we assume that γ k (x) and γ j (x) are statistically independent for k ̸ = j.Note that the proposed hierarchical structure takes into account not only the relationship between ω k (x)'s for different values of x (via α k (x)) but also the correlations between the weights of models at given spatial location x (via Dirichlet distribution).This would not be possible if one choose to model ω k (x)'s directly, let's say with a GP over ω k (x).

Application: nuclear mass extrapolation
As a case study for the Bayesian model mixing framework described above, we consider the separation energies of atomic nuclei, which were the subject of our previous investigations 8,18,28,29,31 .Our particular goal is to compare the following alternatives: (i) Raw models without statistical correction δ f (/w δ f ) vs. models corrected with δ j (w/ δ f ); (ii) BMA vs. global BMM; (iii) Global BMM vs. local BMM.
The two-neutron separation energy (S 2n ) is a fundamental property of the atomic nucleus defined as the energy required to remove two neutrons from the nucleus.It can be expressed as a difference of nuclear masses.Here, the input space X is represented by the numbers of protons Z and neutrons N. Consequently, in this study q = 2, x i := (Z i , N i ) and y i is the observed two-neutron separation energy at x i .We are particularly interested in even-even nuclei, for which both N and Z are even numbers.We use the most recent measured values of two-neutron separation energies for nuclei from the AME2003 dataset 34 as training data (n = 521) for BMM and the GP systematic corrections; for BMA calculations we retain as evidence dataset a subset of this training data consisting of 8 nuclei: 3 proton-rich nuclei 148 Er, 188 Po, 242 Cf, and 5 neutron-rich nuclei 64 Cr, 116 Ru, 160 Nd, 168 Hf, 232 Ra.We keep additional data tabulated in AME2020 35 for an out of sample extrapolative testing dataset (n = 59).These three domains are depicted in Fig. 2.
As for prediction, we will use the largest domain on which two-neutron separation energies are positive, i.e., the corresponding nuclei are predicted to exist.In line with our previous studies, we consider seven theoretical models based on the nuclear density functional theory (DFT) which is capable of describing the whole nuclear chart: SkM * 36 , SkP 37 , SLy4 38 , SV-min 39 , UNEDF0 40 , UNEDF1 41 , and UNEDF2 42 .The DFT data were taken from the theoretical database 43 .The above set of DFT models was augmented by two well-fitted mass models FRDM-2012 44 and HFB-24 45 that have significantly more parameters than the (less phenomenological) DFT models, resulting in a better fit to measured masses.
In the subsequent Bayesian analyses, we use independent priors for the different statistical parameters and keep consistent notation throughout the model mixing variants.In general, we use normal priors for unconstrained parameters, Gamma priors for positive parameters, and uniform priors for bounded parameters.Recall that a Gamma distribution is parametrized by a shape parameter a and a rate parameter b, and has its mean given by a/b and variance given by a/b 2 .For the error scale parameters σ , we use Gamma priors with scale parameter 5 and rate parameter 10, with mean 0.5 MeV and standard deviation 0.22 MeV.In the case of LBMM variant with GLD defined by Eq. ( 7), we take independent normal prior distributions with mean 0 and standard deviation 1 for the elements of β β β 1 , . . ., β β β p .For LBMM with GPD, we used independent squared exponential kernels for the GP: namely, characterized by three hyperparameters η k , ρ Z , and ρ N .We have chosen to take the length-scale parameters ρ common to all nuclear models, but leave the GP intensity parameters η be different for each model, in order to ensure stability and convergence.As a result the GPD weights follow the frequency of the residuals for each model.In the case of Dirichlet distribution with GPD local mixture, we take independent normal priors with mean 0 and variance 1 for the GP mean parameter γ ∞ k , and independent Gamma priors for the three scale parameters η k , ρ Z and ρ N .These priors are taken with respective parameters (10, 2), (5, 2), (5, 2); this corresponds to slightly informative priors which helped to ensure convergence towards weights localized on an appropriate scale.The parameter γ ∞ k determines the long range weight of each model, i.e., far from the training data.Note that taking a zero-mean GP, i.e., setting γ ∞ k = 0, would amount to uniform weights far from the data.When it comes to the global GBMM+L mixture, Eq. (1), we take for ω ω ω independent uniform priors on [0, 1].In practice, the simplex constraint is satisfied implicitly, without the need to apply Eq. ( 4).This confirms that all the individual models are well conceived.For the GBMM+D variant, we take for α a half-normal prior with standard deviation 2.
For δ f ,k , we use the systematic correction for two-neutron separation energy residuals (i.e., differences between theoretical and measured two-neutron separation energies) computed in 29 using Bayesian Gaussian processes; these are fixed with no priors.The training dataset in 29 agrees with Fig. 2 up to a set of 5 additional nuclei.In what follows, we do not use these nuclei during training whenever uncorrected models are considered since they would be extrapolative from the GP's perspective.The consequences of this omission is negligible due to the overall size of the training set.
All model weights were trained with the S 2n values from the full training dataset with the exception of BMA, where we have used only the evidence set shown in Fig. 2 that consists of 8 nuclei.Similar to 8,28,29 , we compute the BMA weights only on a set of representative nuclei because computing evidences on a large dataset inevitably leads to model selection.This happens due to the exponential and multiplicative nature of Gaussian likelihood which punishes large deviations more than it favors good fits (see 4,30 for details).These weights are in turn applied to obtain predictions for S 2n .The proton-rich limit of X , determined by two-proton separation energies, was identified in the previous study 28 .
Tables 1 and 2 summarize the results of our model variants and are discussed in the following paragraphs.

Uncorrected models versus corrected models
We first discuss the fidelity of individual models.To this end, we study root-mean-square (rms) deviations for all the modeling variants discussed in this paper.We consider raw model predictions and the predictions including the systematic corrections δ f ,k .It is seen that for the individual models, the corrected variants generally outperform the raw predictions, see Ref. 18 for discussion.The exception is HFB-24, which has been carefully calibrated to experimental masses; in this case the correction term δ f does not lead to a lower rms deviation on the testing dataset.We want to point out that mixing of corrected models should be done with caution.It is our empirical experience that mixing (GBMM or LBMM) of previously corrected models can lead to overfitting as one tends to fit the statistical models to the small leftover noise since the residuals of all corrected models on the training dataset are practically zero.This can be clearly observed in Table 1: the rms deviations for both local and global mixtures slightly outperform the combinations of corrected models on the the testing dataset.For instance, GBMM+D of uncorrected models gives 0.31 MeV rms deviation on the testing dataset as compared to 0.46 MeV on the corrected models and also as compared to 0.35 MeV rms deviations of BMA on corrected models.Since providing accurate extrapolations is the main focus of this work, in the following, we focus the discussions primarily on the uncorrected models.

BMA versus Global Mixtures
The BMA evidence integrals were calculated on the evidence dataset by means of Monte Carlo (MC), Laplace approximations, and in a closed form under conjugate priors.We denote the corresponding BMA variants as follows: BMA(MC), BMA(Lap) and BMA(ex), respectively (see Methods section for the calculations of BMA weights).In Table 2, we see that the model weights produced by BMA are consistent across all three evidence computation approaches, irrespective of whether the systematic correction has been applied.Averaging corrected models is more democratic as compared to the raw models: this is expected since the GP-based δ f ,k corrections fit the training data closely irrespective of the theoretical model.Still, the BMA testing rms with uncorrected and corrected models are very similar, with a slight preference for the uncorrected models.
The global mixtures of uncorrected models are generally slightly outperforming BMA on both training and testing datasets (see Table 1).This is in fact expected, given that these weights are designed to maximize the predictive power of the model mixture.Indeed, the GBMM+L model is the Bayesian counterpart to a frequentist linear regression against the different nuclear model predictions that minimizes the rms on the training set.This principle still holds despite the uniform prior used for the GBMM+L model which is very informative and plays a regularizing role that reduces overfitting and favors mixing.We can see that the Dirichlet mixture model yields very similar weights, with the benefits of having its weights natively located on the simplex.This comparison of global weights already speaks in favour of ruling out BMA for the purpose of combining imperfect models, in the favor of a Bayesian Dirichlet model.Table 1 also shows the posterior mean of the noise scale parameter σ for comparison.As a rule of thumb, a statistical model with a conservative (liberal) uncertainty quantification (UQ) would have σ above (below) the test rms, and a statistical model with high-fidelity UQ has σ close to the test rms.A more comprehensive view of UQ that reflects the fully propagated prediction uncertainty can be gleaned from Fig. 3 that shows the empirical Empirical coverage probability Credibility level coverage probability 46,47 (ECP).Each curve in Fig. 3 corresponds to the proportion of predictions in the testing dataset falling into the respective credible intervals (equal-tailed credible intervals).If the ECP curve closely follows the diagonal, then the actual fidelity of the credible interval corresponds to the nominal value.Thus we see that the GBMM has both a superior prediction performance and a better UQ then BMA and individual models.

Global versus Local Mixtures
The posterior mean of the LBMM+GPD weights are shown in Fig 4 .The same plots but for LBMM+GLD are given in the Supplementary Information.As discussed earlier, mixing models locally corrected for systematic errors is highly susceptible to overfitting and we therefore focus on uncorrected models, i.e. without δ f .Both LBMM variants show the dominance of the well-fitted HFB-24 mass model throughout the nuclear landscape.As expected, the simplistic linear dependence of weights ω ω ω on (Z, N) in the GLD variant is insufficient to fully capture the complex local behaviour of the mass models learned by a more flexible GPD variant.While the HFB-24 mass model dominates, the final LBMM+GPD results involve other models, primarily FRDM2012, UNEDF0, and SkM * .The weight distribution naturally depends on the choice of models involved in the analysis.This suggests, that a preselection of diversified models to be used in LBMM could also be considered beforehand.
In terms of the rms deviations, the GPD variant does better than the GLD local mixture, which reflects the ability of the GPD to capture the local performance of mass models.Local mixtures perform better than global mixtures on the training Table 2. Global weights calculated on the training dataset with different methods: BMA on the evidence subset (see Fig 2)  obtained from a closed form computation as well as Monte Carlo and Laplace approximations, and the two global mixtures GBMM+L and GBMM+D obtained over the whole training set.For compactness, the following abbreviations are used: SV=SV-min, UNEn=UNEDFn (n=0,1,2), FRDM=FRDM-2012, and HFB=HFB-24.

Discussion
In this work, we propose and implement a Bayesian Dirichlet model mixing framework.The proposed method is illustrated by applying it to nuclear mass models to assess their local fidelity and improve predictability.Raw theoretical models and their statistically-corrected versions were considered to better understand the interplay between GP modeling, BMA, and the BMM frameworks.
Bayesian model mixing of raw models results in testing rms that are at least as good or better than Bayesian model averaging (irrespective of models being corrected) with clearly superior UQ.Thus, improving model predictions through mixing rather than mixing of corrected models leads to the best performance in terms of both prediction accuracy and UQ.Since BMM is trained on a sizable training set, it is also more robust to the choice of priors than BMA which can be very prior sensitive if the evidence in data is weak 1 .
BMM of corrected models should be performed with caution as it may lead to overfitting.In this case, one likely achieves a better improvement with standard BMA based on a well chosen evidence set.
The LBMM+GPD variant achieved the smallest training error on the training dataset (0.25 MeV) which demonstrates that the LBMM well captures the local presences of individual models.Furthermore, the local mixtures clearly surpass all the other modeling strategies explored in this work in terms of UQ fidelity.This shows that the proposed hierarchical Dirichlet model for LBMM effectively represents and propagates uncertainties which is essential for mass modeling into unexplored domains 29 .
The results of BMA depend on the choice of the evidence dataset.That is, by increasing the density of the evidence data in the region of interest, e.g., for applications or extrapolations, one can improve the predictive power of averaging procedure.Improvement in the performance of BMM can also be achieved by restricting the training dataset to the region of interest as opposed to training on the whole domain; this motivates our introduction of local BMM models.
The distributions of BMA and BMM weights also depend on the choice of theoretical models.Table 2 and Fig. 4 show that mixing a large set of models results in some having minimal contributions and point out to the existence of a class of models with 8/15  similar local preferences (e.g., UNEDFn class).This indicates that adding model preselection and model orhogonalization 49 to the BMM pipeline could lead to a further improvement in predictive performance.In fact, in the context of our GBMM+L model, it is well known that collinearity between the proposed theoretical models is a source of major instabilities.

BMA highlights
Let us consider the task of predicting observations from a physical process at new locations x * using the observations y y y = (y(x 1 ), . . ., y(x n )).The BMA posterior predictive distribution is is the prior density of model's parameters (noise scale σ k and systematic discrepancy δ δ δ f ,k ), p(y y y|σ k , δ δ δ f ,k , M k ) is the data likelihood, and π(M k ) is the prior probability that M k is the true model -assuming that one of the models is true.
There is only a handful of statistical distributions under which the evidence integral can be expressed in a closed form.One such scenarios is linear regression models with conjugate priors; the statistical model M k with a constant discrepancy term δ f ,k belongs to this case.For each model, let us consider the prior where δ f ,k , conditionally on a theoretical model choice M k and a precision parameter (the inverse of the variance) λ k , follows a normal distribution with mean µ and variance 1/λ k .Let us further assign to the precision λ k a gamma prior with shape parameter a and rate parameter b.Then, the evidence integral has the closed form solution: where 2(1+n) , κ n = 1 + n, while denoting d i := y i − y k (x i ) and d := (∑ i d i )/n.This solution can be obtained by simple but tedious algebraic manipulations, see Ref. 50 for details.As stated in the main manuscript, we use for the parameter σ a gamma prior with scale and rate parameters 5 and 10.In order to match the mean and standard deviation of 1/σ 2 when σ is distributed according to the common Gamma prior with shape and scale parameters 5 and 10, the results for the closed form BMA were obtained under a gamma prior for the precision (inverse variance) parameter with shape and scale parameters 0.252 and 0.030.
When evidences cannot be obtained explicitly, a MC estimate can be computed as Alternatively, when the discrepancy term is considered constant, the evidence integral can be approximated by a closed form expression.A technique frequently used is Laplace's method of integral quadrature 1 : where σk and δ f ,k represent the posterior modes and The computation of evidence integral is simplified in the scenarios without the constant discrepancy term δ f ,k as the statistical model contains only a single parameter σ k .We leave the details of this simple exercise in probability to the reader.

Application: summary of modeling choices
As a matter of clarity and to guarantee reproducibility of the results presented in section Application: nuclear mass exploration, Table I lists parameter choices and priors for each of the modeling variants discussed.Note that when we consider theoretical model without statistical correction, δ f ,k (x i ) := 0. I. Summary of statistical models, their parameters, and priors used in section Application: nuclear mass exploration.(5, 10), ω ω ω(x i )|α α α(x i ) ∼ Dirichlet(α α α(x i )), log(α k (x i )) ∼ GP(γ ∞ k , c k (x i , x ′ i )), where

MCMC computations
The MCMC approximate posterior distributions for all the modeling variants discussed in this work were obtained using the Hamiltonian Monte Carlo based No-U-Turn sampler (NUTS) 48 .In general, we obtained at least 50 × 10 3 samples from the posterior distributions after which we discarded half as a burn-in.While more conventional samplers such as Metropolis-Hastings (MH) algorithm 50 would be sufficient for both BMA and global mixtures, using NUTS is essential to achieve satisfactory convergence when it comes to LBMM.To illustrate this, we provide selected MCMC traceplots for LBMM+GPD variant with MH and NUTS approximations in METHODS Figs. 1 and 2, respectively.The NUTS whose performance tends to be superior to MH in scenarios with moderately large parameter spaces clearly shows the convergence of Markov Chain while the MH displays poor mixing.

Figure 2 .
Figure 2. Training (black dots), testing (red circles), and evidence (red dots) datasets of two-neutron separation energies of even-even nuclei used in this study.The eight evidence nuclei are also included in the testing dataset.Each nucleus is represented by the number of protons Z and neutrons N. See text for details.

Figure 3 .
Figure 3. Empirical coverage probability for raw models without statistical correction together with BMA and BMM variants.The empirical coverage probability was calculated with equal-tailed credibility intervals.The reference line (diagonal) is marked by a dashed line.

Figure 4 .
Figure 4. Posterior means of the local model weights in the LBMM+GPD variant across the nuclear landscape.

1 .
Traceplots of the scale parameter σ and GP mean parameters γ ∞ k obtained via the Metropolis-Hastings algorithm in the LBMM+GPD variant.

3 METHODS Figure 2 .
Similar as in METHODS Fig.1but for the No-U-Turn sampler.

Table 1 .
Rms deviations (in MeV) for all individual models, global (BMA, GBMM) and local (LBMM) mixtures.Values are provided with and without systematic corrections.For corrected models, we show only the test rms as the train rms and σ values are negligible after the GP fit.For abbreviations of BMA variants, see text.