A machine learning approach to Bayesian parameter estimation

Bayesian estimation is a powerful theoretical paradigm for the operation of quantum sensors. However, the Bayesian method for statistical inference generally suffers from demanding calibration requirements that have so far restricted its use to proof-of-principle experiments. In this theoretical study, we formulate parameter estimation as a classification task and use artificial neural networks to efficiently perform Bayesian estimation. We show that the network's posterior distribution is centered at the true (unknown) value of the parameter within an uncertainty given by the inverse Fisher information, representing the ultimate sensitivity limit for the given apparatus. When only a limited number of calibration measurements are available, our machine-learning based procedure outperforms standard calibration methods. Thus, our work paves the way for Bayesian quantum sensors which can benefit from efficient optimization methods, such as in adaptive schemes, and take advantage of complex non-classical states. These capabilities can significantly enhance the sensitivity of future devices.

Precise parameter estimation in quantum systems can revolutionize current technology and prompt scientific discoveries [1,2]. Prominent examples include gravitational wave detection [3][4][5], time and frequency standards in atomic clocks [6], field sensing in magnetometers [7], inertial sensors [8,9] and biological imaging [10]. As such, improving the sensitivity of quantum sensors is currently an active area of research with most work focused on the control and reduction of noise and decoherence, and on the use of non-classical probe states [1]. Furthermore, the development of data analysis techniques to extract information encoded in complex quantum states [11][12][13][14][15][16][17][18][19] is another crucial, yet often overlooked step toward ultraprecise quantum sensing.
Among different strategies [20][21][22], Bayesian parameter estimation (BPE) is known to be particularly efficient and versatile. The output of BPE is a conditional probability distribution P (θ|µ) which is interpreted as a degree of belief that the parameter θ equals the true (unknown) value θ true , given the sequence of m measurement results µ = µ 1 , ..., µ m and any prior information about θ true [16,17]. BPE is free of any assumption about the probability distribution of the measurement data µ, and it can meaningfully assign a confidence interval to any result, even a single detection event (m = 1). As m becomes large, P (θ|µ) converges to a Gaussian centered at θ true and with a width given by the inverse Fisher information, a result which crucially holds for any probability model and all values of the parameter θ true [16,21,22]. Finally, BPE forms the basis of several adaptive protocols in parameter estimation [23][24][25][26][27][28][29][30]. However, performing BPE necessitates a detailed characterization of the measurement apparatus, which typically requires either modelling the sensor explicitly, or else collecting a prohibitively large amount of calibration data. Although BPE has been demonstrated in single-qubit systems such as NV center magnetometrs [25,29,[31][32][33], its demanding calibration requirements remain a major limitation when moving to more complex systems. For example, complex non-classical states are now routinely generated in ensembles of ultra-cold atoms [1]. BPE using entangled states has so far only limited to some proof-of-principle investigations in few-particle systems [12,14,15]. To employ BPE in systems that cannot be easily modeled, methods must be developed to efficiently calibrate the device given limited data.
In this manuscript we provide a machine-learning approach to BPE. We propose that parameter estimation can be formulated as a classification task -similar to the identification of handwritten digits, see Fig. 1 -able to be performed efficiently with supervised learning techniques based on artificial neural networks [34][35][36]. Classification problems are naturally Bayesian : for instance, the output of the classification network in Fig. 1(a) is the probability that the handwriting is one of the digits 0, ..., 9, in this case a well-trained network should assign the highest probability to the digit 2. Analogously, we design a neural network adapted for parameter estimation whose output is, naturally, a Bayesian parameter distribution. Based on this interpretation, we provide a theoretical framework that enables a network to be trained using the outcome of individual measurement results. This training provides set of Bayesian distributions for each possible experimental outcome and a Bayesian prior that we unambiguously identify and directly link to the training of the network. These Bayesian distributions and prior are subsequently multiplied, depending on experimental outcomes, and used to perform BPE for the estimation of an arbitrary unknown parameter. We show that our BPE protocol is asymptotically unbiased and consistent : it obeys relevant Bayesian bounds [17] dictated, in our examples, by quantum and statistical noise. Our method is tested on a variety of quantum states, demonstrating that classical sensitivity limits can be surpassed when using entangled states. Crucially, the neural network needs to be trained with a relatively small  [37], the network can correctly classify handwritten digits with high accuracy. The network provides the conditional probability P (digit|data) that the image is assigned to a certain ideal digit (2 in this example) given the input pixel data. (b) In parameter estimation, the network input is the result µ of a measurement made at the output of a quantum sensor. In analogy with the digits 0-9, the true (but unknown) probability distribution P (µ|θj) represents a category, labelled by the discrete parameter θj. A handwritten digit is analogous to a crude sampling from this distribution, used to train the network. Then, the output layer would assign a conditional probability P (θj|µ) that a particular classification is correct, given the observed result µ.
amount of data and thus provides a practical advantage over the standard calibration-based BPE.
Although there is a significant body of literature on the application of machine learning techniques to solve problems in quantum science [38,39], quantum sensing has received relatively little attention [40]. Current studies have mainly focused on the optimization of adaptive estimation protocols [41][42][43][44][45][46][47][48][49][50], improved readout for magnetometry [25,51] and state preparation [52]. Similar tasks such as tomography [53][54][55][56][57][58][59], learning quantum states [60][61][62][63][64][65], Hamiltonian estimation [66][67][68][69] and state discrimination [70,71] have also been considered. Neural networks have been applied in the context of parameter estimation with the aim to infer/forecast noisy signals [72][73][74], and for the calibration of a frequentist estimator directly from training data [75]. Unlike these approaches, we show here that a properly trained neural network naturally performs BPE without any assumptions about the system. The machine-learning-based parameter estimation illustrated in this manuscript can be readily applied for data analysis in current quantum sensors, providing all the important advantages of BPE, while enjoying less stringent calibration/training requirements. The method applies to any (mixed or pure) state and measurement observable. In practical applications, noise and decoherence that affect the apparatus are directly included (via the training process) in the Bayesian posterior distributions which therefore fully account for experimental imperfections.

RESULTS
In a general parameter estimation problem, a probe state ρ undergoes a transformation that depends on an unknown parameter θ true . The goal is to estimate θ true from measurements performed on the output stateρ θtrue . A detection event µ occurs with probability P (µ|θ true ) = Tr[ρ θtrueÊµ ], where {Ê µ } is a complete set of positive, E µ ≥ 0, and complete, µÊ µ = 1 operators [76].
The parameter estimation discussed in this manuscript is divided in two parts : i) a neural network is trained and ii) Bayesian estimation performed on a test set, which we detail below. A test set refers to an arbitrary sequence of measurement results µ of length m, possibly different to the number of measurements found in the training set. To build intuition we first illustrate the theory with a pedagogical example consisting in the estimation of the rotation angle of a single qubit state exp (−iσ y θ/2) | ↑ , (σ x,y,z are the usual Pauli matrices and | ↑ , | ↓ are eigenstates of σ z ). The rotation angle θ is estimated by projecting the output state |ψ(θ) = exp (−iσ y θ/2) | ↑ onσ z . The two possible output results, µ =↑, ↓, can occur with probability P (↑ |θ) = cos 2 (θ/2) and P (↓ |θ) = 1 − P (↑ |θ), respectively, which are monotonic over the interval θ j ∈ [0, π].Aside of being purely pedagogical, such a system is relevant to NV center magnetometers [25,29,[31][32][33] Later, we generalise to systems of many qubits, in separable and entangled states, eventually including noise during state preparation and/or in the output measurement.

Training of the neural network
First, the parameter domain is discretized to form a uniform grid of d points θ 1 , ..., θ d which are assumed to be perfectly known. The training set consists of m θj measurements performed at each θ j . For example, the training set for a single qubit would contain d tuples {m ↑,θ , m ↓,θ }, where m µ,θ is the number of times the result µ =↑, ↓ was observed at a particular θ. During training, the network is shown all m train = d j=1 m θj measurement results µ, along with the labels θ j that are sampled from the (unknown) joint distribution [38], Here, P (µ|θ j ) is the probability to observe a measurement result µ when the parameter is set to θ j . This distribution fully characterizes the experimental apparatus  (including all sources of noise and decoherence), is typically unknown to the experimentalist and is never seen by the network. Additionally, the probabilities P (µ|θ j ) need not be sampled uniformly in θ j , which may also have some distribution P (θ j ). Via the optimization of weights and links of artificial neurons, the network attempts to learn the conditional probability P Λ (θ j |µ) that gives the degree of certainty that θ j is the correct label given the particular µ shown during training. This is the essential idea of supervised learning. Here, the subscript Λ denotes the dependence of the output on the randomly chosen initial network, the training algorithm, and the training data itself. In Fig. 2(a) we show the two possible outputs of the network for the single qubit example : that is P Λ (θ j | ↑) and P Λ (θ j | ↓) (blue dots), as a function of the label set

Bayesian inversion and prior distribution
Here, we recognize that the output of the neural network, P Λ (θ j |µ)δθ, can be interpreted as a Bayesian posterior distribution. As we have discretised the continuous random variable θ, it is necessary to account for the grid spacing δθ = θ d /(d − 1). We show that the posterior distribution is formally obtained from the Bayes rule, We emphasise that the Bayesian inversion in Eq. ( 2) is performed indirectly by the network, which does not have access to any of the quantities on the right-hand side of Eq. (2). P Λ (µ) normalises the posterior distribution, d j=1 P Λ (θ j |µ)δθ = 1 and P Λ (θ j ) is called the prior which plays a conceptual, as well as a practical role. Throughout this manuscript we are treating possible measurement results µ as a discrete random variable.
We calculate P Λ (θ j ) from its definition as the marginal distribution, P Λ (θ j ) = µ P Λ (θ j |µ)P Λ (µ) with the sum extending over all possible measurement results µ. As P Λ (µ) is also unknown, we can eliminate it by again inserting the marginal expression P Λ (µ) = d−1 k=1 P Λ (µ|θ k )P Λ (θ k )δθ, which results in the implicit integral equation Equation (3) is a consistency relation that can be solved for P Λ (θ j ), given the network output P Λ (θ j |µ) and the likelihood function P Λ (µ|θ j ). The relation Eq. (3) can be solved for P Λ (θ j ) ≡ p j by recasting it as an eigenvalue problem Ap = 0, for the matrix where δ jk is the Kronecker delta. To evaluate Eq. (4) the likelihood P Λ (µ|θ k ) is needed, however the network only provides P Λ (θ j |µ). For a sufficiently well trained network, we can approximate it with the ideal likelihood distribution, P Λ (µ|θ k ) ≈ P (µ|θ k ), which is either known from theory, or else can be well approximated by the relative frequencies observed in the training data P Λ (µ|θ j ) ≈ m µ,θj /m θj . We have found that the prior calculation in Eqs. (3) and (4) is robust to the choice of P Λ (µ|θ k ). As shown in Fig. 3, the prior P Λ (θ j ) is determined by the sampling of the training data. For instance, if the training data is distributed uniformly (m θj = m independent of θ), then P Λ (θ j ) is flat, as in Fig. 3 (a, b). A non-flat prior could be achieved by choosing a nonuniform distribution of training measurements. For instance, if m train is the total number of measurements collected in the training set, the number of measurements m θj at each θ j could be distributed according to m θj = m train q(θ j ) where q(θ j ) is a positive function of θ j with d j=1 q(θ j ) = 1. In this case, a well-trained network will learn a prior well approximated by P Λ (θ j ) ≈ q(θ j ). Two examples are shown in Fig. 3, panels (c, d) and (e, f ). The grid itself could also be varied, resulting in a non-uniform grid spacing δθ j = θ j+1 − θ j , which would also result in a non-flat prior. However, this is equivalent to a choice of q(θ j ) on a uniform grid. This is clearly illustrated by the step function example [ Fig. 3 (c, d)]. Rather than q(θ j ) itself being a step function, the same result could be achieved using a flat q(θ j ) over a grid spanning [π/2, π] (rather than [0, π]) but sampled at twice the density. For this reason, we consider only uniform grid spacing throughout this manuscript. The prior thus retains the subjective nature that characterizes the Bayesian formalism : here, this subjectivity is associated with the arbitrariness in the collection of the training data.

Network-based BPE
The training of the network gives access to the singlemeasurement (m = 1) conditional probabilities P Λ (θ j |µ) and the prior distribution P Λ (θ j ). We thus proceed with the estimation of an unknown parameter θ true (of course in the numerical experiment θ true is known but this information is never used). Notice that θ true does not need to coincide with one of the grid values θ j . We sample m random measurement results µ = µ 1 , ..., µ m from P (µ|θ true ). The Bayesian posterior distribution corresponding to the sequence µ is whereP Λ (θ j |µ i ) = P Λ (θ j |µ i )/P Λ (θ j ) and N is a normalisation factor (obtained numerically). For concreteness, in the single-qubit example, if a sequence of m measurements gives m ↑ results ↑ and m ↓ = m − m ↑ results ↓, the corresponding Bayesian probability distribution is P Λ (θ j |µ) = NP Λ (θ j )P Λ (θ j | ↓) m ↓P Λ (θ j | ↑) m ↑ , see Fig. 2 (b, c). Equation (5) Specifically, in (a, b) m θ j are distributed uniformly, resulting in a flat prior. In (c, d) data is distributed according to a step function, clearly resulting in a prior that is zerovalued over the part of the domain where no measurements were performed. Finally, in (e, f ) a smooth distribution q(θj) is used, which is clearly reproduced by the network. P Λ (θ j |µ) and the prior P Λ (θ j ). Indeed, a key advantage of our method is that, while the network is trained with single (m = 1) measurement events, the Bayesian analysis can be performed, according to Eq. (5), for arbitrary large m. In other words, we do not need to train the network for each m : the network is trained for m = 1, which guarantees the optimal use of training data. We emphasize that the prior P Λ (θ j ) in Eq. (5) is obtained by solving Eq. (3) : even for a uniform training, the Eq. (3) gives a better results compared to P (θ j ) = 1/π.
Given P Λ (θ j |µ) we can estimate θ true by, for instance, where the corresponding parameter uncertainty is quantified by the posterior variance which assigns a confidence interval to any measurement sequence µ. In a sufficiently well trained network, as the number of measurements m increases, P Λ (θ j |µ) converges to the Gaussian distribution [16,21] centered at the true value θ true and with variance 1/mF (θ true ), where is the Fisher information. F (θ) provides a frequentist bound on the precision of a generic estimator ∆ 2 θ ≥ ∆ 2 θ CRB = 1/mF (θ true ), called the Cramér-Rao bound. This behavior is clearly exhibited by the network in Fig. 2  (b, c) : the distribution narrows as a function of m and centres around θ true . The result Eq. (8) is valid for a sufficiently dense grid (i.e. δθ 1/ mF (θ true )) and in an appropriate phase interval around θ true , and holds for any prior distribution P (θ j ), provided that P (θ j ) is nonvanishing around θ true . By repeating the measurements and using Eq. (5), we can thus gain a factor √ m in sensitivity, ∆θ ∼ 1/ √ m, without requiring either additional training data or additional training for each m. In other words, a single network can be used to provide an estimate for any number of repeated measurements m, limited only by the grid size, meaningful for ∆θ δθ. In the opposite limit, and thus for m F (θ true )/(δθ) 2 , the estimation is biased, namely | Θ(µ) − θ true | ∆ 2 θ . The brackets · · · denote the average over the likelihood function P (µ|θ true ). The presence of an asymptotic bias is intrinsic of Bayesian estimation on a finite grid, when θ true does not coincide with one of the grid points. The effect is present also when using ideal probabilities (namely in the limit m train → ∞) and it is not associated with the neural network. Of course, insufficient training produces a network that poorly generalises to larger m. Figure 2 (d, e) shows convergence to the expected asymptotic result as a function of the number of training examples m θj , for a fixed number of measurement events m = 50.
The strategy of classifying a sequence µ following training based on single measurement results µ only (µ =↑, ↓ for the single-qubit case), is a key difference between this work and typical supervised learning problems such as image recognition [34][35][36]. With image recognition there is a risk that during training a network will merely memorize the training images, and poorly generalise to unseen images (this is called overfitting). The singlemeasurement training that we use avoids this problem. Instead, our network is expected to generalise from the single measurement results seen during training, to sequences with m > 1 via Eq. (5). Therefore, the network will never be asked to perform a prediction on an input µ not found in the training set (which will also only ever contain e.g. µ =↑, ↓, as in the single-qubit example). Rather, if the machine-learned Bayesian posterior for the single-measurements µ is noisy or imperfect, this error will quickly compound when Eq. (5) is applied. Therefore, it is important to compute metrics relevant to parameter estimation such as the mean bias or posterior variance (as in Fig. 4).

Application to many-qubit states
In this section we extend our procedure to systems of N qubits and demonstrate its effectiveness for both separable and entangled states. We introduce the col- k is the kth Pauli matrix for the ith qubit. Making use of these observables, the generalisation from a single qubit to many qubits is straightforward : the network is trained to recognise the result of a singleĴ z measurement with N +1 possible outcomes. The Bayesian posterior for many measurements is then obtained from Eq. (5). We consider phase-dependence encoded by a rotation about J y , which is equivalent to a Mach-Zehnder interferometer [1]. In Fig. 4 we apply our method to a coherentspin state (CSS) |CSS = | ↓ ⊗N (top panels), a twin-Fock state (TFS) given by the symmetrized combination of N/2 spin-up and N/2 spin-down particles |TFS = Symm{| ↓ ⊗N/2 , | ↓ ⊗N/2 } (middle panels) and a depolarised TFSρ = (1 − )|TFS TFS| + I/(N + 1) where I is the identity matrix (in the subspace of permutationsymmetric states) and = 0.1 (bottom panels). We quantify the performance of the network by the mean posterior variance ∆ 2 θ(µ) and bias Θ(µ) − θ true , averaged over all possible measurement sequences µ. For all three states, Fig. 4 shows that our neural network-based BPE is asymptotically efficient and unbiased when tested on a θ not found in the training grid. As expected for the CSS, the posterior variance saturates the standard quantum limit on average (SQL, ∆ 2 θ SQL = 1/mN ). Similarly, the TFS posterior variance (7) overcomes the SQL and approaches, on average, the Cramér-Rao bound ∆ 2 θ CRB = 1/[mN (N/2 + 1)] in the limit of many repeated measurements m. The same is true for the depolarised TFS, demonstrating that our neural network-based BPE is also applicable to mixed states. Furthermore, on average, the estimator (6) gives the true value of the parameter, as expected -so long as the training set is sufficiently large relative to the desired number of measurements m. In particular, networks that are shown more measurements during training are better able to generalise to large m.

Comparison to calibration-based BPE
It is natural to ask how well the network compares to conventional (calibration-based) BPE [12,14,15] making use of the same training data. Consider a training set where m θj measurements are performed at each θ j , with result µ occurring m µ times at this θ j . We assume a uniform distribution m θj , corresponding to a flat prior. The standard approach to either Bayesian or maximum likelihood estimation is to take this data set, and estimate the likelihood functions P (µ|θ j ) using the relative frequencies P (µ|θ j ) ≈ m µ,θj /m θj ≡ f µ,θ , usually aided by some kind of fitting procedure [12,14]. The posterior distribution P (θ j |µ) is then obtained by choosing a prior P (θ j ) and applying Bayes theorem P (θ j |µ) = P (θ j ) m i=1 P (µ i |θ j )/P (µ), where P (µ) provides normalization and µ = µ 1 , ..., µ m is a measurement sequence. We call this a calibration-based Bayesian analysis. A drawback is that it generally requires collecting a large calibration data set, such that relative frequencies f µ,θ well approximate the corresponding probabilities. A further problem is that it is not possible to associate a Bayesian probability to (rare) detection events that did not appear during the calibration, unless the probability is inferred through an arbitrary fit or interpolation procedure. Both issues are overcome by our neural networkbased BPE.
In Figure 5 we compare our network-based BPE to the calibration-based BPE. We consider a multi-partite en-tangled, non-Gaussian state (ENGS) of N = 50 qubits. Entanglement is generated using the one-axis twisting Hamiltonian H OAT = χĴ 2 z [77], for χt = 0.3π which is in the over-squeezed regime [78]. Being highly non-Gaussian, it is difficult to aid the calibration with parametric curve fitting. The network on the other hand, is well suited to learning arbitrary probability distributions. Figure 5 (a) shows a typical example of a singleshot posterior distribution learned by the network, compared to the relative frequencies in Fig. 5 (b). The relative frequencies are intrinsically coarse grained, e.g. in Fig. 5 (b) the resolution limit 1/m θj is visible, unlike the network which is smooth. In Fig. 5 (c,d) we compare the statistically-averaged posterior mean-square error (MSE), which quantifies the fluctuations in the deviation of the Bayesian estimate from θ true (see [17] and Refs. therein). The posterior MSE is a useful figure of merit in realistic models (either a network or a calibration attempt) because imperfections due to unavoidable noise in trai-ning/calibration data can result in an individual estimate Θ(µ) deviating from the true value θ true , even asymptotically. Calibration/training noise can result in positively or negatively biased estimates with equal frequency, which can lead to a deceptively low bias on average (this explains the low bias in Figure 4 when m θj = 10). Figure 5 (c,d) clearly show that the neural network outperforms the calibration (see Methods for details), independently of the phase shift θ true or the number of measurements m. As a sanity check, we have verified that the calibration and the network agree well when the training set is large enough. The solid orange curve is the exact result (as would be produced by a perfect calibration/network). This is clear evidence that with limited training/calibration data, our machine learning approach can provide an advantage over conventional calibration techniques for states that are difficult or impossible to fit. Finally, in Fig. 5 (e) we include the effects of finite detection resolution ∆µ, which is a major limitation in large N systems [1]. Modelling of detection noise is discussed in Methods. Although the sensitivity is degraded, network-based BPE continues to outperform calibrationbased BPE given equal training/calibration resources, see Methods for details.

DISCUSSION
By reformulating parameter estimation as a classification task, we have shown how to efficiently perform BPE using an artificial neural network with an optimal use of calibration data. The prior distribution -which is the characteristic trait of BPE -is directly linked to the training process : the subjectivity of prior knowledge is reflected by the subjective choice of training strategy.
BPE offers important advantages, most notably the asymptotic saturation of the frequentist Cramér-Rao bound that holds regardless the statistical model. Indeed, we have demonstrated that our strategy is consistent and efficient for both separable and entangled states of many qubits. Compared to other BPE protocols based on calibration data, our method is the most effective for non-Gaussian states. We found that our neural network-based BPE procedure can outperform standard calibrationbased BPE protocols when the training/calibration data is limited and in the absence of an obvious or simple fitting functions. This advantage persists in the presence of finite detection resolution and for noisy probe states. In fact, our approach is the most valuable when the quantum sensor is a black box, namely when conditional probabilities of possible measurement results lack an simple explicit model based on a few fitting parameters. In this case our knowledge about the quantum sensor operations is limited to calibration data.
Our neural network-based BPE is readily applicable to current optical and atomic experiments, and therefore could enable BPE with entangled non-Gaussian states in current high precision quantum sensors. Although we focus on single-parameter estimation, our result could also be extended to the simultaneous estimation of multiple parameters. the project EMPIR-USOQS, EMPIR projects are cofunded by the European Unions Horizon2020 research and innovation programme and the EMPIR Participating States. We also acknowledge financial support from the European Unions Horizon 2020 research and innovation programme -Qombs Project, FET Flagship on Quantum Technologies grant no. 820419, and from the H2020 QuantERA ERA-NET Cofund in Quantum Technologies projects QCLOCKS and CEBBEC.

METHODS
Machine learning methods. Throughout this manuscript we employ densely connected, feed-forward neural networks. The networks are implemented and trained using the python-based, open source package Keras [79]. All hidden layers use ReLU neurons (rectified linear unit). All networks have a single input neuron, which accepts a single, real number µ. The number of hidden layers depends on the system, but for a single qubit a single layer of 4 neurons is sufficient (see Fig. 2). For larger and more complex states, more layers and neurons can help, as in Fig. 4 or Fig. 5. The output layer is d softmax neurons, one for each θ j grid point, whose value is denoted a, which is normalised j a j = 1 by construction. As we argue in the main text, the output of the network should be interpreted as a Bayesian posterior distribution, The training process is described in depth elsewhere, see for instance Refs. [34,36]. Briefly, the network is first initialised with random weights. For efficiency, the training set is randomly divided into subsets called mini-batches. The label θ j is encoded as a d-dimensional vector whose kth element is a Kronecher delta function δ jk . Each training element in the current mini-batch is fed into the network, and its label is used to evaluate a cost function C. We use the categorical cross-entropy, which for a µ with label θ j is simply C = − log (a j ). C is then averaged over the whole mini-batch, and minimised using the ADAM algorithm [80]. This is repeated until the entire training set is exhausted, which is called a training epoch. Typically many epochs are required to reach an optimal network. Numerical details for Figures.
In Fig. 2, the network has a single input neuron (which takes as input the result of a single measurement µ), a single hidden layer of 4 neurons and 100 output neurons (corresponding to a θ grid with 100 grid points). The training set contained m θj = 10 3 training measurements per grid point, evenly distributed (corresponding to a flat prior). The network was trained for 5 epochs with a minibatch size of 128.
In Fig. 3, networks were trained to perform inference on a single qubit, and have 40 output neurons (corresponding to a θ-grid of 40 points), but otherwise have the same architecture as the network in Fig. 2. Training is performed for 10 epochs with a mini-batch size of 128. The training set contains total of m train = 40 × 10 3 measurement results.
In Fig. 4, the network trained for coherent-spin states had 1 input neuron, 1 hidden layer of 8 neurons, and 1000 output neurons between 0 ≤ θ j ≤ π. The twin-Fock state network was more complex, 1 input neuron, 2 hidden layers with 32 neurons each, and 1000 output neurons uniformly distributed between 0 ≤ θ j ≤ π/2. Training parameters are adapted to the size of the training set, which is uniform (corresponding to a flat prior). The coherent-spin state training parameters are for m θj = 10, 100, 1000 : 60 epochs with a min-batch size of 8, 40 with 16, and 20 with 32, respectively. The twin-Fock state training parameters are for m θj = 10, 100, 1000 : 60 epochs with a min-batch size of 8, 40 with 16, and 30 with 128, respectively.
In Fig. 5, the neural network had 3 hidden layers with 256 neurons in each, and an output grid with 2000 neurons between 0 ≤ θ j ≤ π. Training was for 60 epochs with a mini-batch size of 1024. The calibration was performed by approximating the likelihood function P (µ|θ j ) by the relative frequencies observed in the training data, smoothed with a cubic interpolation at twice the grid density. The interpolation was performed using interp1d from Python's scipy package.

DATA AVAILABILITY
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

CODE AVAILABILITY
Any code used for the current study are available from the corresponding author on reasonable request.