Semiconductor quantum devices hold great promise for scalable quantum computation. In particular, individual electron spins in quantum dot devices have already shown long spin coherence times with respect to typical gate operation times, high fidelities, all-electrical control, and good prospects for scalability and integration with classical electronics.1

A crucial challenge of scaling spin qubits in quantum dots is that electrostatic confinement potentials vary strongly between devices and even in time, due to randomly fluctuating charge traps in the host material. Characterising such devices, which requires measurements of current or conductance at different applied biases and gate voltages, can be very time consuming. It is normally carried out following simple scripts such as grid scans, which are sequential measurements taken from a 2D grid for a pair of voltages. We call a set of voltages that defines the state of a quantum dot a configuration. Measurement of some configurations is more informative for characterising a quantum dot than the other configurations; measuring uncertain signals is more informative than measuring predictable signals. However, grid scans do not prioritise measurement of informative signals, instead just acquiring measurements according to simple rules (e.g. following a raster pattern). Current efforts in the field of automation of quantum dot are focused on tuning,2,3,4,5,6,7,8,9 a large portion of these relying on grid scanning techniques for measurement. An optimised measurement method that can prioritise and select important configurations is thus key for fast characterisation and automatic tuning. Our method is thus complementary to automating tuning of quantum devices and holds the potential to increase the efficiency of these approaches when combined.

In this paper, we present an algorithm that performs efficient real-time data acquisition for a quantum dot device (Fig. 1a). It starts from a low-resolution uniform grid of measurements, creates a set of full-resolution reconstructions, calculates the predicted information gain (i.e. the acquisition map), selects the most informative measurements to perform next, and repeats this process until the information gain from new measurements is marginal.

Fig. 1
figure 1

Overview of the algorithm and the quantum dot device. a Schematic of the algorithm’s operation. Low-resolution measurements (i) are used to produce reconstructions (ii), which are used to infer the predicted information gain acquisition map (iii). Based on this map, the algorithm chooses the location of the next measurement (iv). The process is repeated until a stopping criterion is met. b Schematic of the device. A bias voltage \({V}_{\text{bias}}\) is applied between ohmic contacts to the two-dimensional electron gas. We apply gate voltages labelled \({V}_{1}\) to \({V}_{4}\) and \({V}_{\text{G}}\). c A measured current map as a function of \({V}_{\text{bias}}\) and \({V}_{\text{G}}\). The Coulomb diamonds are the white regions where electron transport is suppressed, and most of the information necessary to characterise a device is contained just outside these diamonds. d Sequential decision algorithm in a illustrated with an example of a specific current map. In panel (iv), unmeasured pixels are plotted in black; however, initial measurements (i) are represented so as to fill the entire panel (that is, the sparse grid of measurements is represented as a low-resolution image)

In order to select measurements based on information theory, we require a corresponding uncertainty measure (of random variables),10,11,12 and hence a probabilistic model of unobserved variables. One typical approach is to use a Gaussian process.13 Here, we use a conditional variational auto-encoder (CVAE),14 which is capable of generating high-resolution reconstructions given partial information and is fast enough for real-time decisions. Deep generative models such as adversarial networks (GAN),15 the variational auto-encoder (VAE)16 and its extensions, such as CVAE, have shown great success in multi-modal distributions and complex non-stationary patterns of data,17,18 similar to those of observed in quantum device measurements. These are the main advantages of CVAE over a basic Gaussian process. Also, CVAE is more computationally efficient at generating multiple full-resolution reconstructions. Although progress has been made addressing the limitations of Gaussian processes, deep generative models are overall a better fit to the requirements for efficient quantum device measurements. Deep generative models have been used for: speech synthesis;19 generating images of digits and human faces;20,21 transferring image style;22,23 and inpainting missing regions of images.24 Recently, VAE models have been used in scientific research to optimise molecular structures.25,26,27,28 In spite of their suitability, these models have not previously been applied to efficient data acquisition. An advantage of deep generative models over simple interpolation techniques, such as nearest-neighbour and bilinear interpolation, is that deep generative models can learn likely patterns from training data and incorporate them into its reconstructions. Our method, as it is data-driven, it is generalizable to different transport regimes, measurement configurations, and more complex device architectures if an appropriate training set is available.


The device

Our device is a laterally defined quantum dot fabricated by patterning Ti/Au gates over a GaAs/AlGaAs heterostructure containing a two-dimensional electron gas (Fig. 1b). In this device, electrons are subject to the confinement potential created electrostatically by gate voltages. Gate voltages \({V}_{1}\) to \({V}_{4}\) tune the tunneling rates while \({V}_{\text{G}}\) mainly shifts the electrical potential inside the quantum dot. The current through the device is determined both by these gate voltages and by the bias voltage \({V}_{\text{bias}}\). Measurements were performed at \(30\ {\rm{mK}}\).

The quantum dot is characterised by acquiring maps of the electrical current as a function of a pair of varied voltages, which we call a current map configuration. We first focus on varying \({V}_{\text{G}}\) and \({V}_{\text{bias}}\) for fixed values of \({V}_{1}\) to \({V}_{4}\). Fig. 1c) shows a typical example. Diamond-shaped regions or ‘Coulomb diamonds’ indicate Coulomb blockade, where electron tunnelling is suppressed.29 Most current maps have large areas in which the current is almost constant, and consequently measurements in these regions slow down informative data acquisition. Our algorithm must, therefore, give measurement priority to the informative regions of the map. An overview of an algorithm-assisted measurement of a current map is shown in Fig. 1d.

Training the reconstruction model

The role of the reconstruction model is to characterise likely patterns in a training data set, given by a mixture of measured and simulated current maps. We can utilise these likely patterns to predict the unmeasured signals from partial measurements.

Deep generative models represent this pattern characterisation in a low-dimensional real-valued latent vector \({\bf{z}}\), which can be decoded to produce a full-resolution reconstruction. The latent space representation and the decoding are learned during training. Our CVAE consists of two convolutional neural networks, an encoder and a decoder. The encoder is trained to map full-resolution training examples of current maps \(Y\) to the latent space representation \({\bf{z}}\). The encoder also enforces that the distribution \(p({\bf{z}})\) of training examples in latent space is Gaussian.

The decoder is trained to reconstruct \(Y\), from the representation \({\bf{z}}\) combined with an \(8\times 8\) subsample of \(Y\). As a result, \({\bf{z}}\) attempts to represent all the information that is missing from the subsampled data. In a plain VAE, the input of a decoder is only \({\bf{z}}\). If a decoder takes additional input except \({\bf{z}}\), then it is called CVAE, and we found that CVAE generates better reconstructions than VAE for the considered measurements. The chosen loss function, which the CVAE tries to minimise, is a measure of the difference between the training data and the corresponding reconstruction. To avoid blurry reconstructions, we define a contextual loss function that incorporates both pixel-by-pixel and higher-order differences like edges, corners, and shapes. Detailed description of these networks and their training can be found in the Supplementary sections Training and loss function, and Network specification.

The model is trained using both simulated and measured current maps. We choose to work with current maps of resolution \(128\;\times 128\). The simulation is based on a constant-interaction model (see Methods). To measure the current maps for training, we set the bias and gate voltage ranges randomly from a uniform distribution. The training data set consists of 25,000 simulations and 25,000 real examples generated by randomly cropping 750 measured current maps. The current maps were subjected to random offsets, rescaling, and added noise to increase the variability of the training set.

Generating reconstructions from partial data

After training, only the trained decoder network is used in the algorithm of Fig. 1a to reconstruct full-resolution current maps from partial data. At each stage, the known partial current map is denoted \({Y}_{n}\), where \(n\le 12{8}^{2}=16,384\) is the number of pixels to be measured. To generate a reconstruction, the decoder takes as input the initial \(8\times 8\) grid scan \({Y}_{64}\), together with a latent vector \({\bf{z}}\) sampled from the posterior distribution \(p({\bf{z}}|{Y}_{n})\) (see Methods for detail equations and Fig. S1 for the decoder diagram). Note that the posterior density is calculated by the prior density \(p({\bf{z}})\) and a likelihood function, which is comparing reconstructions and the partial data. Multiple posterior samples are drawn from \(p({\bf{z}}|{Y}_{n})\) by the Metropolis–Hastings (MH) method to approximate \(p({\bf{z}}|{Y}_{n})\). From these multiple samples \({{\bf{z}}}_{m}\), corresponding reconstructions are then generated, denoted \({\hat{Y}}_{m}\). In this paper we set \(m=1,\ldots ,100\). The continuous posterior \(p({\bf{z}}|{Y}_{n})\) is then approximated by a discrete posterior of samples \({P}_{n}(m)\), which denotes how probable \({\hat{Y}}_{m}\) is. We refer to \({P}_{n}(m)\) as the posterior distribution of reconstructions.

Making measurement decisions

With each iteration of the decision algorithm, an acquisition map is computed from the accumulated partial measurements and the resulting reconstructions. This acquisition map assigns to each potential measurement location (i.e. to each pixel location in the current map) an information value for the posterior distribution of reconstructions (Fig. 2). The \((n+1)\)th measurement, whose result is \({y}_{n+1}\), is one pixel taken from the true current map and changes our posterior distribution from \({P}_{n}(m)\) to \({P}_{n+1}(m)\), rendering different reconstructions more or less probable.

Fig. 2
figure 2

Computing the acquisition map. a Partial current map. To illustrate the first step in the computation of the acquisition map, we consider a trace (green) through an unmeasured region of the map. b For the unmeasured trace in a, reconstructions provide \(100\) different predictions. Blue and yellow traces highlight two of these predictions. The objective is to determine the most informative measurement location. At \({x}_{2}\), all predictions are similar, so measuring here will have little impact on the posterior distribution of reconstructions. At \({x}_{1}\), predictions are dissimilar and, therefore, \({x}_{1}\) is a more informative measurement location, with a larger effect on the posterior distribution of reconstructions. c Information gain computed for the unmeasured trace in (a). d Acquisition map of information gain computed from the partial measurements in (a), and plotted over the entire current map range

The acquisition map is the expected information gain \({\rm{IG}}(x)\) at each potential measurement location \(x\). Our algorithm calculates it by a weighted sum over reconstructions:

$${\rm{IG}}(x)\equiv \mathop {\sum }\limits_{m}{P}_{n}(m)\times {{\rm{IG}}}_{m}(x),$$

where \({{\rm{IG}}}_{m}(x)\) is the Kullback–Leibler divergence between the distributions \({P}_{n}\) and \({P}_{n+1}\), calculated such that \({y}_{n+1}\) at location \(x\) is taken from reconstruction \({\hat{Y}}_{m}\). The most informative point is \({x}_{n+1}^{* }\equiv {{\rm{argmax}}}_{x}{\rm{IG}}(x)\). This criterion is equivalent both to minimising the expected information entropy of the posterior distribution and to Bayesian active learning by disagreement (BALD,10 see Methods). The difference of the proposed method and BALD is that the proposed method uses random reconstructions of data, which can be multi-modal, whereas BALD assumes that data is normally distributed.

We devised two methods to make decisions based on the acquisition map; a pixel-wise method, and a batch method. The pixel-wise method selects the single best location in the acquisition map. In practice, this is often not optimal in terms of measurement time, because it does not take account of the time needed to ramp the gate voltage between measurement locations (which is limited by details of the measurement electronics and the device settling time). To take account of this limitation, we also devised a batch method, which selects multiple locations from the acquisition map, and then acquires measurements by taking a fast route between them. This reduces the measurement time compared with the pixel-wise method.


To test the algorithm, it was used to acquire a series of current maps in real time. First, the device was thermally cycled, to randomise the charge traps and therefore present the algorithm with a configuration not represented in its training data. Gate voltages \({V}_{1}\)-\({V}_{4}\) were set to a combination of values, and the algorithm was tasked to measure the corresponding current map using both the batch and the pixel-wise methods. This step was repeated for ten different combinations of bias and gate voltages. Fig. 3 presents data acquired by the algorithm at selected acquisition stages, together with selected reconstructions. As expected, reconstructions become less diverse as more measurements are acquired. The reconstructions do not necessarily replicate the measured current map for large \(n\). This is because reconstructions have a limited variability given by the training data. Decisions are made based on the learned patterns from the training data, which implies that this training data should contain at least general patterns which are to be characterised but does not need to include all possible features in a current map.

Fig. 3
figure 3

Updating reconstructions using information from new measurements. In each row, the first column shows the algorithm-assisted measurements, using the batch method, for a given \(n\). The remaining three columns contain example reconstructions given the corresponding \(n\) measurements. As \(n\) increases, the diversity of the reconstructions is reduced and their accuracy increased. As expected, the uncertainty is almost eliminated in the last row. The residual remaining variance is because slightly different reconstructions are nearly equally consistent with the data

As seen, the algorithm gives high priority to regions of the map where the current is rapidly varying, and avoids regions of nearly constant current, such as the interiors of Coulomb diamonds. This strategy is an emergent property of the algorithm and is wise; little information about the device characteristics can be found in low-current gradient regions of the current map. This preference derives from the comparison between reconstructions, which exhibit the greatest disagreement outside Coulomb diamonds. This is also seen in Fig. 4a, which shows two representative measurement sequences using the batch method. The batch method collects grouped measurements while the pixel-wise method distributes measurements more uniformly, given that in this case, the acquisition map is more frequently updated to take account for recently acquired information. Results for other current maps, including for the pixel-wise method, are shown in the Supplementary Figs. 2–6.

Fig. 4
figure 4

Measurements of Coulomb diamonds performed by the algorithm. a Sequential batch measurement in two different experiments. Each row displays algorithm assisted measurements of the current map as a function of \({V}_{\text{bias}}\) and \({V}_{\text{G}}\) for different values of \(n\). The last plot in each row is the full-resolution current map. b, d Current gradient map (defined by Eq. (2)) for each example in (a). c, e Measure of the algorithm’s performance \(r(n)\), real-time estimate of \(r(n)\) across reconstructions (with 90% credible interval shaded), and optimal \(r(n)\) for both examples in (a). The black line is the value of \(r(n)\) corresponding to the alternating grid scan method. The vertical orange line indicates the value of \(n\) determined by the stopping criterion. The corresponding current map in a is highlighted in orange

We compared the performance of the algorithm with an alternating grid scan method. This type of grid scan starts with 8\(\times\)8 measurements and alternately increases the vertical and the horizontal grid size by 2 (i.e. 16\(\times\)8, 16\(\,\times\)16, 32\(\,\times\)16, etc.), without performing the same measurement twice. Over the ten different current maps, the average time for full-resolution data acquisition with the alternating grid scan method is 554 seconds. This time is limited by our bias and gate voltage ramp rate and chosen settling time. The batch method can be implemented with any batch size however for direct comparison with the alternating grid scan we selected increasing batches of 32\(\times {2}^{b}\), where \(b\) is the batch number starting from 1.

Two types of computation are required to make a measurement decision: sampling reconstructions using the MH method, and constructing the acquisition map. One MH sampling iteration takes 63 ms. For experiments, multiple sampling iterations are performed after each batch decision and measurement while acquisition is suspended. Since sampling can be performed simultaneously with measurement acquisition, from now on our measurement times exclude the time for sampling. To compute a single acquisition map takes approximately 50 ms using a NVIDIA GTX 1080 Ti graphics card and Tensorflow30 implementation. The acquisition map must be computed for every batch or every pixel measurement, except for the initial \(8\times 8\) grid scan and the final acquisition step (which has no choice of which pixel(s) to measure). To acquire a full resolution current map thus requires 7 computations (350 ms) for the batch method, and 16,319 computations (816 s) for the pixel-wise method. For the batch method, the computation time is negligible compared to the measurement time, but for the pixel-wise method it is a limiting factor in the measurement rate.

To quantify the algorithm’s performance, we have devised a measure based on the observation that the most informative regions of the current map are those where the current varies strongly with \({V}_{\text{G}}\) and \({V}_{\text{bias}}\). We therefore define the local gradient of the current map at each location \(x\) as

$$v(x)\equiv \;\parallel \nabla Y(x){\parallel }_{2}=\sqrt{{\left(\frac{\partial I(x)}{\partial {V}_{\text{G}}}\right)}^{2}+{\left(\frac{\partial I(x)}{\partial {V}_{\text{bias}}}\right)}^{2}},$$

where \(I(x)\) is current measurement at \(x\), and the derivatives are calculated numerically. The error measure \(r(n)\) of a partial current map is the fraction of the total gradient that remains uncaptured, i.e.

$$r(n)\equiv 1-\frac{V(n)}{V(N)}$$

where \(V(n)\equiv {\sum }_{i=1}^{n}v({x}_{i})\) is the total acquired gradient and \({x}_{i}\) is the location of the \(i\)th measurement. This error can only be calculated after all measurements have been performed. However, we can utilise the \(m\)th reconstruction to generate an estimate \({\tilde{r}}_{m}(n)\) in real time by replacing \(\parallel \hskip -2pt \nabla Y(x){\parallel }_{2}\) with \(\parallel \hskip -2pt \nabla {\hat{Y}}_{m}(x){\parallel }_{2}\). The estimates from multiple reconstructions yield a credibility interval for \(r(n)\). For an optimal algorithm, the error would be \(\bar{r}(n)=1.0-\frac{{V}^{* }(n)}{V(N)}\), where \({V}^{* }(n)\) is the sum of the largest \(n\) values of \(v(x)\). This would be achieved if each measurement location were chosen knowing the full-resolution current map, and thus the location of the the highest unmeasured current gradient. No decision method can exceed this bound. For the real time estimates of \(r(n)\), we have increased the number of reconstructions to 3,000 by adding different noise patterns that are present in typical measured current maps (see Supplementary section Noisy reconstruction). This increase in the variability of the reconstructions is needed to avoid an overconfident estimation of \(r(n)\).

Performances for two experiments are shown in Fig. 4c, e. Grid scans reduce \(r(n)\) linearly with increasing \(n\). The decision algorithm outperforms a simple grid scan and is nearly optimal. When most of the current gradient is localised, the grid scan is far from optimal and even the decision algorithm has room for improvement. In this case, the performance of the algorithm is determined by how representative the training data is. Quantitative analysis of all 10 experiments is in Supplementary Figs. 5 and 6.

We propose a simple stopping criterion that uses the estimated reduction of the error \(r(n)\) to determine when to stop measuring a given current map, in a scenario where more experiments are waiting to be conducted. For a given current map containing \(n\) measured pixels, the error after the next measurement batch is estimated for reconstruction \(m\) to be \({\tilde{r}}_{m}(n+\Delta )\), where \(\Delta\) is the size of the batch. Thus the estimated rate at which the error decreases is \({\beta }_{m}\equiv |{\tilde{r}}_{m}(n+\Delta )-{\tilde{r}}_{m}(n)|/\Delta\). In the worst case among the candidate reconstructions, this rate is \(\beta \equiv \min\nolimits_{m}{\beta }_{m}\). However, if the algorithm begins to measure a new map, for which no reconstructions yet exist, the error of that map will decrease at a rate of at least \(\alpha \equiv 1/N\); this is the slope achieved by a grid scan and the worst case of the decision algorithm (black lines in Fig. 4c, e). Hence when \(\beta\, <\,\alpha\), it is beneficial to halt measurement and move onto a new current map that is awaiting measurement. Since \(\alpha\) and \(\beta\) are the worst-case estimates for each case, the criterion is conservative. The stopping points by this criterion are shown in Fig. 4c, e, with orange dashed lines. The total average time (measurement time plus decision time) to reach the stopping criterion was 237 s, compared with 554 s to measure the complete current map by grid scan, reducing the time needed by a factor between 1.84 and 3.70 across all 10 test cases. A more sophisticated stopping criterion utilising the number of remaining unmeasured current maps and a total measurement budget is presented in Methods.

Generalising the algorithm

The algorithm described here does not require assumptions about the physics of the acquired data, such as requiring that it show Coulomb diamonds. Provided that training data are available, it should also work for other kinds of measurements. To test this, we applied it to a different current map configuration also encountered in quantum dot tuning. In this case the current flowing through the device is measured as a function of two gate voltages (\({V}_{1}\) and \({V}_{2}\)), while keeping other voltages fixed (\({V}_{\text{G}}\), \({V}_{\text{bias}}\), \({V}_{3}\) and \({V}_{4}\)). In these current maps, Coulomb blockade leads to large areas where the current scarcely changes, with diagonal features of allowed current. For the training set, we measured 382 current maps with a resolution of \(251\times 251\) which we randomly cropped to a resolution of \(128\times 128\) and subjected to simple image augmentation techniques (as for the previous training set).

We tested the performance of the algorithm in this new scenario by taking two different combinations of \({V}_{\text{G}}\), \({V}_{\text{bias}}\), \({V}_{3}\) and \({V}_{4}\) and measuring the corresponding current maps in real time (Fig. 5). The device was thermally cycled after the training set was acquired and also between the acquisition of the two current maps in Fig. 5. The algorithm focuses on measuring regions of high current gradient, the corner edges and, in particular, the Coulomb peaks close to these.

Fig. 5
figure 5

Measuring a different current map. a Sequential batch measurement. Each row displays the algorithm-assisted measurements of a current map as a function of \({V}_{1}\) and \({V}_{2}\) for different values of \(n\). The last plot in each row is the full-resolution current map. b, d Current gradient map for both examples in (a). c, e Measure of the algorithm’s performance \(r(n)\), average real-time estimate of \(r(n)\) with 90% credible interval, and optimal \(r(n)\) for both current maps in (a). The black line is the value of \(r(n)\) corresponding to the alternating grid scan method. The dashed orange line indicates the value of \(n\) determined by the stopping criterion. The corresponding current map in a is highlighted in orange. The alternating grid scan took 2267 s and 2333 s to acquire all measurements in the two cases. The batch method took 673 s and 1552 s to reach the stopping criterion

In the top row of Fig. 5a, \(n=4096\) was chosen by the stopping criterion. In the bottom row, the corners edges extended further in the current map and the stopping criterion chose \(n=8192\). This reduced the time needed to measure the current maps by 3.36 and 1.50, respectively, for the two test cases when compared with the alternating grid scans.


The proposed measurement algorithm makes real-time informed decisions on which measurements to perform next on a single quantum dot device. Decisions are based on the disagreement of competing reconstructions built from sparse measurements. The algorithm outperforms grid scan in all cases, and in the majority of cases shows nearly optimum performance. The algorithm reduced the time required to observe the regions of finite current gradient by factors ranging from 1.5 to 3.7 times. Optimisation of batch sizes or a variable scan resolution might reduce this time further, however, the performance gain is limited by the spread of the information gain over the scan range. This is evidenced in both Fig. 4c, e and Fig. 5c, e, where we show that even an optimal algorithm does not significantly outperform the algorithm.

Our algorithm with no modifications can be re-trained to measure different current maps. It simply requires a diverse data set of training examples from which to learn. The decision algorithm performed well even when trained on a small data set of only 382 current maps (at a resolution of \(251\times 251\)), implying that it is robust to limited training data sets. Our algorithm focused on observing all informative regions present in the current map, making it generalisable to different types of measurements and devices. The acquisition function can still be specifically designed to focus on specific transport features such as Coulomb peaks or Coulomb diamond edges. In additional experiments, we demonstrate how this can be achieved by applying additional transformations to the reconstructions (see Supplementary Section Context-aware decision for stability diagrams).

We believe that our algorithm represents a significant first step in automating what to measure next in quantum devices. For a single quantum dot it provides a means of accelerating what can currently be achieved by human experimenters and other automation methods. When provided with an appropriate training data set our algorithm can be applied to a large variety of experiments. In particular, in any conventional qubit tuning method for which time-consuming grid scans are performed, our algorithm would allow for an improvement in measurement efficiency. It will not be long before this kind of approach enables experiments to be performed, and technology to be developed, that would not be feasible otherwise.


Distribution of reconstructions and sampling

Since it is known that deep generative models work well when the data range is from −1 to 1, all measurements are rescaled so that the maximum value of the absolute value of the initial measurement is 1. Let \(Y\) be a random vector containing all pixel values. Observation \({Y}_{n}\), where \(n\ge 1\), is the set of pairs of location \({x}_{j}\) and measurement \({y}_{j}\): \({Y}_{n}=\{({x}_{j},{y}_{j})|j=1,\ldots ,n\}\). Also, a subset of measurements is defined: \({Y}_{n:n^{\prime} }=\{({x}_{j},{y}_{j})|j=n,\ldots ,n^{\prime} \}\). The likelihood of observations given \(Y\) is defined by

$$p({Y}_{n}|Y)\propto \exp (-\lambda {\Sigma }_{(x,y)\in {Y}_{n}}|y-Y(x)|),$$

where \(Y(x)\) is the pixel value of \(Y\) at \(x\), and \(\lambda\) is a free parameter that determines the sensitivity to the distance metric and is set to 1.0 for all experiments in this paper. The posterior probability distribution is defined by Bayes’ rule:

$$p(Y|{Y}_{n})\propto p({Y}_{n}|Y)\ p(Y).$$

Likewise, we can find the posterior distribution of \({\bf{z}}\) given measurements instead of \(Y\). Let \({\bf{z}}^{\prime}\) denote another input of the decoder, which is set to \({Y}_{64}\) in the experiments. Then the posterior distribution of \({\bf{z}}\) can be expressed with \({\bf{z}}^{\prime}\) when \(n\ge 64\):

$$\begin{array}{ccc}p({\bf{z}}|{Y}_{n},{\bf{z}}^{\prime} )&\propto &p({\bf{z}}|{\bf{z}}^{\prime} )\ p\left({Y}_{n}|p({\bf{z}},{\bf{z}}^{\prime} )\right.\\ &\propto &p({\bf{z}}){\int }_{Y}p({Y}_{n}|Y)\ p\left(Y|p({\bf{z}},{\bf{z}}^{\prime} )\ dY\right.\\ &\propto &p({\bf{z}})\ p({Y}_{n}|Y={\hat{Y}}_{{\bf{z}}}),\end{array}$$

where \({\hat{Y}}_{{\bf{z}}}\) is the reconstruction produced by the decoder given \({\bf{z}}\) and \({\bf{z}}^{\prime}\). Since all inputs of the decoder are given, \(p(Y|{\bf{z}},{\bf{z}}^{\prime} )\) is the Dirac delta function centered at \({\hat{Y}}_{{\bf{z}}}\). Also, \(p({\bf{z}}|{\bf{z}}^{\prime} )=p({\bf{z}})\) as \({\bf{z}}\) and \({\bf{z}}^{\prime}\) are assumed independent. Proposal distribution for MH is set to a multivariate normal distribution having centered mean and a covariance matrix equal to one quarter of the identity matrix. For the experiments in this paper, 400 iterations of MCMC steps are conducted when \(n=32\times {2}^{b}\), where \(b\) is any integer larger than or equal to 1. We found that 400 iterations result in good posterior samples. If \(({x}_{n+1},{y}_{n+1})\) is newly observed, then the posterior can be updated incrementally:

$$\begin{array}{ccc}p({\bf{z}}|{Y}_{n+1},{\bf{z}}^{\prime} )&=&\frac{p({x}_{n+1},{y}_{y+1}|{\bf{z}},{\bf{z}}^{\prime} )}{p({x}_{n+1},{y}_{n+1}|{Y}_{n},{\bf{z}}^{\prime} )}\ p({\bf{z}}|{Y}_{n},{\bf{z}}^{\prime} )\\ &=&\frac{p({x}_{n+1},{y}_{y+1}|{\hat{Y}}_{{\bf{z}}})}{p({x}_{n+1},{y}_{n+1}|{Y}_{n},{\bf{z}}^{\prime} )}\ p({\bf{z}}|{Y}_{n},{\bf{z}}^{\prime} ),\end{array}$$

because each term in (4) can be separated.

Decision algorithm

In this section, we derive a computationally simple form of the information gain and the fact that maximising the information gain is equal to minimising the entropy. Let \({p}_{n}(\cdot )=p(\cdot |{Y}_{n},{\bf{z}}^{\prime} )\), and any probabilistic quantity of \({y}_{n+1}\) has the condition \({x}_{n+1}\), but omitted for brevity.

The continuous version of the information gain equation is

$$\begin{array}{ccc}&&{{\mathbb{E}}}_{{y}_{n+1}}[{\bf{KL}}({p}_{n}({\bf{z}}|{y}_{n+1})\parallel {p}_{n}({\bf{z}}))]\\ &=&\int _{{y}_{n+1}}{p}_{n}({y}_{n+1}){\bf{KL}}({p}_{n}({\bf{z}}|{y}_{n+1})\parallel {p}_{n}({\bf{z}}))d{y}_{n+1}\\ &=&\int _{{y}_{n+1}}{p}_{n}({y}_{n+1})\int _{{\bf{z}}}{p}_{n}({\bf{z}}|{y}_{n+1})\mathrm{log}\frac{{p}_{n}({\bf{z}}|{y}_{n+1})}{{p}_{n}({\bf{z}})}d{\bf{z}}d{y}_{n+1}\\ &=&\int _{{y}_{n+1}}\int_{\bf{z}} {p}_{n}({\bf{z}},{y}_{n+1})\mathrm{log}\frac{{p}_{n}({\bf{z}},{y}_{n+1})}{{p}_{n}({\bf{z}}){p}_{n}({y}_{n+1})}d{\bf{z}}d{y}_{n+1}\\ &=&I({\bf{z}}|{Y}_{n};\ {y}_{n+1}|{Y}_{n}),\end{array}$$

where KL is Kullback–Leibler divergence, \(I(\cdot ;\cdot )\) is mutual information. Since \(I({\bf{z}}|{Y}_{n};\ {y}_{n+1}|{Y}_{n})={\rm{H}}({\bf{z}}|{Y}_{n})-{\rm{H}}({\bf{z}}|{Y}_{n},{y}_{n+1})\), maximising the expected KL divergence is equivalent to minimising \({\rm{H}}({\bf{z}}|{Y}_{n},{y}_{n+1})\), which is the entropy of \({\bf{z}}\) after observing \({y}_{n+1}\).

Since this integral is hard to compute, we approximate probability density functions (PDFs) with samples and substitute them into (6). Let \({n}_{s}\) denote the number of measurements that are used for sampling reconstructions \({\hat{{\bf{z}}}}_{1},\ldots ,{\hat{{\bf{z}}}}_{M}\) (the samples are converted to \({\hat{Y}}_{1},\ldots ,{\hat{Y}}_{M}\)). Then \({p}_{{n}_{s}}({\bf{z}})\approx \frac{1}{M}{\sum }_{m}{\delta }_{{\hat{{\bf{z}}}}_{m}}({\bf{z}})\), or with the sample index \(m\), \({P}_{{n}_{s}}(m)=1/M\). For any \(n\ge {n}_{s}\), the probability is updated with the new measurements after \({n}_{s}\): \({P}_{n}(m;{n}_{s})=\frac{p({Y}_{{n}_{s}+1:n}|{\hat{Y}}_{m})}{{\Sigma }_{m}p({Y}_{{n}_{s}+1:n}|{\hat{Y}}_{m})}\), which can be derived from importance sampling. For brevity, the sampling distribution information \({n}_{s}\) is omitted for the remaining section. Likewise, \({p}_{n}({y}_{n+1})={\int }_{{\bf{z}}}{p}_{n}({y}_{n+1}|{\bf{z}})\ {p}_{n}({\bf{z}})\approx {\sum }_{m}{P}_{n}(m)\ {p}_{n}({y}_{n+1}|{{\bf{z}}}_{m})\). Lastly, we use the value of \({\hat{Y}}_{m}\) at \({x}_{n+1}\) for a sample of \({p}_{n}({y}_{n+1}|{{\bf{z}}}_{m})\) for simple and efficient computation. As a result, the information gain is approximated, up to a constant c, by:

$$\begin{array}{ccc}&&{{\mathbb{E}}}_{{y}_{n+1}}\left[\right.{\bf{KL}}({p}_{n}({\bf{z}}|{y}_{n+1})\ \parallel \ {p}_{n}({\bf{z}}))\left]\right.\\ &&\approx \sum\limits_{m}{P}_{n}(m)\ {\bf{KL}}({P}_{n+1}\ \parallel \ {P}_{n}) + c.\end{array}$$

Simulator for training data

To aid the training of the model simulated training data was used to prevent over-fitting. Simulated data produced via a simple implementation of the constant interaction model29 was used along with basic data augmentation techniques. These techniques were not intended to be physically accurate but instead to produce quickly a diverse set of examples that contain features that mimic real data.

The constant interaction model makes the assumptions that all interactions felt by a confined electrons within the dot can be captured by a simple constant capacitance \({C}_{\Sigma }\) which is given by \({C}_{\Sigma }={C}_{\text{S}}+{C}_{\text{D}}+{C}_{\text{G}}\) where \({C}_{\text{S}}\), \({C}_{\text{D}}\), and \({C}_{\text{G}}\) are capacitances to the source, drain and gate respectively. Making this assumption the total energy of the dot \(U(N)\) where \(N\) is the number of electrons occupying the dot, is \(U(N)=\frac{{(-|e|(N\;-\;{N}_{0})\;+\;{C}_{\text{S}}{V}_{\text{S}}\;+\;{C}_{\text{D}}{V}_{\text{D}}\;+\;{C}_{\text{G}}{V}_{\text{G}})}^{2}}{2{C}_{\Sigma }}+\sum _{n=1}^{N}{E}_{n}\) where \({N}_{0}\) compensates for the background charge and \({E}_{n}\) is a term that represents occupied single electron energy levels that is characterised by the confinement potential.

Using this we derive the electrochemical potential \(\mu (N)=U(N)-U(N-1)=\frac{{e}^{2}}{{C}_{\Sigma }}(N-{N}_{0}-\frac{1}{2})-\frac{|e|}{{C}_{\Sigma }}({V}_{\text{S}}{C}_{\text{S}}+{V}_{\text{D}}{C}_{\text{D}}+{V}_{\text{G}}{C}_{\text{G}})+{E}_{n}\).

To produce a training example random values are generated for \({C}_{\text{S}}\), \({C}_{\text{D}}\), and \({C}_{\text{G}}\). The energy levels within a randomly generated gate voltage window and source drain bias window are then counted. To aid generalisation to real data we randomly generated energy level transitions (which are also counted) as well as slightly linearly scaled \({C}_{\Sigma }\), \({C}_{\text{S}}\), \({C}_{\text{D}}\), and \({C}_{\text{G}}\) with \(N\). This linear scaling was also randomly generated and results in produced diamonds that vary in size with respect to \({V}_{\text{G}}\). Examples of the training data produced by this simulator can be seen in Supplementary Fig. 1.

Stopping criterion

Utility, denoted by \(u\), is the ratio of total measured gradient to the total gradient of a stability diagram: \(u(n)=1.0-r(n)\). Here, we assume that we have \(K\) more stability diagrams to be measured. The location of each diagram is defined by a different voltage range, and \(k=0,\ldots ,K\) is the index of the diagrams, where \(k=0\) is the index of the diagram that we are currently measuring.

Let \(T\) denote the total measurement budget for the current and remaining stability diagram. In this paper we assume that a unit budget for measuring one pixel is 1.0. The total utility is

$$\begin{array}{l}{u}_{\text{tot}}=\mathop{\sum }\limits_{k=0}^{K}{u}_{k}({t}_{k}) ={u}_{0}({t}_{0})+{u}_{\text{nxt}}(T-{t}_{0}),\end{array}$$

where \({u}_{k}(\cdot )\) is the utility from measuring \(k\)th diagram, \({t}_{k}\) is the planned budget for \(k\)th diagram satisfying \({\sum }_{k=0}^{K}{t}_{k}=T\), and \({u}_{\text{nxt}}(T-{t}_{0})={\sum }_{k=1}^{K}{u}_{k}({t}_{k})\).

Let \(t\) denote the already spent budget on the current diagram, \(t\le {t}_{0}\). If we stop the measurement then \({t}_{0}=t\), or \({t}_{0}=t+\Delta\) if we decide to continue the measurement, where \(\Delta\) is a predefined batch size. For the decision, the utilities of two cases are compared: when \({t}_{0}=t\),


Otherwise, \({t}_{0}=t+\Delta\) and

$${u}_{\text{tot}}={u}_{0}(t+\Delta )+{u}_{\text{nxt}}(T-(t+\Delta )).$$

If (8) \(<\) (7), it is better to stop and move to the next voltage range. Rearranging the inequality leads to

$${u}_{0}(t+\Delta )-{u}_{0}(t)\,{<}\,{u}_{\text{nxt}}(T-t)-{u}_{\text{nxt}}(T-(t+\Delta ))x.$$

The left-hand-side (lhs) of (9) means the difference of utility if we invest \(\Delta\) budget more on the current diagram, and the right-hand-side the difference when \(\Delta\) more budget is used for remaining diagrams. As we discussed in Results section, we can calculate multiple slope estimates \({\beta }_{m}\) for spending \(\Delta\) to the current diagram: \({u}_{0}(t+\Delta )-{u}_{0}(t)\approx {\beta }_{m}\Delta\).

The right-hand-side (rhs) of (9) can be approximated by \(\alpha \Delta\) if \(K=\infty\), where \(\alpha =1/16,384\) is the slope of grid scan measuring a new stability diagram. Note that \(\alpha\) can be considered as the empirical worst case performance of the decision algorithm measuring a new diagram as it holds for all the experiments we have conducted. If \(\Delta =N\), this approximation is the exact quantity for any algorithms as all algorithms satisfy \(r(0)=1.0\) and \(r(N)=0.0\). Since \(\alpha\) can be interpreted as the worst case estimate, we also approximate lhs of (9) with the worst case estimate \(\beta =\mathop{min}\nolimits_{m}{\beta }_{m}\).

If \(K\,{<}\,\infty\), and the remaining budget \(T-t\) is more than the budget to measure all of remaining diagrams, there is no utility after all measurements are finished. Hence, the approximation is capped:

$${u}_{\text{nxt}}(T-t)=\alpha \min (T-t,N\,\times K),$$

where \(K\) is the number of remaining diagrams to be measured.

As a result, the stopping criterion when \(K=\infty\) is

$$\beta \,{<}\,\alpha .$$

The stopping criterion when \(K\,{<}\,\infty\) is

$$\beta \,{<}\,\frac{\alpha (\min (T-t,N\,\times K)-\min (T-(t+\Delta ),N\,\times K))}{\Delta }.$$

The rhs of (12) is always less than or equal to \(\alpha\), and more total budget \(T\) makes it low, which leads to late stopping or no stopping.