Efficiently measuring a quantum device using machine learning

Scalable quantum technologies will present challenges for characterizing and tuning quantum devices. This is a time-consuming activity, and as the size of quantum systems increases, this task will become intractable without the aid of automation. We present measurements on a quantum dot device performed by a machine learning algorithm. The algorithm selects the most informative measurements to perform next using information theory and a probabilistic deep-generative model, the latter capable of generating multiple full-resolution reconstructions from scattered partial measurements. We demonstrate, for two different measurement configurations, that the algorithm outperforms standard grid scan techniques, reducing the number of measurements required by up to 4 times and the measurement time by 3.7 times. Our contribution goes beyond the use of machine learning for data search and analysis, and instead presents the use of algorithms to automate measurement. This work lays the foundation for automated control of large quantum circuits.

Semiconductor quantum devices hold great promise for scalable quantum computation. In particular, individual electron spins in quantum dot devices have shown long spin coherence times with respect to typical gate operation times, high fidelities, all-electrical control, and good prospects for scalability and integration with standard semiconductor technologies [1].
A crucial challenge of scaling spin qubits in quantum dots is that electrostatic confinement potentials have large variability among devices and even in time, predominately due to charge traps. The characterisation of such devices, which implies measurements of current or conductance at different applied biases and gate voltages, can be very time consuming. It would normally be carried out following simple scripts such as a grid scan, which is inefficient and slow. Optimising this process involves selecting those biases and gate voltages for which measurements are more informative. An optimised method for device measurement is key to automate device tuning. Current efforts towards automating quantum dot tuning are based on grid scans and require several hours to execute, lack generality across devices, and/or require manual input [2][3][4].
In this paper, we present an algorithm that performs efficient real-time data acquisition for a quantum dot device. It starts from a low-resolution uniform grid of measurements, creates a set of full-resolution reconstructions, calculates the predicted information gain (or acquisition map), and selects the measurements which will give the maximum information gain (Fig. 1a). This process is iterated until the information gain from new measurements is marginal.
Such an information-theoretic criterion for selecting measurements is based on an uncertainty measure of random variables [5][6][7], and hence a probabilistic model is required on the unobserved variables. Rather than using a Gaussian process [8], we use a conditional variational auto-encoder (CVAE) [9], which is capable of generating high-resolution reconstructions given partial information and is fast enough for real-time decisions. In spite of their suitability, these models have not previously been applied to efficient data acquisition. Deep generative models such as the variational auto-encoder (VAE) [10] and generative adversarial networks (GAN) [11] have shown great success in generating complex non-stationary patterns of data and multi-modal distributions [12,13]. Deep generative models are used for: speech synthesis [14]; generating images of digits and human faces [15]; transferring image style [16,17]; and inpainting missing regions of images [18]. Recently, VAE models have been used in scientific research to optimise molecular structures [19][20][21][22]. An advantage of deep generative models over simple interpolation techniques, such as nearest-neighbour and bilinear interpolation, is that deep generative models can learn likely patterns from training data and utilise such patterns to make reconstructions.
Our device is a laterally defined quantum dot fabricated by patterning Ti/Au gates over a GaAs/AlGaAs heterostructure containing a two-dimensional electron gas (Fig. 1b). In this device, electrons are subject to the confinement potential created electrostatically by gate voltages. Gate voltages V 1 to V 4 tune the tunneling rates while V G mainly shifts the electrical potential of the dot level. The current through the device is determined both by these gate voltages and by the bias voltage V bias . Our measurements were performed at 30 mK. In Fig. 1c we show a current map as a function of V G and V bias for fixed values of V 1 to V 4 . Diamond shaped regions or 'Coulomb diamonds', correspond to Coulomb blockade, where electron tunnelling is suppressed [23]. Most current maps have large areas in which the current is almost constant, and consequently measurements in these Overview of the algorithm and the quantum dot device. (a) Schematic of the algorithm's operation. Lowresolution measurements (i) are used to produce reconstructions (ii), which are used to infer the predicted information gain acquisition map (iii). Based on this map, the algorithm chooses the location of the next measurement (iv). The process is repeated until a stopping criterion is met. (b) Schematic of the device. A bias voltage V bias is applied between ohmic contacts to the two-dimensional electron gas. We apply gate voltages labelled V1 to V4 and VG. (c) A measured current map as a function of V bias and VG. The Coulomb diamonds are the white regions where electron transport is suppressed, and most of the information necessary to characterize a device is contained just outside these regions. (d) Sequential decision algorithm in (a) illustrated with an example of a specific current map. In panel (iv), unmeasured pixels are plotted black; however, initial measurements (i) are represented so as to fill the entire panel (that is, the sparse grid of measurements is represented as a low-resolution image).
regions slow down informative data acquisition dramatically. The device current gradient is the derivative of the current with respect to bias and gate voltages, and therefore regions of high current gradient are typically very informative for device characterization. Our algorithm gives measurement priority to the informative regions of the current map, which leads to measurements that concentrate in regions of high current gradient. An overview of an algorithm-assisted measurement of a current map is shown in Fig. 1d.

RECONSTRUCTION MODEL AND TRAINING
The role of the reconstruction model is to characterise likely patterns in a training dataset, derived from a mix-ture of measured and simulated current maps. We can utilise these likely patterns to predict the unmeasured signals from given partial measurements.
Deep generative models represent this pattern characterisation in a low-dimensional real-valued latent vector z, which can be decoded to produce a full-resolution reconstruction. The latent space representation and the decoding are learned during training. The CVAE that we use consists of two convolutional neural networks, an encoder and a decoder. The encoder is trained to map full-resolution training examples of current maps Y to the latent space representation z.
The encoder also enforces that the distribution p(z) of training examples in latent space is Gaussian. The decoder is trained to reconstruct Y , from the representation z combined with an 8 × 8 subsample of Y . As a result, z attempts to represent all the information that is missing from the subsampled data. The chosen loss function, which the CVAE attempts to minimise, is a measure of the difference between the training data and the corresponding reconstruction. To avoid blurry reconstructions, we define a contextual loss function that incorporates both pixel-by-pixel and higher-order differences like edges, corners, ans shapes. Detailed description of these networks and their training can be found in the Supplementary Information.
The model is trained using both simulated and measured current maps. We choose to work with current maps of resolution 128 × 128. The simulation is based on a constant-interaction model (see Methods). To measure the current maps for training, we set the bias and gate voltages ranges randomly from a uniform distribution. The training dataset consists of 25,000 simulations and 25,000 real examples generated by randomly cropping 750 measured current maps. The current maps were subjected to random offsets, rescaling, and added noise to increase the variability of the training set.

GENERATING RECONSTRUCTIONS FROM PARTIAL DATA
The trained decoder network is now used in the algorithm of Fig. 1a to reconstruct full-resolution current maps from partial data. At each stage, the known partial current map is denoted Y n , where n ≤ 128 2 = 16, 384 is the number of measured pixels. To generate a reconstruction, the decoder takes as input the initial 8 × 8 grid scan Y 64 , together with a latent vector z sampled from the posterior distribution p(z|Y n ). The latent space of z and the prior probability p(z) are constructed by the CVAE during training, but the posterior distribution takes account of all n measurements (for details, see Methods). The posterior samples are drawn from p(z|Y n ) by the Metropolis-Hastings (MH) method, of which one iteration moves previous samples towards p(z|Y n ). More iterations make better samples of p(z|Y n ). The samples of z are then converted toŶ m , where m = 1, . . . , 100 is the reconstruction index. The continuous posterior p(z|Y n ) is then approximated by a discrete posterior of samples P n (m), which denotes how probableŶ m is. We refer to P n (m) as the posterior distribution of reconstructions.

SEQUENTIAL MEASUREMENT DECISION
With each iteration of the decision algorithm, an acquisition map is computed from the accumulated partial measurements and the resulting reconstructions. The purpose of this acquisition map is to indicate how informative potential measurement locations are for the posterior distribution of reconstructions (Fig 2). The (n+1)th measurement, whose result is y n+1 is one pixel taken from the true current map, changes our posterior distribution from P n (m) to P n+1 (m), rendering different reconstructions more or less probable.
The acquisition map is the expected information gain IG(x) at each potential measurement location x. Our algorithm calculates it by a weighted sum over recon-structions: where IG m (x) is the Kullback-Leibler divergence between the distributions P n and P n+1 , calculated under the assumption that y n+1 is taken at location x from reconstructionŶ m . The most informative point is . This criterion is equivalent both to minimising the expected information entropy of the posterior distribution and to Bayesian active learning by disagreement (BALD) [5] (see Methods).
We devised a choice of two methods to make decisions based on the acquisition map; a pixel-wise method, and a batch method. The pixel-wise method selects the single best location in the acquisition map. This method is not optimal in terms of measurement time, as it might collect data from locations that require a large gate voltage ramp. The ramp rate is limited by the measurement electronics and the device settling time. The batch method selects multiple locations from the acquisition map, and then acquires selected measurements taking into account the distance between locations, thus reducing the measurement time compared with the pixel-wise method.

RESULTS
To test the algorithm, it was used to acquire a series of current maps in real time. First, the device was thermally cycled, to randomise the charge traps and therefore present the algorithm with a configuration not represented in its training data. Next, gate voltages V 1 -V 4 were set to a new combination of values, and the algorithm was tasked to measure the corresponding current map using both the batch and the pixel-wise methods. This step was repeated for ten different voltage combinations. Fig 3 presents data acquired during a typical iteration, together with selected reconstructions at each stage. As expected, reconstructions become less diverse as more measurements are acquired. The reconstructions do not necessarily replicate the measured current map for large n. This is because reconstructions have a limited variability given by the training data. Decisions are made based on the learned patterns from the training data, which implies that the training data should contain at least general patterns which are to be characterised. Consequently, the training dataset does not need to include all possible features in a current map.
In Fig. 4a two representative measurement sequences using the batch method are shown. The algorithm avoids measurements in regions of low current gradient. These regions coincide with the interiors of the Coulomb diamonds for the cases considered. This strategy is an emergent property of the algorithm and is wise; little information about the device characteristics can be found in low-current gradient regions of the current map. This Posterior update of reconstructions. In each row, the first column shows the algorithm-assisted measurements, using the batch method, for a given n. The remaining three columns contain example reconstructions given the corresponding n measurements. As n increases, the diversity of the reconstructions is reduced and their accuracy increased. There is still uncertainty remaining even in the last row -the posterior distribution still contains variance.
preference derives from the comparison between reconstructions, which exhibit the greatest disagreement outside Coulomb diamonds. The performance for other current maps, and for the pixel-wise method, are shown in the Supplementary Information.
We compared the performance of the algorithm with an alternating grid scan method. This type of grid scan starts with 8×8 measurements and alternately increases the vertical and the horizontal grid size by 2 (i.e. 16×8, 16×16, 32×16, etc.), without performing the same measurement twice. Over the ten different current maps, the average time for full-resolution data acquisition with the alternating grid scan method is 554 seconds. This time is limited by our bias and gate voltage ramp rate and chosen settling time. The batch method can be implemented with any batch size however for direct comparison with the alternating grid scan we selected increasing batches of 32×2 b , where b is the batch number starting from 1.
Two types of computation are required to make a measurement decision: sampling reconstructions using MH and constructing the acquisition map. One MH sampling iteration takes 63 ms. For experiments, multiple sampling iterations are performed when n reaches one of batch decision points while measurement is suspended. Since, sampling can be performed simultaneously with periods of measurement acquisition, and thus does not add to the measurement time, our reported measurement times in this paper exclude the time for sampling. To compute a single acquisition map takes approximately 50 ms using a NVIDIA GTX 1080 Ti graphics card and Tensorflow [24] implementation. The acquisition map must be computed for every batch or every pixel measurement, except the initial 8 × 8 grid scan and the final acquisition step (which has no choice which pixel(s) to measure). To acquire a full resolution current map thus requires 7 computations (350 ms) for the batch method, and 16,319 computations (816 s) for the pixelwise method. For the batch method, the computation time is negligible compared to the measurement time, but for the pixel-wise method it is a limiting factor in the measurement rate.
We have developed a measure of the algorithm's performance that is based on the observation that measurements in regions of low current gradient are less informative than regions of high current gradient. Let v(x) denote the numerically approximated euclidean norm of the gradient ∇Y (x) 2 , which is equivalent to the norm of the current gradient at x. Then the error is defined as where N is the total number of pixels (in our case 16,384), and V (n) = n i=1 v(x i ) in which x i is the location of the ith measurement. Hence r(n) is the ratio of total current gradient at unmeasured locations to total current gradient in the entire map. This error can only be calculated after all measurements have been performed. However, we can utilise the mth reconstruction to generate an estimater m (n) in real-time by replacing ∇Y (x) 2 with ∇Ŷ m (x) 2 . The estimates from multiple reconstructions yield a credibility interval for r(n). The value of r(n) for an optimal algorithm is . This is the performance that would be obtained if each measurement location were chosen knowing the full-resolution current map, and thus which is the next measurement location that corresponds to the highest unmeasured current gradient. No decision method can exceed this bound. For the real time estimates of r(n), we have increased the number of reconstructions to 3,000 by adding different noise patterns that are present in typical measured current maps (See Supplementary Information). This increase in the variability of the reconstructions is needed to avoid an overconfident estimation of r(n).
Performances for the two experiments are shown in Figs 4c and 4e. Grid scans reduce r(n) linearly with increasing n. The decision algorithm outperforms a simple grid scan and is nearly optimal. When most of the current gradient is localised, the grid scan is far from op-  We propose a simple stopping criterion that uses the estimated reduction of the error r(n) to determine when to stop measuring a given current map, in a scenario where more experiments are waiting to be conducted. For a given current map, from which n pixels have been measured, the error after the next measurement batch is estimated for reconstruction m to ber m (n+∆), where ∆ is the size of the batch. Thus the estimated rate at which the error decreases is β m ≡ r m (n + ∆) −r m (n)|/∆. In the worst case among the candidate reconstructions, this rate is β ≡ min m β m . However, if the algorithm begins to measure a new map, for which no reconstructions yet exist, the error of that map will decrease at a rate of at least α ≡ 1/N ; this is the slope achieved by a grid scan and the worst case of the decision algorithm (black lines in Fig 4c,e). Hence when β < α, it is beneficial to halt measurement and move onto a new current map that is awaiting measurement. Since α and β are the worst-case estimates for each case, the criterion is conservative. The stopping points by this criterion are shown in Figs. 4(c,e) with orange dashed lines. The total average time (mea-surement time plus decision time) to reach the stopping criterion was 237 s, compared with 554 s to measure the complete current map by grid scan, reducing the time needed by a factor between 1.84 and 3.70 across all 10 test cases. A more sophisticated stopping criterion utilising the number of remaining unmeasured current maps and total measurement budget is given in Methods.

GENERALITY
To prove the versatility of the algorithm, which does not require assumptions about the characteristics of the acquired data, we applied it to a different measurement configuration also encountered in quantum dot tuning. In this configuration the current flowing through the device is measured as a function of two gate voltages (V 1 and V 2 ), while keeping other voltages fixed (V G , V bias , V 3 and V 4 ). The current map in this case has large areas where the current scarcely changes, and diagonal features indicative of Coulomb blockade. For the training set, we measured 382 current maps with a resolution of 251×251 which we cropped randomly with simple image augmentation techniques.
We tested the performance of the algorithm in this new scenario by taking two different combinations of V G , V bias , V 3 and V 4 and measuring the corresponding current maps in real time (Fig. 5). The device was thermally cycled after the training set was acquired and also between the acquisition of the two current maps in Fig. 5. The algorithm focused on measuring regions of high current gradient, the corner edges and, in particular, the Coulomb peaks close to these.
In the top row of Fig. 5a, n = 4, 096 was chosen by the stopping criterion. In the bottom row, the corners edges extended further in the current map and the stopping criterion chose n = 8, 192. This reduced the time needed to measure the current maps by 3.36 and 1.50 for the two test cases when compared with the alternating grid scans.

AUTOMATING WHAT TO MEASURE NEXT
The proposed measurement algorithm makes real-time informed decisions on which measurements to perform next on a single quantum dot device. Decisions are based on the disagreement of competing reconstructions built from sparse measurements. The algorithm outperforms grid scan in all cases, and in the majority of cases shows nearly optimum performance. The algorithm reduced the time required to observe the regions of finite current gradient by factors ranging from 1.5 to 3.7 times.
Our algorithm with no modifications can be re-trained to measure different current map configurations. It simply requires a diverse dataset of training examples from which to learn. The decision algorithm performed well even when trained on a small data set of only 382 current maps (at a resolution of 251 × 251), implying that it is robust to limited training datasets. Our algorithm focused on observing all informative regions present in the current map. However, if only specific features such as Coulomb peaks or Coulomb diamond edges are of interest, the acquisition function can be specifically designed to focus on them (see Supplementary Information). We believe that our algorithm represents a significant first step in automating what to measure next in quantum devices. For a single quantum dot it provides a means of accelerating what can currently be achieved by human experimenters and other automation methods. Our algorithm can be applied to acquire data in other types of experiments provided an appropriate training data set is accessible and the generative model is retrained. It will not be long before this kind of approach enables experiments to be performed, and technology to be developed, that would not be feasible otherwise.

Distribution of reconstructions and sampling
Since it is known that deep generative models work well when the data range is from -1 to 1, all measurements are rescaled so that the maximum value of the absolute value of the initial measurement is 1. Let Y be a random vector containing all pixel values. The CVAE model makes a distribution p(Y | Y i ) and enables sampling from the distribution. Observation Y n , where n ≥ 1, is the set of pairs of location x j and measurement y j : Y n = {(x j , y j ) | j = 1, . . . , n}. Also, a subset of measurements is defined: Y n:n = {(x j , y j ) | j = n, . . . , n }. The likelihood of observations given Y is defined by where Y (x) is the pixel value of Y at x, and λ is a free parameter that determines the sensitivity to the distance metric and is set to 1.0 for all experiments in this paper. The posterior probability distribution is defined by Bayes' rule: Likewise, we can find the posterior distribution of z given measurements instead of Y . Let z denote another input of the decoder, which is set to Y i in the experiments. Then the posterior distribution of z can be expressed with z when n ≥ i: whereŶ z is the reconstruction produced by the decoder given z and Y i . Since all inputs of the decoder are given, p(Y | z, z ) is the Dirac delta function centered atŶ z . Also, p(z | z ) = p(z) as z and z are independent. Proposal distribution for MH is set to a multivariate normal distribution having centered mean and a covariance matrix equal to one quarter of the identity matrix. For the experiments in this paper, 400 iterations of MCMC steps are conducted when n = 64 × 2 b , where b is any integer larger than or equal to 0. We found that 400 iterations result in good posterior samples. If (x n+1 , y n+1 ) is newly observed, then the posterior can be updated incrementally: because each term in (2) can be separated.

Decision algorithm
In this section, we derive a computationally simple form of the information gain and the fact that maximising the information gain is equal to minimising the entropy. Let p n (·) = p(·|Y n , z ), and any probabilistic quantity of y n+1 has the condition x n+1 , but omitted for brevity.
The continuous version of the information gain equation is where KL is Kullback-Leibler divergence, I(·; ·) is mutual information. Since , maximising the expected KL divergence is equivalent to minimising H(z | Y n , y n+1 ), which is the entropy of z after observing y n+1 . Since this integral is hard to compute, we approximate probability density functions (PDFs) with samples and substitute them into (4). Let n s is the number of measurements that are used for sampling reconstructionŝ Σmp(Yn s+1:n |Ŷm) . For brevity, the sampling distribution information n s is omitted for the remaining section. Likewise, p n (y n+1 ) = z p n (y n+1 | z) p n (z) ≈ m P n (m) p n (y n+1 | z m ). Lastly, we use the value of Y m at x n+1 for a sample of p n (y n+1 | z m ) for simple and efficient computation. As a result, the information gain is approximated by:

Simulator for Training data
To aid the training of the model simulated training data was used to prevent over-fitting. Simulated data produced via a simple implementation of the constant interaction model [23] was used along with basic data augmentation techniques. These techniques were not intended to be physically accurate but instead to produce quickly a diverse set of examples that contain features that mimic real data. The constant interaction model makes the assumptions that all interactions felt by a confined electrons within the dot can be captured by a simple constant capacitance C Σ which is given by C Σ = C S + C D + C G where C S , C D and C G are capacitances to the source, drain and gate respectively. Making this assumption the total energy of the dot U (N ) where N is the number of electrons occupying the where N 0 compensates for the background charge and E n is a term that represents occupied single electron energy levels that is characterised by the confinement potential.
Using this we derive the electrochemical potential To produce a training example random values are generated for C S , C D and C G . The energy levels within a randomly generated gate voltage window and source drain bias window are then counted. To aid generalisation to real data we randomly generated energy level transitions (which are also counted) as well as slightly linearly scaled C Σ , C S , C D , and C G with N . This linear scaling was also randomly generated and results in produced diamonds that vary in size with respect to V G . Examples of the training data produced by this simulator can be seen in Supplementary Material.

Stopping criterion
Utility, denoted by u, is the ratio of total measured gradient to the total gradient of a stability diagram: u(n) = 1.0 − r(n). Here, we assume that we have K more stability diagrams to be measured. The location of each diagram is defined by a different voltage range, and k = 0, . . . , K is the index of the diagrams, where k = 0 is the index of the diagram that we are currently measuring.
Let T denote the total measurement budget for the current and remaining stability diagram. In this paper we assume that a unit budget for measuring one pixel is 1.0. The total utility is where u k (·) is the utility from measuring kth diagram, t k is the planned budget for kth diagram satisfying K k=0 t k = T , and u nxt (T − t 0 ) = K k=1 u k (t k ). Let t denote the already spent budget on the current diagram, t ≤ t 0 . If we stop the measurement then t 0 = t, or t 0 = t + ∆ if we decide to continue the measurement, where ∆ is a predefined batch size. For the decision, the utilities of two cases are compared: when t 0 = t, Otherwise, t 0 = t + ∆ and u tot = u 0 (t + ∆) + u nxt T − (t + ∆) .
If (6) < (5), it is better to stop and move to the next voltage range. Rearranging the inequality leads to The left-hand-side (lhs) of (7) means the difference of utility if we invest ∆ budget more on the current diagram, and the right-hand-side the difference when ∆ more budget is used for remaining diagrams. As we discussed in Results section, we can calculate multiple slope estimates β m for spending ∆ to the current diagram: The right-hand-side (rhs) of (7) can be approximated by α∆ if K = ∞, where α = 1/16, 384 is the slope of grid scan measuring a new stability diagram. Note that α can be considered as the empirical worst case performance of the decision algorithm measuring a new diagram as it holds for all the experiments we have conducted. If ∆ = N , this approximation is the exact quantity for any algorithms as all algorithms satisfy r(0) = 1.0 and r(N ) = 0.0. Since α can be interpreted as the worst case estimate, we also approximate lhs of (7) with the worst case estimate β = min m β m .
If K < ∞, and the remaining budget T −t is more than the budget to measure all of remaining diagrams, there is no utility after all measurements are finished. Hence, the approximation is capped: where K is the number of remaining diagrams to be measured.
As a result, the stopping criterion when K = ∞ is β < α .
The stopping criterion when K < ∞ is The rhs of (8) is always less than or equal to α, and more total budget T makes it low, which leads to late stopping or no stopping. The purpose of this section is to define the losses used for the training of the CVAE. The training is performed by minimising user-defined loss terms through changing the decoder and encoder parameters θ and φ using a gradient decent based method. The two loss terms that are minimised to train the encoder and decoder networks are the difference loss and the latent loss.
The difference loss consists of two difference metrics. The first is a sum of the pixel-wise difference between the reconstruction and the training example. The second is a contextual difference which is similar in concept to GAN; the contextual loss is taken from another convolutional neural network called the discriminator. The discriminator is trained in tandem with the encoder and decoder and is trained to distinguish between reconstructions and training examples. The input to the discriminator is a training example Y or reconstructionŶ and the output is a value between 0 and 1, representing the probability the input is a training example or a reconstruction. As the discriminator is trained to distinguish between training examples and reconstructions, it learns to decode contextual features that distinguish reconstructions from training examples. We then calculate the difference between intermediate layer representations of the training example and intermediate layer representations of its reconstruction. If we ignore the contextual loss, the decoder produces only blurry reconstructions.
The latent loss is applied only to the encoder and forces the set of encoded training examples {z} to be normally distributed with mean of the zero vector and the covariance of a diagonal matrix. This can be achieved by minimising the Kullback-Leibler (KL) divergence between the output distribution of the encoder and the target zero-mean distribution.

Network specification
The specification of the convolutional neural networks used in this paper is described in Table S1∼S3. Exponential linear unit is applied after each layer except the final layer of the encoder, decoder, and the discriminator. Batch normalisation is applied after all convolution layers except as separately described. The first and second number in parentheses of the layer names indicate kernel size and stride.

Noisy reconstruction
For the estimationr m (n), a single reconstructionŶ m is augmented to 30 noisy reconstructions: x )|x ∈ X} is a noise profile consisting of pairs of location and noise, X is a set of all voltage pairs in a 2D domain, and α m,j,SNR is a multiplier that makes the signal-to-noise ratio SNR, where the signal isŶ m and the noise is E j . We measured 10 noise profiles at non-conducting voltage ranges, but very close to Coulomb diamonds, and j is the index of the profile. SNR is chosen from {20 2 , 40 2 , 80 2 }, which leads to a high noise, medium noise, and low noise. Layer name output size Initial 128x128x1 Conv (5,2) 64x64x64 Max pooling(3,2) 32x32x64 Conv (3,1) 32x32x128 Conv (3,2) 16x16x128 Conv (3,1) 16x16x128 Conv (3,2) 8x8x128 Conv (3,1) 8x8x128 Conv (3,2) 4x4x128 Fully connected 200 TABLE S1. Specification of the encoder.

Context-aware decision for stability diagrams
By converting reconstructions to some context maps, we can make a decision related with the context map. We have developed a segmentation method, that produces a segmentation map which has a value is 1 if the location is inside a diamond or 0 otherwise. This segmentation method is based on another deep neural network called a U-net [25]. Training data for the segmentation network are pairs of current map and segmentation map, which are generated by the same simulator used for the reconstruction network. Fig. S7 shows the segmentation result of a trained network for 10 real stability diagrams.
By producing segmentation maps of reconstructions, their segmentation disagreement can be calculated. This produces large disagreement along the edges of reconstructions resulting in measurements that focus on diamond edges as show in Fig. S8a. Noise is also added to the outside of diamond segmented maps. This supplies further disagreement between segmentation maps which prioritises measurement outside of the diamond after edges are measured.
The success measure e(n) in Fig. S8b is calculated by applying the segmentation model to the fully measured current map and then applying a Sobel filter to the resulting segmented map; this produces an edge map. The error and optimal performance are then calculated as the ratio of this remaining quantity in the same way as was done for r(n) except substituting the edge maps for transconductance maps.