Machine learning enables completely automatic tuning of a quantum device faster than human experts

Variability is a problem for the scalability of semiconductor quantum devices. The parameter space is large, and the operating range is small. Our statistical tuning algorithm searches for specific electron transport features in gate-defined quantum dot devices with a gate voltage space of up to eight dimensions. Starting from the full range of each gate voltage, our machine learning algorithm can tune each device to optimal performance in a median time of under 70 minutes. This performance surpassed our best human benchmark (although both human and machine performance can be improved). The algorithm is approximately 180 times faster than an automated random search of the parameter space, and is suitable for different material systems and device architectures. Our results yield a quantitative measurement of device variability, from one device to another and after thermal cycling. Our machine learning algorithm can be extended to higher dimensions and other technologies.

Gate defined quantum dots are promising candidates for scalable quantum computation and simulation [1,2].They can be completely controlled electrically and are more compact than superconducting qubit implementations [1].These devices operate as transistors, in which electrons are controlled by applied gate voltages.If these gate voltages are set correctly, quantum dots are created, enabling single-electron control.If two such quantum dots are created in close proximity, the double quantum dot can be used to define robust spin qubits from the singlet and triplet states of two electrons [3,4].Due to device variability, caused by charge traps and other device defects, the combination of gate voltage settings which defines a double quantum dot varies unpredictably from device to device, and even in the same device after a thermal cycle.This variability is one of the key challenges that must be overcome in order to create scalable quantum circuits for technological applications such as quantum computing.Typical devices require several gate electrodes, creating a high dimensional parameter space difficult for humans to navigate.Tuning is thus a time consuming activity and we are reaching the limits of our ability to do this manually in arrays of quantum devices.To find, in a multi-dimensional space, the gate voltages which render the device operational is referred to in the literature as coarse tuning [5,6].
Here, we present a statistical algorithm which is able to explore the entire multi-dimensional gate voltage space available for electrostatically defined double quantum dots, with the aim of automatically tuning them and studying their variability.Until this work, coarse tuning required manual input [7] or was restricted to a small gate voltage subspace [8].We demonstrate a completely automated algorithm which is able to tune different devices with up to eight gate electrodes.This is a challenging endeavour because the desired transport features are only present in small regions of the gate-voltage space, surrounded by large regions in which the device is either completely pinched-off or tunnel barriers are too open to observe single electron transport.Moreover, the transport features that indicate the device is tuned to the double dot regime are time-consuming to measure and difficult to parametrise.Machine learning techniques and other automated approaches have been used for tuning quantum devices [5][6][7][8][9][10][11][12][13].These techniques are limited to small regions of the device parameter space or require information about the device characteristics.We believe our work significantly improves the state-of-the-art: our algorithm models the entire parameter space and tunes a device completely automatically (without human input), in approximately 70 minutes, faster than the typical tuning by a human expert [14].
Our algorithm explores the gate-voltage space by measuring the current flowing through the device, and its design makes only a few assumptions, allowing it to be readily applied to other device architectures.Our quan- † Both authors contributed equally and are displayed in random order.arXiv:2001.02589v1[cond-mat.mes-hall]8 Jan 2020 tum dot devices are defined in a 2-dimensional electron gas in a GaAs/AlGaAs heterostructure by Ti/Au gate electrodes.DC voltages applied to these gate electrodes, V 1 to V 8 , create a lateral confinement potential for electrons.A bias voltage V bias is applied to ohmic contacts to drive a current (I) through the device.The device schematic, designed for precise control of the confinement potential [15][16][17], is shown in Fig. 1a.Measurements were performed at 50 mK.
We consider the space defined by up to eight gate voltages between 0 and -2 V.This range was chosen to avoid leakage currents.In this parameter space, the algorithm has to find the desirable transport features within tens of mV.Identifying these features is slow, as it requires to measure I as a function of two plunger gate voltages, i.e. gate voltages that significantly modify the electron occupation of each quantum dot.Note that fast readout techniques are only suited for small regions of the parameter space.Our algorithm is thus designed to minimize the number of two dimensional current maps required.We make two observations.Firstly, that for very negative gate voltages, no current will flow through the device, i.e. the device is pinched-off.Conversely, for very positive gate voltages, full current will flow and single electron transport will not be achieved.This means that transport features are expected to be found on the hypersurface that separates low and high current regions in parameter space.The second observation is that to achieve single-electron transport, a confinement potential is needed.The particular transport features that evidence single-electron transport are Coulomb peaks, which are peaks in the current flowing through the device as a function of a single plunger gate voltage.These observations leads us to only two modelling assumptions: (i) single and double quantum dot transport features are embedded in a boundary hypersurface, shown in Fig. 1b, which separates regions in which a measurable current flows, from regions in which the current vanishes; (ii) large regions of this hypersurface do not display transport features.
The algorithm consists of two parts: a sampler that generates candidate coordinates on the hypersurface, and an investigation phase in which we collect data in the vicinity of each candidate coordinate to evaluate transport features (Fig. 2a).The result of the investigation phase feeds back into the sampler, which chooses a new candidate coordinate in the light of this information.The objective of the sampler is to produce candidate coordinates in gate voltage space for which the device operates as a double quantum dot.A block diagram of the algorithm is displayed in Fig. 2b.Our modelling assumptions are based on the physics of gate defined devices leading to minimal constraints; we do not assume a particular shape for the hypersurface, and we instead allow measurements to define it by fitting the data with a Gaussian process.
We demonstrate over several runs, in two different devices and over thermal cycles, that the algorithm is able to find transport features corresponding to double quantum dots.We perform an ablation study, which identifies The current threshold considered to define this hypersurface is 20% of the maximum measured current.The gate voltage parameter space, restricted to 3D for illustration, contains small regions in which double and single quantum dot transport features can be found.These regions typically appear darker in this representation because they produce complex boundaries.Right: For particular gate voltage coordinates marked with green crosses, the current as a function of V7 and V3 is displayed.The top and bottom current maps display double and single quantum dot transport features, respectively.
the relative contribution of each of the modules that constitute the algorithm, justifying its design.Finally, we demonstrate that our algorithm is capable of quantifying device variability, which has only been theoretically explored so far [18].
Automating experimental science has the impact to significantly accelerate the process of scientific discovery.In this work we show that a combination of simple physical principles and flexible probabilistic machine learning models can be used to efficiently characterise and tune a device.We envisage that in the near future such judicious application of machine learning will have tremendous impact even in areas where only small amounts of data are available and no clear fitness functions can be defined.

I. SAMPLING PHASE
The algorithm starts by fixing V bias and performing a measurement of current at the two extremes of the gate voltage space, V i = 0 and V i = −2 V for i = 0, ..., N where N indicates the number of gate electrodes.This initial calibration sets the current scale that the algorithm requires to define the hypersurface.The search range was chosen to span the typical gate voltage settings for tuning lateral quantum dots, which are usually a few hundred mV.Once this initialization phase is completed, the sampling phase begins.The aim of the sampler is to propose candidate gate voltage coordinates to be evaluated by the investigation phase described in section II.The sampler is very general, just relying on our two modelling assumptions, FIG. 2. Overview of the algorithm.a The sampling phase produces candidate coordinates in gate voltage space, which are on the boundary hypersurface (pink surface).The distance between a candidate coordinate (red cross) and the origin of the gate voltage space is marked with a dashed line.The investigation phase, evaluates the local region by, for example, measuring current maps which are evaluated by a score function.Evaluation results are fed back to the sampling phase.b Algorithm's block diagram.The algorithm is initialized by fixing V bias and performing a current measurement at the two extremes of the gate voltage space.The next phase is sampling, in which a candidate coordinate is first produced from a modelled hypersuface, selected and then found on the true hypersurface by performing a current measurement.Also, pruning of unusable regions of the gate voltage space is performed.The investigation phase first detects Coulomb peaks in the current measurement, and if successful, then the Coulomb peak spacing defines a gate voltage window to produce a current map at low resolution.The algorithm increases the scan resolution if the low resolution current map is promising in terms of double quantum dot transport features according to the score function.The result of the investigation phase is fed to the sampling phase after the high resolution scan, or earlier if no Coulomb peaks are detected or if the low resolution scan does not pass the threshold score.Humans label the high resolution map to confirm the presence of features corresponding to the double quantum dot regime.
as it models the hyper-surface (i), and prunes regions of the hyper-surface that do not contain transport features (ii).For pruning, feedback from the investigation phase is required to evaluate the presence of transport features in a given region of the hypersurface.The sampler learns from this feedback where desired transport features are likely to be found on the hypersurface.

A. Hypersurface model
To parametrise the hypersurface, we define the origin o of the gate voltage space as V i = −100 mV for i = 0, ..., N .This is to allow for measurements around o while remaining within the defined gate voltage range.We also indicate directions in gate voltage space with a unit vector u.We define the distance from o to the hypersurface for a given direction u as r(u).The corresponding coordinate on the hypersurface is v(u) = r(u)u + o.From measured distances {r(u i )} i=1,...,n in n sampling iterations, the algorithm predicts the distance r(u) for every u using a Gaussian process regression: r(u) ∼ N (m(u), s 2 (u)), i.e. r(u) follows a Gaussian distribution N with m(u) and s 2 (u) the posterior mean and variance, respectively (Fig. 3a).We define ṽ(u) = r(u)u+o the hypersurface coordinate predicted by the model in direction u.Details of the prediction models can be found in the Supplementary Material.
For a randomly generated direction u, the algorithm verifies this prediction by measuring the current along u.This current measurement does not commence at o, since it is time-consuming to sweep a big range of gate voltages.It instead starts from a coordinate g(u) = (m(u) − 2s(u))u + o, which is with high probability in a region of high current, and stops when the true hypersurface coordinate v(u) is found.The coordinate g(u) is checked to be always further away from V i = 0 mV than o.If this coordinate were to be in a pinched-off region, the algorithm measures the current along −u and when high values of current are found, it reverts to measuring along u to find v(u).
To sample the hypersurface uniformly, we cannot rely on uniformly distributed random directions of u, because the resulting set of predictions ṽ(u) will be denser in regions of the hypersurface for which r(u) is shorter.To achieve approximately uniform sampling, we simulate particles undergoing Brownian motion inside the volume delimited by the modelled hypersurface m(u).The algorithm collects the coordinates on this hypersurface which the particles hit.We have developed this Brownian motion approach based on the findings from [19], where an algorithm for generating random samples from geometric objects is established.

B. Selection and pruning
From the set {ṽ} collected by the Brownian particle simulation, one coordinate is chosen according to the estimated probability of finding transport features in the coordinate's vicinity.The particular transport features we are interested in are sets of Coulomb peaks evidencing single-electron tunneling.The algorithm uses Gaussian process classification [20] to estimate the probability of measuring Coulomb peaks Ppeak (v) in the vicinity of a given v.When selecting a coordinate to sample we must make a decision about whether to exploit our prediction of Ppeak (v) (by choosing the maximum of Ppeak (v)) or whether to explore the parameter space in order to better predict Ppeak (v).To balance this trade-off we utilise Thompson sampling which selects candidate coordinates The gate voltage space is restricted to two dimensions for illustration, and its origin is marked with a circle.The origin does not coincide with 0 gate voltages so that its surroundings are included in the considered gate voltage range.a Schematic hypersurface model.The predictions of the model are schematically indicated.Blue (pink) areas represent regions of predicted zero (non zero) current, and the green line represents the predicted hypersurface with uncertainty bounds.These uncertainty bounds vanish for each measured coordinate on the hypersurface.The arrows represent Brownian particles simulated to achieve uniform sampling of the hypersurface.Inset: Dark pink indicates the true non-zero current region and lighter pink the same region predicted by the model.The algorithm verifies the model prediction (green cross) by measuring the current starting from a coordinate g(u), chosen with high probability in a region of high current, and stops when the true hypersurface is encountered (red cross).b Selection and pruning.Dark pink indicates a schematic non-zero current region in a blue background representing currents close to zero.The dashed line shows the sweep in gate voltages from a coordinate on the hypersurface to a higher-current coordinate v δ (u).From v δ (u), black lines indicate current traces as a function of decreasing gate voltages 1 and 2. In this example, pinch-off is not observed as a function of gate voltage 1.The algorithm will then move the origin to match the gate voltage 2 component of v δ (u).The multi-colored line represents the estimated probability of finding transport features in the vicinity of a coordinate on the hyper-surface.
We have also introduced a pruning rule to rapidly exclude directions u for which the hypersurface cannot be intersected, since it lies outside the gate voltage range considered.The algorithm takes a pinch-off coordinate v(u), increases all gate voltages by a step-back parameter, and takes current traces along the gate voltage axes (in direction of decreasing gate voltages) to look for pinch-off.Because the step-back vector δ has all positive elements, the starting point of these traces v δ (u) = v(u) + δ will probably be located in a high current region.If a pinchoff coordinate can only be found along one gate voltage axis k, then the algorithm moves the k component of the origin o to match the k component of v δ (u) (Fig. 3b).This pruning rule excludes very quickly those regions of the parameter space for which the hypersurface cannot be intersected within the considered gate voltage range, and thus it is just applied for the first 30 iteration of the algorithm.We chose the step-back parameter as 100 mV for all experiments.Because o was chosen to allow for this step back, v δ (u) to always lie in the considered gate voltage range.

II. INVESTIGATION PHASE
The aim of the investigation phase is to evaluate the local region of a candidate coordinate to determine if desired transport features are present, and provide feedback to the sampling phase about its findings.It is important to obtain a fast but reliable evaluation of candidate coordinates.The algorithm first performs a quick measurement to determine if there are Coulomb peaks in the vicinity of a given candidate coordinate.This peak detection test will inform the selection of candidate coordinates at the sampling phase.
If Coulomb peaks are found in the vicinity of a candidate coordinate, indicating that it is in a quantum dot regime, the algorithm further explores the region to determine whether it is in a double quantum dot regime, and if so produces a score ranking its quality.

A. Peak detection
For a given pinch-off coordinate v(u), the algorithm explores the plane containing this coordinate and defined by directions in gate voltage space V3 and V7 , chosen as increasing plunger gate voltages V 3 and V 7 .The first measurement made by the investigation phase is a diagonal trace of current in this plane, i.e. in direction Ve = 1 √ 2 ( V3 + V7 ) opposite to pinch-off, starting from v(u).The trace is 128 mV long in gate voltage space and the voltage resolution is 1 mV.The voltage range and resolution are chosen parameters.Peak detection is used to determine the presence of Coulomb peaks in this trace.If the current trace does not reveal peaks, investigation of this coordinate v(u) ceases, and the sampler generates new candidate coordinate in the light of this information.If peaks are observed then the investigation phase progresses to further exploration of the plane in question.

B. Score function decision
If a Coulomb peak was observed at the peak detection stage, the algorithm proceeds to explore further the vicinity of the candidate coordinate.The average spacing between observed peaks is used to define a window in the plane given by Ve and the orthogonal vector Va = 1 √ 2 ( V3 − V7 ).This window in gate voltage space is square and it is bounded in direction Ve by the pinchoff coordinate v(u).The window's size is 3.5 times the average spacing between peaks.If less than three peaks are observed, the window's default size is chosen to be 100 mV to observe other peaks if present.A current map in this window can be measured with different resolutions.We define low (high) resolution as 16 × 16 (48 × 48) pixels in the window.
With a low resolution current map, the algorithm evaluates the quantum dot regime using a score function.This score function is a predefined mathematical expression designed to reward specific transport features.A honeycomb pattern observed in the current map indicates that the device is tuned to the double quantum dot regime.The score function thus rewards current maps in which sharp and curved lines are observed.A detailed description of the score function can be found in the Supplementary Material.
Based on this score, the algorithm makes a decision.If the score is bigger than or equal to a certain threshold, the current map is measured again with high resolution for a human to check if the device is indeed tuned to the double quantum dot regime.If the score is below the threshold, the algorithm reverts to the sampling phase.In this way, just current maps in which a honeycomb pattern is observed are selected (see appendix A).For our experiments we dynamically adjust the threshold during the experiment such that the algorithm proceeds to measure at high resolution only 15% of the low resolution current maps.The statistical analysis on the optimal score threshold can be found in the Supplementary Material.

III. RESULTS
The performance of our algorithm is assessed by a statistical analysis of the expected success time µ t .This is defined as the time it takes the algorithm to acquire a high-resolution current map that is confirmed a posteriori by humans as containing double quantum dot features.Because human labelling is subjective, three different researchers labelled all current maps, deciding in each case if they could identify features corresponding to the double quantum dot regime, with no other information available.See Supplementary Material for details of the multi-labeller statistical analysis.

A. Device tuning
To benchmark the tuning speed of our algorithm, we ran it several times on two different devices with identical gate architecture, Devices 1 and 2, and we compared its performance with a Pure random algorithm.The Pure random algorithm searches the whole gate voltage parameter space by producing a uniform distribution of candidate coordinates.It does not include a hypersurface model or pruning rules, but uses peak detection in its investigation phase.
As mentioned in the introduction, we consider a gate voltage space whose dimension is defined by the number of working gate electrodes, and we provide a gate voltage range that avoids leakage currents.While for Device 1 we considered the eight-dimensional parameter space defined by all its gate electrodes, for Device 2 we excluded gate electrode 6 by setting V 6 = 0 mV due to observed leakage currents associated with this gate.
We define the average count C as the number of current maps labelled by humans as displaying double quantum dot features divided by the number of labellers.For a run of the Pure random algorithm in Device 2 and five runs of our algorithm in Devices 1 and 2, we calculated C as a function of laboratory time (Fig. 4a,b).We observe that C is vastly superior for our algorithm compared with Pure random, illustrating the magnitude of the parameter space.
The labellers considered a total of 2048 current maps produced in different runs, including those of the ablation study in Section III B. The labellers had no information of the run in which each current map was produced or the algorithm used.For the Pure random approach, the labelled set was composed of 51 current maps produced by the algorithm and 100 randomly selected from the set of 2048.
The time µ t is estimated by the multi-labeller statistics.The multi-labeller statistics uses an average likelihood of µ t over multiple labellers and produces an aggregated posterior distribution (see Supplementary Material).From this distribution, the median and 80% (equal-tailed) credible interval of µ t is 2.8 hr and (1.9, 7.3) hr for Device 1 and 1.1 hr and (0.9, 1.6) hr for Device 2. Experienced humans require approximately 3 hr to tune a device of similar characteristics into exhibiting double quantum dot features [14].Our algorithm's performance might therefore be considered super human.Due to device variability, the hypersurfaces of these two devices are significantly different, showing our algorithm is capable of coping with those differences.In Fig. 4c,d, we compare the probability of measuring Coulomb peaks in the vicinity of a given v(u), P (peaks), for Pure random and different runs of our algorithm.We calculate P (peaks) as the number of sampled coordinates in the vicinity of which Coulomb peaks were detected over n.In this way, we confirm that P (peaks) is significantly increased by our algorithm.It has a rapid growth followed by saturation.
Figure 4e,f shows the high resolution current maps produced by Pure random and one of our algorithm runs.We observe that our algorithm produces high resolution current maps which are recognized by all labellers as displaying double quantum dot features within 1.53 hr.The three current maps in Figure 4f correspond to double quantum dot regimes found by our algorithm in different regions of the gate voltage space.The number of labellers C who identify the current maps produced by Pure random as corresponding to double quantum dots, C, is 0 or 1.
To significantly reduce tuning times, we then modified our algorithm to group gate electrodes that perform similar functions.The algorithm assigns equal gate voltages to gate electrodes in the same group.For Device 1, we organized the eight gate electrodes into four groups: , and G 4 = (V 4 , V 5 , V 6 ).In this case, the median and 80% credible interval of µ t improve to 0.6 hr and (0.4, 1.1) hr.This approach, by exploiting knowledge of the device architecture, reduces µ t by more than four times.

B. Ablation study
Our algorithm combines a sampling phase, which integrates the hypersurface model (section I A) with selection and pruning (section I B), and an investigation phase that includes peak detection (section II A) and score function decisions (section II B).Each of these modules contributes to the algorithm's performance.An ablation study identifies the relative contributions of each module, justifying the algorithm's architecture.For this ablation study we chose to compare our algorithm, Full decision, with three reduced versions that combine different modules; Pure random, Uniform surface and Peak selection (see Table I).
Pure random, defined in the previous section, produces a uniform distribution of candidate coordinates over the whole gate voltage space.It excludes the hypersurface model and pruning rules.(60 × 60) pixels.
To analyse the algorithm's performance, we estimate P (peaks) and the probability of success, i.e. the probability to acquire a high-resolution current map labelled as containing double quantum dot features, given Coulomb peak measurements P (success|peaks).To take measurement times into consideration, we define t 500 as the time to sample and investigate 500 coordinates in gate voltage space.The ablation study was performed in Device 1 keeping investigation phase parameters fixed.Results are displayed in Fig. 5.
Figure 5a shows that the introduction of the hypersurface model, and selection and pruning, increases t 500 .This is because P (peaks) increases with these modules (Fig. 5b), and thus the number of low and high resolution current maps required by the investigation phase is larger.Within uncertainty, P (success|peaks) remains mostly constant for the different algorithms considered.The result is a decreasing µ t from Pure random to Peak selection within experimental uncertainty.See appendix B for a mathematical analysis of these results.
The reason behind the use of peak detection in all the algorithms considered for this ablation study is the vast amount of measurement time that would have been required otherwise.Without peak detection, the posterior median estimate of µ t for Pure random is 680 hr.
To complete the ablation study, we compare the considered algorithms with the grouped gates approach described in the previous section, keeping parameters such as the current map resolutions are equal.We found µ t = 80.5 mins (see Supplementary Material).
In summary, comparing Pure random and Uniform surface, we show the importance of hypersurface modelling.The difference between Uniform surface and Peak selection highlights the importance of selection and pruning.The improved performance of Full decision with respect to Peak selection evidences the tuning speedup achieved by the introduction of the score function.These results demonstrate Full decision exhibits the shortest µ t and imply an improvement over Pure random without peak detection of approximately 180 times.

IV. DEVICE VARIABILITY
The variability of electrostatically defined quantum devices has not been quantitatively studied so far.We have been able to exploit our algorithms for this purpose.Using the uniform surface algorithm only (no investigation phase), we obtain a set of coordinates on the hyper-surface v a .Changes occurring to this hyper-surface are detected by running the algorithm again and comparing the new set of coordinates, v b , with v a .This comparison can be done by a point set registration method, which allows us to find a transformation between point sets, i.e. between sets of coordinates.
Affine transformations have proven adequate to find useful combinations of gate voltages for device tuning [9,10].To find a measure of device variability, understood as changes occurring to a device's hyper-surface, we thus use an affine transformation v t = Bv b , with B a matrix which is a function of the transformation's parameters.We are looking for a transformation of coordinates that converts v b into a set of coordinates v t which is as similar to v a as possible.
The particular point set registration method we used is coherent point drift registration [21].This method works with an affine transformation which includes a translation vector.We have modified the method to set this translation vector to zero, as the transformation between hyper-surfaces can be fully characterized by the matrix B (see Supplementary Material).
We have used this approach to quantify the variability between devices 1 and 2, and the effect of a thermal cycle in the hypersurface of Device 2. Figure 6 displays the matrix B c = B − I for each case, quantifying how much B, the transformation that converts a set of coordinates from one hypersurface onto the other, differs from the identity matrix (I).Non-zero elements of B c thus indicate device variability.Diagonal elements of B c are responsible for scale transformations and can be interpreted as a capacitance change for a given gate electrode.Off-diagonal elements are responsible for shear transformations and can be interpreted as a change in cross-capacitance between a pair of gate electrodes.Fig. 6a shows B c corresponding to the changes in the hypersurface of Device 2 after a thermal cycle.This transformation shows that device variability in a thermal cycle is dominated by a uniform change in capacitance for all gate electrodes.We have also measured B c for a thermal cycle of Device 1 (see Supplementary Material).Fig. 6b displays B c comparing the hypersurface of Device 1 with the hypersurface of Device 2. We observe that the variability between these devices, which share a similar gate architecture, is given by non-uniform changes in gate electrode capacitance, as well as by changes in crosscapacitance.This variability is attributed to charge traps and other device defects, such as a small differences in the patterning of gate electrodes.

V. CONCLUSION
We demonstrated an algorithm capable of tuning a quantum device with multiple gate electrodes in approximately 70 mins.This was achieved by efficiently navigating a multi dimensional parameter space without manual input or previous knowledge about the device architecture.This tuning time was reproduced in different runs of the algorithm, and in a different device with a similar gate architecture.Our tuning algorithm is able to tune devices with different number of gate electrodes with no modifications.We showed that gate electrodes with similar functions can be grouped to reduce the dimensionality of the gate voltage space and reduce tuning times to 36 mins.Tuning times might be further improved with efficient measurement techniques [22], as measurement and gate voltage ramping times were found to be the limiting factor.We analysed our algorithm design through an ablation study, which allowed us to justify and highlight the importance of each of its modules.The improvement over the pure random search without peak detection is estimated to be 179 times.
We showed that device variability can be quantified using point set registration by uniform sampling of the hypersurface separating regions of high and low current in gate voltage space.We found that variability between devices with similar gate architectures is given by non-uniform changes in gate capacitances and cross-capacitances.Variability across thermal cycles is only given by a uniform change in gate capacitances.
Other device architectures might use the sampling phase of our algorithm as a first tuning step, and the investigation phase can be adapted to tune quantum devices into more diverse configurations.To achieve full automated tuning of a singlet-triplet qubit, it will be necessary to go beyond this work by tuning the quantum dot tunnel barriers, identifying spin-selective transitions, and configuring the device for single-shot readout.
Appendix A: The score function as a classifier One of key strength of the proposed algorithm is that it does not require an ideal score function.It is important to highlight that we are using the score function just as a classifier, instead of aiming at finding the gate voltage configuration that maximises the score.The reason for this is threefold; (i) the score function is not always a smooth function; (ii) it does not always capture the quality of the transport features; (iii) it is just designed for a particular transport regime, in this case, honeycomb patterns.Therefore, the score threshold acts as a parameter that just controls the characteristics of the classifier.If the threshold is low, many high resolution scans not leading to double quantum dot transport features are produced.If the threshold is too high, then promising gate voltage windows are missed.The optimal threshold can be estimated by minimising the time required to produce a high-resolution current map that is labelled by humans as containing double quantum dot features.

Appendix B: Mathematical analysis of ablation study results
The results in the ablation study can be verified under a few assumptions by a mathematical derivation of µ t (see Supplementary Material).From this derivation, we can compare the expected times t abl = t 500 /500 and µ abl t for Pure random, Uniform surface, and Peak selection, where t 2D = t 2D-L + t 2D-H with t 2D-L(H) the expected measurement times corresponding to low (high) resolution current maps acquired in a sampling iteration.t others accounts for the rest of the investigation and sampling time, including ramping of gate voltages, peak detection, and computation time.The simulation of the Brownian particles is conducted in parallel with the investigation phase of the coordinate proposed by the sampler in a previous run, and it does not increase t others .Note that the ablation study algorithms do not include the score function decision, so high resolution current maps are always acquired.In this case, low resolution maps are not useful, but we have still included t 2D-L in t 2D to keep the comparison between algorithms consistent.We estimate t others ∼ 35 s, t 2D-L ∼ 33 s and t 2D-H ∼ 273 s.
Equations (B1) and (B2) confirm our results in Fig. 5a, as t abl grows for increasing P (peaks).They also show that the observed decrease in µ abl t is a consequence of an increase in P (peaks), given that P (success|peaks) remains mostly constant within experimental uncertainty (Fig. 5b).t 2D is constant and t others increases only slightly from Pure random to Peak selection (see Supplementary Material).We observe that optimising t 2D is key to decrease µ t , motivating the introduction of a score function.
For the Full decision algorithm, µ t and t are, , where P (highres) is the probability of acquiring a high resolution current map.The probability of acquiring a high resolution current map conditional on the observation of Coulomb peaks is P (highres|peaks).
The score function decision reduces t, because t abl − t full = P (peaks)(1 − P (highres|peaks))t 2D-H and P (highres|peaks) < 1.The reduction in t given by the introduction of the score function decision is observed in Fig 5a.
Comparisons between µ abl t and µ full t can be affected by the dependence of P (success|peaks) on the score function threshold.In Fig. 5b, however, we observe that P (success|peaks) is similar for Peak selection and Full decision.This implies that the introduction of a score function threshold does not reduce the probability of success.
In this case, This equation confirms that that the score function reduces µ t in the case that the score function threshold does not degrade P (success|peaks).Further analysis on the optimal threshold, i.e the threshold that minimizes µ full t , can be found in Supplementary Material.

SUPPLEMENTARY MATERIAL Hypersurface detector
Finding the hypersurface boundary or 'pinch-off' boundary requires an expected maximum and minimum current value.This is obtained by a single current trace before running the algorithm.The pinch-off is defined by a threshold, chosen as 20% of the maximum current.The algorithm measures a trace in direction u from o: ux i + o, where i is the step number and x i is the distance to the origin at the ith step.We choose x 0 = 0 and x i+1 = x i + 10 mV.For each step i, the pinch-off detector identifies the gate voltage coordinate for which the measured current is lower than the threshold.If the current continues to be below the threshold for a certain range R, set to 50mV for all experiments, the algorithm considers the device is pinched off.

Gaussian process models
In this paper, we use two separate models that rely on Gaussian processes: one for estimating the hypersurface r(u), and the other for estimating the location of Coulomb peaks Ppeak (v).Each Gaussian process requires a prior distribution, which is defined by a prior mean function and a covariance kernel.By setting the prior mean function m r (•) and covariance kernel k r (•, •), the prior distribution is r(u) ∼ N (m r (u), k r (u, u)), and the covariance is cov[r(u 1 ), r(u 2 )] = k r (u 1 , u 2 ).
For the model of r(u) we initially have no prior knowledge, except for the maximum and minimum gate voltage bounds.Thus the range of r is: r ∈ where N is the number of gate electrodes.We set the prior distribution in this range: m r (u) = (r max − r min )/2, k r (u, u) = (r max − r min ) 2 /4 2 .This prior means that r max and r min will be two standard deviations away from m r (u).
To model the probability Ppeak (v), we factorise it into two components Ppeak (v) = Ppeak|valid (v) Pvalid (v), where Pvalid (v) is the probability that v corresponds to a pinchoff location, and not to the boundary of the operable gate voltages [0,-2] V.Each component of Ppeak (v) is estimated with a Gaussian process classification model.
For all Gaussian process models the Matern 5/2 kernel is used for the prior covariance.The prior covariance is defined by length scales l k with k = 1, ..., N , setting the smoothness of the model predictions.These length scales are periodically optimised to the Maximum a posterior (MAP) using gathered data and a prior gamma distribution.The gamma distribution is set to have mean 0.4, and variance 0.1 2 for r(u); mean 500 and variance 100 2 for Ppeak|valid ; and mean 50, and variance 20 2 for Pvalid .

Score function
The score function is designed to distinguish between measurements containing double quantum dot features, single quantum dot features, and measurements for which the confinement potential is not well defined.This could be achieved using modern machine learning techniques, however, the score function we use is not reliant on training data and hence is less device and labeller specific.The score consists of three separate scores multiplied: the orientation of gradients score, the sharpness score, and the fit direction score.
The orientation of gradient score aims to capture if Coulomb peaks appear as straight lines in a current map or if a honeycomb pattern is present.This is achieved by taking the numerical derivative of the current map and producing gradient unit vectors.These gradient vectors will exhibit a distribution of angles.If Coulomb peaks appear as straight lines, then the distribution of angles will reveal two high-density collections of unit vectors, separated by π rad from each other.In this case, the high-density collections will have a small variance and a single line can fit the two points they define.For a honeycomb pattern, this variance will be larger or the high-density collections will not be well defined, and thus at least two lines will be required to fit the high-density collections.The value of the gradient orientation score is defined as the difference between the average of residuals corresponding to a fit through the high-density collections.Gradients smaller than a certain value are not considered for the fitting.This threshold was inferred from the current noise.
The sharpness score aims to prioritise 'sharp' Coulomb peaks.To compute the sharpness score the current map is first spit up into 16 smaller tiles.Each title is segmented into regions of high current and low current using a simple threshold.Then, the second order derivative of the current as a function of gate voltage is computed.This second order derivative is averaged over the high current regions and its standard deviation is calculated.The score for each tile is defined as the product of these average and standard deviation values.The scores for the 16 tiles are then averaged together giving the final sharpness score.
The fit direction score aims to further differentiate between single and double quantum dot behavior.The current map is first spit up into 16 smaller tiles, as for the sharpness score.For each of these tiles, a lines is fit to the high-density collections of gradient vectors, in the same way as for the orientation of gradient score.From this fit, a gradient vector angle is assigned to the tile.For each row of tiles (in direction Ve ), the standard deviation of gradient vector angles is calculated.In this way, the array of 4 × 4 tiles is reduced to a 1 × 4 vector.The average of the elements in this vector, defines the fit direction score.

Labelling procedure
Current maps that are measured by the algorithms in the tuning and ablation study sections, are labelled to determine if they contain features that indicate the device is tuned to the double quantum dot regime.This labelling is performed by human experts.To remove human bias, labelling was performed by 3 independent labellers.
Two separate data sets were labelled.The first contained 2048 current maps taken from all experiments presented in this paper, except the pure random experiment in section III A. The second contained 151 current maps, 51 of which were taken from the pure random experiment in section III A and 100 were randomly selected from the first data set.The current maps in each data set were randomly shuffled so that labellers could not identify which experiment produced a given current map.The labellers were the same for both data sets.

Point set registration
Coherent point drift [21] was our choice of point set registration, since it can be used for point sets of high dimensionality.It is also simple to implement, it is robust to noise in the point sets, and supports affine transformations as well as rigid and non-rigid transformations.We focus on affine transformations T (x) = Bx + t, where B is a square matrix, and t is a column vector.Reference [21] presents the mathematical derivation for this case.In our application of the affine transformation, t does not have significance in terms of the device physics, and it was therefore set to 0. A derivation for the case t = 0 was required.This was achieved by setting the centered point matrices X = X and Ŷ = Y in the derivation from section 4 (Rigid & affine point set registration) in Ref. [21].

Single-labeller statistics
To infer the expected tuning time µ t , and the probabilities P (peaks), and P (success|peaks), we use Bayesian inference with Jeffreys prior.Let p denote the probability (either P (peaks) or P (success|peaks)) that we want to estimate.The Jeffreys prior for a binomial distribution is p ∼ Beta(0.5, 0.5).If we observe k successes over n trials, then the posterior distribution of p is p|k, n ∼ Beta(0.5 + k, 0.5 + n − k).
To infer µ t , we assume that the rate of success over time follows a Poisson distribution k ∼ Poisson(λt tot ), where t tot is the total time for an algorithm run, and λ is a rate parameter.The expected time between two consecutive successes (i.e. the acquisition of currents maps displaying double quantum dot transport features) is µ t = 1/λ.The Jeffreys prior for a Poisson distribution is λ ∼ Gamma(0.5, 0), hence the posterior is λ|k, t tot ∼ Gamma(0.5 + k, t tot ).Consequently, µ t |k, t tot ∼ Inv-Gamma(0.5 + k, t tot ).

Multi-labeller statistics
Before introducing multi-labeller statistics, let us introduce necessary notation: D = {(v i , Y i )|i = 1 ∼ n}, where v i is a voltage configuration, and Y i is a high-resolution scan in the vicinity of v i .If no high-resolution scan was acquired at v i , then Y i is an empty array.The data D requires a labelling function ψ: ψ(Y i ) = 0 if there is no recognized double quantum dot transport features, or 1 otherwise.But labellers might not agree, and thus we need to marginalise this effect.Let θ denote a parameter to be inferred (µ t , P(peaks), or P(success|peaks)), then p(θ|D, ψ) = p(D|θ, ψ)p(θ)/p(D), where Ψ is a set of labellers.Note that ψ only affects the likelihood.To minimize the affect of ψ, the posterior can be marginalised over ψ and also approximated by samples of ψ: where |Ψ| is the number of labellers.Therefore, the cumulative distribution function of θ|D is

Mathematical derivation of µt
We define t i as the time taken to produce the ith sample, and ti = i j=1 t j .We also define t s as the interval between successes, or from start of the algorithm run to the first success if there were no previous success.Likewise, n s is the iteration number corresponding to the first success.We want to derive E[t s ].For brevity, we define shortened notations: α = P (peaks), β = P (success|peaks).Note that P (success) = αβ.
Then the distribution of t s is The expected time is The last term E[ ti |n s = i] depends on the tuning algorithm.
Derivation of µ abl t The time for each iteration of the algorithm is t n = t n,other + t n,2D I n,peaks , where I n,peaks is equal to 1 if peaks are detected at v(u n ) and 0 otherwise, t n,2D and t n,other is the time taken for 2D scans and other related times for iteration n.Consequently, tn = n i=1 t i,other + n i=1 t i,2D I i,peaks , and tn |(n s = n) = n i=1 t i,other + n−1 i=1 t i,2D I i,peaks + t n,2D , because n s = n implies that 2D scans are measured at n.We assume that t i,other and t i,2D are i.i.d.across i, and we denote the expected time t others and t 2D , respectively.Then the conditional expectation is For brevity, we use the following shortened notations: α = P (peaks, highres) = P (highres), β = P (success|peaks, highres) = P (success|highres).Deriving the conditional expectation is very similar to the previous section, Therefore, the marginalised expectation is Note that P (success) = αβ = α β , which leads to β = αβ/α .Also, α/α = P (peaks)/P (highres) = 1/P (highres|peaks).Substituting these yields where t OL = t others + αt 2D-L .Note that β is the recall, which is the proportion of the real double dot samples over all positively classified samples, of a classifier, and it is related with α .Therefore, a model of β in terms of α is required.

Perfect score function
We define a score function is perfect if there exists a threshold that perfectly classifies double-dot samples from others.In this case, The model for β is β = min(1, q/α ), where q is the actual proportion of double-dot samples.In other words, the recall is 1 if α ≤ q, or q/α otherwise.It is trivial to see that the expected time (S4) is minimum at α = q.If q is unknown, which is usual, then we can utilise a prior distribution on q and optimise α given the prior distribution.
In summary, from this analysis we conclude that it is beneficial to keep the threshold as big as possible if the recall does not drop significantly.Also, the incentive of using a score function as classifier is higher when the relative cost of a high resolution scan is big.

Imperfect score function
Let us assume that a score function is perfect with the probability ξ, or the score is random with the probability (1 − ξ).Then the model for β is β = min(q, α )ξ + α q(1 − ξ) α .
Again, it is trivial that the expected time (S4) increases when α increases if α ≤ q.
In case of α > q, (S4) becomes and the partial derivative w.r.t.α is Note that α does not change the sign of the derivative, but ξ changes the sign.Rearranging ∂E[ts|α ]   ∂α = 0 yields ξ = tOL tOL+t2D-H .It means that the optimal α is q if ξ > tOL tOL+t2D-H .Otherwise, increasing α always decreases the expected time, hence setting α as big as possible reduces the time.It means that there is no advantage of using the score function if ξ < tOL tOL+t2D-H .Figure S2 shows the tuning performance, on device 1, of the grouped gates implementation of the algorithm, under the same conditions as Section IIIA in the main text.We have also included the grouped gates implementation of the algorithm, under the same conditions as in Section IIIB, in Fig. S3.Additional results for the device variability study in Section IIIC, where device 1 is thermal cycled, is displayed in Fig. S4.The raw data used to calculate tuning performance in Section IIIA is shown in Table S1.The raw data used to calculate tuning performance in Section IIIB is shown in Table S2.

FIG. 1 .
FIG. 1.Overview of the device, and gate voltage space.a Schematic of a gate-defined double quantum dot device.b Left: Boundary hypersurface measured as a function of V2, V5, and V8, with fixed values of V1, V3, V4, V6 and V7.The current threshold considered to define this hypersurface is 20% of the maximum measured current.The gate voltage parameter space, restricted to 3D for illustration, contains small regions in which double and single quantum dot transport features can be found.These regions typically appear darker in this representation because they produce complex boundaries.Right: For particular gate voltage coordinates marked with green crosses, the current as a function of V7 and V3 is displayed.The top and bottom current maps display double and single quantum dot transport features, respectively.
v o lt a g e 3 Pruning (only for the first 30 iterations)

FIG. 4 .
FIG. 4. Algorithm's performance.a-d Average number of current maps displaying double quantum dot features, C, and P (peaks) as a function of laboratory time.Current maps are labelled by humans a posteriori, i.e. after the algorithm is stopped.a,c and b,d correspond to one run of Pure random and five runs of our algorithm, respectively.All algorithm runs displayed in main panels were performed in Device 2, while insets show runs of our algorithm in Device 1. e,f High resolution current maps produced by Pure random and one of our algorithm runs, respectively.We indicate the laboratory time at which they were acquired and the number of labellers, C, that identified them as displaying double quantum dot features.Current maps are ordered from left to right in decreasing order of C, and maps that have identical values of C are displayed in the order at which they were sampled.Each panel uses an independent colour scale running from red (highest current measured) to blue (lowest current).

FIG. 5 .
FIG. 5.Ablation study.a and b bar charts comparing µ (light green), t500 (dark green), P (peaks) (dark blue) and P (success|peaks) (light blue) for the different algorithms considered.Error bars represent 80% (equal-tailed) credible intervals.Due to a measurement problem, 459 sampling iterations instead of 500 were considered for the Full decision algorithm.c to f High resolution current maps sampled by Pure random, Uniform surface, Peak selection and Full decision, respectively.In each panel, we indicate C, the number of human labellers that identified a map as displaying double quantum dot features.Current maps with identical values of C are displayed in the order in which they were sampled, from top to bottom.Current maps with C = 0 were randomly selected.Each panel uses an independent colour scale running from red (highest current measured) to blue (lowest current).

FIG. 6 .
FIG. 6. Learning about device variability.Bc matrices obtained using point set registration.Indices are the gate voltage coordinates v b and vt.V6 = 0 mV was fixed in Device 2 to prevent leakage currents.a Transformation between the hypersurface of Device 2 before and after a thermal cycle.b Transformation between the hyperfsurfaces of Device 1 and Device 2.

Figure
Figure S1 is a detailed block diagram of the algorithm.FigureS2shows the tuning performance, on device 1, of the grouped gates implementation of the algorithm, under the same conditions as Section IIIA in the main text.We have also included the grouped gates implementation of

TABLE I .
Comparison of algorithms used in the ablation study.