Abstract
Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and errorprone, they will depend on faulttolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trappedYb^{+}ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10^{−4}).
Introduction
The field of quantum information processing has seen great growth over the past 30 years, driven by exciting quantum algorithms inaccessible to classical computers. Small quantum information processors have been demonstrated experimentally using superconducting circuits^{1,2,3}, electrons in semiconductors^{4,5,6}, trapped atoms and ions^{7,8,9} and photons^{10,11,12}. Trapped ions are among the most reliable qubits available today; up to 14 qubits have been addressed in a single trap^{13}, a 5qubit quantum information processor has been realized^{14}, and singlequbit gates have demonstrated randomized benchmarking (RB) infidelities as low as 10^{−6} (refs 15, 16, 17).
Unlike classical bits, qubits are intrinsically noisy and errorprone, and will require active, faulttolerant quantum error correction (FTQEC^{18}) to operate reliably. To function, FTQEC requires physical qubit operations to be high quality, with errors below a specific threshold. Fault tolerance (FT) thresholds for quantum computing have been proven against various noise models, and generally require pergate failure rates between 10^{−6} and 10^{−2} (refs 19, 20, 21, 22). However, the particular metric of ‘error rate’ depends on the noise model. Against realistic general errors, including small unitary errors, thresholds are stated in terms of the gates’ diamond norm error, (refs 20, 23, 24).
RB^{25,26}, the most commonly used method for qubit characterization, measures a single error rate () that closely approximates the gates’ average process infidelity. Because RB is relatively insensitive to unitary errors^{27} that dominate diamond norm error^{28} and have unpredictable consequences for FTQEC^{20}, it cannot efficiently measure diamond norm error to high precision. This makes it nearly impossible to demonstrate suitability for FT using RB alone, unless errors are assumed to be strictly incoherent. There are variants of RB that characterize and report additional parameters, but none of them are well suited for diamond norm characterization or comparison to FT thresholds^{29,30,31}.
We use a characterization method called gate set tomography (GST)^{32,33,34} to systematically debug and improve a 1qubit trappedYb^{+}ion quantum information processor, and—finally—to demonstrate with very high confidence that all three of its quantum logic operations surpass a proven threshold for FTQEC. GST provides a full and extremely accurate tomographic description of every gate, complete with statistical confidence bounds. We use this information to iteratively improve our singlequbit operations and to place tight bounds on the diamond norm error of the final gates, producing the first singlequbit gates whose errors are demonstrably below a rigorous threshold for faulttolerant error correction.
This is not a demonstration of FTQEC, which requires not just singlequbit gates, but also highfidelity twoqubit gates, repeatable measurements and (of course) more qubits. However, the GST methods that we use here to demonstrate 1qubit gate errors below the threshold do generalize to 2qubit gates, to the characterization of repeatable measurements, and to important properties of multiqubit systems such as crosstalk. So, while pushing singlequbit gate errors below the threshold is only one step toward achieving FTQEC, it is an important one.
Results
Gate set tomography
Our goals are (1) to implement quantum operations satisfying a FTQEC threshold and (2) to ‘prove’—that is, demonstrate conclusively—that we have done so. Genuine proofs are the domain of mathematics. In experimental science, the highest achievable standard is to provide experimental data (or summary statistics) that (1) are consistent with the desired outcome; and (2) are inconsistent with any other plausible theory, and thus rule out all alternatives to some high level of confidence. Our intent is not to provide an exclusive protocol for such demonstrations, but rather to establish that GST is sufficient to do so.
While RB and quantum process tomography^{35,36,37} can be used in this fashion, they each face nontrivial obstacles. RB’s insensitivity to unitary errors makes it a poor tool for bounding worstcase error rates (diamond norms). In process tomography^{5,38}, small calibration errors in the gates used to implement different measurements propagate to the final results, invalidating them. GST is a selfcalibrating tomography protocol that solves both of these problems. GST protocols based on short quantum circuits were developed at IBM^{38} and Sandia^{32}. The longsequence GST protocol demonstrated here is orders of magnitude more precise.
GST relies on two assumptions: (a) the system being characterized is a qubit with a 2dimensional Hilbert space; (b) each gate operation is stationary and Markovian. It treats the qubit as a black box with operation buttons (one for initialization, one for measurement and the rest for gate operations) as shown in Fig. 1a, and selfconsistently determines all operations up to a choice of basis (a gauge; see ‘Methods’ section). It can also detect and quantify violations of these assumptions (see next subsection).
In GST, the real (noisy) gates are modelled as tracepreserving linear maps on density matrices (TP maps). Such maps must be completely positive to be physical, and thus are usually referred to as ‘CPTP maps’; for technical reasons, we do not always impose the CP constraint in GST, but otherwise these maps are functionally the same as CPTP maps.
The qubit’s quantum state ρ is a fourelement vector in the vector space of 2 × 2 Hermitian matrices (Hilbert–Schmidt space)^{32}, and each gate is a 4 × 4 matrix G that acts on by left multiplication (that is, ). Measurement is represented by a twooutcome positive operatorvalued measure (POVM) {E, −E}. Our target state preparation and measurement (SPAM) are and .
Data for GST come from gate sequences (quantum circuits), each comprising: (1) initialization, (2) a series of gates and (3) measurement. Each sequence is repeated N times, and the frequency of 0/1 counts is recorded. In the experiments reported here, we implemented and used the set of gates {G_{I}, G_{X}, G_{Y}}, but GST can analyse any gate set rich enough to prepare an informationally complete set of probe states and measurements.
GST analysis proceeds as shown in Fig. 1b. First, a specific set of short sequences is analysed by linear inversion (see ‘Methods’ section) to get a rough estimate of the gates and SPAM operations. This estimate has an unavoidable gauge freedom; every observable probability is invariant under
for any invertible matrix M. We choose a gauge that makes the estimated gates as similar to the target gates as possible (see ‘Methods’ section). If the rough estimate is not already completely positive, we truncate each gate to the nearest CP map, to ensure physically valid probabilities in the next step. Next, using the rough estimate as a starting point, we iteratively add more data. In the mth iteration, we add data from gate sequences of length 2^{(m−1)} into the pool, then numerically adjust the estimate to minimize the χ^{2} divergence between the observed frequencies and estimated probabilities. This ‘minχ^{2}’ estimate is then used as the seed for a numerical maximization of the likelihood function . Finally, we perform another gauge optimization to maximize similarity to targets.
The GST gate sequences (see Fig. 1d) are chosen to (collectively) amplify every physical parameter in the gate set. Short sequences called germs are repeated many times, and these ‘germ power’ sequences are pre and postfixed by each of six fiducial sequences. In this work, we use six fiducial sequences, {, G_{X}, G_{Y}, G_{X}G_{X}G_{X}, G_{Y}G_{Y}G_{Y}, G_{X}G_{X}}, where denotes the null sequence, and G_{X} (G_{y}) are noisy π/2 rotations about x (y). These fiducials map ρ and E to (approximately) the six Pauli eigenstates, defining an informationally complete experimental reference frame. For further details on sequence design and a complete list of all experiments performed, see ‘Methods’ section.
Experiment
Our qubit is a single ^{171}Yb^{+} ion in a stateoftheart linear surface ion trap (Fig. 1a). Ions are trapped by photoionizing neutral ytterbium vapour that reaches the trapping volume through a slot from the back of the surface trap chip. The qubit is encoded in the hyperfine clock states of the ^{2}S_{1/2} ground state of ^{171}Yb^{+}: , . Standard laser cooling techniques are applied to Doppler cool the ion and prepare it in the 0> state^{39}. Standard fluorescence state detection^{39} is used to measure in the basis.
Three logic gates—G_{I} (the idle or identity gate), G_{X} (a π/2 X rotation) and G_{Y} (a π/2 Y rotation)—are realized by using a microwave horn to apply pulses nearresonant with the 12.6428 MHz separation of the qubit levels. Broadband composite pulses (BB1)^{40,41} are employed to minimize sensitivity to amplitude fluctuations in the microwave signal.
Using characterization procedures to debug and improve qubits has a long history. A Ramsey fringe or Rabi oscillation experiment is a kind of limited tomography, combined with physical intuition, which is used to improve the quality of the quantum logic operations. In typical tuneup procedures, different types of these experiments are iterated, until the qubit gates are deemed good enough to proceed.
Various improvements to this generic tuningup scheme outlined above have been offered, including augmenting oscillation experiments to detect microwave pulse distortions^{42}, designing small sequences for error amplification^{43}, supplementing sequence experiments with RB to do detailed noise spectroscopy^{44} or replacing them with iterative RB to guide the system toward higher RB fidelity operations^{45}. All of these techniques, and other experiments combined with physical intuition, can be and have been used to produce qubits with very highfidelity operations. In comparison, GST has the distinct advantage in that it includes all experiments necessary for full and loopholefree qubit characterization. It can be though of as systematic statistical inference on a provably sufficient set of Rabi/Ramsey experiments.
We used GST to analyse systematically and improve our trappedion qubit operations over the course of five experimental runs from 17 April 2014–30 March 2015. Experiments #1–2 used the Sandia Thunderbird trap^{46}, and Experiments #3–5 used Sandia’s highopticalaccess (HOA2) trap. Figure 2a summarizes the gates’ steady improvement over this period by tracking their process infidelities^{47}, which corresponds to the RB error rate^{48}.
Experiment #1 detected severely nonMarkovian behaviour. We sought to address this by stabilizing the microwave amplifier’s temperature, and stabilizing microwave πtimes using active feedback (drift control), as described in ‘Methods’ section. Experiment #2 showed improved fidelity in the G_{X} and G_{Y} gates, but no reduction in nonMarkovianity. We then moved our qubit to the HOA2 trap, and improved trap stability. In Experiment #3, GST showed significant improvements in fidelity and Markovianity, and that G_{I} remained worse than the other gates. To improve it, we changed G_{I} from ‘do nothing for one clock cycle’ to the dynamical decoupling pulse sequence X_{π}W_{1.25π}(X)_{−π}W_{1.25π}, where X_{π} and Y_{π} denote π rotations around X and Y, respectively, and W_{1.25π} means ‘wait for the duration of a 1.25πpulse’. We also applied active drift control of the qubit frequency, and improved the calibration of the BB1 pulse sequences. Experiment #4 showed reduction of coherent errors in G_{X} and G_{Y}, but persistent nonMarkovian errors in G_{I}. After we upgraded G_{I} to the secondorder dynamical decoupling sequence X_{π}Y_{π}X_{π}Y_{π} (ref. 49), Experiment #5 demonstrated uniformly excellent gates. Subsequent analysis indicates that the improved performance of G_{I} stemmed largely from the constant duty cycle of the microwave system, rather than from the intrinsic properties of the decoupling sequence used. This emphasizes that GST can identify specific errors, but not necessarily their cause. The estimated process matrices for the gates are shown in Fig. 3a. Figure 3b shows the error generators, defined as , where is the estimate and G_{0} is the target.
Demonstrating suitability for FT
Useful quantum computation is expected to require faulttolerant error correction. The most important milestone for a quantum operation is, therefore, ‘Is it suitable for use in FTQEC?’ Operations that induce too much error will cause FTQEC protocols to fail. Demonstrating conclusively that gates are suitable for FT requires: (1) establishing a sufficient condition for the gates to not induce failure; and (2) showing that the gates satisfy that condition, with high confidence, by means of experimental data that are inconsistent with all gates that do not satisfy the condition.
Demonstrating suitability for FT using infidelity alone is hard. Threshold theorems against general errors (arbitrary CP maps) are stated in terms of the diamond norm distance between the real and ideal gates^{24,50},
where d is the system’s Hilbert space dimension, is the trace norm and the supremum is over density matrices ρ with dimension d^{2} (ref. 51). Because the diamond norm error can be as large as (refs 27, 28), even a spectacular RB result like =10^{−6} (ref. 15) only establishes an upper bound of 10^{−3} on the diamond norm. The bestknown proof of FT against general noise^{20} derived a threshold of 2.3 × 10^{−5} against stochastic noise, and generalized it to a diamond norm threshold of ∼10^{−5} against general (coherent) noise. This was subsequently improved to 1.94 × 10^{−4} (ref. 21) and finally to 6.7 × 10^{−4} (ref. 22), the highest (currently) proven threshold against general noise.
Unlike RB, GST enables direct computation of the diamond norm between the estimated and target gates (we use a semidefinite program^{52}). Figure 2b shows the diamond norm error of our gates over time, culminating on 30 March 2015 in diamond norm error rates (with 95% confidence intervals) of (1.58±0.15) × 10^{−4}, (1.39±0.22) × 10^{−4} and (1.62±0.27) × 10^{−4} for G_{I}, G_{X} and G_{Y} respectively. All three gates surpass the threshold with 95% confidence. (In point of fact, they surpass even the older 1.94 × 10^{−4} threshold with 95% confidence.)
We note that, although we only demonstrated Clifford operations, and nonClifford operations are needed for universal control, FTQEC is possible only using Cliffords. Furthermore, we can still extrapolate the performance of nonClifford gates (for example, a T gate) in our system. A pessimistic estimate of the error on an X π/4 rotation, for example, would simply be the same as the error on the X π/2 gate that we characterized. This is because implementing the X π/4 gate in practice is equivalent to running the X π/2 gate for a shorter duration.
Quantifying nonMarkovianity
In real experimental systems, repeated quantum operations are never actually identical. For example, experimental imperfections in the stability of the system may cause quantum operations to drift over time. Collectively, we refer to all such nonrepeatability as nonMarkovianity. It represents a significant potential problem for FT, as proofs of FT thresholds are typically carried out using Markovian error models. So, to be confident that a gate set is suitable for FTQEC (that is, achieves a FT threshold), we would like to demonstrate that nonMarkovian behaviour is absent. This is not feasible, for two reasons. First, all physical systems (including qubits) are at least a tiny bit nonMarkovian. Second, ‘nonMarkovian noise’ is so general that there is always some conceivable mechanism that would elude detection by any protocol (not just GST). Our goal is to reduce detectable nonMarkovian behaviour to the point where its visible effects are consistent with the FT threshold.
We use GST results to debug nonMarkovian effects and achieve this goal, as illustrated in Fig. 2c. Doing this is nontrivial, because neither GST nor process tomography is actually designed to characterize nonMarkovianity. In GST’s underlying model, the qubit is Markovian: its state at time t+1 is determined completely by (1) its state at time t and (2) the operation applied at time t. This assumption is far reaching. It implies that noise in the logic gates is stationary, uncorrelated in time, memoryless and independent of context (for example, what gates were recently applied). It implies that the gate operations can (for a single qubit) be represented as static 4 × 4 superoperators, and that state preparation and measurements may each be represented as static fourdimensional vectors and dual vectors (respectively), in Hilbert–Schmidt space.
As indicated above, the Markovian assumption is not strictly true for any experimental system. In addition to slow drift, there may be correlations between errors in consecutive gates, and the ‘qubit’ may not even be a twolevel system (for example, due to leakage levels). These are all examples of nonMarkovianity, and lie outside the GST model.
In principle, all guarantees about GST are void in the presence of nonMarkovian noise, as there are no process matrices to measure or report. However, for many typical nonMarkovian behaviours, GST degrades in a quantifiable way. These kinds of nonMarkovian noise cause data that are consistent with no Markovian gate set, and this failure to fit the data can be quantified. Since data generated by any Markovian model could be fit with predictable accuracy, significant badnessoffit can be interpreted as violation of the model and therefore as nonMarkovianity, though the particular type cannot be easily identified. As long as the data appear sufficiently Markovian, the GST estimate will be fairly reliable and have significant predictive power.
To quantify nonMarkovianity, we consider the loglikelihood (equation (13))
where f_{s}=n_{s}/N. The best conceivable fit to a data set would be one where p_{s}=f_{s} for every sequence s. Thus, the entropy of a data set is an upper bound on log ,
We define the quantity .
Standard properties of maximumlikelihood estimation theory^{53} imply that if
 1
The data were in fact generated by some gate set,
 2
There are N_{p} free parameters in the gate set, and
 3
The data set contains N_{s}>N_{p} distinct gate sequences
then as N→∞, 2Δlog is a random variable with a χ_{k}^{2} distribution, where k=N_{s}−N_{p}. This means that its expected value is , and its RMS variance is . Thus, if the fit is ‘good’, then 2Δlog should lie roughly within the interval . Hence, by comparing the difference 2Δlog −k to , we can determine how well the Markovian model was able to fit the data.
We quantify goodnessoffit by N_{σ}, the number of s.d.’s from the expected mean the expected mean the loglikelihood score is:
We can also calculate 2Δlog for individual experiments or subsets of gate sequences. Figure 4 illustrates this, where 2Δlog is shown for every individual experiment associated with each power of each germ (for a total 36 experiments per collection, due to six preparation fiducials and six measurement fiducials). This analysis makes it possible to see whether nonMarkovianity increases with sequence length (it usually does, because longer sequences amplify slowly varying noise), and which sequences are particularly inconsistent with the best Markovian fit.
Figure 4 compares 2Δlog scores for a simulated (perfectly Markovian) data set to those for two experimental data sets, one from 2 March 2015, and the other from 30 March 2015. The March 2 experimental data set is highly nonMarkovian, while the March 30 data set looks very similar to the simulated data set. These data demonstrate the degree to which we are able to stabilize our qubit and reduce nonMarkovian effects. In the final run on March 30, Markovianity is violated only at the 4σ level. While this is statistically significant—it implies with high confidence that the gates are not perfectly Markovian—it is not practically significant. To see this, recall that this is an extraordinary sensitive experiment, as witnessed by the fact that the error bars on the diamond norm are ±2 × 10^{−5}. This sensitivity extends to nonMarkovian behaviour as well. Reducing the sensitivity of the experiment by a factor of 4 (either by reducing maximum L by a factor of 4, or by reducing N by a factor of 16) would render the nonMarkovianity undetectable, at the cost of increasing the error bars by a factor of 4 to ±8 × 10^{−5}. This implies that the observed nonMarkovianity is effectively equivalent to less than 10^{−4} additional diamond norm error, which is comfortably below the threshold.
Comparison to randomized benchmarking
As of this writing, RB is the de facto standard in qubit characterization. As a consistency check, we perform RB simultaneously with the final GST experiment (by interleaving the GST and RB sequences over the entire period of experimentation), to see whether GST correctly predicted the results of RB. We follow the experimental and analysis procedure of ref. 54, and use RB sequences ranging in length from two gates to 1,970 gates.
Strictly speaking, RB measures the error rate per Clifford operation. Our Clifford operations are, as is usual, compiled into elementary {G_{i}, G_{x}, G_{y}} gates, with an average of 3.125 elementary gates per Clifford. Analysis of the data in strict accordance with the literature (that is, plotting survival probability versus # of Cliffords in the sequence) yielded an experimental error rate of (1.65±0.03) × 10^{−4} per Clifford operation. Dividing this by 3.125 (a questionable but common practice) suggests a pergate error rate of about =(5.28±0.10) × 10^{−5}.
However, our main goal is to compare the RB data with GST’s predictions for it. For this purpose, we find it more informative to fit (and plot; see Fig. 5) the observed probabilities versus the number of elementary gates in the sequence. All the rest of the analysis in this section is based on this analysis method, which yields a perelementary gate RB error rate of =(5.31±0.16) × 10^{−5}. Error bars are 95% confidence intervals. The experimental error bars are calculated via nonparametric bootstrap (by resampling the experimental data with replacement). We then simulate those RB experiments using the GST estimates. The GST results predict an RB error rate of =(4.53±0.25) × 10^{−5} (see Fig. 5a,b). The simulated error bars are calculated via parametric bootstrap. (The GST estimate is used to generate many sets of simulated GST experiments, each of which in turn yields a new GST gate set estimate. This ensemble of estimates then generates an ensemble of simulated RB decay rates, from which the simulated error bars are derived).
While these decay rates are nearly identical, there is a statistically significant discrepancy. The most obvious explanation is a flaw in the GST analysis, but we find that extensive simulations with known Markovian gates rule this possibility out. We believe that the discrepancy stems from physical causes—that is, from nonMarkovian noise. The most common form of nonMarkovian errors is lowfrequency drift, which manifests in both RB and GST as coherent errors that remain nearly fixed over the course of any one sequence, but change from sequence to sequence (and between repetitions of a single sequence). In the presence of such effects, GST typically overestimates the RB decay rate, because GST amplifies coherent errors to which RB sequences are relatively insensitive (Markovian or not). Thus, GST typically reports a higher rate of Markovian noise in a quixotic attempt to fit its data, while RB simply does not see the noise.
But in this experiment, we observe the opposite effect. Instead of over estimating the RB error rate, GST under estimates it. While the exact cause remains uncertain, we observe that this behaviour is completely consistent with anticorrelated noise (each gate flips between under and overrotation at each application) induced by dynamically corrected gates (DCG)^{49}.
Here is a concrete model that reproduces this behaviour: Consider a unitary error that varies in time—but instead of varying slowly, it oscillates at the system’s Nyquist frequency (that is, flips sign every clock cycle). For simple gates implemented with a single pulse, this would be highly implausible. In this experiment, however, we implement DCG. The simplest DCG is a dynamically corrected idle gate (our G_{I}). This is nothing but dynamical decoupling—periodic X_{π} pulses that echo away small Z rotations. Such sequences create a ‘toggling frame’ for the qubit that flips sign twice per clock cycle. Any timing or amplitude errors in the pulses can leave a residual error that flips sign every clock cycle, making a plausible noise model for a DCG.
We model this effect by augmenting the qubit state space with a classical binary variable q∈{−1, +1}. We define a composite gate set , based on a standard gate set , which consists of two singlequbit gate sets and that act conditionally on the value of the classical bit q, which flips every time a gate is applied. These gate sets are identical to , except that the G_{X} and G_{Y} elements of have a fixed, slight over rotation by an angle θ, while the G_{X} and G_{Y} elements of have a fixed, slight under rotation by θ. At the beginning of each simulated experiment, q is chosen randomly.
acts on an eightdimensional state space, and data generated from it is not fully consistent with any Markovian singlequbit gate set. But GST can be applied to that data, and will find the Markovian singlequbit gate set that fits it best. (Indeed, as no experimental system is perfectly Markovian, this is in essence what GST always does).
We generate simulated data, with finitesample error, for all the GST and RB experiments performed on 30 March 2015. For this simulation, is defined by setting equal to the GST estimate from 30 March 2015, and setting θ=1.25 × 10^{−2}.
Analysing the GST data generated by yields an estimated gate set almost identical to that obtained from experimental data . All but two of the 36 free gate matrix elements are within the 95% confidence intervals assigned to , and the three remaining elements are at most 0.05σ outside them. Every gate is within 4.4 × 10^{−5} (in diamond norm) from the corresponding gate in . We conclude that GST, as performed, cannot distinguish the composite model from .
The RB data simulated with also matches the experimental RB data almost perfectly, yielding an RB error rate of (5.38±0.17) × 10^{−5} that is statistically indistinguishable from the experimentally observed RB decay rate of (5.31±0.16) × 10^{−5}. Both data sets (experimental and simulatedby) are shown in Fig. 5, along with RB data simulated from . We conclude that RB observes significantly different error rates for and .
This does not imply that our qubit really is described by , but it demonstrate a plausible nonMarkovian model that is fully consistent with our data. There might be many other (different) nonMarkovian models equally consistent with it. And while certain kinds of nonMarkovian noise can be detected by RB^{55} and GST, neither GST nor RB are designed to function reliably in the presence of any nonMarkovian noise, so neither of them is explicitly ‘right’ or ‘wrong’ for this case.
The relative power of RB and GST
RB and GST share the common framework of data from gate sequences (circuits) that are (1) diverse, (2) repeated and (3) long. But they are distinguished by the kind of sequences performed. RB sequences are random, for the specific purpose of ‘twirling’ the noise. GST sequences are structured and periodic, for the specific purpose of amplifying errors.
This difference is fundamental. It makes RB intrinsically insensitive to coherent errors, which dominate the diamond norm error metric^{27,28}. For example, suppose that one logic gate overrotates by a small angle θ, while the others are perfect. In random sequences containing L applications of this gate, it will (by construction) be interleaved with other gates chosen randomly. The rotations by θ will add up incoherently, producing (on average) a total rotation of , and therefore an error probability of Lθ^{2}. Thus, a coherent error by θ appears (in RB) as an incoherent error of probability θ^{2}.
But circuits of practical interest are not random. Since not all ‘useful’ circuits are known at this time, it is wise to consider how errors affect arbitrary circuits in the worst (most fragile) case. For the example given above, the worst case is a periodic sequence in which the imperfect gate appears L consecutive times. Rotations add up coherently, the final angle is Lθ and the final error probability is L^{2}θ^{2}. So, for example, a 10^{−3} rotation can cause a 1% failure rate after just L=100 gates. In randomized circuits, the same failure rate would require L=10^{4} gates.
The diamond norm metric is a strict upper bound on the rate at which failure probabilities can grow, and so it takes account (by construction) of the worstcase behaviour given above. The diamond norm error for a small coherent error by angle θ is O(θ). Process infidelity (closely related to the RB error rate) does not account for worstcase behaviour, and the process infidelity for a small coherent error by angle θ is O(θ^{2}).
GST intentionally implements a wide variety of periodic sequences, to ensure that at least one of them is approximately ‘worst case’ for every possible coherent error. This allows GST to detect coherent errors of size θ using sequences of length L=O(1/θ), repeated O(1) times. Detecting the same error with randomized sequences requires much long sequences of length L=O(1/θ^{2}), or else a much higher number of repetitions (both of which correspond to orders of magnitude more time and effort).
Periodic sequences might be incorporated into RB, to make it more sensitive. Doing so, however, would eliminate its characteristic feature. Such a protocol would no longer be RB. On the other hand, there are several interesting variations of RB that retain its randomized nature, most notably interleaved benchmarking^{29}, RB tomography^{30} and unitarity benchmarking^{31}. While interesting in their own right, they are all subject to the same trouble: random gate sequences are much less sensitive to coherent errors than periodic ones, and therefore every form of RB is necessarily inefficient at detecting coherent errors.
Unitarity benchmarking is particularly interesting, since (unlike other forms of RB) it can separate coherent and incoherent errors, and therefore provide good information about diamond norm error rates. Unfortunately, it is (compared with GST) extremely inefficient at doing so.
Wallman et al.^{31} defined a quantity u (unitarity), which measures the rate of purity decay. They gave an RBlike protocol for measuring it, and pointed out that u and r together could be used to bound the diamond norm error. If
then the noise is purely incoherent, and the diamond norm error is O(r). If u−u_{min} is sufficiently small, then the errors are primarily incoherent, and the diamond norm error remains O(r).
However, the actual bounds (see ref. 56) are of the form
We have demonstrated that our gates’ diamond norm error is O(r) using GST. Doing the same thing using unitarity requires showing that u−u_{min}=O(r^{2}). But 1−u is itself an RBtype quantity, meaning that it appears as an error rate (in experiments that measure purity), and is measured using RB. As a result, showing that u−u_{min}=O(r^{2}) is equivalent to:
 1
Performing standard RB to measure r, the decay rate of sequence fidelity.
 2
Performing a different RBlike experiment to measure r′=1−u.
 3
Demonstrating (based on those experiments) that r−r′=O(r^{2}).
For r=10^{−4} (the regime we access experimentally), this requires measuring both r and r′ to 10^{−8} precision. This is extraordinarily hard. The most efficient way to do it is using sequences of length L≈10^{4}. These would yield survival probabilities around p_{L}≈1/e. Achieving the necessary precision would require estimating p_{L} to ±10^{−4}, which would require approximately N=10^{8} repetitions (because the uncertainty is ). This is at least 10^{6} times more repetitions than would be required for standard RB, or for GST, and is completely impractical.
Validating 10^{−5} accuracy with simulations
We have claimed uncertainties (error bars/confidence regions—see ‘Methods’ section) of about 10^{−5} for diamond norms and process matrix elements. This is remarkable, and demands supporting evidence. To confirm this behaviour, we simulate GST experiments using (known) gate sets with unitary errors. The results (Fig. 6) confirm Heisenberg scaling: diamond norm distance between estimated and true gates decreases with the maximum sequence length (L) as 1/L. This scaling holds up to L≈1/, where is the stochastic error rate. This is consistent with the ±∼10^{−5} observed error bars on diamond norm errors in our final experiment, for which L=8,192.
Discussion
GST allows us to achieve highquality gates in a trappedYb^{+}ion qubit, and to characterize it to unprecedented precision. Although lowerRB error rates have been reported in trappedion qubits^{15,16}, our gates are the first to demonstrably surpass a rigorous FT threshold against general noise. GST is the first protocol that can efficiently demonstrate this important milestone and provide reliable feedback to debug and improve those gates.
Lowerror singlequbit gates are just one of several critical achievements required to enable faulttolerant quantum computing. Thus, this is only a first step. But GST—which can be generalized to 2qubit gates and measurements—does answer one key and pressing question: ‘Once suitable operations have been achieved, how can their performance be verified for a critical, objective observer?’ RB can provide reliable information about process fidelity (which unambiguously captures stochastic or incoherent errors), but as of this writing, process fidelity is not known to be the relevant metric for FT. An exciting recent development in this area is the introduction of randomized compiling^{57}, which has the potential to provably reduce the importance of coherent errors. But until and unless such techniques lead to a FT proof that is insensitive to them, and are confirmed to be practical in the context of FTQEC, coherent errors remain a point of concern. GST provides an efficient way to diagnose and bound all Markovian errors in gates.
Methods
Experimental details
In Sandia’s Thunderbird trap, ions were trapped 80 μm above the trap surface. Typical trap frequencies were 0.5, 1.8 and 2.3 MHz, for the axial and two radial modes, respectively. In the HOA2 trap, ions were trapped 68 μm above the trap surface and trap frequencies of 0.5, 2.2 and 2.8 MHz were achieved. Typical trapping times were several hours for the Thunderbird trap and up to 100 h for the HOA2 trap. Coherence times were measured to be ∼1 s in both traps, and were most likely clocklimited.
The microwave radiation used for qubit manipulation was generated by singleside band modulating the output of a 12.600 GHz dielectric resonator oscillator with the output of a direct digital synthesizer (DDS) near 42.812 MHz. The master clock for the dielectric resonator oscillator and DDS was generated by a rubidium frequency standard. The output of the singleside band modulator was amplified and directed parallel to the trap surface using a microwave horn. The microwave frequency and phase was controlled via the DDS and approximately square pulses were generated by switching the output of the DDS using a highisolation rf switch. An offset was added to the constituent pulses of the BB1 pulse sequence to compensate for switching imperfections.
Drift control of the microwave πtime was realized by interleaving experiments in which the ion was initialized, exposed to a 10.5π microwave pulse, and measured. The πtime was adjusted after state detection; upon seeing , πtime was decreased by 0.625 ns, while upon seeing , πtime was increased by 0.625 ns. For the next experiment, the πtime was then truncated to the time resolution of the experimental control (5 ns). Similarly, drift control of the qubit frequency was implemented by interleaving a Ramsey experiment in which the ion, after state initialization, is subject to: (1) a G_{X} gate, (2) a 25 ms wait time and (3) a G_{Y} gate. Upon state detection, the qubit frequency was adjusted by +8 mHz for a result, and by −8 mHz for a result.
Linear GST
Linearinversion GST (LGST) is a highly reliable but lowaccuracy way to obtain an initial estimate of the gate set that serves as a seed for further refinement by longsequence GST (see next subsection). LGST is essentially simultaneous ‘uncalibrated’ process and state tomography. By performing process tomographylike experiments on a set of gates, as well as the null operation (that is, the ‘do nothing for no time’ operation), LGST can provide rough estimates of all the gates involved, as well as the state preparation and measurement operations. LGST requires minimal assumptions about the various operations (unlike standard tomography), and computes its estimates using only basic linear algebra (the most complicated step is matrix inversion). A detailed explanation of the LGST procedure is provided in ref. 32; LGST is also described in ref. 33.
Analysing long sequences in GST
GST incorporates data from long sequences in two stages. The first stage consists of several iterations, each of which performs a minimumχ^{2} estimation. Each iteration takes the result of the previous iteration as a seed, and includes successively more of the longsequence data. The second stage is a maximumlikelihood estimation, which is seeded from the first stage and uses all of the data. This procedure consistently avoids local minima in the objective function. In this section, we give the details of this algorithm (outlined in Fig. 1).
The iterative fitting procedure starts by fitting only data from the shortest gate sequences (which are easy to fit and insensitive to most nonMarkovian noise), then successively adds longer and longer sequences (with base sequence length L≤1, 2, 4, 8, …). Since we get an estimate at each intermediate L, it is possible to quantify not just the goodness of the best fit, but how the goodnessoffit behaves as longer and longer sequences are added in, which is useful for debugging.
At each step in the iterative process, we vary the gate set to minimize Pearson’s χ^{2} test statistic, which measures the discrepancy between a predicted probability (p) and an observed frequency (f). It is defined as
where N is the number of samples taken. In this analysis, χ^{2} is used to compare the set of probabilities predicted by a gate set (p_{s}) and the frequencies obtained from a data set (f_{s}). Each experiment (that is, gate sequence) s is associated to two probabilities: ‘plus’ has probability p_{s} and ‘minus’ has probability 1−p_{s}. The χ^{2} of a single gate string s is
where N is the number of times the experiment s was performed, p_{s} is the probability of a ‘plus’ outcome as predicted by the gate set and f_{s} is the observed frequency of ‘plus’. The total χ^{2} for a data set is just the sum
To estimate our gate set parameters, we minimize at each iteration using the Levenberg–Marquardt algorithim implemented in SciPy^{58}.
The final stage in longsequence GST analysis is a maximumlikelihood estimation (MLE), based on numerical optimization of the loglikelihood function log . The loglikelihood for an noutcome system with predicted probabilities p_{i} and observed frequencies f_{i} (i=1 … n) is given by:
where N is the total number of counts. Like the χ^{2} statistic, log is used to compare the set of probabilities predicted by a gate set (p_{s}) to the frequencies obtained from a data set (f_{s}). Each experiment (that is, gate sequence) s is associated to two probabilities: ‘plus’ has probability p_{s} and ‘minus’ has probability 1−p_{s}. The log contribution of a single gate string s is
where N is the number of times the experiment s was performed, p_{s} is the probability of a ‘plus’ outcome as predicted by the gate set and f_{s} is the observed frequency of ‘plus’. The total loglikelihood for an entire data set is just the sum
We find the maximum of this quantity using the same Levenberg–Marquardt algorithm as above, in order to compute the final (modulo gauge optimization) estimate of the gates.
LGST would be a perfect estimator in the absence of finitesample error. However, it is inefficient with respect to accuracy. Like process tomography, its inaccuracy scales as , which means that achieving 10^{−5} error bars on all parameters would require around N=10^{10} repetitions of each experiment. Long sequences amplify errors proportional to L, enabling inaccuracy of for all parameters. (This scaling breaks down for L≥1/, where is the rate of stochastic decoherence. In our experiments, ≤10^{−4}, and we perform experiments as long as L=8192≈10^{4}).
We use a hybrid algorithm (involving both minχ^{2} and MLE) because each of its components have certain weaknesses. Empirically, we find that MLE is statistically wellmotivated and avoids any bias, whereas χ^{2} optimization is numerically more stable and faster computationally but yields biased estimators, especially for the SPAM parameters. Our hybrid method combines both virtues, by using the more efficient and reliable minχ^{2} algorithm to get a very good seed for the final (unbiased) MLE.
On a modern laptop, singlequbit GST with maximum L=1,024 can run in under 1 min; the analysis for maximum L=8,192 takes about 40 min.
Selecting gate sequences for GST
The data that GST use to reconstruct a gate set come from performing gate sequences (that is, quantum circuits). Every gate sequence necessarily comprises (i) initialization, (ii) some gates and (iii) measurement (which yields a count that is recorded in the data set). The sequences used for GST have an additional structure:
 1
Each GST sequence begins with a preparation fiducial sequence, and ends with a measurement fiducial sequence, with an ‘operation of interest’ sandwiched in the middle.
 2
The ‘operation of interest’, which could in principle be any gate sequence, is chosen to be a germ power sequence—that is, a short ‘germ’ sequence, repeated an integer number of times.
Thus, every GST sequence is of the form , where F_{i} and F_{j} are preparation and measurement fiducials (respectively), g_{k} is a germ and L is an integer. F_{i} and F_{j} range exhaustively over a set of six fiducial sequences, while g_{k} ranges exhaustively over a set of 11 germs. In this section, we explain how the fiducials and germs are chosen.
The purpose of the fiducials is to prepare a sufficiently diverse set of input states and measurements to completely probe the operation of interest. This is achieved if (and only if) the input states and the measurement effects are both informationally complete (IC). A set of matrices is IC if and only if it spans the vector space of matrices. This requires at least d^{2} linearly independent elements.
In general, any randomly chosen set of d^{2} states or effects will be IC. So, for singlequbit GST, we could choose d^{2}=4 random fiducial sequences. However, while the resulting {ρ_{i}} and {E_{j}} will almost certainly be linearly independent, they may be close to linearly dependent. This property is quantified by the spectrum of the Gram matrix , defined by
If either set fails to be IC, the Gram matrix will fail to have d^{2} nonzero (to machine precision) singular values. As any one of the d^{2} largest singular values becomes small, inverting the Gram matrix on its support (as is required for LGST) becomes illconditioned, and finitesample fluctuations in GST get amplified, causing poor accuracy.
We would like both preparation and measurement fiducials to be uniformly IC, meaning that they span as uniformly as possible, and the smallest singular value of the Gram matrix is as large as possible. There exists a singlequbit uniformly IC set with only four elements (the SICPOVM), but it cannot be generated with Clifford operations and stabilizer states. The smallest convenient uniformly IC set is the sixelement set of stabilizer states (the eigenstates of X, Y and Z). We choose six fiducial sequences so that, if the gates are ideal, they will prepare the stabilizer states exactly. They are
where indicates the null sequence (no gates).
Slightly imperfect gates will prepare states (and effects) that are close to the stabilizer states—and therefore close to uniformly IC, and almost as effective in probing the operation of interest. If the gates are sufficiently far from the targets, it can be detected by computing the singular values of the empirical Gram matrix, and then new fiducials can be chosen.
Once the fiducials are defined, we need to define ‘operations of interest’ for them to probe. By sandwiching any such operation between an exhaustive set of 36 fiducial pairs, we are essentially doing process tomography on the operation (although the algorithm for incorporating these data into the GST fit is more complex than simple process tomography).
The obvious operations of interest are the gates themselves (G_{x}, G_{y} and G_{i}). By probing each gate tomographically, and repeating each sequence N times, GST can estimate the gates to with accuracy (for some constant α). To achieve higher accuracy, we do tomography on powers of the gates, by designating, for example, or as an operation of interest. (Powers of 2 are chosen merely for convenience; any logarithmically spaced sequence of integer powers would work).
Repeating a gate L times—that is, performing sequences of the form —amplifies errors in the gate itself. So, for example, if G_{x} is actually a rotation by θ=π/2+, then is a rotation by 32. GST can now characterize that rotation to within , which equates to estimating θ to within . Raising gates to the Lth power amplifies deviations by L, which in turn reduces estimation error by a factor of L.
However, simple repetition of G_{x} does not amplify every error. For example, suppose that G_{x} is in fact a π/2 rotation, but around the wrong axis, corresponding to the unitary map
as opposed to the target unitary e^{−i(π/4)X}. (X and Y indicate the Pauli operators σ_{x} and σ_{y}).
This is a tilt error, and it is not amplified by . It’s easy to see this by observing that =1, so the error cancels itself out after just four repetitions.
More sophisticated sequences are needed to amplify tilt errors. For this example, it is sufficient to probe G_{x}G_{y}. Assuming (for now) that G_{y} is a perfect π/2 rotation around y, G_{x}G_{y} is a rotation by 2π/3+. Therefore, performing (G_{x}G_{y})^{L} amplifies the deviation by a factor of L, and setting it as an operation of interest allows GST to estimate to within . The short sequence G_{x}G_{y} is a germ, and repeating it L times yields a germ power sequence that can be sandwiched between fiducials to equip GST with high sensitivity to the parameter .
The general situation gets rapidly complicated—for example, if G_{y} is not perfect, then G_{x}G_{y} alone cannot distinguish between Y tilt in G_{x} and X tilt in G_{y}. Each germ is sensitive to some nontrivial linear combination of gate set parameters. To choose a set of germs, we list the possible germs (that is, all reasonably short sequences), and for each germ g we identify what linear combination of parameters it amplifies. We do this by computing a Jacobian,
where σ(g) is the gate sequence product for germ g (obtained by just multiplying together the process matrices), and is a vector containing all the parameters of the gate set (for example, the elements of all the process matrices).
In the singlequbit case, σ(g) is a 4 × 4 matrix, and is 48dimensional because it contains the elements of three 4 × 4 gate matrices. Constraining all gates to be tracepreserving reduces the number of free parameters to 12 and 36 (respectively), so ∇_{g} is a 12 × 36 matrix. Its 12 right singular vectors indicate linear combinations of gate set parameters that σ(g) amplifies (when repeated L times), and the corresponding singular values quantify how much they are amplified. A zero singular value indicates a parameter that is not amplified at all (like the tilt error discussed above). A set of germs {g_{1} … g_{N}} is, collectively, described by a Jacobian
Our goal is to choose germs that provide high sensitivity at ‘large’ values of L. In practice, it is not useful to make L larger than 1/, where is the rate of stochastic or depolarizing noise. To select germs, however, we ignore this effect and make the simplifying assumption that the gates (and therefore σ(g)) are reversible (a good approximation when is small). Under this assumption, it is possible to define the L→∞ limit of the Jacobian in equation (18). Using the product rule, and assuming that all the gates are unitary (and therefore σ(g)^{−1}=σ(g)^{†}),
As L→∞, the average over all powers n of σ(g) twirls . By Schur’s lemma, the effect of twirling is to project onto the commutant of σ(g)—that is, onto the subspace of matrices that commute with σ(g). Furthermore, multiplication by the unitary σ(g)^{−(L−1)} is merely a change of basis, and has no effect on the right singular vectors or the singular values of ∇_{g}. So, up to an irrelevant change of basis:
where Π_{σ(g)} is the projection onto the commutant of σ(g).
This framework defines a notion of informational completeness for germs. A set of germs {g_{i}} is amplificationally complete (AC) if and only if the right singular rank of its Jacobian equals the total number of physically accessible (gaugeinvariant) parameters in the gate set. For a general set of three singlequbit tracepreserving gates, a gauge transformation is G_{k}→TG_{k}T^{−1} where T is an invertible tracepreserving superoperator, so there are 12 gauge parameters and 36−12=24 gaugeinvariant parameters. To build an AC set of germs, it is sufficient to add germs to the set until its Jacobian has rank 24. By constructing a complete set of infinitesimal gauge transformations, we can actually construct the projector Π_{g−i} onto the (local) space of gaugeinvariant perturbations to the gate set .
We then optimize this set numerically, by adding and removing germs (taken from an exhaustive list of all short sequences), and only keeping a modification if it lowers a certain score function. (For singlequbit GST, we find it convenient to search over all germs of length ≤6. However, this set of candidates need not be exhaustive. (A larger gate set, for example, would generate a prohibitively large exhaustive candidate set). We have used randomly chosen subsets as candidate sets and gotten similar results). The score function is
This score estimates the mean squared error of estimation if a fixed number of counts are spread over the k distinct germs. Running this algorithm until it cannot improve the germ set any further produces the following set of 11 germs used in the final (March 2015) GST runs (see Fig. 7):
The GST gauge, and how to set it
A gate set comprises: an initial density matrix ρ (represented as a Hilbert–Schmidt vector), a measurement effect E (represented as a Hilbert–Schmidt dual vector) and one or more gates G_{i} (represented as superoperators). But not every parameter in this representation is physically observable. A gate set has intrinsic ‘gauge’ degrees of freedom, because two distinct gate sets (or an entire manifold of them) can yield identical probabilities for all possible experiments. Gauge transformations alter a gate set’s elements without changing any observable probability. They take the form
where M is any invertible superoperator. If (as usual) we consider only tracepreserving (TP) gate sets, then the corresponding necessary and sufficient condition for a gauge transformation to preserve this constraint is that M be itself TP (that is, its first row should be (1, 0, 0,…).
This gauge freedom makes it difficult to compare two gate sets, since two apparently distinct gate sets may actually be equivalent. Most of the metrics used to measure distance between two gates are not gaugeinvariant (for example, fidelity, tracenorm distance and diamond norm distance are all gaugevariant). So, while it would be ideal to work only with gaugeinvariant metrics, we have very few metrics (and developing them and championing their adoption to the scientific community is beyond the scope of this work). Instead, to generate meaningful metrics, we gauge optimize gate sets to make them as ‘close’ as possible before computing metrics.
Given a gate set and a target , we transform by M (as above) where M is chosen to optimize some criterion of ‘closeness’ between and . This is ‘gauge optimization’. The final output of GST is thus the gate set that is most similar to the target, according to some gaugevariant quantity, among a class of gaugeequivalent gate sets. In the work reported here, we minimize (for convenience) a weighted Frobenius distance:
where denotes the Frobenius norm, G_{i} ranges over all gates in the set, and w_{g} and w_{s} are weighting factors. The weight ratio w_{s}/w_{g} allows us to finetune the relative contributions of discrepancies in logic gates and in SPAM. This is important because their respective natural uncertainties are usually quite different; gates can be probed far more accurately than SPAM. Thus, typically, ; we weight the gate matrix elements more highly because they are known more precisely. We use an iterative numerical method to find an M that minimizes this quantity.
Each gate set is gaugeoptimized as a whole; we report all metrics using gates in a single gauge. It would be incorrect to separately optimize the gauge for different reported quantities (for example, gaugeoptimizing for the fidelity of a single gate and reporting each such bestfidelity separately). Finally, we note that the process of gauge optimization against a reference gate set is sufficient to solve gauge ambiguity issues. That is, any quantity of interest that is not inherently gaugeinvariant (for example, diamond norm) becomes so when this gauge optimization is performed. This numerical optimization process is not physically elegant, but is adequate for the practical applications we consider here.
Error bars
In interpreting the GST analyses (and in particular confirming the claim that we have demonstrated FT), it is necessary to assign error bars to gate set estimates (and derived quantities thereof). For most GSTderived quantities, we use Hessianbased likelihood ratio (LR) confidence regions, while for RBrelated quantities, we use parametric and nonparametric bootstrapping. We also use parametric bootstrapped error bars as a sanity check on our Hessianbased LR confidence regions, and find them to be in good agreement. Unless otherwise stated, all error bars indicate ∼95% (2σ) confidence intervals.
We employ two flavours of bootstrapping: parametric and nonparametric. Both derive statistical quantities of interest from ensembles of simulated data sets, but these data sets are generated in different ways.
For the parametric bootstrap, ensembles of data sets are generated by first computing the GST estimate of the experimental data set in question, and then using this estimate to generate an ensemble of new data sets, each of which has the same experiments and number of shots per experiment as the actual experimental data set.
For the nonparametric bootstrap, ensembles of data sets are generated by simply resampling the experimental data set with replacement. In both parametric and nonparametric bootstraps, we typically generate an ensemble of 100 data sets, to ensure good statistics.
GST is used to map each resampled data set to a gate set estimate. Each gate set is gaugeoptimized to match the experimental GST estimate as closely as possible. Then, from this ensemble of gaugeoptimized gate sets, any statistical quantity of interest (such as s.d.) may be calculated for process matrix elements or for derived quantities such as diamond norm.
We use bootstrapped error bars for two purposes. First, they serve as a sanity check on the more rigorous (but tricky to implement) LR confidence regions described in the following subsection. In Fig. 8, we compare the performance of parametric bootstrapping to the LR method, and see good agreement. Second, we use bootstrapping to put error bars on experimental RB decay rates. These are modelfree and therefore not amenable to LR confidence regions. Error bars on experimental RB decay rate error bars were calculated via nonparametric bootstrapping, while error bars on simulated RB decay rates were calculated via parametric bootstrapping on the underlying GST estimate used to generate the RB data.
Bootstrapping is a very general method for generating error bars, but it is (1) not always reliable, (2) subject to smallsample errors unless very many Monte Carlo samples are generated and (3) quite timeconsuming (up to 24 h of computer time were required to generate the 100 samples used for this paper). Likelihood ratio (LR) confidence regions^{59} are preferable in most ways, and we use them as our primary source for ‘error bars’.
The basic theory for LR confidence regions, as applied to quantum tomography, can be found in ref. 59. Confidence regions have a solid (if often misunderstood) statistical meaning: if an estimator generates confidence regions with a confidence level of 1−α, then with probability at least 1−α (taken over the ensemble of all possible data sets), the confidence region assigned by the estimator will contain the true parameter value. This does not mean ‘Given particular error bars, the probability that they contain the truth is 1−α,’ as there is no random variable to take a probability with respect to once the estimate has been assigned.
As implemented here, GST has two convenient properties. First, it yields a likelihood function that is well approximated by a Gaussian (because the total number of samples is quite large). Second, it involves no explicit constraints, meaning that the MLE is never squashed against a boundary (as it often is in standard state and process tomography, where the positivity constraint is critical to ensuring a physically valid estimate). These properties mean that we can approximate the loglikelihood function by a quadratic function, whose shape is given by the Hessian (matrix of second derivatives) of log at the MLE. This Hessian defines a covariance tensor in gate set space, which (when scaled by an appropriate factor) defines an ellipsoid that is a valid 1−α confidence region.
Writing down this ellipsoid explicitly (as a covariance tensor) is possible, but not useful in practice. Instead, we use it to define error bars (confidence intervals) for all relevant scalar quantities (including fidelities, diamond norms, gate matrix elements and so on).
Let be a scalar function of a gate set. We define a 95% confidence interval around the bestestimate value of by computing
where P(H) is the Hessian projected onto the (local) space of nongauge gate set parameters, and we have linearized . C_{1} is a scalar constant which satisfies CDF_{1}(C_{1})=0.95, where CDF_{1} is the cumulative density function of the probability distribution. With δf so defined, f*±δf specifies the 95% confidence interval for f. Within the linear approximation to f, which is valid for small δf, this interval corresponds to minimizing and maximizing the value of f over the contour of the loglikelihood corresponding to a 95% confidence interval if the loglikelihood had a single parameter.
We emphasize that this does not construct a 95% confidence region. There are roughly 34 gaugeinvariant parameters in a gate set; the threshold used here implies 95% confidence intervals for each of them. The resulting region contains the truth only if every one of the intervals contains its parameter, which occurs with probability at least 0.95^{34}≈17%.
We believe this is a more meaningful way to report ‘error bars’ than to report a 95% confidence region for the entire gate set. For one thing, it is consistent with the error bars reported by the bootstrap (which yields standard errors for each parameter independently, and would have to be expanded significantly to represent a joint confidence region). Empirically, we find that definition 27 correlates closely with the 2σ error bars on gate elements computed by parametric bootstrapping (see Fig. 8). Furthermore, we use the confidence region primarily to report uncertainties on single quantities (for example, diamond norms), independent of the others.
Data availability
The GST and RB analysis in this paper was performed using the opensource software pyGSTi (python GST implementation)^{60}, which was developed for this work. All data sets and analysis scripts necessary to reproduce the results presented here are available online as supplemental information at https://doi.org/10.5281/zenodo.231329.
Additional information
How to cite this article: BlumeKohout, R. et al. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography. Nat. Commun. 8, 14485 doi: 10.1038/ncomms14485 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
25 June 2018
The original HTML version of this Article omitted the article number; it should have been ‘14485’. This has now been corrected in the HTML version of the Article. The PDF version was correct from the time of publication.
References
 1
Nakamura, Y., Pashkin, Y. A. & Tsai, J. Coherent control of macroscopic quantum states in a singlecooperpair box. Nature 398, 786–788 (1999).
 2
Wallraff, A. et al. Strong coupling of a single photon to a superconducting qubit using circuit quantum electrodynamics. Nature 431, 162–167 (2004).
 3
Barends, R. et al. Superconducting quantum circuits at the surface code threshold for fault tolerance. Nature 508, 500–503 (2014).
 4
Petta, J. R. et al. Coherent manipulation of coupled electron spins in semiconductor quantum dots. Science 309, 2180–2184 (2005).
 5
Shulman, M. D. et al. Demonstration of entanglement of electrostatically coupled singlettriplet qubits. Science 336, 202–205 (2012).
 6
Veldhorst, M. et al. A twoqubit logic gate in silicon. Nature 526, 410–414 (2015).
 7
Leibfried, D. et al. Experimental demonstration of a robust, highfidelity geometric two ionqubit phase gate. Nature 422, 412–415 (2003).
 8
Barrett, M. et al. Deterministic quantum teleportation of atomic qubits. Nature 429, 737–739 (2004).
 9
Reiserer, A., Kalb, N., Rempe, G. & Ritter, S. A quantum gate between a flying optical photon and a single trapped atom. Nature 508, 237–240 (2014).
 10
O’Brien, J. L., Pryde, G. J., White, A. G., Ralph, T. C. & Branning, D. Demonstration of an alloptical quantum controllednot gate. Nature 426, 264–267 (2003).
 11
Peruzzo, A. et al. Quantum walks of correlated photons. Science 329, 1500–1503 (2010).
 12
Politi, A., Matthews, J. C. & O’Brien, J. L. Shor’s quantum factoring algorithm on a photonic chip. Science 325, 1221–1221 (2009).
 13
Monz, T. et al. 14qubit entanglement: creation and coherence. Phys. Rev. Lett. 106, 130506 (2011).
 14
Debnath, S. et al. Demonstration of a small programmable quantum computer with atomic qubits. Nature 536, 63–66 (2016).
 15
Harty, T. P. et al. Highfidelity preparation, gates, memory, and readout of a trappedion quantum bit. Phys. Rev. Lett. 113, 220501 (2014).
 16
Brown, K. R. et al. Singlequbitgate error below 10^{−4} in a trapped ion. Phys. Rev. A 84, 030303 (2011).
 17
Mount, E. et al. Error compensation of singlequbit gates in a surfaceelectrode ion trap using composite pulses. Phys. Rev. A 92, 060301 (2015).
 18
Lidar D., Brunn T. (eds) Quantum Error Correction Cambridge University Press (2013).
 19
Knill, E. Quantum computing with realistically noisy devices. Nature 434, 39–44 (2005).
 20
Aliferis, P., Gottesman, D. & Preskill, J. Quantum accuracy threshold for concatenated distance3 codes. Quantum Info. Comput. 6, 97–165 (2006).
 21
Aliferis, P. & Cross, A. W. Subsystem fault tolerance with the baconshor code. Phys. Rev. Lett. 98, 220502 (2007).
 22
Aliferis, P. & Preskill, J. Fibonacci scheme for faulttolerant quantum computation. Phys. Rev. A 79, 012332 (2009).
 23
Aliferis, P. et al. Faulttolerant computing with biasednoise superconducting qubits: a case study. New J. Phys. 11, 013061 (2009).
 24
Aharonov, D. & BenOr, M. Faulttolerant quantum computation with constant error rate. SIAM J. Comput. 38, 1207–1282 (2008).
 25
Emerson, J. et al. Symmetrized characterization of noisy quantum processes. Science 317, 1893–1896 (2007).
 26
Knill, E. et al. Randomized benchmarking of quantum gates. Phys. Rev. A 77, 012307 (2008).
 27
Sanders, Y. R., Wallman, J. J. & Sanders, B. C. Bounding quantum gate error rate based on reported average fidelity. New J. Phys. 18, 012002 (2016).
 28
Kueng, R., Long, D. M., Doherty, A. C. & Flammia, S. T. Comparing experiments to the faulttolerance threshold. Phys. Rev. Lett. 117, 170502 (2016).
 29
Magesan, E. et al. Efficient measurement of quantum gate error by interleaved randomized benchmarking. Phys. Rev. Lett. 109, 080505 (2012).
 30
Kimmel, S., da Silva, M. P., Ryan, C. A., Johnson, B. R. & Ohki, T. Robust extraction of tomographic information via randomized benchmarking. Phys. Rev. X 4, 011050 (2014).
 31
Wallman, J., Granade, C., Harper, R. & Flammia, S. T. Estimating the coherence of noise. New J. Phys. 17, 113020 (2015).
 32
BlumeKohout, R. et al. Robust, selfconsistent, closedform tomography of quantum logic gates on a trapped ion qubit. Preprint available at http://arxiv.org/abs/1310.4492 (2013).
 33
Greenbaum, D. Introduction to quantum gate set tomography. Preprint available at http://arxiv.org/abs/1509.02921 (2015).
 34
Kim, D. et al. Microwavedriven coherent operation of a semiconductor quantum dot charge qubit. Nat. Nanotechnol. 10, 243–247 (2015).
 35
Chuang, I. L. & Nielsen, M. Prescription for experimental determination of the dynamics of a quantum black box. J. Mod. Opt. 44, 2455–2467 (1997).
 36
Weinstein, Y. S. et al. Quantum process tomography of the quantum fourier transform. J. Chem. Phys. 121, 6117–6133 (2004).
 37
O’Brien, J. L. et al. Quantum process tomography of a controllednot gate. Phys. Rev. Lett. 93, 080502 (2004).
 38
Merkel, S. T. et al. Selfconsistent quantum process tomography. Phys. Rev. A 87, 062119 (2013).
 39
Olmschenk, S. et al. Manipulation and detection of a trapped Yb^{+} hyperfine qubit. Phys. Rev. A 76, 052314 (2007).
 40
Wimperis, S. Broadband, narrowband, and passband composite pulses for use in advanced NMR experiments. J. Magn. Reson. Ser. A 109, 221–231 (1994).
 41
Merrill, J. T. & Brown, K. R. Progress in Compensating Pulse Sequences for Quantum Computation 241–294John Wiley & Sons, Inc. (2014).
 42
Gustavsson, S. et al. Improving quantum gate fidelities by using a qubit to measure microwave pulse distortions. Phys. Rev. Lett. 110, 040502 (2013).
 43
Morton, J. J. L. et al. Measuring errors in singlequbit rotations by pulsed electron paramagnetic resonance. Phys. Rev. A 71, 012332 (2005).
 44
O’Malley, P. J. J. et al. Qubit metrology of ultralow phase noise using randomized benchmarking. Phys. Rev. Appl. 3, 044009 (2015).
 45
Kelly, J. et al. Optimal quantum control using randomized benchmarking. Phys. Rev. Lett. 112, 240504 (2014).
 46
Stick, D. et al. Demonstration of a microfabricated surface electrode ion trap. Preprint available at http://arxiv.org/abs/1008.0990 (2010).
 47
Magesan, E., BlumeKohout, R. & Emerson, J. Gate fidelity fluctuations and quantum process invariants. Phys. Rev. A 84, 012309 (2011).
 48
Magesan, E., Gambetta, J. M. & Emerson, J. Characterizing quantum gates via randomized benchmarking. Phys. Rev. A 85, 042311 (2012).
 49
Khodjasteh, K. & Viola, L. Dynamical quantum error correction of unitary operations with bounded controls. Phys. Rev. A 80, 032314 (2009).
 50
Aharonov, D., Kitaev, A. & Nisan, N. in Proceedings of the thirtieth annual ACM symposium on theory of computing. 20–30ACM (1998).
 51
Benenti, G. & Strini, G. Computing the distance between quantum channels: usefulness of the fano representation. J. Phys. B 43, 215508 (2010).
 52
Watrous, J. Simpler semidefinite programs for completely bounded norms. Chicago J. Theor. Comp. Sci. 8, 11–14 (2013).
 53
Wasserman, L. All of Statistics: A Concise Course in Statistical Inference Springer (2013).
 54
Wallman, J. J. & Flammia, S. T. Randomized benchmarking with confidence. New J. Phys. 16, 103032 (2014).
 55
Ball, H., Stace, T. M., Flammia, S. T. & Biercuk, M. J. Effect of noise correlations on randomized benchmarking. Phys. Rev. A 93, 022303 (2016).
 56
Wallman, J. J. Bounding experimental quantum error rates relative to faulttolerant thresholds. Preprint available at arXiv:1511.00727 (2015).
 57
Wallman, J. J. & Emerson, J. Noise tailoring for scalable quantum computation via randomized compiling. Phys. Rev. A 94, 052325 (2016).
 58
Jones, E. et al. Scipy : Open source scientific tools for python. Available at http://www.scipy.org/ (accessed online 12012017) (2001).
 59
BlumeKohout, R. Robust error bars for quantum tomography. Preprint available at http://arxiv.org/abs/1202.5270 (2012).
 60
Nielsen, E., Rudinger, K., Gamble, J. K. & BlumeKohout, R. pyGSTi: A python implementation of gate set tomography. Available at http://github.com/pyGSTio (2016).
Acknowledgements
Sandia National Laboratories is a multiprogramme laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DEAC0494AL85000. The authors thank Travis Scholten and Jonathan Gross for assistance with data visualization and Kevin Young for providing code for diamond norm computation. J.K.G. gratefully acknowledges support from the Sandia National Laboratories Truman Fellowship Programme, which is funded by the Laboratory Directed Research and Development (LDRD) programme. This research was funded, in part, by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA). All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of IARPA, the ODNI or the U.S. Government.
Author information
Affiliations
Contributions
R.B.K., J.K.G., E.N. and K.R. contributed to the theoretical development of GST and the code implementation. J.M., K.F. and P.M. performed the experiments.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
BlumeKohout, R., Gamble, J., Nielsen, E. et al. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography. Nat Commun 8, 14485 (2017). https://doi.org/10.1038/ncomms14485
Received:
Accepted:
Published:
Further reading

NonMarkovian Noise Characterization with the Transfer Tensor Method
Physical Review Applied (2020)

Experimentally testing quantum critical dynamics beyond the Kibble–Zurek mechanism
Communications Physics (2020)

Bootstrapping quantum process tomography via a perturbative ansatz
Nature Communications (2020)

Detecting crosstalk errors in quantum information processors
Quantum (2020)

Certified quantum gates
Physical Review A (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.