Experimental quantum verification in the presence of temporally correlated noise

Growth in the capabilities of quantum information hardware mandates access to techniques for performance verification that function under realistic laboratory conditions. Here we experimentally characterise the impact of common temporally correlated noise processes on both randomised benchmarking (RB) and gate-set tomography (GST). Our analysis highlights the role of sequence structure in enhancing or suppressing the sensitivity of quantum verification protocols to either slowly or rapidly varying noise, which we treat in the limiting cases of quasi-DC miscalibration and white noise power spectra. We perform experiments with a single trapped 171Yb+ ion-qubit and inject engineered noise ∝σ^z\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left( { \propto \hat \sigma _z} \right)$$\end{document} to probe protocol performance. Experiments on RB validate predictions that measured fidelities over sequences are described by a gamma distribution varying between approximately Gaussian, and a broad, highly skewed distribution for rapidly and slowly varying noise, respectively. Similarly we find a strong gate set dependence of default experimental GST procedures in the presence of correlated errors, leading to significant deviations between estimated and calculated diamond distances in the presence of correlated σ^z\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat \sigma _z$$\end{document} errors. Numerical simulations demonstrate that expansion of the gate set to include negative rotations can suppress these discrepancies and increase reported diamond distances by orders of magnitude for the same error processes. Similar effects do not occur for correlated σ^x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat \sigma _x$$\end{document} or σ^y\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat \sigma _y$$\end{document} errors or depolarising noise processes, highlighting the impact of the critical interplay of selected gate set and the gauge optimisation process on the meaning of the reported diamond norm in correlated noise environments. Experiments reveal that the presence of correlated noise may compromise the interpretation of techniques for the validation of quantum hardware. A team led by Michael Biercuk at Australia’s University of Sydney and National Measurement Institute, carried out experiments on a single trapped 171Yb+ ion to test the reliability of widespread techniques for characterisation, validation and verification of quantum hardware. Although error processes are often assumed to be statistically independent, in practice slowly varying external fields may introduce temporal correlations in noise. The experiments revealed that the outcome of randomised benchmarking and gate-set tomography differ substantially in presence of correlated noise, and reveal an unexpected sequence-dependent behaviour. These results demonstrate that the reliability of standard performance benchmarking techniques is strongly influenced by the statistical properties of the noise affecting the hardware, complicating direct comparisons between experiments.


INTRODUCTION
Quantum characterisation, validation, and verification (QCVV) techniques are broadly used in the quantum information community in order to evaluate the performance of experimental hardware. A variety of techniques have emerged including randomised benchmarking (RB), 1,2 purity benchmarking, 3 process tomography, 4-7 adaptive methods, 8 and gate-set tomography (GST). 9,10 Each protocol has relative strengths and weaknesses; for instance, RB has low experimental overhead but only provides average information about gate performance, while process tomography provides more information at the cost of unfavourable scaling in measurement overhead. 11 Despite their differences, these protocols share the common theme that they were originally developed and mathematically formalised assuming that error processes are statistically independent and do not exhibit strong correlations in time. 1,2,10 Even in highly controlled laboratory environments there are a range of noise sources that, when applied to a qubit concurrent with logical gate operations, produce effective error models that diverge significantly from the assumptions underlying most QCVV protocols. For example, slow variations in ambient magnetic fields or drifts in amplifier gain can produce temporally correlated noise processes, often characterised through a power spectral density possessing large weight at low frequencies. [12][13][14] Moreover, these error processes may exhibit gate-dependent behaviour. So far such processes have been largely ignored in experimental QCVV, with predominantly phenomenological attempts used to explain deviations from ideal outputs. 15 Understanding that such an approach is untenable when attempting to rigorously compare QCVV results to metrics relevant to quantum error correction has recently led to an expansion of theoretical activity in this space. 3,[16][17][18][19][20] In this work our objectives are to experimentally characterise and explain the impact of temporally correlated noise processes on the outputs of QCVV protocols, and to identify potential modifications enabling users to improve the utility of the information returned. We perform QCVV experiments using a single trapped 171 Yb + ion as a long-lived, high-stability qubit. Our study implements engineered frequency noise /σ z ð Þ in the control system in order to study the impact of different temporal noise correlations on QCVV results. We apply noise in the two extremes, either quasi-DC offsets or noise with an effective white power spectrum to approximate slowly and rapidly varying noise, respectively. Measurements reveal that QCVV outputs diverge significantly when subject to these different types of noise, highlighting potential circumstances where the information extracted from a given protocol may no longer accurately represent the true error processes experienced by individual gates. Our experiments are compared against analytic calculations linking the underlying structure of the QCVV sequences with the manifestation of specific characteristics associated with the presence of noise correlations.
We examine two common QCVV protocols in the experimental quantum information community: RB and GST. The construction of these protocols follows a similar pattern, a series of unitary quantum operations is applied to one or more qubits sequentially in time, followed by a projective measurement (Fig. 1a). Experimental measurements are acquired and combined, then experimental parameters are changed according to some prescription (e.g. changing the sequence length J) and further data are collected. The variation in QCVV protocols predominantly comes from the different constituent operations that are applied and the analysis techniques by which measurement results are post-processed to extract information.
In RB, sequences are constructed by concatenating unitary operations U l selected at random from the 24 Clifford operations C l . The final operation in a sequence of length J is selected to invert the net rotation U J ¼ Q JÀ1 l¼1 C l À1 , such that the sequence implements a net identity Q J l¼1 C l ¼Î. In experimental GST as defined by the pyGSTi package, by contrast, operations are selected deterministically according to a tabulated routine comprising specifically crafted sequences that are designed to maximise overall sensitivity to all detectable error types. These operations are constructed by concatenating so-called "germs", short sequences implementing predefined unitary rotations, which, in our case, are constructed from a subset of Clifford gates. The first and last unitaries U 1,J ∈ {F α , F β }, termed the "fiducial" operations, effectively set the reference frame for statepreparation and measurement (Fig. 1a); see "Methods" for further detail.
In our experiments we engineer noise in order to permit quantitative analysis of QCVV outputs under known conditions. We compare the outputs obtained from both RB and GST for two distinct noise-correlation regimes. Firstly, where the engineered noise is implemented as a constant miscalibration over the entire sequence, which is the extreme case for slowly varying noise and produces temporally correlated errors. Secondly, where the engineered noise is rapidly varying (yielding an approximately white power spectrum), which leads to errors that are uncorrelated between gates (Fig. 1b). We now introduce a framework for interpreting the impact of sequence structure and noise correlations on measurement outcomes to facilitate an analysis of our results.

Mapping noise to measured error in RB
The key analytic tool for our study is a formalism mapping an applied noise model to an output error for a given Clifford sequence, following a procedure derived in ref. 17 . Error accumulation over a given Clifford sequence maps to a "random walk" in a three-dimensional vector-space representing the action of sequential error unitaries in the operator space spanned by the Pauli operators,σ fx;y;zg (Fig. 1c). Forσ z noise, the lth step of the walk is calculated by conjugatingσ z with the entire operator subsequence K lÀ1 Q lÀ1 q¼1 U q up to the (l − 1)th gate, with multiplication performed from the left. This conjugation always results in a member of the Pauli group, allowing us to compactly write P l K y lÀ1σ z K lÀ1 ¼r l Áσ, whereσ ¼σ x ;σ y ;σ z À Á and r l 2 ±x; ±ŷ; ±ẑ f g . The direction of P l in Pauli space therefore maps to the Cartesian unit vectorr l associated with the lth step of a J-step walkR P J l¼1 δ lrl . For our chosen error model, the step length, δ l , captures the integrated phase between the driving field and qubit during execution of the single gate U l . In terms of experimental parameters, δ l = Δ/Ω, where Δ/Ω is the detuning expressed in terms of the experimental Rabi frequency Ω (see "Methods").
The overall form of the walk is a statistical measure of how the sequence itself interacts with the noise process to produce a net, measurable accumulation of error. Sequences that are highly susceptible to error accumulation produce walks that migrate far from the origin, while sequences exhibiting error suppression produce walks that meander back towards the origin. The net walk length is captured in the mean-squared distance from the origin kRk 2 D E , averaged over noise realisations. This links to the trace fidelity, defined as F trace ¼ Tr are modified unitary operations to take into account the effect of theσ z noise. We then define the infidelity E . Appropriately linking this picture of error accumulation to standard laboratory measurements requires consideration of the measurement routine itself. In typical measurements the qubit Bloch vector at the end of the sequence is projected onto the quantisation axis, z, with basis states |0〉 and |1〉. A measurement of this type is therefore insensitive to net rotations around that a b Slowly varying noise Rapidly varying noise RB: GST: c Fig. 1 QCVV sequence construction and mapping to accumulated error. a Overview of unitary sequence construction for RB and GST, using Clifford gates, C l or fiducial operations, F α,β and repeated germs (G) n respectively. b Schematic representation of slowly and rapidly varying noise with relevant time scales defined by the sequence where δ represents the instantaneous noise values drawn from a normal distribution with σ 2 variance. Grey lines are other possible noise realisations. For RB, the noise is sampled from this distribution and varies shot-to-shot between noise realisations, while in GST a single value is selected for the entire set of experiments. c Sequence-dependent "random walk" calculated for an arbitrary QCVV sequence (here according to the RB prescription) with J = 100 in Pauli space. Green dot indicates origin and black triangle indicates sequence terminus. Blue line represents the 3D walk, which can be used to calculate the trace infidelity while the grey line represents the 2D projection, measurable in a standard projective measurement. The green arrow indicates the net walk vector,Ṽ 2D , given unit step size D E is the mean-squared walk length along the quantisation axis (see Supplementary Material for details). At this stage we must link the correlation properties of the noise to the form of the walk for a specific sequence. Considering only the underlying properties of the sequence, we may assume unit-length steps, resulting in a deterministic sequencedependent walk with lengthṼ P J l¼1r l . The presence or absence of temporal noise correlations is now captured through a rescaling of the individual steps in the deterministic walk for a specific sequence. In the case of slowly varying noise, and to first-order approximation, the net error can be separated into two independent parts, kRk 2 ¼ δ 2 kṼk 2 , where δ is the value of the noise and kṼk is the net unit-step walk specific to a particular sequence. 17 However, in the case of rapidly varying noise these two terms are no longer separable and the net error must be calculated as the convolution of the noise value at each timestep and each individual step in the random walk, Experimental platform and engineered noise We perform experiments using the hyperfine qubit in a single trapped 171 Yb + ion driven by microwaves near 12.64 GHz, with basis states 0 Our calibration process permits accurate determination of the (first-order magnetic-field-insensitive) qubit transition frequency to within approximately 1 Hz. In our laboratory, this qubit and the associated control system have been demonstrated to possess a coherence time of T 2~1 s, measurement fidelity of 99.7% limited by photon collection efficiency, and error rates from intrinsic system noise of p RB ≈ 6 × 10 −5 using "baseline" RB experiments (see Supplementary Figures). Details of the control system and experimental protocols for QCVV techniques used here are presented in the Methods, and information about various detection procedures in use for estimating P (including a Bayesian method) are found in the Supplementary Materials.
We engineerσ z noise applied concurrently with Clifford operations through the application of a detuning, Δ, of the qubit driving field from resonance using an externally modulated vector signal generator. As the detuning is applied concurrently with driven qubit rotations about x and y axes, rotation errors arise along multiple directions on the Bloch sphere, rather than being purelyσ z in character. An additional violation of typical assumptions employed in RB is that different Clifford gates are physically decomposed into base rotations with different durations, which means that our formal error model will also be gate-dependent. 19 For each of our two limiting noise cases we engineer N different noise "realisations" in order to average over an appropriate ensemble. In our experiments, we set the distribution of noise Δ=Ω $ N 0; σ 2 ð Þ, where σ 2 is the variance of the distribution, such that the root-mean-square value is approximately equivalent in both cases once averaged over all noise realisations. The specific implementation of noise engineering and its impact on the conduct of RB and GST is described in "Methods", and additional details on the error model are provided in the Supplementary Materials.
Experiments involve state preparation in the |0〉 state, application of a unitary sequence appropriate for a QCVV protocol while subject to noise, and projective measurement of the qubit along the quantisation axis. The sequence of operations applied and the measurement procedure are determined by the protocol in use.

RB survival probability distributions
In the limit of rapidly varying noise, all sequences of randomly ordered Clifford gates with length J are equivalent under noise averaging, and all sequence survival probabilities tend towards the mean. Recent theoretical studies have demonstrated that measurements on RB sequences in the presence of temporal noise correlations, can produce a divergence between average and worst-case reported trace fidelities. 17,20 Thus we find that measurement outcomes for different RB sequences are characterised by distributions with distinctly different shapes depending on the temporal correlations in the noise. The standard practice of combining all measurements to extract an RB error rate, p RB , from the decay of the mean over all J-gate sequences as a function of J, results in a global ensemble average and does not take advantage of this information (formally, as the noise we implement exhibits temporal correlations, the value of p RB one extracts may not be meaningful as a measure of average Clifford gate error). Our analysis takes advantage of the additional information which is always present in an RB experiment in order to evaluate the impact of noise correlations and deduce useful information about the underlying error process.
In our experimental study we measure the noise-averaged survival probabilities for a set of sequences {η i } J , indexed by i and of length J, for different lengths 25 ≤ J ≤ 200 (Fig. 2a), where we implement the same set of J-gate sequences under application of either slowly or rapidly varying detuning noise. For an arbitrary individual sequence, η i and a single noise realisation, n, we perform r nominally identical repetitions of the experiment. We combine the information from the outcomes of these individual repetitions to produce a maximum-likelihood estimate of survival probability, P i;n (see Supplementary Materials). The use of multiple repetitions under identical conditions reduces quantum projection noise in the qubit measurement and assists in isolating specific quantitative contributions to the distribution of survival probabilities, though this is not possible without noise engineering. In general, we average measured outcomes over a fixed number of noise realisations to yield P i; Á h i for a fixed sequence η i . From here on, we will refer to this noise-averaged survival probability as P.
In the case of rapidly varying noise we observe the distribution of sequence outcomes is symmetrically spread around the sequence-averaged mean survival probability, P J ð Þ, and the entire distribution shifts away from zero error with increasing J (red data, Fig. 2a). The presence of slowly varying noise, by contrast, produces a broad distribution of measured P over each set {η i } J , demonstrating a positively skewed set of outcomes and the persistence of a long tail at higher error rates (lower survival probabilities). In this case, as J increases the distribution broadens but remains skewed. Under both noise correlation cases, the measured P J ð Þ remain approximately the same. The differences in the distribution of measured survival probabilities over sequences under these two noise models reproduces the central predictions of ref. 17 We compare the characteristics of the distributions themselves against analytic predictions for both slowly and rapidly varying noise, beginning with the measured expectation, E I ð Þ, and variance, V I ð Þ (Fig. 2b, c), finding good agreement by taking only the applied noise strength as an input into a theoretical model (see Supplementary Materials). More specifically, theoretical predictions suggest that the distribution of outcomes under both noise modelsas well as intermediate models described by coloured power spectrashould be well described by a gamma distribution. 17 The general gamma distribution probability density Experimental quantum verification S Mavadia et al. function is given by where α and β are the shape and scale parameters and Γ(x) is the gamma function. The form of the gamma distribution will vary significantly between the limiting noise cases treated here, tending towards a symmetric Gaussian for rapidly varying noise and a broader positively skewed distribution in the presence of slowly varying noise, as determined by the values of α and β. Figure 2d-g shows histograms of RB sequence survival probabilities in the presence of the extreme case of slowly varying noise, quasi-DC miscalibration. We overlay gamma distributions calculated from first principles using no free parameters (black lines) as Γ(1, (2Jσ 2 /3) (1/2 + π 2 /96)), and fixing α = 1 while allowing β to vary as a fit parameter (green lines). The theoretical prediction captures both the measured skew towards high survival probabilities and the approximate "length" of the tail at low survival probabilities. We believe that residual disagreement between data and first-principles calculations arises due to both limited sequence sampling and contributions from higher-order analytic error terms when the approximation Jσ 2 ( 1 is no longer valid. Importantly, data and theory show the mode of the distribution is close to unity survival probability P ¼ 1 ð Þ and therefore corresponds to a lower error than the mean. For details on modifications to the theory presented in 17 accounting for the specific noise and gate-dependent error model employed in our experiments, contributions from higher-order terms, and expanded data sets including larger sequence numbers, see Supplementary Material.

Modification of RB for identification of model violation
The fact that the distribution of sequence survival probabilities under slowly varying noise does not converge to the mean indicates sequence-dependence in the resulting error accumulation. The emergence of this phenomenology is elucidated through an examination of the walks for different sequences. Under this type of noise certain sequences possess walks with largeṼ 2D 2 , hence amplifying the accumulation of error, while others tend back towards the origin and show reduced accumulated error (Fig.  3a, b). We classify sequences as "long-walk" if they possess a 2D projection beyond the diffusive mean-squared limit for an unbiased random walk,Ṽ 2D 2 > 2 3 J. We link between the sequence walk in Pauli space and the noise-averaged survival probability by displaying the experimentally measured 1 À P for sequences of fixed length J = 200 against the calculated 2D walk length,Ṽ 2D 2 (Fig. 3c). Data are presented for both rapidly varying (red open markers) and slowly varying (grey solid markers) noise, where the same set of sequences is used between the noise models. Measurements for rapidly varying noise are fit with a line possessing a slope approximately consistent with zero, while for the same sequences under slowly varying noise, the measurements show a positive dependence oñ V 2D 2 as expected. We believe the significant scatter in the plot is partially due to a concurrently acting noise source and higherorder contributions to error, neither of which are incorporated in the first principles calculation of the walk,Ṽ 2D 2 (see Supplementary Material and Appendix C of ref. 17 ). Nonetheless, the effect of sequence structure on measured survival probability is clearly visible for the case of slowly varying noise. In aggregate, this phenomenology gives rise to the skewed gamma distribution under slowly varying noise described above, and the convergence of all noise-averaged survival probabilities for individual sequences to the ensemble average when the noise is rapidly varying. However, preselection of RB sequences possessing large calculated, unit-step walks also provides a mechanism to both identify the presence of temporally correlated errors and extract an RB outcome that more closely approximates worst-case errors. In Fig. 3d we plot 1 À P vs. J for a subset of sequences preselected to possess long walks as in Fig. 3a, whose error rates we denote p LW RB J ð Þ. We choose that the preselection of long walks is based on the conditionṼ 2D 2 >2 2 3 J. When these long-walk sequences are subjected to rapidly varying noise, the distribution of survival probabilities over sequences remains approximately Gaussian about the mean, and the expectation value over this subset closely approximates the expectation value over an unbiased random sampling of the 24 J possible J-gate sequences, P rapid LW % P rapid , (Fig. 3d, red solid line and blue dashed line). However, in the presence of slowly varying noise we observe a larger spread in P slow LW than that achieved with unbiased sampling. The difference between the sequence-averaged survival probabilities in these noise cases arises solely because of the intrinsic properties of the sequences in use. Extracting an RB gate-error-rate, p LW ð Þ RB from P LW J ð Þ in the presence of slowly varying noise, we typically find an increase p LW ð Þ RB $ 2 À 5 p RB relative to standard sequence sampling, depending on the number of long-walk sequences employed, and the threshold value ofṼ 2D 2 used to define a "long walk" (Fig. 3c). This approach effectively constitutes the construction of an RB protocol that increases the reported error rate by enhancing sensitivity to a particular noise type, which in our case is ∝σ z . Alternative sequences may also be calculated that are more sensitive toσ x orσ y noise than randomly selected RB sequences. These error enhancing sequences give a clear, qualitative signature of the violation of the assumption that the error process is uncorrelated in time, although we do not claim that such a signature is in general uniquely associated with the presence of temporal noise correlations. Furthermore, because calculation of V 2D 2 and sequence preselection is performed numerically in advance, this approach alleviates the requirement to average extensively in experiment over sequences in order to reveal the skewed fidelity distribution.
Experimental GST in the presence of correlated noise We now apply the sequence-dependent Pauli walk framework to the default experimental GST gate set in order to understand the interplay of sequence structure and temporal noise correlations in the experimental GST estimation procedures. We begin by collating all standard experimental GST sequences up to 256 gates in length using gates G I Î, the identity, G x , a π/2σ x rotation and G y defined similarly. We define sequences to include fiducial operations and germs (see "Methods" and ref. 10 ), and calculate the corresponding walk lengths. Here we assume unit step size under application of either a constantσ z orσ x unitary error process (Fig. 4a, b) such that kRk 2 ¼ δ 2 kṼk 2 , and plot V 2D 2 as a proxy for projected sequence error vs. J. We overlay the results on the calculated probability distribution of unit-step walks for RB sequences, presented as a colour scale for comparison. Points appear clumped due to the experimental GST prescription using different fiducials (leading to different sequence lengths) surrounding a reported germ, as highlighted in Fig. 4b.
Examining these data indicates that GST sequences used in the default package broadly sample the range of expected fidelities in the presence of strongly correlatedσ x errors, more effectively so than RB. However, their structure appears to systematically suppress measured errors in the presence of correlatedσ z errors. This mimics the positive skew of RB sequence survival probabilities in the presence of slowly varying noise, as observed in the colour scale. In the presence of correlatedσ z errors, only GST sequences consisting of repeated G I germs, formally equivalent to Ramsey experiments, 21 show sensitivity to this kind of error. We now explore the impact of these observations in further detail by both numerical investigations and experiments involving engineered unitaryσ z errors.
Given measurement outcomes (experimental or simulated) for the prescribed sequences, the open-source analysis package pyGSTi 22 is used to extract a large set of results characterising the performance of the gate set. One important metric calculated by the protocol for each gate is the diamond distance, G ideal À G k k } , which is meant to provide a worst-case bound on the distance to the ideal gate operation. Experimental GST has found wide adoption in part because of its ability to calculate this metric, which is postulated to be important for formal analyses of faulttolerance in the context of quantum error correction.
In our first test, we numerically probe the sensitivity of the experimental GST analysis procedure to correlated error using the aforementioned pyGSTi toolkit. We introduce constantσ x ,σ y , orσ z errors via concurrent unitary rotations added to the formerly ideal operations. Therefore the exact mathematical representation of each gate (G I,x,y ) is known from analytical transformations and we have two paths to evaluate gate performance (Fig. 4c). First, we directly calculate the diamond distance G ideal À G err k k } using the matrix representation of G err , maintaining the initial frame of reference. Second, we estimate it by employing pyGSTi to simulate data using G err and determine the diamond distance of the estimate G est ð Þ err obtained by the toolkit's fitting routines.
As a self-consistent QCVV implementation, the experimental GST estimation procedure incorporates a gauge optimisation by construction, as it makes no assumptions in regard to the qubit and its measurement basis. It performs two rounds of gauge optimisation, allowing identification of a frame in which to minimise the distance of the entire set of estimated gates in relation to the target gates. The relevance of this gauge freedom on RB-derived estimates of gate performance was highlighted recently in. 23 To illustrate how gauge freedom affects the results, we separately calculate the diamond distance with and without gauge optimising our analytic gate set G err using routines included in the pyGSTi toolkit.
We plot the calculated and estimated diamond norms for G I,x,y , subject to processes similar to either a constant overrotation (i.e. proportional toσ x orσ y depending on the gate in question, with no error on G I operations), or a constant detuning error (i.e. proportional toσ z ), as shown Fig. 4d, e. Here we see that the Schematic representations of long a and short b length walks in 3D (coloured lines) and 2D (black lines), defined relative to a limit deduced from diffusive behaviour, as indicated by the blue circle. c Noise-averaged fidelity distributions of the same sequences as a function of walk length in the 2D plane. Measured infidelity vs. 2D walk length,Ṽ 2D 2 , when subject to slowly varying (grey) and rapidly varying (red) noise with linear fit overlaid. The slope of this fit is (0.8 ± 1) × 10 −5 , consistent with zero. d RB using long-walk sequences. Solid red line corresponds to RB performed using 20 long-walk sequences and rapidly varying noise. Extracted p LW ð Þ RB matches that extracted under the same conditions using an unbiased sampling of all sequences (dashed line). Grey line corresponds to RB using the same long-walk sequences and slowly varying noise. For the exponential fits, statepreparation and measurement error, κ, is fixed to 3 × 10 −3 Experimental quantum verification S Mavadia et al. estimated diamond distance for operators G I,x,y closely matches the calculated value in the presence of numerical overrotation errors. When used with its standard gate set {G x , G y , G I }, pyGSTi's estimate of G x and G y errors arising from constant unitaryσ z errors differs significantly, however, and only the diamond norm estimate for G I appears similar to the directly calculated value. Other estimated quantities such as process infidelity and the associated Choi matrices are affected in a similar way (see Supplementary Material). However, performing gauge optimisation on the analytically calculated matrices G err as well (within the pyGSTi package) reduces the difference in the reported diamond distance forσ z errors, and produces agreement with the much lower G x,y diamond distance reported by the GST estimation procedure (Fig. 4e). Among the error models we have tested, for this gate set such behaviour is only manifested in the presence of temporally correlatedσ z errors and does not appear using various other error processes built into the pyGSTi analysis package (see Supplementary Material for details). We note that full gauge optimisation is a requirement for self-consistency of results within GST.
To further investigate the influence of the gauge degree of freedom, we repeat our numerical analysis under the application of identical unitary errors, but extend the gate set by adding negative rotations −G x , −G y corresponding to −π/2σ x andσ y rotations and incorporating a number of associated compound germs (Fig. 4f, g). The resulting gauge-optimised calculated and estimated diamond-distance values now increase, moving closer to the analytic calculation obtained without gauge optimisation. The behaviour of estimated diamond distances for operations −G x and −G y are indistinguishable from those presented to within numerical uncertainty. This simple change in the gate set directly reveals the role of gauge optimisation in the discrepancies we noted above. The additional information now available to experimental GST via the extended gate and germ set effectively constrains the optimisation procedure, allowing it to detect errors that could previously be absorbed in a gauge transformation.
We follow up on these numerical investigations by performing experiments using experimental GST sequences subjected to engineered unitaryσ z -errors of varying strength. As before, we generate an operation with known error magnitude and form, allowing us to directly produce a matrix representation for the gate and hence calculate the diamond distance for the (deliberately) imperfect gates we apply to our trapped-ion system. Again the experimental GST procedure produces an estimate of the diamond distance that matches the calculation for G I , but yields estimates of the diamond distance from experimental data approximately an order of magnitude below the (unoptimised) calculated value for G x,y (Fig. 4h). Allowing gauge optimisation on the calculated diamond distance changes its scaling with error magnitude as in simulations above. We do not find strong agreement between data for G x,y and this gauge-optimised scaling, but cannot exclude the possibility that other finite sampling effects may cause saturation of small reported diamond distances.
In addition to the cases presented above we have also performed experimental GST with a wide variety of engineered, time-varying errors. These include detuning and amplitude noise exhibiting 50 Hz fluctuations and slow drifts (i.e. varying in time during individual sequences), constant overrotations, and added state-preparation and measurement (SPAM) errors. While these do not form part of this manuscript, they might help inform further studies by other authors in the future. All data sets, corresponding pyGSTi analysis files and resultant reports are included as part of the Supplementary Material.

DISCUSSION
In our studies we have employed a simple analytic framework -a formalism mapping noise to error accumulation in sequences of Clifford operations -to explore the sensitivity of RB and GST to slowly varying noise processes. Theoretical predictions derived from this framework match RB experiments employing engineered noise with known characteristics: either slowly varying or rapidly varying on the sequence timescale. This highlights the utility of the random-walk analysis in determining sequencedependent sensitivities of QCVV protocols in the presence of temporally correlated noise.
We have compared RB survival probabilities over sequences to a gamma distribution Γ α ¼ 1; β ð Þ , where β is determined by the type of error model employed in the experiment, and shown good agreement using no free parameters. In addition we have demonstrated that in the presence of slowly varying noise, the mode of the distribution of survival probabilities over sequences is shifted towards lower error rates than the mean and that a long tail of high-error outcomes appears as predicted in. 17 Overall, the experiments reported here give a clear experimental signature of the violation of the assumption that errors between gates are independent. While we do not claim that the features we observe are in general uniquely derived from this interpretation, we hope these results may help experimentalists seeking to interpret complex RB data sets. We believe that more detailed reporting of RB outcomes including the publication of distributions of the survival probability P, as well as the sequences employed, will facilitate more meaningful comparisons between RB data sets derived from different physical systems, as the relevance of p RB is diminished when error processes exhibit temporal correlations.
Through a combination of analytic calculations, numerics, and experiments with engineered errors we have found a similar bias towards lower estimates of diamond distance in experimental GST procedures when using the standard G I,x,y gate set subjected to strongly correlated, unitaryσ z errors. The asymmetry we observe between the manifestation of correlatedσ x =σ y andσ z errorsensitivity has previously only been reported in the context of RB. 23 We have shown explicitly how the low diamond-distance estimates under this kind of noise are related to the gauge optimisation performed as part of the protocol; limiting the gauge freedom by extending the gate set under application of an identical error process dramatically changed the estimated diamond distance of the very same gates in numerical simulations. This highlights that estimates are always reported up to an implicit gauge degree of freedom, making absolute comparisons of diamond norms challenging.
These observations are commensurate with a simple physical interpretation of the effect of an optimised gauge transformation in the circumstances we examine. In the presence of correlatedσ z errors, when the gate set is limited to G I,x,y gates, the reconstructed operator includes an extra error component along the z-axis. The effect of gauge optimisation is to rotate the axis of rotation of the G x and G y operators back to the equatorial plane, effectively cancelling this error. Under this circumstance the magnitude of rotation of these gates is smaller than expected in a fixed lab frame, and the second-order nature of the residual errors result in a steeper gradient of the dotted line in Fig. 4e. In contrast the G I rotation should have no net rotation and therefore this error will not be cancelled by a simple gauge transformation.
Gauge optimisation is designed to produce the best estimate for errors over the entire gate set in relation to a given target, and in a sense acts to "distribute" nominal errors over all constituent rotations in the gate set. The validity of such a gauge transformation in the presence of independent protocols for establishing a measurement basis remains an open question and has been highlighted recently by Rudnicki et al. 24 The variation of calculated and estimated diamond distances under correlatedσ z errors when subjected to seemingly small modifications of the default gate set has again not been reported previously in the context of GST, and indicates an important dependence of its output on the specific gate set employed, the characteristics of the underlying error source, and the gauge optimisation procedure.
Clearly the observed performance of experimental GST in the presence of correlatedσ x noise, such as resulting from experimental overrotations, can make GST a valuable tool in debugging an experimental system, 25 although precise calibrations can also be carried out efficiently using a subset of the full experimental GST protocol. 26 The effect of gauge optimisation in the presence ofσ z errors and with use of the default gate set, however, is concerning as a key implied benefit of experimental GST is its ability to provide a rigorous upper bound on gate errors using a fully self-contained analysis package. Recent experimental work 10 on the topic claimed such upper bounds on gate errors using experimental GST and compared these to the fault-tolerance threshold with high reported confidence and tight uncertainties. The results above and observations made 24 suggest that there may be residual uncertainty in interpretation of such data due to the potential unresolved conflict between full gauge freedom and the nominal existence of a measurement basis constraining that freedom. Furthermore, when acquiring and evaluating data, care has to be taken to to suppress any form of model violations reported by the GST toolkit in its likelihood analysis, as otherwise the extracted performance metrics may become unreliable. These deviations are currently not reflected in the uncertainties (i.e. error bars) calculated for those metrics by the toolkit and discussions with its authors suggest that a connection between the two is a non-trivial process.
In light of the investigations reported here, we believe that there is a need for greater awareness of the subtleties of the use of both RB and experimental GST in the presence of temporally correlated noise environments. In order to enhance the meaning and utility of reported results we advocate that QCVV benchmarks such as p RB and experimental GST diamond distances should be reported together with a quantitative measure of violation from a purely Markovian, temporally uncorrelated model. In the case of RB, this could be the difference between the extracted p RB of long and short walk sequences; in experimental GST the deviation is already being reported as part of the routine, yet the question about the impact of gauge optimisation that we identified remains. Similarly, if using experimental GST as a standalone gate evaluation procedure one cannot know a priori the form of the underlying noise -and hence any associated experimental GST insensitivities. Increasing the rigour of resultant upper bounds on diamond distances could require performing experimental GST using multiple different gate sets in order to identify potential "blind spots", owing to the implicitly required gauge transformations. Given the experimental overhead, however, this brute force approach is not necessarily attractive and further modifications to experimental GST could resolve the issue with considerably greater efficiency. Overall, we hope that these observations will assist in both the interpretation of QCVV experiments when model violation may occur, and the development of new techniques with improved rigour and efficiency for larger scale systems.

Experimental gate implementation
Quantum gates are implemented on a single 171 Yb + ion by driving its qubit transition at 12.6 GHz with microwave pulses produced by a vector signal generator (VSG, model Keysight E8267D). The phase of the driving field is adjustable via I-Q modulation allowing us to implement rotations around any axis lying in the xy-plane of the Bloch sphere. Rotations around the zaxis are carried out as frame-updates, i.e. pre-calculated, instantaneous changes of the generator I-Q values. Identity operations are realised as idle periods, whereby no signals are applied for a time equivalent to that of a π or π/2 rotation. We additionally implement pulse modulation (RF blanking) to suppress transients in microwave power at pulse edges. In this way, we implement the full set of Clifford gates as listed in supplementary materials.
Experimental quantum verification S Mavadia et al.