Experimental quantum verification in the presence of temporally correlated noise

Growth in the complexity and capabilities of quantum information hardware mandates access to practical techniques for performance verification that function under realistic laboratory conditions. Here we experimentally characterise the impact of common temporally correlated noise processes on both randomised benchmarking (RB) and gate-set tomography (GST). We study these using an analytic toolkit based on a formalism mapping noise to errors for arbitrary sequences of unitary operations. This analysis highlights the role of sequence structure in enhancing or suppressing the sensitivity of quantum verification protocols to either slowly or rapidly varying noise, which we treat in the limiting cases of quasi-DC miscalibration and white noise power spectra. We perform experiments with a single trapped $^{171}$Yb$^{+}$ ion as a qubit and inject engineered noise ($\propto \sigma^z$) to probe protocol performance. Experiments on RB validate predictions that the distribution of measured fidelities over sequences is described by a gamma distribution varying between approximately Gaussian for rapidly varying noise, and a broad, highly skewed distribution for the slowly varying case. Similarly we find a strong gate set dependence of GST in the presence of correlated errors, leading to significant deviations between estimated and calculated diamond distances in the presence of correlated $\sigma^z$ errors. Numerical simulations demonstrate that expansion of the gate set to include negative rotations can suppress these discrepancies and increase reported diamond distances by orders of magnitude for the same error processes. Similar effects do not occur for correlated $\sigma^x$ or $\sigma^y$ errors or rapidly varying noise processes, highlighting the critical interplay of selected gate set and the gauge optimisation process on the meaning of the reported diamond norm in correlated noise environments.

Quantum characterisation, validation, and verification (QCVV) techniques are broadly used in the quantum information community in order to evaluate the performance of experimental hardware. A variety of techniques have emerged including randomised benchmarking (RB) [1,2], purity benchmarking [3], process tomography [4][5][6][7], adaptive methods [8], and gate-set tomography (GST) [9,10]. Each protocol has relative strengths and weaknesses; for instance, RB has low experimental overhead but only provides average information about gate performance, while process tomography provides more information at the cost of unfavourable scaling in measurement overhead [11]. Despite their differences, these protocols share the common theme that they were originally developed and mathematically formalised assuming that error processes are statistically independent and do not exhibit strong correlations in time [1,2,10].
Even in highly controlled laboratory environments there are a range of noise sources that, when applied to a qubit concurrent with logical gate operations, produce effective error models that diverge significantly from the assumptions underlying most QCVV protocols. For example, slow variations in ambient magnetic fields or drifts in amplifier gain can produce temporally correlated noise processes, often characterised through a power spectral density possessing large weight at low frequencies [12][13][14]. Moreover, these error processes may exhibit gate-dependent behavior. So far such processes have been largely ignored in experimental QCVV, with predominantly phenomenological attempts used to explain deviations from ideal outputs [15]. Understanding that such an approach is untenable when attempting to rigorously compare QCVV results to metrics relevant to quantum error correction has recently led to an expansion of theoretical activity in this space [16][17][18][19][20][21].
In this work our objectives are to experimentally characterise and explain the impact of temporally correlated noise processes on the outputs of QCVV protocols, and to identify potential modifications enabling users to improve the utility of the information returned. We perform QCVV experiments using a single trapped 171 Yb + ion as a long-lived, highstability qubit. Our study implements engineered frequency noise (∝σ z ) in the control system in order to study the impact of different temporal noise correlations on QCVV results. We apply noise in the two extremes, either quasi-DC offsets or noise with an effective white power spectrum to approximate slowly and rapidly varying noise, respectively. Measurements reveal that QCVV outputs diverge significantly when subject to these different types of noise, highlighting potential circumstances where the information extracted from a given protocol may no longer accurately represent the true error processes experienced by individual gates. Our experiments are compared against analytic calculations linking the underlying structure of the QCVV sequences with the manifestation of specific characteristics associated with the presence of noise correlations.
We examine two common QCVV protocols in the experimental quantum information community: RB and GST. The construction of these protocols follows a similar pattern, a series of unitary quantum operations is applied to one or more qubits sequentially in time, followed by a projective measure-ment (Fig. 1a). Experimental measurements are acquired and combined, experimental parameters are changed according to some prescription (e.g. changing the sequence length, J) and further data are collected. The variation in QCVV protocols predominantly comes from the different constituent operations that are applied and the analysis techniques by which measurement results are post-processed to extract information.
In RB, sequences are constructed by concatenating unitary operations U l selected at random from the 24 Clifford operations C l . The final operation in a sequence of length J is selected to invert the net rotation U J = ( J−1 l=1 C l ) −1 , such that the sequence implements a net identity J l=1 C l =Î. In GST, by contrast, operations are selected deterministically according to a tabulated routine comprising specifically crafted sequences that are designed to maximise overall sensitivity to all detectable error types. These operations are constructed by concatenating so-called "germs", short sequences implementing predefined unitary rotations, which, in our case, are constructed from a subset of Clifford gates. The first and last unitaries U 1,J ∈ {F α , F β }, termed the "fiducial" operations, effectively set the reference frame for state-preparation and measurement (Fig. 1a), see Methods for further detail.
In our experiments we engineer noise in order to permit quantitative analysis of QCVV outputs under known conditions. We compare the outputs obtained from both RB and GST for two distinct noise-correlation regimes. Firstly where the engineered noise is implemented as a quasi-DC miscalibration over the entire sequence, which is the extreme case for slowly varying noise and produces temporally correlated errors. Secondly, where the engineered noise is rapidly varying (yielding an approximately white power spectrum), which leads to errors that are uncorrelated between gates (Fig. 1b). We now introduce a framework for interpreting the impact of sequence structure and noise correlations on measurement outcomes to facilitate an analysis of our results.

A. Mapping noise to measured error in RB
The key analytic tool for our study is a formalism mapping an applied noise model to an output error for a given Clifford sequence, following a procedure derived in [18]. Error accumulation over a given Clifford sequence maps to a "random walk" in a three-dimensional vector-space representing the action of sequential error unitaries in the operator space spanned by the Pauli operators,σ {x,y,z} (Fig. 1c). Forσ z noise, the l th step of the walk is calculated by conjugatingσ z with the entire operator subsequence K l−1 ≡ l−1 q=1 U q up to the (l −1) th gate, with multiplication performed from the left. This conjugation always results in a member of the Pauli group, allowing us to compactly write P l ≡ K † l−1σ z K l−1 =r l · σ, where σ = (σ x ,σ y ,σ z ) andr l ∈ {±x, ±ŷ, ±ẑ}. The direction of P l in Pauli space therefore maps to the Cartesian unit vectorr l associated with the l th step of a J-step walk R ≡ J l=1 δ lrl . For our chosen error model, the step length, δ l , captures the integrated phase between the driving field and qubit during execution of the single gate U l . In terms of ex-perimental parameters, δ l = ∆/Ω, where ∆/Ω is the detuning expressed in terms of the experimental Rabi frequency, Ω (see Methods).
The overall form of the walk is a statistical measure of how the sequence itself interacts with the noise process to produce a net, measurable accumulation of error. Sequences that are highly susceptible to error accumulation produce walks that migrate far from the origin, while sequences exhibiting error suppression produce walks that meander back towards the origin. The net walk length is captured in the mean-squared distance from the origin R 2 , averaged over noise realisations. This links to the "trace fidelity", defined as F trace = |Tr( J l=1Ũ l )| 2 /4, whereŨ are modified unitary operations to take into account the effect of theσ z noise. We then define the infidelity I trace = 1 − F trace R 2 . Appropriately linking this picture of error accumulation to standard laboratory measurements requires consideration of the measurement routine itself. In typical measurements the qubit Bloch vector at the end of the sequence is projected onto the quantisation axis, z, with basis states |0 and |1 . A measurement of this type is therefore insensitive to net rotations around that axis of the Bloch sphere, meaning that it only probes a 2D projection of the 3D walk onto the xy-plane. Our preferred metric is the survival probability, F survival , that may be linked directly to such a 2D projection (grey line, Fig. 1c) R z 2 is the mean-squared walk length along the quantisation axis (see Supplementary Material for details). As all of our measurements are simply of the survival probability, we henceforth drop the subscripts for F and I.
At this stage we must link the correlation properties of the noise to the form of the walk for a specific sequence. Considering only the underlying properties of the sequence, we may assume unit-length steps, resulting in a deterministic sequence-dependent walk with length V ≡ J l=1r l . The presence or absence of temporal noise correlations is now captured through a rescaling of the individual steps in the deterministic walk for a specific sequence. In the case of slowly varying noise, and to first-order approximation, the net error can be separated into two independent parts, R 2 = δ 2 V 2 , where δ is the value of the noise and V is the net unit-step walk specific to a particular sequence [18]. However, in the case of rapidly varying noise these two terms are no longer separable and the net error must be calculated as the convolution of the noise value at each timestep and each individual step in the random walk, R 2 = J l=1 δ lrl 2 .

B. Experimental platform and engineered noise
We perform experiments using the hyperfine qubit in a single trapped 171 Yb + ion driven by microwaves near 12.64 GHz, with basis states |0 ≡ 2 S 1/2 |F = 0, m F = 0 and |1 ≡ 2 S 1/2 |F = 1, m F = 0 . Our calibration process permits accurate determination of the (first-order magneticfield-insensitive) qubit transition frequency to within approximately 1 Hz. In our laboratory, this qubit and the associated control system have been demonstrated to possess a coherence FIG. 1. QCVV sequence construction and mapping to accumulated error. a Overview of unitary sequence construction for RB and GST, using Clifford gates, C l or fiducial operations, F α,β and repeated germs (G) n respectively. b Schematic representation of slowly and rapidly varying noise with relevant time scales defined by the sequence where δ represents the instantaneous noise values drawn from a normal distribution with σ 2 variance. Grey lines are other possible noise realisations. For RB, the noise is sampled from this distribution and varies shot-to-shot between noise realisations, while in GST a single value is selected for the entire set of experiments. c Sequence-dependent "random walk" calculated for an arbitrary QCVV sequence (here according to the RB prescription) with J = 100 in Pauli space. Green dot indicates origin and black triangle indicates sequence terminus. Blue line represents the 3D walk, which can be used to calculate the trace infidelity while grey represents the 2D projection, and is measurable in a standard projective measurement. The green arrow indicates the net walk vector, V 2D , given unit step size.
time of T 2 ∼ 1 s, measurement fidelity of ∼ 99.7% limited by photon collection efficiency, and error rates from intrinsic system noise of p RB ≈ 6 × 10 −5 using "baseline" RB experiments (see Supplementary Figures). Details of the control system and experimental protocols for QCVV techniques used here are presented in the Methods, and information about various detection procedures in use for estimating F survival (including a Bayesian method) are found in the Supplementary Materials. We engineerσ z noise applied concurrently with Clifford operations through the application of a detuning, ∆, of the qubit driving field from resonance using an externally modulated vector signal generator (see Methods). As the detuning is applied concurrently with driven qubit rotations about x and y axes, rotation errors arise along multiple directions on the Bloch sphere, rather than being purelyσ z in character. An additional violation of typical assumptions employed in RB is that different Clifford gates are physically decomposed into base rotations with different durations, which means that our formal error model will also be gate-dependent [21].
For each of our two limiting noise cases we engineer N different noise "realisations" in order to average over an appropriate ensemble. In our experiments we set the distribution of noise ∆/Ω ∼ N (0, σ 2 ), where σ 2 is the variance of the distribution, such that the root-mean-square value is approximately equivalent in both cases once averaged over all noise realisations. The specific implementation of noise engineering and its impact on the conduct of RB and GST is described in the Methods, and additional details on the error model are provided in the Supplementary Materials.
Experiments involve state preparation in the |0 state, application of a unitary sequence appropriate for a QCVV protocol while subject to noise, and projective measurement of the qubit along the quantisation axis. The sequence of operations applied and the measurement procedure are determined by the protocol in use.

C. RB fidelity distributions
In the limit of rapidly varying noise, all sequences of randomly ordered Clifford gates with length J are equivalent under noise averaging, and all sequence survival probabilities tend towards the mean. Recent theoretical studies have demonstrated that measurements on RB sequences in the presence of temporal noise correlations, can produce a divergence between average and worst-case reported trace fidelities [18,22]. Thus we find that measurement outcomes for different RB sequences are characterised by distributions with distinctly different shapes depending on the temporal correlations in the noise. The standard practice of combining all measurements to extract an RB error rate, p RB , from the decay of the mean over all J-gate sequences as a function of J, results in a global ensemble average and does not take advantage of this information (formally, as the noise we implement exhibits temporal correlations, the value of p RB one extracts may not be meaningful as a measure of average Clifford gate error). Our analysis takes advantage of the additional information which is always present in a RB experiment in order to evaluate the impact of noise correlations and deduce useful information about the underlying error process.
In our experimental study we measure the noise-averaged survival probabilities for a set of sequences {η i } J , indexed by i and of length J, for different lengths 25 ≤ J ≤ 200 (Fig. 2a), where we implement the same set of J-gate sequences under application of either slowly or rapidly varying detuning noise. For an arbitrary individual sequence, η i and a single noise realisation, n, we perform r nominally identical repetitions of the experiment. We combine the information from the outcomes of these individual repetitions to produce a maximum-likelihood estimate of survival probability, F i,n (see Supplementary Materials). The use of multiple repetitions under identical conditions reduces quantum projection noise in the qubit measurement and assists in isolating specific quantitative contributions to the distribution of survival probabilities, though this is not possible without noise engineering. In general, we average measured outcomes over a fixed number of noise realisations to yield F i, · for a fixed sequence η i . From here on, we will refer to this noise-averaged survival probability as F.
In the case of rapidly varying noise we observe the distribution of sequence outcomes is symmetrically spread around the sequence-averaged mean survival probability, F(J), and the entire distribution shifts away from zero error with increasing  Fig 2a). The presence of slowly varying noise, by contrast, produces a broad distribution of measured F over each set {η i } J , demonstrating a positively skewed set of outcomes and the persistence of a long tail at higher error rates (lower survival probabilities). In this case, as J increases the distribution broadens but remains skewed. Under both noise correlation cases, the measured F(J) remain approximately the same. The differences in the distribution of measured survival probabilities over sequences under these two noise models reproduces the central predictions of Ref. [18].
We compare the characteristics of the distributions themselves against analytic predictions for both slowly and rapidly varying noise, beginning with the measured expectation, E(I), and variance, V(I) (Fig. 2b-c), finding good agreement by taking only the applied noise strength as an input into a theoretical model (see Supplementary Materials). More specifically, theoretical predictions suggest that the distribution of outcomes under both noise models -as well as intermediate models described by coloured power spectra -should be well described by a gamma distribution [18]. The general gamma distribution probability density function is given by where α and β are the shape and scale parameters and Γ(x) is the gamma function. The form of the gamma distribution will vary significantly between the limiting noise cases treated here, tending towards a symmetric Gaussian for rapidly varying noise and a broader positively skewed distribution in the presence of slowly varying noise, as determined by the values of α and β. Figures 2d-g show histograms of RB sequence survival probabilities in the presence of the extreme case of slowly varying noise, quasi-DC miscalibration. We overlay gamma distributions calculated from first principles using no free parameters (black lines) as Γ(1, 2Jσ 2 /3(1/2+π 2 /96)), and fixing α = 1 while allowing β to vary as a fit parameter (green lines). The theoretical prediction captures both the measured skew towards high survival probabilities and the approximate "length" of the tail at low survival probabilities. We believe that residual disagreement between data and first-principles calculations arises due to both limited sequence sampling and contributions from higher-order analytic error terms when the approximation Jσ 2 1 is no longer valid. Importantly, data and theory show the mode of the distribution is close to unity survival probability (I = 0) and therefore corresponds to a lower error than the mean. For details on modifications to the theory presented in [18] accounting for the specific noise and gate-dependent error model employed in our experiments, contributions from higher-order terms, and expanded data sets including larger sequence numbers, see Supplementary Material.

D. Modification of RB for identification of model violation
The fact that the distribution of sequence survival probabilities under slowly varying noise does not converge to the mean indicates sequence-dependence in the resulting error ac-cumulation. The emergence of this phenomenology is elucidated through an examination of the walks for different sequences. Under this type of noise certain sequences possess walks with large V 2D 2 , hence amplifying the accumulation of error, while others tend back towards the origin and show reduced accumulated error (Fig. 3a-b). We arbitrarily classify sequences as "long-walk" if they possess a 2D projection beyond the diffusive mean-squared limit for an unbiased random walk, V 2D 2 > 2 3 J. We link between the sequence walk in Pauli space and the noise-averaged survival probability by displaying the experimentally measured I for sequences of fixed length J = 200 against the calculated 2D walk length, V 2D 2 (Fig. 3c). Data are presented for both rapidly varying (red open markers) and slowly varying (grey solid markers) noise, where the same set of sequences is used between the noise models. Measurements for rapidly varying noise are fit with a line possessing slope approximately consistent with zero, while for the same sequences under slowly varying noise, the measurements show a positive dependence on V 2D 2 as expected. We believe the significant scatter in the plot is partially due to a concurrently acting noise source and higher-order contributions to error, neither of which are incorporated in the first principles calculation of the walk, V 2D 2 (see Supplementary Materials and Appendix C of [18]). Nonetheless, the effect of sequence structure on measured survival probability is clearly visible for the case of slowly varying noise.
In aggregate, this phenomenology gives rise to the skewed gamma distribution under slowly varying noise described above, and the convergence of all noise averaged survival probabilities for individual sequences to the ensemble average when the noise is rapidly varying. However, pre-selection of RB sequences possessing large calculated, unit-step walks also provides a mechanism to both identify the presence of temporally correlated errors and extract an RB outcome that more closely approximates worst-case errors. In Fig. 3d we plot I vs. J for a subset of sequences preselected to possess long walks as in Fig 3a-b, whose survival probabilities we denote F LW (J). We choose that the preselection of long walks is based on the condition V 2D 2 > 2 × 2 3 J. When these long-walk sequences are subjected to rapidly varying noise, the distribution of survival probabilities over sequences remains approximately Gaussian about the mean, and the expectation value over this subset closely approximates the expectation value over an unbiased random sampling of the 24 J possible J-gate sequences, F rapid LW ≈ F rapid , (Fig. 3d, red solid line and blue dashed line). However, in the presence of slowly varying noise we observe a larger error rate than that achieved with unbiased sampling F slow LW > F slow .
The difference between the sequence-averaged expectation values in these noise cases arises solely because of the intrinsic properties of the sequences in use.
Extracting a RB gate-error-rate, p length walks under slowly varying noise in 3D (coloured lines) and 2D (black lines), defined relative to a limit deduced from diffusive behaviour, as indicated by the blue circle. c) Noise-averaged fidelity distributions of the same sequences as a function of walk length in the 2D plane. Measured infidelity vs. 2D walk length, V 2D 2 , when subject to slowly varying (grey) and rapidly varying (red) noise with linear fit overlaid. The slope of this fit is (0.8 ± 1) × 10 −5 , consistent with zero. d) RB using longwalk sequences. Solid red line corresponds to RB performed using 20 long-walk sequences and rapidly varying noise. Extracted p (LW ) RB matches that extracted under the same conditions using an unbiased sampling of all sequences (dashed line). Grey line corresponds to RB using the same long-walk sequences and slowly varying noise. For the exponential fits, state-preparation and measurement error, κ, is fixed to 3 × 10 −3 . employed, and the threshold value of V 2D 2 used to define a "long walk" (Fig. 3c). This approach effectively constitutes construction of an RB protocol that increases the reported error rate by enhancing sensitivity to a particular noise type, which in our case is ∝σ z . Alternative sequences may also be calculated that are more sensitive toσ x orσ y noise than randomly selected RB sequences. These error enhancing sequences give a clear, qualitative signature of the violation of the assumption that the error process is uncorrelated in time, although we do not claim that such a signature is in general uniquely associated with the presence of temporal noise correlations. Furthermore, because calculation of V 2D 2 and sequence pre-selection is performed numerically in advance, this approach alleviates the requirement to average extensively in experiment over sequences in order to reveal the skewed fidelity distribution.

E. GST in the presence of correlated noise
We now apply the sequence-dependent Pauli walk framework to GST in order to understand the interplay of sequence structure and temporal noise correlations in the GST estimation procedures. We begin by collating all standard GST sequences up to 256 gates in length using gates G I ≡Î, the identity, G x , a π/2σ x rotation and G y defined similarly. We define sequences to include fiducial operations and germs (see Methods and Ref [23]), and calculate the corresponding walk lengths. Here we assume unit step size under application of either a quasi-DCσ z orσ x unitary error process (Fig. 4a, b) such that R 2 = δ 2 V 2 , and plot V 2D 2 as a proxy for projected sequence error vs. J. We overlay the results on the calculated probability distribution of unit-step walks for RB sequences, presented as a colour scale for comparison. Points appear clumped due to the GST prescription using different fiducials (leading to different sequence lengths) surrounding a reported germ, as highlighted in Fig. 4b.
Examining these data indicates that GST sequences broadly sample the range of expected fidelities in the presence of strongly correlatedσ x errors, more effectively so than RB. However, their structure appears to systematically suppress measured errors in the presence of correlatedσ z errors. This mimics the positive skew of RB sequence survival probabilities in the presence of slowly varying noise, as observed in the colour scale. In the presence of correlatedσ z errors, only GST sequences consisting of repeated G I germs, formally equivalent to Ramsey experiments [24], show sensitivity to this kind of error. We now explore the impact of these observations in further detail by both numerical investigations and experiments involving engineered unitaryσ z errors.
Given measurement outcomes (experimental or simulated) for the prescribed sequences, the open-source analysis package pyGSTi [25] is used to extract a large set of results characterising the performance of the gate set. One important metric calculated by the protocol for each gate is the diamond distance, G ideal − G , which is meant to provide a worst-case bound on the distance to the ideal gate operation. GST has found wide adoption in part because of its ability to calculate this metric, which is postulated to be important for formal analyses of fault-tolerance in the context of quantum error correction.
In our first test, we numerically probe the sensitivity of the GST analysis procedure to correlated error using the aforementioned pyGSTi toolkit [25]. We introduce constantσ x , σ y , orσ z errors via concurrent unitary rotations added to the formerly ideal operations. Therefore the exact mathematical representation of each gate (G I,x,y ) is known from analytical transformations and we have two paths to evaluate gate performance (Fig. 4c). First, we directly calculate the diamond distance ( G ideal − G err ) using the matrix representation of G err , maintaining the initial frame of reference. Second, we estimate it by employing pyGSTi to simulate data using G err and determine the diamond distance ( G ideal − G to the qubit and its measurement basis. It performs two rounds of gauge optimisation, allowing identification of a frame in which to minimise the distance to the entire set of target gates. The relevance of this gauge freedom on RB-derived estimates of gate performance was highlighted recently in [26]. To illustrate how gauge freedom affects the results, we separately calculate the diamond distance with and without gauge optimising our analytic gate set G err using routines included in the pyGSTi toolkit. We plot the calculated and estimated diamond norms for G I,x,y , subject to processes similar to either a constant overrotation (i.e. proportional toσ x orσ y depending on the germ with no error on G I operations), or a constant detuning error (i.e. proportional toσ z ), as shown Fig. 4d-e. Here we see that while the estimated diamond distance for operators G I,x,y closely matches the calculated value in the presence of numerical overrotation errors, GST appears to significantly underestimate G x and G y errors arising from constant unitaryσ z errors, and only the diamond norm estimate for G I appears similar to the directly calculated value. Other estimated quantities such as process infidelity and the associated Choi matrices are affected in a similar way (see Supplementary Material). However, performing gauge optimisation on the analytically calculated matrices G err as well (within the pyGSTi package) reduces the reported diamond distance for σ z errors, and produces agreement with the much lower G x,y diamond distance reported by the GST estimation procedure (Fig. 4e). Among the error models we have tested, for this gate set such behavior is only manifested in the presence of temporally correlatedσ z errors and does not appear using various other error processes built into the GST analysis package (see Supplementary Material for details) .
To further investigate the influence of the gauge degree of freedom, we repeat our numerical analysis under the application of identical unitary errors, but extend the gate set by adding negative rotations −G x , −G y corresponding to −π/2σ x andσ y rotations and incorporating a number of associated compound germs ( Fig. 4f-g). The resulting gaugeoptimised calculated and estimated diamond-distance values now increase, moving closer to the analytic calculation obtained without gauge optimisation. This simple change in the gate set directly reveals the role of gauge optimisation in the discrepancies we noted above, as the additional information provided to GST via the extended gate set effectively constrains the gauge optimisation procedure.
We follow up on these numerical investigations by performing experiments using GST gates subjected to engineered unitaryσ z -errors of varying strength. As before, we generate an operation with known error magnitude and form, allowing us to directly produce a matrix representation for the gate and hence calculate the diamond distance for the (deliberately) imperfect gates we apply to our trapped-ion system. Again GST produces an estimate of the diamond distance that matches the calculation for G I , but yields estimates of the diamond distance from experimental data approximately an order of magnitude below the (unoptimised) calculated value for G x,y (Fig. 4h). Allowing gauge optimisation on the calculated diamond distance changes its scaling with error magnitude as in simulations above. We do not find strong agreement between data for G x,y and this gauge-optimised scaling, but cannot exclude the possibility that other finite sampling effects may cause saturation of small reported diamond distances.
In addition to the cases presented above we have also performed experimental GST with a wide variety of engineered, time-varying errors. These include detuning and amplitude noise exhibiting 50 Hz fluctuations and slow drifts (i.e. varying in time during individual sequences), constant overrotations, and added state-preparation and measurement (SPAM) errors. All data sets, corresponding pyGSTi analysis files and resultant reports are included as part of the Supplementary Material.

II. Discussion
In our studies we have employed a simple analytic framework -a formalism mapping noise to error accumulation in sequences of Clifford operations -to explore the sensitivity of RB and GST to slowly varying noise processes. Theoretical predictions derived from this framework match RB experiments employing engineered noise with known characteristics: either slowly varying or rapidly varying on the sequence timescale. This highlights the utility of the randomwalk analysis in determining sequence-dependent sensitivities of QCVV protocols in the presence of temporally correlated noise.
We have compared RB survival probabilities over sequences to a gamma distribution Γ(α = 1, β), where β is determined by the type of error model employed in the experiment, and shown good agreement using no free parameters. In addition we have demonstrated that in the presence of slowly varying noise, the mode of the distribution of survival probabilities over sequences is shifted towards lower error rates than the mean and that a long tail of high-error outcomes appears as predicted in [18].
Overall, the experiments reported here give a clear experimental signature of the violation of the assumption that errors between gates are independent. While we do not claim that the features we observe are in general uniquely derived from this interpretation, we hope these results may help experimentalists seeking to interpret complex RB data sets. We believe that more detailed reporting of RB outcomes including the publication of distributions over F, as well as the sequences employed, will facilitate more meaningful comparisons between RB data sets derived from different physical systems, as the relevance of p RB is diminished when error processes exhibit temporal correlations.
Through a combination of analytic calculations, numerics, and experiments with engineered errors we have found a similar bias towards high estimates of gate fidelity in GST (using a standard G I,x,y gate set) subjected to strongly correlated, unitaryσ z errors. The asymmetry we observe between the manifestation of correlatedσ x /σ y andσ z error-sensitivity in GST outputs has not been reported previously, to the best of our knowledge. We have shown explicitly how the low diamonddistance estimates under this kind of noise are related to the gauge optimisation performed as part of the protocol; limiting the gauge freedom by extending the gate set under application of an identical error process dramatically changed the estimated diamond distance of the very same gates in numerical simulations.
These observations are commensurate with a simple physical interpretation of the effect of an optimised gauge transformation in the circumstances we examine. In the presence of correlatedσ z errors, when the gate set is limited to G I,x,y gates, the reconstructed operator includes an extra error component along the z-axis. The effect of gauge optimisation is to rotate the operators for G x and G y back to the equatorial plane, effectively cancelling this error. Under this circumstance the magnitude of rotation of these gates is smaller than expected in a fixed lab frame, and the second-order nature of the residual errors result in a steeper gradient of the dotted line in Fig. 4e. In contrast the G I rotation should have no net rotation and therefore this error will not be cancelled by a simple gauge transformation.
Gauge optimisation is designed to produce the best estimate for errors over the entire gate set, and in a sense acts to "distribute" nominal errors over all constituent rotations in the gate set. The validity of such a gauge transformation in the presence of independent protocols for establishing a measurement basis remains an open question and has been highlighted recently by Rudnicki et al. [27]. The variation of calculated and estimated diamond distances under correlatedσ z errors when subjected to seemingly small modifications of the gate set has again not been reported previously, and indicates an important dependence of GST output on the specific gate set employed, the characteristics of the underlying error source, and the gauge optimisation procedure.
Clearly the observed performance of GST in the presence of correlatedσ x noise, such as resulting from experimental over-rotations, can make GST a valuable tool in debugging an experimental system [28], although precise calibrations can also be carried out efficiently using a subset of the full GST protocol [29]. The effect of gauge optimisation in the presence ofσ z errors, however, is concerning as a key implied benefit of GST is its ability to directly estimate the diamond distance and hence provide a rigorous upper bound on gate errors using a fully self-contained analysis package. Recent experimental work [10] on the topic claimed such upper bounds on gate errors using GST and compared these to the fault-tolerance threshold with high reported confidence and tight uncertainties. The results above call into question the relevance of the reported metrics without either additional independent verification and noise characterisation, or a more detailed discussion on the relationship between the gate set implemented in GST and the gates which could potentially be employed in a calculation using that system. Furthermore, when acquiring and evaluating data, care has to be taken to to suppress any form of model violations reported by the GST toolkit in its likelihood analysis, as otherwise the extracted performance metrics may become unreliable. These deviations are currently not reflected in the uncertainties (i.e. error bars) calculated for those metrics by the toolkit and discussions with its authors suggest that a connection between the two is a nontrivial process.
In light of the investigations reported here, we believe that there is a need for greater awareness of the subtleties of the use of both RB and GST in the presence of temporally correlated noise environments. In order to enhance the meaning and utility of reported results we advocate that QCVV benchmarks such as p RB and GST diamond distances should be reported together with a quantitative measure of violation from a purely Markovian, temporally uncorrelated model. In the case of RB, this could be the difference between the extracted p RB of long and short walk sequences; in GST the deviation is already being reported as part of the routine, yet the question about the impact of gauge optimisation that we identified remains. Similarly, if using GST as a standalone gate evaluation procedure one cannot know a priori the form of the underlying noise -and hence any associated GST insensitivities. Increasing the rigour of resultant upper bounds on diamond distances could require performing GST using multiple different gate sets in order to identify potential "blind spots." Given the experimental overhead, however, this brute force approach is not necessarily attractive and further modifications to GST could resolve the issue with considerably greater efficiency. Overall, we hope that these observations will assist in both the interpretation of QCVV experiments when model violation may occur, and the development of new techniques with improved rigour and efficiency for larger scale systems.

Experimental gate implementation
Quantum gates are implemented on a single 171 Yb + ion by driving its qubit transition at 12.6 GHz with microwave pulses produced by a vector signal generator (VSG, model Keysight E8267D). The phase of the driving field is adjustable via I-Q modulation allowing us to implement rotations around any axis lying in the xy-plane of the Bloch sphere. Rotations around the z-axis are carried out as frame-updates, i.e. precalculated, instantaneous changes of the generator I-Q values. Identity operations are realised as "idle" periods, whereby no signals are applied for a time equivalent to that of a π or π/2 rotation. We additionally implement pulse modulation ("RF blanking") to suppress transients in microwave power at pulse edges. In this way, we implement the full set of Clifford gates as listed in supplementary materials.
All RB and GST sequences are uploaded to the VSG prior to the experiments and selected when required. When the number of implemented sequences is large, as is the case with GST, the latter step is the bottleneck in our experiments as sequence selection, depending on the constituent number of gates J, can take up to tens of seconds using our signal generator due to the use of the in-built, high-suppression, RF blanking switch which adds significant overhead.

Experimental noise implementation
In RB experiments correlated noise is implemented by shifting the VSG drive frequency by a fixed amount based on a list of N = 200 samples from a Gaussian noise distribution (see Supplemental Information). The same list of noise realisations is repeated for each RB sequence in a set of given length J, yielding sets of noise-averaged fidelities. In GST experiments we implement constant noise of the same strength over all the sequences. Only four noise detunings are implemented due to the large overhead imposed by sequence selection prior to execution.
Rapidly varying noise in RB is implemented via the VSG's external frequency modulation, whereby the frequency offset is encoded as a series of calibrated offset voltages on an arbitrary waveform generator (Keysight 33622A) and supplied time-synchronous to each gate within a sequence. Again, N = 200 different realisations, each consisting of J samples are applied to each RB sequence to extract a noise-averaged fidelity. Further details can be found in the Supplementary Material.