Detecting and tracking drift in quantum information processors

If quantum information processors are to fulfill their potential, the diverse errors that affect them must be understood and suppressed. But errors typically fluctuate over time, and the most widely used tools for characterizing them assume static error modes and rates. This mismatch can cause unheralded failures, misidentified error modes, and wasted experimental effort. Here, we demonstrate a spectral analysis technique for resolving time dependence in quantum processors. Our method is fast, simple, and statistically sound. It can be applied to time-series data from any quantum processor experiment. We use data from simulations and trapped-ion qubit experiments to show how our method can resolve time dependence when applied to popular characterization protocols, including randomized benchmarking, gate set tomography, and Ramsey spectroscopy. In the experiments, we detect instability and localize its source, implement drift control techniques to compensate for this instability, and then demonstrate that the instability has been suppressed.


SUPPLEMENTARY NOTE 1 -DRIFT CHARACTERIZATION USING TIME-SERIES DATA
Here we provide supporting theory and information for the methods that we propose for detecting and characterizing instability in the outcome distribution of arbitrary quantum circuits. This is the methods summarized in steps 1-3 of the flowchart in Fig. 1 of the main text.

A. Power spectra
To begin, we detail how a clickstream is transformed into the frequency domain and then converted into a power spectrum (step 1 in the flowchart of Fig. 1A of the main text). To transform into the frequency domain we apply a Fourier-like transform F to rescaled data. All of the data analysis in this article uses the Type-II discrete cosine transform (DCT), with an orthogonal normalization. It is define by the matrix F (dct) with the elements where ω, i = 0, . . . , N − 1 [1,2]. The Type-II DCT typically results in a sparser representation of signals with differing values at the boundaries than many of the alternative Fourier-like transforms [1,2]. Functions of this sort are typical in our problem, and so this is our motivation for using this transform. Although all our data analysis uses the Type-II DCT, our methods and our theory (except where stated) are all directly applicable for any transform matrix F that satisfies the following criteria: (i) F is orthogonal, (ii) every matrix element of F satisfies F 2 ωi ≤ b F /N for a small constant b F , and (iii) the top row of F is proportional to the vector of all ones. This includes, e.g., sine transforms, other cosine transforms, and the Walsh-Hadamard transform [1][2][3] (all with the appropriate normalizations). Note that the standard discrete Fourier transform does not satisfy these requirements (with an appropriate normalization it is unitary, but not orthogonal). However, the methods described here are compatible with the discrete Fourier transform as long as some minor adjustments are made to guarantee that, e.g., the probability trace estimates are real. Throughout the following theory, we refer to the indices of the F-domain vector as frequencies, but note that (i) when we report frequencies for experimental data these indices have been converted to Hertz, and (ii) for a generic transformation an index should only be interpreted as indicating the corresponding basis function of F, and it may or may not constitute a frequency.
As discussed in the main text, the motivation for using spectral analysis is that many physically plausible p(t) are sparse in the frequency domain. But note that the relevant quantity to the data analysis is the discrete-time probability trajectory p, where p i = p(t i ) and t i is the i th data collection time for the circuit in question. Moreover, as we are concerned with deviations from time-independence, it is the "signal" vector Outer DC Electrodes showing the antennae locations with respect to the dc control and rf electrodes. The leads for the microwave antennae attach directly to the flex cable waveguides. The antennae are located on the lowest metal layer underneath the trap electrodes. While the generated magnetic field is attenuated by the metal layers above it, the gaps between electrodes make this design viable and the magnetic field above the trap is about half of the field strength generated if the antennae would not be covered by trap electrodes. In the Ramsey experiment the ion was situated at the center of the trap, above the loading slot. In both GST experiments the ion was located above the center of the left microwave antenna. that is of most relevance, wherev denotes the mean of the vector v and 1 = (1, 1, . . . , 1) T . It is sparsity ofs that is important for our data analysis. Sparsity in the frequency domain of the continuous-time probability p(t) implies thats = Fs is a sparse vector if: (i) a sensible choice is made for the transformation F, such as the Type-II DCT, (ii) the number of sample times N is sufficiently high, and (iii) the sample times t 0 , t 1 , t 2 , . . . , t N−1 for the circuit in question are spaced reasonably uniformly.
All the experiments and simulations herein consist of rastering through a set of circuitsi.e., running each circuit once, and then looping through this until each circuit has been run N times. So the time between sequential runs of a circuit is exactly constant (and the same for all the circuits) in an idealized experiment. Violations of this ideal are unavoidable, but if they are small (as they are in our experiments) they only mildly degrade the effectiveness of our techniques. This can be understood by noting that small perturbations on equally-spaced times (or a small number of erroneously long time-steps of any size) will only slightly reduce the sparsity ofs, for any Fourier-sparse p(t). Large deviations from the equally-spaced-times ideal can substantial degrade the sensitivity of the methods we presented in the main text (although this will not cause instability to be spuriously detected). In this case, we can instead generate power spectra using a technique that explicitly takes into account the sample times. One such option is the floating-mean Lomb-Scargle periodogram [4,5], and all our methods can be adapted to this spectral analysis technique. As there are technical differences in this case, and this method is not needed to analyze our experimental data, we do not include the details here.
Before transforming a clickstream (x) to the frequency domain, we first subtract its mean and divide by its variance. This simplifies the statistics of the power spectrum, but the spectrum is not well-defined if the clickstream consists of only zeros or only ones (because then the variance of x is zero). So we define the frequency-domain amplitudes by: The power at frequency ω is then given by |z ω | 2 . This convention for the constant clickstream case is convenient, particularly when averaging the power spectra from multiple clickstreams, as there are then fewer situations in which we need to explicitly take this special case into account. Moreover, (0, 1, 1, . . . ) roughly corresponds to the N → ∞ limit ofz for a clickstream containing a single 1 or 0 (the correspondence is not precise, because the spectrum of a delta function is not completely flat for a general transform F).

B. Statistical hypothesis testing
All of our methods are built upon statistical hypothesis testing [6,7]. We now review the relevant aspects of this field and formulate our hypothesis testing problem. Consider some set of random variables {A 1 , A 2 , . . . } with a joint distribution that is parameterized by θ ∈ H for some space H. A statistical hypothesis is the conjecture that θ ∈ H 0 ⊂ H. In this article, data consists of a clickstream of N bits for each of C ≥ 1 circuits. Generalizing the notation used in the main text so that we explicitly denote the circuit index, let x c,i be the bit output by the i th repetition of the c th circuit, and denote the corresponding Bernoulli random variable from which that datum is drawn by X c,i . The set of random variables about which we state statistical hypotheses is {X c,i }. The random variable X c,i has some unknown probability p c,i of taking the value 1, so the set of random variables {X c,i } is parameterized by the space Statistical hypothesis testing starts from some set of null hypotheses {H 0,i } and uses statistical hypothesis tests on data drawn from the random variables to attempt to reject one or more of these null hypotheses (and/or intersections of these null hypotheses). Our methods look for evidence that |p ω,c | > 0 at each non-zero frequency ω and for each circuit c. This is formalized by the set of null hypotheses {H 0,c,ω } where H 0,c,ω is the conjecture thatp c,ω = 0, with ω = 1, 2, . . . , N − 1 and c = 1, 2, . . . , C. For our purposes, testing a null hypothesis consists of the following four steps: (i) pick a significance level α ∈ (0, 1), with α = 5% a common choice; (ii) select a test statistic Λ, which is a function from data to R; (iii) find a threshold such that the probability of observing Λ larger than this threshold if the null hypothesis is true is at most α; (iv) collect data, evaluate Λ, and reject the null hypothesis if and only if Λ is larger than the threshold. This procedure guarantees that the null hypothesis is falsely rejected with probability at most α. Our test statistic for testing the null hypothesis H 0,c,ω is simply the power of the normalized data at the frequency ω, i.e., |z c,ω | 2 .
Implementing multiple hypothesis tests that use the above procedure will typically cause the probability of falsely rejecting at least one null hypothesis -the family-wise error rate (FWER) -to increase quickly with the number of tests (for T tests FWER ≤ T α). The standard approach when implementing multiple tests is to adapt the testing procedure to, at a minimum, maintain weak control of the FWER. Weak control of the FWER means that if all the null hypotheses are true then, for some pre-specified global significance α, the probability of falsely rejecting one or more null hypotheses is at most α. It is common to choose a testing procedure that also maintains strong control of the FWER. Strong control of the FWER means that the probability of falsely rejecting one or more true null hypotheses is at most α and this holds even if one or more of the null hypotheses are false. We seek to maintain strong control of the FWER, so that whenever we reject H 0,c,ω we can be (1 − α) confident that ω is a frequency with a non-zero amplitude in the Fourier decomposition of p c .
Strong control of the FWER at a global significance of α can be maintained by implementing the i th of T tests at a local significance of αw i for any w i ≥ 0 satisfying i w i = 1. This is known as the (generalized) Bonferroni correction [6,7]. The Bonferroni correction can be conservative; there are other methods for maintaining strong control of the FWER that are more powerful than the Bonferroni correction, such as the Holms procedure [6,7]. Moreover, strong control of the FWER is not the only way to ensure "confidence" in the resultse.g., we could instead choose to control the false discovery rate, which is the expected ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses [8]. Variants of our techniques that control the false discovery rate and/or use a different multi-test correction procedure are possible, and essentially involve setting a slightly lower power significance threshold in the analysis.

C. Statistics of the power spectra
We now derive the distribution of the power spectrum of the data, both when p(t) is and is not constant. We consider the frequency-domain dataz ω , defined in Supplementary Eq. (3), for an arbitrary non-zero frequency ω, and we derive an approximation to the distribution of the corresponding random variable,Z where ω > 0 (we are again dropping the circuit indexing, and we have ignored the edge-case of the constant clickstream instance in this equation). Note thatX = 1 N i X i is the average of random variables, and so it is itself a random variable. In the main text we asserted thatZ 2 ω is approximately χ 2 1 distributed whenever the probability trajectory is a constant. This is a sub-case of what we derive below, but we also study the marginal distribution ofZ 2 ω whenp ω = 0 but the probability trajectory is not necessarily constant (i.e., there is perhaps power at one or more other non-zero frequencies). We do so in order to be able to claim that we are maintaining strong control of the FWER. By itself, the fact thatZ 2 ω is approximately χ 2 1 distributed when the probability trajectory is constant can only be used to ensure weak control of the FWER.
The random variableZ ω is rather complex to directly analyze, because it depends on bothX ω andX. So we first make an approximation. AsX is the mean of the X i we have that where E(A) and V(A) denote the expectation value and variance of the random variable A, respectively. Therefore where we assume that 0 <p < 1. So we now derive the distribution ofỸ ω , as an approximation to the distribution of Z ω . The random variableỸ ω is the sum of N independent random variables:Ỹ A heuristic appeal to the central limit theorem therefore suggests thatỸ ω is normally distributed. That is,Ỹ ω is approximately N(µ ω , ν ω ) distributed for N 1, with some mean µ ω and variance ν ω . Later we formally apply the central limit theorem toỸ ω , but first we derive formulae for µ ω and ν ω . For any non-zero ω, µ ω is simply a rescaling of the component of the probability trajectory at this frequency: .
So, µ ω = 0 under the null hypothesis thatp ω = 0. The variance ν ω has a more subtle dependence on the probability trajectory. By noting thatỸ ω is a summation of independent Bernoulli random variables multiplied by constants, and by noting that the variance of a Bernoulli random variable with bias b is b(1 − b), we obtain It then follows from the orthogonality of F that ν ω = 1 when the probability trajectory is constant. SoZ ω is approximately N(0, 1) distributed when the probability trajectory is constant. Therefore, calculating significance thresholds using the model thatZ 2 ω is χ 2 1 distributed under the null hypothesis will maintain weak control of the FWER.
Using the relation p i = s i +p, where s is the signal vector, we can rewrite the variance as where So far, F is a general Fourier-like transformation that is an orthogonal matrix (see earlier). For some transformation, such as the Walsh-Hadamard transformation (but not the DCT), F 2 ωi = 1/N and so Supplementary Eq. (11) simplifies to because i s i = 0. Here is the 2-norm. That is, the variance is upper-bounded by 1, and it decreases in proportion to the signal power per timestep (which is equal to the variance of p). So the maximum variance over the null hypothesis space ofp ω = 0 is ν ω = 1. This means that modelingZ 2 ω with a χ 2 1 distribution is sufficient for strong control of the FWER if the power spectra are generated using, e.g., the Walsh-Hadamard transformation.
For a general transformation the situation is more complicated. In particular, even with p ω = 0 it is possible to obtain ν ω > 1 for a matrix F containing elements of magnitude greater than 1 / √ N. Because all of our data analysis uses the Type-II DCT, defined in Supplementary Eq. (1), at this point we specialize to this transformation. In this case, simple algebra can be used to show that where q = (s 2 0 , s 2 1 , s 2 2 , . . . ) is the vector of time-domain signal powers, and we are allowing the frequency index to extend outside its range as specified in the definition of the DCT. This equation implies that the variance at ω is increased by signal power at 2ω if the oscillations are not centered on 1 /2. This can only result in a variance above 1 if the amplitude at 2ω outweighs the decrease in the variance caused by the total power per time-step in s. So the variance at ω is maximized by a pure-tone probability trajectory oscillating between 1 /2 and 1 or 0 at a frequency of 2ω, resulting in the saturable bound ν ω ≤ 1 + 1 /6. Therefore the maximal variance over the null hypothesis space ofp ω = 0 is ν ω = 7 /6. So, ignoring the two approximations made so far, there exists a probability trajectory within the null hypothesis subspace defined bỹ p ω = 0 such thatZ ω is N(0, 7 /6) distributed. This implies that calculating significance thresholds using a χ 2 1 does not strictly provide strong control of the FWER.
We could guarantee strong control of the FWER by calculating thresholds using a rescaled χ 2 1 distribution (or, equivalently, by normalizing the data differently). But we choose not to do this, because for a broad class of probability trajectories ν ω ≤ 1 for all ω. So accounting for the possibility of ν ω > 1 would result in a reduction in test poweri.e., higher significance thresholds -to gain statistical rigor in some edge cases. In our opinion this is not a good tradeoff, and we instead settle for maintaining weak control of the FWER and "almost" maintaining strong control of the FWER. This is true in the sense that although there are some probability trajectories for which the FWER is slightly above the desired value, (i) the increase in the FWER is small, (ii) it can only occur for extremal points in the parameter space, and (iii) it is counteracted by the conservative nature of the Bonferroni correction that we use when calculating thresholds from a χ 2 1 distribution (see later). All further statements about strong control of the FWER should be understood to implicitly contain this minor caveat.
The above reasoning was all based on the assumption that , under a reasonable notion of the limit of a probability trajectory. In particular, we assume that 1 <p < 1 − 1 and ν ω > 2 for some constants 1 , 2 > 0 and all N. The proof will use Lyapunov's central limit theorem [9], which we now state. Let {A 1 , A 2 , . . . , A n } be a sequence of independent random variables where A i has a finite expected value µ i and variance ν i , and define s 2 holds for some δ > 0, the distribution of A = n i=1 (A i − µ i )/s n converges to N(0, 1) as N → ∞. The random variableỸ ω may be written in the format of this central limit theorem: and the expected value and variance of every Q ωi is bounded (due to our assumptions). It remains to prove that Lyapunov's condition holds. We have that because Q ωi = F ωi Y i and the moments of Y i are bounded from above by a constant (as 1 <p < 1 − 1 ). Therefore with the equality holding because s 3 n = ν 3 /2 ω and ν ω > 2 . So Lyapunov's condition holds for δ = 1.
In addition to performing hypothesis testing on the power spectrum for each circuit, we also test the single power spectrum obtained by averaging over the power spectra for the different circuits, e.g., see Fig. 3 in the main text. The power at frequency ω in this spectrum is given bỹ where C is the number of circuits. The clickstreams for different circuits are independent, and so the corresponding random variableZ 2 avg,ω is the average of independent random variables that are approximately χ 2 1 distributed under the (intersection) null hypothesis thatp c,ω = 0, for all c. ThereforeZ 2 avg,ω is approximately χ 2 C/C under this null hypothesis.

D. Hypothesis testing of the power spectra
Here we use the statistical model developed above to calculate the power significance threshold. We argued above that Z 2 c,ω can be modeled as a χ 2 1 random variable under the null hypothesis thatp c,ω = 0 (up to minor caveats). So if we are only testing a single circuit c and a single frequency ω, then the α-significance power threshold is simply where CDF k denotes the cumulative distribution function for the χ 2 k distribution. However, we test every circuit at every non-zero frequency, and we also test the average power spectrum. So, to maintain strong control of the FWER with a global significance of α in this set of (N−1)(C+1) tests, we use a Bonferroni correction. In general, this means setting the local significance of the test on circuit c at frequency ω to w c,ω α, and setting the local significance of the test on the average power spectrum at frequency ω to w ω α for some non-negative weightings w c,ω and w ω satisfying ω (w ω + c w c,ω ) = 1. The test weightings can be optimized for different usages.
Often it is equally important to identify any drift at every frequency and in every circuit. A one-parameter subclass of weightings that is useful under this circumstance is given by w c,ω = (1 − w)/(C(N − 1)) and w ω = w/(N − 1) for some weighting factor w. This choice of weightings results in a single ω-independent significance threshold for the power in the individual power spectra for the C circuits of and an ω-independent significance threshold for the power in the averaged spectrum of The weighting factor w controls the proportion of the test statistical significance that is allocated to the test on the averaged power spectrum versus the significance allocated to the individual power spectra. The averaged power spectrum does not contain information about which circuits (if any) exhibit drift, but has (under most circumstances) increased sensitivity to any drift. So the best choice for this weighting depends on the relative importance of detecting any drift versus being able to assigning it to a particular circuit. All our data analysis uses the thresholds given in Supplementary Eq. (22-23), with α = 5% and w = 1 /2, with the exception of time-resolved RB, wherein w = 1.

E. Estimating probability trajectories
We now explain our methods for estimating the circuit outcome probability trajectories (step 3, Fig. 1A of the main text). The methods are based on model selection, to avoid overfitting. The general form of the model that we use for a probability trajectory is: where f ω (t) is the ω th basis function of the chosen Fourier transform, the summation is over some set of frequencies W that are to be chosen using model selection, and the γ ω are parameters constrained only so that p(γ, t) is a valid probability at all data acquisition times (the condition that p(γ, t) is within [0, 1] for all real-valued t can cause overly severe compression of the amplitudes). In the case of the Type-II DCT the basis functions are where, as throughout, t i is the i th data collection time for the circuit in question, and is the size of the time-step. The first step is model selection to identify a set of frequencies (W). These frequencies can be chosen using the results of the hypothesis testing on the data spectra. It is useful to allow flexibility in exactly how they are chosen from the hypothesis testing results, because this allows the techniques to be adapted to different applications. A good general-purpose option, which results in estimates with a simple statistical meaning, is to set W to contain those and only those frequencies that are found to be statistically significant in the power spectrum for the circuit in question. With this method we are only including frequencies that contribute a nonzero component to the true p(t) with confidence (1 − α), where α is the chosen global significance level. All of the data analysis in this article uses this method, with the exception of the model selection in time-resolved RB (see later).
It remains to estimate the model parameters, i.e., the amplitudes γ. Maximum likelihood estimation (MLE) is a statistically sound methodology. But MLE requires a numerical optimization, which can sometimes be slow. So we also provide an optimization-free method. The data amplitudex ω is a well-motivated estimate of the probability trajectory amplitude at ω, since E(X ω ) =p ω . In terms of the rescaled amplitudesz, this suggests setting γ ω = S (x)z ω where S (x) is a rescaling factor that inverts the normalization of thez amplitudes by the data standard deviation (and also rescales for the chosen normalization of the basis functions). This is intuitive but flawed: the resulting probability trajectory is not necessarily a valid probability at all times. To address this, we add a regularization to this estimator. We map each amplitude via with δ the smallest constant such that the estimated probability trajectory p(γ, t) is within [ , 1− ] at all data acquisition times t i , for some chosen ≥ 0 (e.g., = 0 if estimates on the parameter space boundary are considered acceptable). We refer to this estimation method as the "Fourier filter," because it is a form of signal filtering. The Fourier filter returns very similar estimates to MLE under most conditions, although note that if = 0 it is possible for the Fourier filter to return estimates that have a likelihood of zero. All data analysis in this paper uses MLE, except the timeresolved RB analysis. The motivation for using the Fourier filter in time-resolved RB is that then the speed of the analysis is limited only by the fast Fourier transform and curve fitting at the times of interest. This is particularly relevant to RB, as one motivation for using RB is its speed and simplicity. Moreover, note that for all the data herein the difference between the MLE and the Fourier filter estimates is negligible.
F. Analyzing data from circuits with more than two outcomes Throughout this article we have only considered analysis of circuits with two possible outcomes. This is sufficient for all the experiments and simulations herein. However, the techniques we present can be easily generalized to M-outcome circuits, as we now briefly outline. In this more general setting, the aim is still to detect and quantify time variation in the probability distribution over outcomes for each circuit. However M can be very large in commonly encountered circumstances, as M = 2 n for an n-qubit circuit terminated with the standard measurement. Even accurate estimation of a timeindependent distribution over a very large number of outcomes is infeasible with reasonable quantities of data, so it will not be possible to accurately estimate a time-dependent distribution with many outcomes. For this reasons, it is useful to allow for a coarse-graining of the outcomes intoM ≤ M bins (e.g., we can marginalize over some subset of the qubits in an n-qubit circuit or bin the outcomes into two sets). The analysis will then interrogate time-dependence in the coarsegrained probability distribution. So coarse-graining can make the analysis feasible with a reasonable amount of data, but the analysis will be insensitive to any time dependence that disappears under the chosen coarse-graining.
Each coarse-grained outcome has a bit-string time-series associated with it, corresponding to whether or not that measurement outcome was observed at each time. A power spectrum can be calculated for each such bit-string, and exactly the same hypothesis testing performed on each such power spectrum as in the 0/1 outcomes cases (with an appropriate correction for the increased number of hypothesis tests). Moreover, the spectrum obtained by averaging the spectra for a circuit over course-grained measurement outcomes can also be tested (which will be approximately χ 2M −1 distributed under the null hypothesis), or it can be tested instead of the per-outcome spectra -to avoid significance dilution by implementing too many hypothesis tests. The probability trajectory estimation (step 3, Fig. 1A of the main text) then proceeds exactly as in the 0/1 outcome case, except that now the estimation is ofM trajectories that must sum to 1 at all times. Any time-resolved parameter estimation proceeds exactly as in the 0/1 outcome case.

SUPPLEMENTARY NOTE 2 -TIME-RESOLVED TOMOGRAPHY AND BENCHMARKING
Here we provide supporting theory and information for our time-resolved tomography and benchmarking methods. This is the methods summarized in steps 4 and 5 of the flowchart in Fig. 1 of the main text.

G. Model selection in time-resolved tomography
We now explain the model selection step in our "intrusive" approach to time-resolved tomography and benchmarking (step 5a, Fig. 1A of the main text). Our methods for timeresolved tomography and benchmarking start from a timeindependent {γ j }-parameterized model M that predicts circuit outcomes. In the intrusive approach we expand each parameter γ j into the time-dependent form where the γ j,ω are restricted only so that they satisfy any constraints on γ j , the summation is over some set of non-zero frequencies S j , and j indexes all the parameters in the timeindependent model. So, we obtain a new model M tr containing free parameters, where |S| denotes the size of the set S. The parameterized model is entirely defined by the set M = {S j }, and the model selection consists of choosing this set.
For the related problem of choosing the frequencies in the model for a probability trajectory (see above) there is a clear solution: our instability detection methods are directly testing hypotheses about the frequencies in p(t), and so there is a clear statistical justification for choosing the frequencies in our model for p(t) using the results of these tests. In contrast, the parameters in the model M do not necessarily have a simple relationship to the circuit probabilities -for an arbitrary model the relationship is entirely arbitrary. So, the relationship between the frequencies detected in the probability trajectories and the frequencies in each γ j is also, for a general model, entirely arbitrary. However, for typical models associated with tomographic methods and circuits, under some approximation there is a simple relationship between the model parameters and the predicted probabilities.
To illustrate this, we turn to the Ramsey model with a timedependent phase that was introduced in the main text: If we expand around lt w Ω(t) = 0 we obtain So, if lt w Ω(t) is small then the dominant frequencies in p l (t) are also necessarily frequencies in Ω(t). Equivalent expansions around the zero-error parameter values hold for the general process matrix model of GST -and most other tomographic techniques use a model that is a restriction of that in GST. Therefore, we can use the results of the hypothesis tests on the data from sufficiently short circuits to select the frequencies for the model parameters. But longer circuits are more sensitive to small variations in the model parameters, so we do not want to entirely discount that data in the model selection. To address this problem we can pick a small set of candidate models and then choose between them using the Akaike information criterion (AIC) [10]. For a parameterized model M with k free parameters, the AIC score for that model is where L max is the maximum of the likelihood function for that model. For a set of candidate parameterized models, the preferred model is the one with the minimum AIC score. Moreover, AIC(M a ) − AIC(M b ) is twice the log relative likelihood of model M b with respect to model M a , so this difference quantifies how much one model is preferred over another. This allows us to decide between multiple time-resolved models.
For the Ramsey experiment presented in the main text we select the expansion where W = {1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 14, 15} is the set of all the frequency indices found to be significant in any of the power spectra. We then check that the data justifies including the higher frequencies in the model -because the lowest frequencies appear for circuits where l is small enough so that lt w Ω(t) ≈ 0, and so they are almost certainly components in the true Ω(t). In particular, we select the 10 candidate parameterized models defined by the frequency sets W 0 = {1, 2, 3, 4}, through to W 9 = W, and compare their AIC scores, AIC k for k = 1, 2, . . . , 9. The values of AIC k − AIC 9 are all well above zero (ranging from around 10 to around 1000). So the model containing all the frequencies is clearly favored according to the AIC. The detuning estimate reported in the main text is the MLE over this parameterized model. There is no guarantee that W contains all of the frequencies that are components in the true Ω(t). We could also select some larger models to consider, but we do not do this because it would be ad hoc: there are N − 14 = 5986 remaining frequencies that we could also include, and there is no clear method for choosing the next frequency to add. We could calculate the AIC for every possible model, but this is clearly impractical. In general, within this framework of time-resolved models based on Fourier decompositions there are 2 Nk−k possible time-resolved models, where k is the number of parameters in the time-independent model. Moreover, this would not be a reasonable use of the AIC. When adding an erroneous parameter to a model sometimes the AIC score will decrease slightly due to random fluctuations in the data slightly favoring this model, and if we consider adding many different parameters this will almost certainly occur in at least one case.
In the main text we also presented the results of timeresolved Ramsey tomography after we had discarded 80% of our data (i.e., we keep every 5 th click for each circuit). This analysis followed the same procedure as the analysis on all the data. In particular, note that we implemented independent model selection, i.e., we did not "cheat" by using the time-resolved model selection that was performed using all the data. In particular, using only 80% of the data resulted in a smaller model whereby W = {1, 2, 3, 5, 6, 7, 9}. We therefore would not expect the two estimates of Ω(t) to be equal up to statistical uncertainty (which they are not -see Fig. 1E), as both these statistical uncertainties are standard in-model uncertainties and the two estimates are fits to different models.

H. Time-resolved RB
RB is a set of protocols for benchmarking a set of gates on n qubits [11][12][13]. The core protocols all follow the same general procedure: (i) For a range of lengths m, run K circuits sampled from some Φ m distribution whereby (a) the average length of a sampled circuit scales linearly with m, and (b) each circuit will deterministically output some n-bit string if implemented perfectly.
(ii) Process the data to obtain the average success probability (P m ) at each length m, meaning the probability that the expected bit-string was successfully output.
(iii) Fit P m to P m = A + Bλ m , and convert λ to an error rate via r = N(1 − λ) where N is a constant that we set to N = (4 n − 1)/4 n where n is the number of qubits.
Within this framework there are a range of different RB protocols, e.g., those of Refs. [11][12][13], that vary in how the random circuits are chosen. Our methods can be used to upgrade any of these RB protocols to time-resolved techniques. We focus on the non-intrusive approach to time-resolved RB, which is the following method: 1. Obtain data by rastering through the sampled circuits.
2. Use the general stability analysis to estimate the timeresolved success probability p(t) for each circuit.
3. Average the time-resolved success probabilities for all circuits of the same RB length m to obtain time-resolved average success probabilities P m (t).
4. Apply the standard RB curve fitting analysis [step (iii) above] to P m (t i ) at every time of interest t i .
The general instability analysis step in this process has some useful flexibility. It has two aspects that can be optimized to a particular task: the exact statistical hypothesis testing on the power spectra, and the selection of the frequencies to include in the time-resolved p(t) models. Because the RB circuits are random, they are broadly all sensitive to the same physical parameters, with the exact sensitivities depending on the precise details of the random circuit. So the frequencies in p(t) will typically be the same in all the sampled circuits, even if the power in each frequency will generally vary between circuits. In this circumstance the averaged power spectrum is more powerful for detecting these frequencies, because the averaging suppresses measurement shot noise and amplifies signal if the same frequencies appear in most of the spectra. Moreover, here there is little penalty for erroneously including a frequency in the success probability trajectory in a few circuits -we are implementing a global analysis of the data, rather than trying to extract detailed diagnostic information. So we only perform hypothesis testing on the average power spectrum (i.e., we set w = 1 in Supplementary Eqs. (22-23)), and then we include every statistically significant frequency in the model for each success probability.

I. Time-resolved RB simulation
In the main text we demonstrate time-resolved RB on simulated data. This demonstration is of time-resolved direct randomized benchmarking (DRB) [13] on two qubits, with a gate set consisting of a cphase gate between the qubits (i.e., a controlled σ z ), rotations around thex andŷ axes by ± π 2 and π on each qubit, and the Hadamard and idle gates on each qubit. The details of how DRB circuits are sampled (i.e., the Φ m of DRB) can be found in Ref. [13] -they are omitted here as they are not relevant to assessing the performance of timeresolved RB.
We simulated gates with time-dependent coherent phase errors and time-independent depolarizing errors. Each gate is modeled as the perfect gate followed by the error map on each qubit, where is a one-qubit depolarizing map, and We set γ = 1 − 0.04/3, making the probability of Pauli error 1%. The rotation angle is set to where a = 2π × 10 −5 , b = 1.5 × 10 −2 , φ = 2π × 10 −2 , and c t is a Brownian stochastic process given by c 0 = 0 and c t+1 = c t +N(0, ν) where ν = 3 /200. The parameters in this error model have been chosen to give physically plausible phase trajectories, and to not be artificially easy for our methods (e.g., they do not induce p(t) that are exceptionally sparse in the Type-II DCT basis). They have no further significance. The time t starts at 0 and increases by 1 after a single run of any circuit.
In Fig. 2A of the main text the estimated time-resolved error rate is compared to the true time-dependent r(t). The value of r(t) is obtained by fitting the exact instantaneous average success probabilities for the set of 100 sampled circuits. This r(t) is the target that time-resolved DRB with this particular set of random circuits is aiming for -it does not correspond to the large-sample-limit error rate that finite-sampling DRB is estimating. We choose this comparison because analysis of time-series data from one particular set of K randomly sampled circuits cannot correct for any finite sampling error introduced when sampling this set of circuits. By defining r(t) in this way, we avoid conflating this finite-circuit-sampling error with any inaccuracies that are intrinsic to time-resolved benchmarking.

J. Gate set tomography
GST is a method for jointly reconstructing a set of gates, a state preparation and a measurement [14,15]. GST is predicated on the standard Markovian model for errors in QIPs, whereby the prepared state is a fixed density operator, the gates are fixed completely positive and trace preserving (CPTP) maps -which are the linear transformations that map density operators to density operators -and the measurement is a fixed positive-operator valued measure. For our implementation of GST, we parameterize a CPTP gate by G = Λ • G target where G target is the fixed target map and Λ is a CPTP map encompassing the errors in the gate. This error map is then parameterized as Λ = exp(G) with G defined by where the summations are over x, y and z, and Here, σ x , σ y and σ z are the Pauli operators. The {H k } maps generate all unitary errors. If (θ x , θ y , θ z ) = φv wherev is a unit vector then the generated unitary is a rotation around thê v axis of the Bloch sphere by an angle φ. So θ x , θ y and θ z are the components of the unitary error along thex,ŷ andẑ axes in the standard sense. The {S i j } maps generate all non-unitary Markovian errors, so the matrix s dictates the stochastic error rates and the size of any amplitude damping errors. In this parameterization, the CPTP constraint is that s is positive semi-definite. Finally, note that this parameterization of Λ includes only infinitesimally-generated CPTP maps; it does not strictly permit every possible CPTP map.

K. Time-resolved GST simulation
In the main text, time-resolved GST is demonstrated on simulated data. Here we detail the error model that was simulated, and use this example to explain how we apply our methods to implement time-resolved GST. The simulation is for GST circuits out to L = 128 containing the gates G i , G x and G y with an error model that, when expressed in the parameterization above, consists of: (i) Setting the s matrix for all three gates so that if there are no coherent errors Λ is a uniform depolarizing channel with a 1% error rate (this is s ∝ 1).
(ii) Setting all the coherent error parameters to zero, except is the H k error rate for gate G, and f k (t) is the k th basis function of the Type-II DCT.
Time starts at 0 and increments by 1 after each raster. The trajectories of the time-dependent parameters over the 1000 simulated rasters are shown in Fig. 2D of the main text (and are also included in Fig 2C).
To implement time-resolved GST on this simulated data we first apply the general stability analysis. The resulting statistic for every circuit is shown in Supplementary Fig. 2A, where p is the p-value of the largest peak in the power spectrum for that circuit. (Due to numerical accuracy limitations, the minimal p-value is p = 10 −16 ). The value of λ p varies strongly between circuits of the same approximate length L. The strong variations in λ p imply that the time-varying errors are coherent, as stochastic errors have a fairly uniform impact across the GST circuits. Moreover, by inspecting the values of λ p for the simple germs G i , G x and G y it becomes clear that θ z (G i ), θ x (G x ) and θ y (G y ) are varying. For example, with the G i germ it is the generalized Ramsey circuits for which λ p is statistically significant, as highlighted in Supplementary Fig. 2B. These circuits amplify θ z (G i ), so this parameter is likely varying over time. Similarly, it is generalized Rabi sequences that exhibit statistically significant λ p for The evidence for instability in the outcomes of each circuit in simulated GST data, as quantified by the test statistic λ p . A pixel is colored when λ p is large enough to be statistically significant, otherwise it is greyscale. The arrangement of the data is as in Fig. 3 of the main text. The pattern of significant λ p is used to decide which model parameters to expand into a sum of Fourier coefficients. C. The results of time-resolved GST when we select a time-resolved parameterized model using a simple frequency selection technique that erroneously includes high-frequency components in theẑ-axis rotation angle of G i . The parameter estimates still closely track the true values, demonstrating that time-resolved GST can be successful even with simple model selection methods.
the G x and G y germs, and these circuits amplify over/underrotation errors in these gates, i.e., the parameters θ x (G x ) and θ y (G y ). So we conclude that θ z (G i ), θ x (G x ) and θ y (G y ) should be expanded into a summation of Fourier coefficients. Having decided on the set of parameters that are to be expanded into a time-dependent functions, we now use the short-hand θ i ≡ θ z (G i ), θ x ≡ θ x (G x ) and θ y ≡ θ y (G y ). Deciding on the frequencies to include in the expansion of each of these parameters is more involved. One simple option is to use the frequencies that are statistically significant in the global power spectrum -the set W = {1, 2, 3, 4, 97, 98, 99, 100, 101, 102} -as the frequency set for all three parameters. This parameterized model (M 1 ) includes all of the frequencies in the true model, but also sums and differences of these frequencies, as well as high-frequencies in the expansion of G i that are not present in the true model. Yet, implementing MLE over this parameterized model still provides an accurate time-resolved estimate of θ i , θ x and θ y , as shown in Supplementary Fig. 2C.
A more sophisticated model selection strategy is to select the frequencies to include in the expansion for θ i (the set W i ) to be those frequencies that are statistically significant in the circuits in which G i is a germ, and similarly for θ x and θ y . This results in the sets W i = {1, 2, 3}, W x = {1, 2, 3, 100, 103} and W y = {1, 2, 3, 5, 97, 100, 101, 103}, and constitutes a smaller parameterized model (M 2 ). Even this smaller model still contains frequencies that are not in the true model, but each frequency set now only contains sums and differences of the frequencies that appear in the corresponding parameter. The MLE over this parameterized model is what is reported in the main text. The difference in the AIC scores of these two candidate parameterized models is AIC(M 1 ) − AIC(M 2 ) 1. So the AIC favors discarding the erroneous parameters in M 1 .

L. Time-resolved GST on experimental data
We now detail the exact approach used to implement timeresolved GST on the data from the first of our two GST experiments (as discussed in the main text, we implemented standard time-independent GST on the data from the second experiment). As discussed in the main text, we use the λ p statistics to motivate only expanding theẑ-axis rotation angle of the G i gatei.e., using the same notation as above, the parameter θ z ≡ θ z (G i ) -into a sum of Fourier components. Specifically, we expand θ z into where W = {1, 2, 5, 7, 15} is the set of significant frequencies in the average power spectrum (shown in Fig. 3D). This method for the frequency selection is motivated by noting that, as we have decided that only θ z exhibits time-dependence, all the frequencies in the averaged power spectrum must be at- Error rates for the three gates in the two GST experiments, obtained using timeresolved GST. In the first experiment the error rates of the G i gate vary over time (see Fig. 3C of the main text). The values reported here are the average over the experiment duration. All other error rates are static over the course of each experiment. See the text for definitions of these three different error rates.
tributable to variations in θ z . So far this is exactly the same type of analysis as with the time-resolved GST on simulated data. However, fitting the model requires more care with experimental data. First we implemented standard time-independent GST on all the data. The predictions of this GST estimate are compared to the aggregated counts data in Supplementary Fig. 3. This summarizes the discrepancy between the data and the predictions of the model estimated by standard GST using (A) the log-likelihood ratio (LLR) and (B) the total variational distance (TVD). The LLR test statistic for a circuit is Here L gst is the likelihood, given the data, of the outcome probabilities for that circuit that are predicted by the GST estimate. Conversely, L freq is the likelihood of the outcome probabilities that are equal to the observed frequencies (i.e., for each outcome, set the corresponding probability equal to the number of times that particular outcome was observed divided by the total number of outcome counts for the circuit). Under the null hypothesis that the model of standard GST is true, Wilks' theorem tells us that the LLR is asymptotically χ 2 1 distributed. This is because, since we have many circuits, the degrees of freedom per circuit is approximately zero for the GST model (used to compute L gst ) and one for the model maximized over to compute L freq .The TVD for a circuit is where p model is the probability for the circuit to output 1 predicted by the estimated model and x is the clickstream for the circuit. We do not use the TVD as a test static (it has less convenient properties for hypothesis testing than the LLR), but the TVD is useful as -unlike the LLR -it quantifies the distance between the observed frequencies and the predicted probabilities. There are statistically significant LLRs in Supplementary Fig. 3A, so the data are inconsistent with the model of standard GST. From the general instability analysis, we know that there is some instability in the qubit (see Fig. 3A of the main text).
So we did not expect the time-independent model of standard GST to be consistent with the data. But, if drift was the only effect outside of GST's model that is needed to explain the data, we would expect the predictions of the estimated GST model to be inconsistent with the data only for those circuits that exhibit drift. This is not the case; comparing Supplementary Fig. 3 to Fig. 3A of the main text shows that the inconsistency between the estimate of standard GST and the data cannot all be attributed to detectable drift. In particular, the data from many of the long circuits containing germs other than G i are irreconcilable with the model of GST, but there is no evidence for drift in these circuits. So there is some other non-Markovian effect that our time-resolved parameterized model does not take into account.
This sort of "model violation" interferes with the model selection and data fitting techniques that we rely on. If we fit our time-resolved parameterized model to all the data, the data that are inconsistent with any set of parameter values can pollute the estimates and render them meaningless. So we first estimate most of the time-independent parameters in the model -specifically, the state preparation, measurement, G x , G y and the non-unitary errors matrix s for G i -by implementing ordinary GST on the data from circuits with L ≤ 64 (these circuits are short enough for the model of GST to be reasonably consistent with the data). Only the two time-independent coherent error parameters in G i and θ z -which is a summation of 6 Fourier coefficients -are not fixed by this initial fit. Using MLE, we then fit this 8-parameter time-resolved model to the time-series data for all circuits with L ≤ 64 and all the circuits where G i is a germ with L ≤ 1024. This allows us to extract information about the time-dependent parameter without polluting the results with data that does not fit this model.

M. Estimated gate error rates
In the main text, we report the diamond distance error rate ( ) of the estimated (time-resolved) gate process matrices in the two GST experiments. Table 1 includes additional information on the gate error rates. In addition to , this table reports the entanglement infidelity ( f ) and spectral entanglement infidelity ( spec,f ) for each gate. The diamond distance error rate for process matrix G is where · is the diamond norm [16] and G target is the target process matrix that G ideally should be. The entanglement infidelity is where ϕ is any maximally entangled state [17]. Although the entanglement infidelity is defined with respect to a maximally entangled state, note that it is well-defined for one-qubit gates and it does not imply that any entangling operations are being performed. Another commonly used gate infidelity is the average gate infidelity (¯ ). The entanglement infidelity is related to the average gate infidelity via a linear rescaling. Specifically, f = (d + 1)¯ /2 [17] where d is the dimension, so here d = 2. We use the entanglement infidelity because = f for depolarizing errors, so the difference between and f is a rough measure of how structured the errors are, e.g., for coherent errors > f .
The spectral entanglement infidelity ( spec,f ) is defined to be where Spec(A) is the diagonal matrix that has the eigenvalues of the matrix A on the diagonal (and the relative ordering of the eigenvalues in the two matrices is chosen to minimize spec,f ). The spectral entanglement infidelity is an alternative to the standard entanglement infidelity that is invariant under transformation of the "GST gauge" (see Ref. [14] for discussions of this gauge).
Looking at Table 1 we see that all three gates have a smaller diamond distance error rate in the second experiment, with a decrease from ∼ 0.5% to ∼ 0.05%. The infidelity also decreases for G y , but, in contrast, for G i and G x the infidelity increases from f ∼ 0.003% to f ∼ 0.006%. The diamond distance error rate is sensitive at order θ to a coherent error with a rotation angle of θ, whereas the infidelity is sensitive to coherent errors only at order θ 2 . So, a decrease in and an increase in f is a sign that the magnitude of the coherent errors (i.e., calibration errors) has been reduced in the second experiment, but that the rate of stochastic errors has slightly increased.

SUPPLEMENTARY NOTE 3 -EXPERIMENT DESIGN
In this section we discuss how to design experiments to efficiently detect and characterize instability. This includes details on the relationship between the amount of data taken and the sensitivity of our methods.
N. How much data do we need?
Consider looking for instability in a given set of circuits. For example, we could be running an algorithm, or syndrome extraction circuits, or the circuits from some benchmarking routine. Furthermore, assume that we are going to collect data by rastering through this set of circuits N times. It then only remains to choose N, the length of the clickstream for each circuit. In this circumstance, it is important to understand how the sensitivity of our methods varies with N.
Each circuit has some discrete-time probability trajectory p = (p 0 , p 1 , . . . , p N−1 ). We can always express this as a summation over the basis vectors of F. As we use the Type-II DCT for all of our explicit data analysis, we will use this basis for clarity. In this basis, we write Supplementary Figure 4. Instability detection sensitivity. The probability of detecting that p has a nonzero component at ω, from a clickstream of length N and with an amplitude of γ at any non-zero basis frequency ω, using tests of the power spectrum at a statistical significance of 5%. Here γ corresponds to the amplitude in an unnormalized DCT expansion of p (i.e., the expansion of Supplementary Eq. (48)), meaning that if this is the only non-zero frequency with a non-zero amplitude in this expansion of p then p oscillates between p ± γ.
for some {γ ω } amplitudes. We can use our statistical model for the frequency-domain data (see earlier) to approximately calculate the probability of successfully detecting that some γ ω is non-zero, as a function of γ ω , N and the test significance α.
In particular, the probability of detecting that p has a nonzero component at ω, from a clickstream of length N and with tests at a statistical significance of α, is approximately Here, erf(·) is the error function, and T (α, N) is the power significance threshold for the hypothesis test at that frequency. In the case of testing a single clickstream, and splitting the test significance equally over all N − 1 non-zero frequencies (as we do for all our data analysis, but which is not necessarily always desirable), this is given by where CDF k is the χ 2 k cumulative distribution function. If we are testing C clickstreams, we simply replace α → α/C in this equation. Supplementary Fig. 4 shows a heatmap of P det (N, γ, 0.05) as N and γ are varied. From this plot we can read off the clickstream length required to reliably detect an instability of a given amplitude. It shows that only moderate amounts of data are required to detect relatively small instabilities, e.g., an instability of amplitude γ ≈ 0.1 can be reliably detected with N ≈ 1000. This quantity of data is routinely taken to estimate static circuit outcome distributions.
As well as detecting instabilities in probability trajectories, our methods also estimate p(t). To quantify the accuracy of our probability trajectory estimation we use simulations. We simulated clickstream data sampled from a pure-tone probability trajectory p(γ): Supplementary Fig. 5 shows both the instability detection probability and the error in our estimator. Supplementary  Fig. 5B shows the root mean squared error (RMSE) of our estimator for three values of γ (unbroken lines), alongside the corresponding baselines set by estimating p(t) as a constant given by the mean of the clickstream (dotted lines). This is the mean RMSE, obtained in 10 5 simulations, and note that this simulation used the "Fourier filter" variant of our estimator (as it is faster to calculate than the MLE version). In the regime where N is so low that instability is almost never detected, the RMSE of our estimator is essentially equal to the RMSE estimator of the baseline clickstream-mean estimate. This is because our estimate is the mean of the clickstream unless instability is detected. In fact, when N is so low that instability is almost never detected (as shown in Supplementary Fig. 5A) our estimator actually performs worse than the baseline, on average. This is because in approximately 5% of instances (for testing at 5% statistical significance) our detection routine will spuriously detect that there is a component of p at some random frequency, and so we will spuriously add a component at that frequency into our estimator. Comparing Supplementary Fig. 5A and 5B, we see that the RMSE of our estimator rapidly improves once N moves into the regime where instability at frequency ω is successfully detected with high probability. Once the detection probability saturates at essentially 1, the RMSE then follows O( 1 / √ N) scaling. This is because the RMSE in the estimate of each amplitude in our selected model is essentially shot-noise limited.
So far we have considered how long a single clickstream needs to be to reliably detect instabilities of a given magnitude. But our analysis can also jointly test the data from multiple circuits, by calculating and then testing an averaged power spectrum. This allows for much greater sensitivity for a fixed N, where N is the clickstream length for each circuit. Supplementary Fig. 6 demonstrates this enhancement, by plotting the minimum N required for a 50% detection probability (N min ) when averaging spectra from 1, 10, 100 and 1000 circuits. This is for the case of clickstreams that are all sampled from the same pure-tone probability trajectory p(γ). This case is chosen as it provides the maximal enhancement from averaging spectra, and -although it is unlikely that multiple circuits will have identical p(t) -there are many circumstances under which many or all of a set of circuits would contain drift at the same frequencies.
Supplementary Fig. 6 shows that increasing the number of circuits provides a substantial reduction in N min . For example, if γ = 0.1 then N min ≈ 100 for 100 clickstreams whereas N min ≈ 1000 for a single clickstream. But note that the sen-B A Supplementary Figure 5. Detection probability and estimation error. A. The detection probability versus clickstream length for data sampled from a pure-tone probability p(t) = 0.5 + γ cos(ωt + τ) for three values of the amplitude γ, where ω and τ are such that this is a pure-tone in the DCT basis. Note that this is the probability of detecting significant power at any non-zero ω, so the low-N limit is the false detection probability of approximately the test significance α, set here to 5%. B. The root mean squared error (RMSE) of our estimate of p(t). The dotted lines show the RMSE of the constantprobability estimates, whereby we estimate p(t) to be the mean of the clickstream. All quantities are calculated by averaging over the analysis results from 10 5 simulated clickstream instances. sitivity enhancement from increasing the number of circuits is smaller than that from increasing the clickstream length. As illustrated in the above example, to obtain a 50% detection probability at γ = 0.1 we need 10 times more bits in total in the 100 clickstream case than in the single clickstream case. Essentially, this is due to the difference between the O( 1 / √ C) shot-noise suppression obtained when averaging C spectra with independent statistical fluctuations verses the linear growth with N of the power in the spectrum of p (for p with an N-independent variance).

O. Choosing the circuit repetition rate
Our methods can be used for both dedicated drift characterization and for auxiliary data analysis. For dedicated drift characterization we need to carefully chose a set of quantum circuits that are as sensitive as possible to the specific parameters under study, as we discussed in the main text. Greater sensitivity results in large fluctuations in the circuits' outcome probabilities, for a given instability on the parameters of interest, meaning that fewer data are needed to reliably detect and characterize the drift -as quantified above (see Supplementary Fig. 4-6). But there is also another important aspect for designing dedicated drift characterization experiments: the rate that the circuits need to be repeated. If the dedicated drift characterization circuits can be run sufficiently infrequently then it becomes possible to interleave them with application circuits. Furthermore, understanding the effect of reducing the repetition rate is important even in experiments run only for the purpose of drift characterization -as we will often have to decide between running more circuits, or repeating fewer circuits at a faster rate. For example, in the case of both GST and Ramsey experiments, we must choose the maximum circuit length, and this directly impacts the rate at which we can raster through the full set of circuits.
Reducing the repetition rate of the drift characterization circuitsi.e., increasing the time between each run of a circuit, t gap -will have two effects. Firstly, for a fixed experiment duration we will obtain fewer data points, i.e., the clickstreams will be shorter. The impact of the clickstream length on the sensitivity of our methods has already been discussed in detail earlier (see Supplementary Fig. 4-6). But note that, for dedicated drift characterization, we are free to choose the circuits that maximally amplify the parameters we expect might be drifting -so that the oscillations in the p(t) need only be small if they are small in every possible circuit (under the assumption that we have correctly identified all the parameters that might be drifting). Therefore, relatively low N will typically still allow for sensitive drift characterization. In the main text we demonstrated this, by discarding 80% of our Ramsey data (giving N = 1200) and showing that we still obtain a high-precision time-resolved estimate of the qubit detuning (see Supplementary Fig. 1E). Moreover, much more of that data could be discarded if we only wanted to detect drift and obtain a rough time-resolved estimate of the detuning.
The second impact of increasing t gap is that the sensitivity to high-frequency drift is reduced. As with all signal analysis techniques that use regularly sampled data, there is a Nyquist limiti.e., oscillations at a frequency of 1 /2t gap are undetectable, and oscillations at higher frequencies are either aliased, or they are undetectable if they are integer multiples of this frequency. The repetition rate should therefore be chosen to maintain sensitivity to the frequency range of interest. As illustrated by our analysis of the Ramsey experiment, this means that we can have a slow repetition rate if there is only low-frequency drift.