Optimizing quantum gates towards the scale of logical qubits

A foundational assumption of quantum error correction theory is that quantum gates can be scaled to large processors without exceeding the error-threshold for fault tolerance. Two major challenges that could become fundamental roadblocks are manufacturing high-performance quantum hardware and engineering a control system that can reach its performance limits. The control challenge of scaling quantum gates from small to large processors without degrading performance often maps to non-convex, high-constraint, and time-dynamic control optimization over an exponentially expanding configuration space. Here we report on a control optimization strategy that can scalably overcome the complexity of such problems. We demonstrate it by choreographing the frequency trajectories of 68 frequency-tunable superconducting qubits to execute single- and two-qubit gates while mitigating computational errors. When combined with a comprehensive model of physical errors across our processor, the strategy suppresses physical error rates by ~3.7× compared with the case of no optimization. Furthermore, it is projected to achieve a similar performance advantage on a distance-23 surface code logical qubit with 1057 physical qubits. Our control optimization strategy solves a generic scaling challenge in a way that can be adapted to a variety of quantum operations, algorithms, and computing architectures.


I. INTRODUCTION
Superconducting quantum processors have demonstrated elements of surface code quantum error correction [4][5][6] establishing themselves as promising candidates for fault-tolerant quantum computing.Nonetheless, imperfections in hardware and control introduce physical errors that corrupt quantum information [7] and could limit scalability.Even if a large enough quantum processor with a high enough performance limit to implement er- * corresponding author, pklimov@google.comror correction can be manufactured, there is no guarantee that a control strategy will be able to reach that limit.
Choreographing frequency trajectories is a complex optimization problem due to engineered and parasitic interactions among computational elements [27] and their environment [29][30][31], hardware [32] and control [28] inhomogeneities, performance fluctuations [33], and competition between error mechanisms.Mathematically, the problem is non-convex, highly constrained, time-dynamic, and expands exponentially with processor size.
Past research into overcoming these complexities employed frequency partitioning strategies [34,35] that either faced difficulties scaling with realistic hardware imperfections or whose scalability is not well understood [4,13,17,36].To overcome the limitations of these strategies, we proposed the "Snake" optimizer [37] and employed an early version in past reports [38][39][40][41][42][43][44][45][46][47].However, an optimization strategy has not been developed around it, it has not been rigorously benchmarked, and large enough processors to investigate its scalability have only recently become available.Whether high performance configurations exist at scale and whether they can be quickly discovered and stabilized are open questions.
Here we address these questions by developing a control optimization strategy around Snake that can scalably overcome the complexity of problems like frequency optimization within the high performance, high stability, and low runtime requirements of an industrial system.The strategy introduces generic frameworks for building processor-scale optimization models, training them for various quantum algorithms, and adapting to their unique optimization landscapes via Snake.This flexible approach can be applied to a variety of quantum operations, algorithms, and architectures.We believe it will be an important element in scaling quantum control and realizing commercially valuable quantum computations.
We investigate the prospects of this strategy for optimizing quantum gates for error correction in superconducting qubits.We demonstrate that it strongly sup- Corresponding qubit frequency trajectories (F ), parameterized by single-qubit idle (fj for qubit qj) and two-qubit interaction (fij for qi and qj) frequencies.Quantum computational errors depend strongly on frequency trajectories since most physical error mechanisms are frequency dependent (red dots are non-exhaustive examples).Namely, pulse distortion errors (1) happen during large frequency excursions.Relaxation errors (2) happen around relaxation hotspots, for example due to two-levelsystem defects (TLS, horizontal resonance).Stray coupling errors (3) happen during frequency collisions between coupled computational elements.Dephasing errors (4) happen towards lower frequencies, where qubit flux-sensitivity grows.(d) We leverage our understanding of physical errors to estimate the algorithm's error (E) and then optimize it with respect to qubit frequency trajectories.(e) We employ the Snake optimizer, which can solve optimization problems at an arbitrary dimension, controlled by the scope parameter (S).These graphs show possible idle (nodes) and interaction (edges) frequency optimization variables (blue) at one Snake optimization step, for scopes ranging from S = Smax (global limit) to S = 1 (local limit).(f) Snake optimization threads for three scopes (increase downwards).Snake's high configurability enables it to scalably overcome frequency optimization complexity and be adapted to a variety of quantum operations, algorithms, and architectures.presses physical error rates, approaching the surface code threshold for fault tolerance on our processor with tens of qubits.To pave the way towards much larger processors, we demonstrate Snake "healing" and "stitching", which were designed to stabilize performance over long timescales and geometrically parallelize optimization.Finally, we introduce a simulation environment that emulates our quantum computing stack and combine it with optimization, healing, and stitching to project the scalability of our strategy towards thousands of qubits.

II. QUANTUM HARDWARE
Our hardware platform is a Sycamore processor with N = 68 frequency-tunable transmon qubits on a twodimensional lattice.Engineered tunable coupling exists between 109 nearest-neighbors [6,48,49].We configure the processor to execute the surface code gate set, which includes single-qubit XY rotations (SQ) and twoqubit controlled-Z (CZ) gates [10].SQ gates are implemented via microwave pulses resonant with qubits' respective |0⟩ ↔ |1⟩ idle frequencies (f i for qubit q i executing SQ i ).CZ gates are implemented by sweeping neighboring qubits into |11⟩ ↔ |02⟩ resonance near respective interaction frequencies (f ij for qubits q i and q j executing CZ ij ) and actuating their couplers.The N = 68 idle and ∼ 2N = 109 interaction frequencies -which constitute one frequency configuration F with dimension ∼ 3N = |F | = 177 -parameterize qubit frequency trajectories, which we seek to optimize.

III. PERFORMANCE BENCHMARK
We evaluate the performance of frequency configurations via the parallel two-qubit cross-entropy benchmarking algorithm (CZXEB) [38,50].CZXEB executes cycles of parallel SQ gates followed by parallel CZ gates (see SI), benchmarking them in a context representative of many quantum algorithms.Most relevant to this study is that CZXEB reflects the structure of the surface code's parity checks and has empirically served as a valuable performance proxy of logical error [6].The processed output of CZXEB is the benchmark distribution e c , in which each value is one qubit pair's average error per cycle e c,ij , which includes error contributions from respective SQ i , SQ j , and CZ ij gates.Benchmarks are generally not normally distributed across a processor and are thus reported via percentiles as 50.0%(97.5−50)% (2.5−50)% and plotted as quantile boxplots.The wide range from 2.5% to 97.5% is the distribution spread, which spans ±2σ standard deviations for normally distributed data.

IV. OPTIMIZATION MODEL
We approach frequency optimization as a model based problem.In turn, we must define an algorithm error estimator E that is representative of the performance of the target quantum algorithm A at the optimizable frequency configuration F (Fig. 1d).This problem is hard because the estimator must be fast for scalability, predictive for scaling projections, and physical for metrology investigations.We introduce a flexible framework for overcoming these competing requirements that can be adapted to define the optimization landscapes a variety of quantum operations, algorithms, and architectures (see SI).
Our framework corresponds to the decomposition E(F |A, D) = g∈A m∈M w g,m (A)ϵ g,m (F g,m |D).The sums are over all gates g ∈ A and known physical error mechanisms m ∈ M .ϵ g,m are algorithm-independent error components that depend on some subset of frequencies F g,m ⊆ F and can be computed from relevant characterization data D. w g,m are algorithm-dependent weights that capture algorithmic context via training on benchmarks that are sufficiently representative of A. Defining the estimator thus maps to defining the target quantum algorithm, the algorithm-independent error components, and then training the algorithm-dependent weights.
We set our target quantum algorithm to CZXEB to gear the estimator towards the surface code's parity checks.Furthermore, since CZXEB is also our benchmarking algorithm, we can associate the performance of optimized frequency configurations with our optimization strategy.We then define error components corresponding to dephasing [22][23][24], relaxation [23,25,26], stray coupling [27], and frequency-pulse distortion [28] over qubit frequency trajectories.The relevant characterization data include qubit flux-sensitivity spectra, energyrelaxation rate spectra, parasitic stray coupling parameters, and pulse distortion parameters, which are measured prior to optimization.Finally, we train the weights via a protocol that we developed specifically to reduce the risk of overfitting [51] (see SI).It constrains weights via homogeneity and symmetry assumptions and then leverages the frequency tunability of our architecture to train them on single-and two-qubit gate benchmarks taken in configurations of variable complexity.
The resulting algorithm error estimator represents a comprehensive understanding of physical errors in our processor.It spans ∼ 4 × 10 4 error components, only has 16 trained weights for the full processor, and is trained and tested on ∼ 6500 benchmarks.Despite its scale, it can still be evaluated ∼ 100 times/second on a desktop.Furthermore, it can predict CZXEB cycle errors in the wide range ∼ 3 − 40 × 10 −3 within two factors of experimental uncertainty (see SI).In total, the estimator fulfills our speed, predictivity, and physicality requirements.

V. OPTIMIZATION STRATEGY
Finding an optimized frequency configuration from the algorithm error estimator maps to solving F * = argmin F E. This problem is hard for several reasons.First, all |F | ∼ 3N idle and interaction frequencies are interdependent due to engineered and parasitic interactions between nearest and next-nearest neighbor qubits.Second, the estimator has numerous local minima since most error mechanisms and hardware constraints compete, and since it is built from noisy characterization data.Finally, there are ∼ k |F | ∼ k 3N possible configurations, where k is the number of options per frequency, as constrained by hardware and control specifications and inhomogeneities.In total, the problem is highly-constrained, non-convex, and expands exponentially with processor size.We developed the Snake optimizer [37] to scalably overcome the complexity of control optimization problems like frequency optimization.
Snake implements a graph based algorithm that maps the variable frequency configuration F onto a graph and then launches an optimization thread from some seed frequency (Fig. 1e-f).It then finds all unoptimized frequencies F S within a neighborhood whose size is bounded by the scope parameter S, and constructs the Snake estimator E S .E S contains all terms in E that depend only on F S , which serve as optimization variables, and previously optimized frequencies F * that are algorithmically relevant F * ∩ A, which serve as fixed constraints.Snake then solves F * S = argmin F S E S , updates F * , traverses, and repeats until all frequencies have been optimized.Frequency configurations are typically optimized from multiple seeds in parallel and the one that minimizes the algorithm error estimator is benchmarked.
Snake's favorable scaling properties are derived from the scope S, which tunes the greediness of its optimization between the local and global limits [37].By tuning the scope within 1 ≤ S ≤ S max , we can bound the number of frequencies optimized at each traversal step to 1 ≤ |F S | ≤ |F |, where |F S | ∼ S 2 and S max ∼ √ 3N in two dimensions (Fig. 1e).In turn, we can split one complex ∼ 3N -dimensional problem over ∼ k 3N configurations into ∼ 3N/S 2 simpler ∼ S 2 -dimensional problems over ∼ k S 2 configurations each.Such splitting terminates at S = 1, where Snake optimizes ∼ 3N 1-dimensional problems over ∼ k 1 configurations each.Importantly, the intermediate dimensional problems with S < S max are exponentially smaller than the global problem and independent of processor size.
Snake is not expected to discover globally optimal configurations.However, if it can find sufficiently performant configurations for the target quantum algorithm -for example with errors below the fault-tolerance threshold [3] -it will solve the scaling complexity problem.Namely, we will not be faced with an exponentially expanding problem as our processors scale, but linearly more problems with bounded configuration spaces.Furthermore, since Snake's seed strategy, traversal strategy, inner-loop optimizer, and scope are highly configurable, it should be adaptable to overcome similar scaling complexities in other control problems and hardware (see SI).

VI. VALIDATING PERFORMANCE
To experimentally investigate whether Snake can actually find performant frequency configurations at some intermediate dimension, we optimize our processor at scopes ranging from S = 1 (177 1D local problems) to S = S max (one 177D global problem) and benchmark CZXEB.We evaluate configurations by comparing their benchmarks against three performance standards (Fig. 2a).First, the baseline standard references benchmarks taken in a random frequency configuration, which establishes the average performance of the hardware and control system without frequency optimization e c = 16.7 +267.1 −10.6 × 10 −3 at N = 68).Second, the outlier standard references a constant cycle error, above which gates are considered performance outliers (e c = 15.0 × 10 −3 for all N ).Third, the crossover standard references published benchmarks from the same processor that reached the surface code's crossover regime, which approaches the error correction threshold (e c = 6.2 +7.6 −2.5 × 10 −3 at N = 49) [3,6].This standard establishes what we consider high performance, while recognizing that much higher performance will be necessary to implement error correction in practice.
First, the fact that we see performance variations between configurations illustrates that poor frequency choices cannot be compensated for by other components of our control system (see SI) and that optimization is critical.Second, the fact that local optimization underperforms illustrates that frequency optimization is a nonlocal problem and that tradeoffs between gates must be considered.Third, the fact that global optimization underperforms even after an hour of searching illustrates the difficulty of navigating the configuration space even on our relatively small processor.Finally, the fact that relatively low intermediate dimensions found the most performant configurations is consistent with relatively local engineered and parasitic interactions and suggests that Snake can navigate our architecture's configuration space in a way that should scale to larger processors.

VII. STABILIZING PERFORMANCE
Stabilizing performant configurations is as difficult and important as finding them.Namely, a processor's optimization landscape constantly evolves and performance outliers emerge on timescales ranging from seconds to months, with the most catastrophic due to TLS defects fluctuating into the path of qubit frequency trajectories [33].Unfortunately, even a low percentage of outliers can significantly degrade the performance of a quantum algorithm [6].However, re-optimizing all gates of a processor when a low percentage of outliers are detected is unscalable from a runtime perspective and introduces the risk of degrading performant gates.
By design, Snake healing can surgically re-optimize outliers, nominally much faster than full re-optimization, and without degrading performant gates [37].To investigate the viability of healing, we heal all configurations generated by the variable-scope experiment described above, targetting gates as described in the SI (Fig. 2d-e).From the perspective of stability, the progressively worse configurations emulate the performance of our processor over progressively longer timescales following optimization.Healing suppresses outliers by ∼ 48% averaged over configurations, typically runs > 10× faster than full reoptimization, and rarely degrades performant gates.Furthermore, heals can be applied repetitively and parallelized for sufficiently sparse outliers.These results demonstrate the viability of healing for scalably suppressing outliers to stabilize performance.

VIII. IMPACT OF METROLOGY
We now consider the impact of the algorithm error estimator's composition on Snake's performance.In particular, the dephasing, relaxation, stray coupling, and pulse distortion error components may be interpreted as distinct error mitigation strategies that can be activated independently.To isolate their impact and to understand their interplay, we progressively activate them in all combinations, optimize, and benchmark CZXEB (Fig. 3a).
To build intuition for the impact of each error mitigation strategy, we inspect frequency configurations optimized with only one mitigation strategy activated (Fig. 3b).Most are visually structured, with inhomogeneities arising from fabrication imperfections in the processor's parameters.Dephasing mitigation biases qubits towards their maximum frequencies, where flux sensitivity vanishes [9].Relaxation mitigation biases qubits away from relaxation hotspots driven by coupling to the control [9] and readout circuitry [29,52], packaging environment [31], and random TLS defects [30].Stray-coupling mitigation disperses qubits to avoid frequency collisions between parasitically coupled gates.Finally, pulse-distortion mitigation biases idles towards a multi-layered checkerboard, with neighbors at one of two symmetric |11⟩ ↔ |02⟩ CZ resonances [10], and interactions towards resonance between the idles, to minimize frequency excursions.The inversion of frequencies at the eastern edges of the processor was triggered by fabrication imperfections that broke the symmetry between CZ resonances.This observation highlights non-trivial interplay between error mitigation and hardware inhomogeneities.
Interestingly, while some of these mitigation strategies alone may find performant configurations at the scale of several qubits, none of them substantially outperform the random baseline configuration at the scale of our processor.As we progressively activate mitigation strategies, competition between error mechanisms causes frequency configurations lose visual structure, while performance approaches the crossover standard.Analyzing error contributions in optimized configurations, we confirm that activating mitigation strategies selectively and effectively suppresses their corresponding error components, while only weakly impacting others (see SI).These results support our interpretation of error components as error mitigation strategies and that our optimizer can effectively reconcile their competition and suppress them.More generally, they highlight the importance of error metrology on the performance of our optimization strategy.

IX. PERFORMANCE SCALABILITY
We are finally ready to investigate Snake's scalability.To do so, we conduct a scaling experiment that may be valuable for evaluating the prospects of any quantum hardware and control system.Namely, we optimize, heal, and benchmark hundreds of configurations of our processor ranging in size from N = 2 to 68 (Fig. 4a).As before, we reference the crossover standard.However, we now reference multiple baseline standards that correspond to unoptimized random configurations of variable size.Despite the irregular shapes of some configurations, we find surprisingly clear scaling trends.
CZXEB benchmarks grow and then saturate in both optimized and unoptimized configurations.Furthermore, mean cycle errors are well-represented by the model ⟨e c (N )⟩ = e sat − e scale exp(−N/N sat ), where N sat is the qubit saturation constant, e scale is the error penalty in scaling gates from small to large systems, and e sat is the saturated error.Fitting this model to the empirical benchmarks, we find that optimized configurations saturate near the crossover standard, with best-fit parameters N sat = 22 ± 10 (±1σ), e scale = 3.1 ± 0.4 × 10 −3 , and e sat = 7.5 ± 0.4 × 10 −3 .
To estimate Snake's performance advantage, we make several comparisons.From the empirical benchmarks, we compare the mean cycle errors ⟨e base c ⟩/⟨e snake c ⟩ in isolation (N = 2) and in parallel at scale (N = 68), which are 3.1 ± 0.5 and 6.4 ± 1.0, respectively.Remarkably, the optimized N = 68 configuration outperforms unoptimized N = 2 configurations by 2.3 ± 0.4×.Finally, the optimized N = 68 configuration has a ∼ 40× narrower benchmark distribution spread.From the saturation model, we compare the scaling penalty e base scale /e snake scale and the saturated cycle errors e base sat /e snake sat , which are 5.6 ± 1.8 and 3.7 ± 0.7, respectively.These comparisons illustrate that Snake achieves a significant performance advantage to N = 68.
To investigate Snake's future scalability, we simulate much larger processors than manufactured to date.To do so, we developed a generative model that can generate simulated processors of arbitrary size and connectiv- ity with simulated characterization data that are nearly indistinguishable from our processor [53] (see SI).We generate simulated processors ranging in size from N = 17 to 1057, with connectivity corresponding to distance-3 to 23 surface code logical qubits (d = 3 to 23 with N = 2d 2 −1) [54].We optimize simulated processors exactly like our processor and predict CZXEB benchmarks via our esti-mator.Simulated benchmarks reproduce the saturation trends seen in experiment, building trust in our simulation environment and results (Fig. 4b).Furthermore, they project that Snake's performance advantage should scale to a d = 23 logical qubit with N = 1057.

X. RUNTIME SCALABILITY
Despite the promising performance outlook, practically scaling to thousands of qubits will require Snake to be geometrically parallelized.Namely, even though optimization runtimes scale nearly linearly with processor size (∼ 3.6±0.1 seconds added per qubit at S = 2), N = 1057 threads take ∼ 1.4 hours.This exceeds our runtime budget of 0.5 hours, which was chosen for compatibility with operating large surface codes (see SI).
By design, Snake stitching can split a processor into R disjoint regions, optimize them in parallel, and stitch configurations [37].Stitching leads optimization runtimes to scale sub-linearly with processor size, which should enable scalability towards N ∼ 10 4 with R = 128 within our runtime budget in principle.In practice, however, stitching risks amplifying outliers at seams, where Snake must reconcile constraints between independently optimized configurations.
To investigate the viability of stitching, we stitch and heal our N = 68 processor with R = 2 (Fig. 4c) as well as an N = 1057 (d = 23) simulated processor with R = 4 (Fig. 4d).We chose convenient stitch geometries, but believe they will ultimately need to be optimized (see SI). Experimental data are limited, but outliers are not amplified at seams and stitched configurations perform as well as their unstitched counterparts (e c = 6.4 +4. 4  −1.8 × 10 −3 for N = 68 and e c = 6.3 +4.3 −2.9 × 10 −3 for N = 1057).Finally, we note that stitching the d = 23 logical qubit with R = 4 is equivalent to stitching four d = 11 logical qubits into a 4-logical-qubit processor [54], which illustrates how larger surface codes may be optimized.

XI. OUTLOOK
We introduced a control optimization strategy that combines generic frameworks for building, training, and navigating the optimization landscapes presented by a variety of quantum operations, algorithms, and architectures.It offers a significant performance advantage for quantum gates on our superconducting quantum processor with tens of qubits, approaching the surface code threshold for fault tolerance, and shows promise for scalability towards logical qubits with thousands of qubits.A recent demonstration of error suppression in a scaled up surface code logical qubit [6] enabled by this strategy underscores its potential.
Looking towards commercially valuable quantum computations, significant challenges remain.The larger and more performant processors that are necessary to implement them will be susceptible to error mechanisms that are currently irrelevant or yet to be discovered.Furthermore, we expect that stabilizing performance over long computations that may span days [65] will present significant hurdles.Towards that end, Snake's model-based approach can leverage historical characterization data to forecast and optimize around failures before they happen [66,67].Finally, even though we expect that model-based optimization will remain critical for injecting metrological discoveries into control optimization for the foreseeable future, Snake can also deploy model-free reinforcement learning agents [68][69][70], which may reduce the burden of developing performance estimators (see SI).The techniques presented here should complement the numerous other control, hardware, and algorithm advancements necessary to implement quantum error correction.

XIII. ACKNOWLEDGEMENTS
We thank the broader Google Quantum AI team for fabricating the processor, building and maintaining the cryogenic system, and general hardware and software infrastructure that enabled this experiment.We also thank Austin Fowler, Alexis Morvan, and Xiao Mi for feedback on the manuscript.Paul V. Klimov Here we overview our control system to offer context for where our optimization system exists within our quantum computing stack (Figure S1).The following sections describe most of these components in more detail.Our control system learns the control-electronics signals needed to execute quantum gates for some target quantum algorithm (A).It can be split into the characterization, optimization, calibration, and benchmarking systems.The characterization system measures data (D) that are believed to be necessary to operate and estimate the performance of quantum gates over qubits' operable frequency ranges, including qubits' flux sensitivities versus frequency (∼ T −1 ϕ spectra), energy-relaxation rates versus frequency (T −1 1 spectra), parasitic stray coupling parameters (χ coupling), and frequency-pulse distortion parameters (δ parameters).The optimization system then estimates the quantum algorithm's error (E) around that data and optimizes it over gate frequencies (F ).The calibration system then learns the control parameters necessary to execute gates at the optimized frequency configuration (F * ) (details in [1][2][3]).Finally, the benchmarking system evaluates the performance of the configuration via one or more benchmarking algorithms.If calibration failures were not encountered and if benchmarks exceed some standard of high performance, the quantum algorithm is executed.

II. CHARACTERIZATION SYSTEM
The characterization system interrogates our hardware's computational elements -including qubits, couplers, and readout resonators -and the control electronics (details in [1,2]).The output is the characterization data D, which includes: • Gate trajectory parameters SQ gate length: t SQ = 25 ns.
• Hardware and control parameters T 1 relaxation time versus frequency (i.e.spectra), from which we estimate relaxation errors.
df dϕ flux sensitivity spectra, from which we estimate ∼ T ϕ dephasing errors.
χ parasitic stray coupling parameters, from which we estimate stray coupling errors due to parasitic coupling between nearestand next-nearest-neighbor qubits.
δ frequency-pulse distortion parameters, from which estimate errors due to large frequency excursions during CZ gates.
The characterization data are used to construct optimization hard bounds and the algorithm error estimator's error components, as discussed below.

III. THE ALGORITHM ERROR ESTIMATOR
The algorithm error estimator can be represented by a variety of models (e.g. a linear model, a neural network, or a quantum simulation).The only strict requirement for optimization is that the estimator be representative of the performance of the target quantum algorithm.For this study, we append several requirements -physicality, speed, and accuracy.These are competing interests that are difficult to satisfy simultaneously.In the following sections we describe how we construct an estimator that can satisfy them.

A. Construction
We define the algorithm error estimator E(F |A, D) by making several simplifying assumptions: 1.A quantum algorithm's error can be decomposed into a sum over the gate error estimators E g (F g |A, D) of its constituent gates g ∈ A (i.e.CZXEB with 2 qubits and m cycles comprises 2m SQ and m CZ gate error estimators).F g is the subset of F that are relevant to E g .This assumption is motivated by the digital error model, which states that gate errors can be added when they are small and not correlated in space or time [5], and which was validated in the context of random circuits [1].
2. A gate's error can be decomposed into a sum over algorithm-independent error components ϵ g,m (F g,m |D), each of which corresponds to a distinct physical error mechanism m ∈ M .F g,m is the subset of F g that are relevant to estimating ϵ g,m .This assumption should apply in the limit of small and uncorrelated error components.Error components should be defined for all known error mechanisms, including those that go beyond the assumptions of the digital error model (e.g.stray coupling errors, which are correlated in space).Furthermore, they should be defined for all possible algorithmic contexts (e.g.stray coupling components for all possible combinations of parasitically coupled gates executed concurrently).
3. The effect of implementing an arbitrary gate within an arbitrary quantum algorithm can be fully encompassed by algorithm-dependent weights w g,m (A).One unique weight is assigned to, and multiplies, each error component.The weights filter error components by algorithmic relevance (e.g.weights corresponding to stray coupling errors are only non-zero for concurrent gates) and relative contribution (e.g.weights corresponding to dephasing errors are lower in algorithms with dynamical decoupling than without).
4. The weights can be trained on benchmarks that are sufficiently representative of the target quantum algorithm.Empirical training injects algorithmic context and can compensate for some of the simplifying assumptions made above and inaccuracies in the error components.
These assumptions lead to the decomposition: This decomposition establishes a powerful and broadly applicable framework: • Compatible with physicality, speed, and accuracy.
• Error components can be added as new error mechanisms are discovered.
• Error components can be interchanged to transfer the estimator between hardware architectures (see Section III D).
• Weights can be re-trained to transfer the estimator between quantum algorithms.
• Optimization variables (F here) can be interchanged to adapt the estimator to other control variables, provided that the error components are defined in terms of those variables.
• Equivalent to the basis-expansion method in machine learning [6][7][8], bridging the domains and facilitating knowledge transfer.
The assumption that errors can be added in this way is expected to break down in certain limits -particularly in highly structured quantum circuits.Fortunately, methods like randomized compiling [9][10][11] offer the potential to recompile those algorithms into compliance.Finally, even when our assumptions cannot be fulfilled, we expect that our algorithm error estimator can still serve as a valuable optimization proxy.

B. Error components
We define error components for the dephasing [12][13][14], relaxation [13,[15][16][17], stray coupling [18], and pulse distortion [19,20] error mechanisms.We leverage textbook physics, published literature, and extensive metrology of our quantum and classical control hardware.Furthermore, since runtime is a critical scaling consideration, we engineer error components for speed via software optimizations, physical simplifications, and mathematical approximations, some of which trade against accuracy.
Here we provide key error components ϵ g,m for gates g and mechanisms m, the relevant frequencies F g,m and characterization data D, and how they scale with processor size N .We note that each error mechanism is segmented into multiple error components as described in detail below.When referencing a mechanism, we consider all such components together.(a) m = Dephasing error on CZ ij due to q i undergoing some state-dependent frequency trajectory F i (F g,m ).

Single qubit gate error components
(b) m = Relaxation error on CZ ij due to q i undergoing some state-dependent frequency trajectory F i (F g,m ).
(d) m = Frequency-pulse distortion error on CZ ij due to q i 's excursion from In total, an N qubit processor has ∼ 10 3 N error components.Our N = 68 processor has ∼ 4 × 10 4 error components.This number is large due to the high granularity with which we segment error components: • Each error mechanism is generally segmented into single-and two-qubit error components (e.g.1a and 2a above for dephasing).
• Two-qubit error components typically integrate errors over frequency trajectories F according to the hardware implementation and the local temporal order of gates within the quantum algorithm.For CZXEB, they may integrate over a trapezoidal trajectory that links idles and interactions.
• Each error component is segmented by gate.For example, we consider 68 (N ) components for SQ dephasing (1a above) and 109 (∼ 2N ) components for CZ pulse distortion (2d above).

C. Runtime
The majority of optimization runtime is spent evaluating error components.To estimate how quickly we can evaluate our error components, we consider evaluation runtimes for our N = 68 processor's 177 gate-frequency variable algorithm error estimator on a high-performance desktop (Figure S2).We can evaluate the estimator > 100/second, which translates into ≲ 1/ second for evaluating all error components of a gate on average.

D. Extensions to other hardware
Extending the algorithm error estimator to control optimization problems in other hardware will require defining new error components.While the connection between error components in our transmon qubits [21] and other superconducting qubits [22] is reasonably clear, the connection to entirely different hardware is not.To illustrate the versatility of our framework, we connect our error components to control objectives when choreographing the spatial trajectories of reconfigurable atom arrays [23,24].The objective of minimizing atom loss and heating by avoiding crossing spatial trajectories is analogous to our stray coupling components, which penalize crossing frequency trajectories.The objective of minimizing atom move distance is analogous to our pulse distortion components, which penalize large frequency excursions.The objective of minimizing atoms' vertical extent is similar to our dephasing components, which squeeze frequencies towards their maxima.Similar connections exist to control problems in other hardware, for example shuttling electrons in quantum dots [25][26][27][28] or shuttling ions in ion traps [29][30][31][32].

IV. TRAINING THE ESTIMATOR
When training the estimator, we must ensure that it generalizes to unseen configurations of our processor.In turn, we take significant care in mitigating the risk of overfitting [6][7][8]33].In traditional statistics and machine learning applications, overfitting is often combated by reducing the complexity of a model by discarding or constraining correlated features -e.g.via principal component analysis -or by suppressing them during traininge.g.via regularization.These approaches are most compatible with models that don't necessarily require interpretability, such as neural networks.However, they are incompatible with our requirement that the estimator be physical.Namely, we must keep all physically relevant error components, even though some of them may be correlated in some scenarios.Next we formalize the overfitting problem and then develop training-data sampling and model-training protocols to overcome it.

A. Bounding model capacity
Model capacity is an important concept in machine learning that measures the expressivity of a model and is often associated with the number of trainable parameters [33].It is desirable to select a model for training that has a high enough capacity to be able to represent complex patterns but low enough capacity to reduce the risk of overfitting and the amount of training data necessary.
For the algorithm error estimator, we equate the model capacity with the number of independent trainable weights.Within this definition, the capacity of our N = 68 algorithm error estimator is ∼ 4 × 10 4 , since each error component generally has an independent trainable weight.Furthermore, given local engineered and parasitic coupling, we expect the capacity to scale linearly with processor size.
Given the considerations mentioned above, taking all weights to be independent trainable parameters is neither practical nor scalable.To bound the capacity of our estimator, we make several reasonable assumptions: • Homogeneity: The parameters of our processor are homogeneous enough such that weights for the same error components for the same gate types are equal.For example 1(a) above, this assumption leads to w g,m = w g ′ ,m for g =SQ i and g ′ =SQ j .
• Symmetry: The weights of some error components must be equal due to symmetry.For example 2(a) above, this assumption leads to w g,m = w g,m ′ , where m and m ′ correspond to the respective trajectories F i for the input states |0, 1⟩ and |1, 0⟩.
By applying these assumptions, we reduce our algorithm error estimator's capacity from ∼ 4 × 10 4 to 16.Furthermore, capacity does not grow with processor size.
We thus believe this is a scalable strategy for bounding the capacity of our algorithm error estimator.

B. Sampling training data
Building a sufficiently large and diverse training dataset such that the trained estimator generalizes to complex and unseen scenarios is a complex problem in benchmarking and statistics [6,8].Towards that end, we leverage the flexibility of our quantum processor architecture to accomplish the following: • Multiple training targets: We train on both SQRB and CZXEB benchmarks.
• Isolate error components: We employ the frequency-tunability of our architecture to benchmark gates in configurations of variable complexity.For example, to isolate the relaxation and dephasing error components, we benchmark gates in sparse configurations with negligible stray coupling.
As another example, to controllably introduce stray coupling, we benchmark gates in denser configurations with only nearest-or next-nearest neighbor gates benchmarked simultaneously.
• Boost statistical leverage: We employ the frequency-tunability of our architecture to vary the relative strength of each error component.For example, we benchmark gates near and far from their respective qubits' maximum frequencies to generate data with high leverage in dephasing.As another example, we benchmark gates near and far from two-level-system (TLS) defects to generate data with high leverage in relaxation.
For this study, we sampled ∼ 6500 benchmarks, averaging > 100 training benchmarks per independent trainable weight, over several weeks.Although time consuming, the trained estimator remains viable over months, and is trivial to refine as benchmarks accumulate over time.Furthermore, the runtime cost of acquiring training data in parallel scales inversely with processor size (i.e.since the number of gates that can be benchmarked in parallel scales as ∼ N ), which is especially favorable from a scalability perspective.Finally, we believe that if a sufficiently complete understanding of a processor and the control system is available, the training data may be augmented via generative modeling methods similar to those used to generate our simulated processors [34].

C. Training
To train the algorithm error estimator's weights, we developed an iterative supervised learning protocol [35].It progressively trains and constrains distinct subsets of the estimator's weights over multiple training iterations (Figure S3).Early iterations train the estimator on isolated benchmarks, which isolate the weights corresponding to relaxation and dephasing.Later iterations train the estimator on parallel benchmarks, which introduce weights corresponding to stray coupling.We note that this protocol was designed specifically to further suppress the capacity of the model trained at any given iteration.We implement training iterations within the standard machine learning framework [6][7][8]36], using ∼ 60% of the available benchmarks for training and the remaining ∼ 40% for computing accuracy metrics (below).The error components are the training features and the error benchmarks are the training targets.At each iteration, we train using the Adam optimizer [37] with a mean-absolute-error cost function, which is less susceptible to outliers than the more standard mean-squarederror.Since the trained weights are not generally useful, they will not be presented.

D. Accuracy
We evaluate the accuracy of our trained estimator on the test data, which was not used during training.Our trained estimator can predict both SQRB and CZXEB benchmarks in arbitrary configurations.To evaluate the accuracy of these predictions, we consider the following metrics and quote the medians (Figure S4): Although the medians are a concise statistic, they oversimplify the situation.Namely, inaccuracy metrics typically depend on and increase with the magnitude of benchmark values.These dependencies suggest that defining a trust region, within which we trust our estimator's predictions, may be more useful.
We define the trust region to be the range of Isolated and Parallel CZXEB test data over which the estimators inaccuracy is simultaneously ≲ 1/2× the measured error and ≲ 2× the experimental uncertainty (Figure S5).We interpret these as signal-to-noise metrics.Finally, we note that CZXEB benchmarks are the hardest to predict, requiring accurate SQ and CZ gate error predictions.
• Trust region: ∼ 3 to ∼ 40 × 10 −3 Therefore, we can trust our estimator within a wide range that spans one order of magnitude.

E. Innacuracy sources
We expect our estimator's predictions to deviate from measurements for reasons including: • Simplifications and/or approximations made to the error components to suppress runtimes.
• Less training data towards higher errors.
• Control and hardware parameter inhomogeneities.For further insights, we first inspect predicted versus measured data in Figure S4.The largest inaccuracies are seen towards high parallel CZXEB errors (∼ 5% of the training and test data).Since similar inaccuracies are not as prevalent in other benchmarks, we believe the primary culprits are: • Inaccuracies in the CZ stray coupling error components, which are known to be complex.
• The impact of leakage, which can be driven via stray coupling, on CZXEB is not well understood.
Second, we inspect the isolated CZXEB data in Figure S6.Here the interaction frequencies are swept while the idles are fixed.These data should be interpreted as line cuts of the processor's much higher dimensional error landscape, where all idles and interactions are variables.Isolated CZXEB mostly isolates dephasing and relaxation errors, the latter of which dominate in our system and exhibit the most complex patterns.These data suggest that our estimator can reproduce complex error patterns, some of which span an order of magnitude and exceed the trust region defined above.However, we also observe clear inaccuracies, the most significant of which we believe are due to: • TLS fluctuations [38] between characterization and benchmarking.

V. OPTIMIZATION SYSTEM
Our optimization system is based on the Snake optimizer, which we proposed [39,40] as a platform for deploying custom optimization strategies within the demands of an industrial control system.Snake leverages concepts in dynamic programming and graph optimization to offer several key functionalities, which to the best of our knowledge are not offered by any other optimizer: • Flexibility: Can implement a wide array of optimization strategies via judicious selection of several parameters (see Section V C).Most notably, it can deploy virtually any inner loop optimizer at any dimension between the local and global optimization limits.This customizability facilitates adapting Snake to a variety of optimization landscapes.
• Stability: Can locally re-optimize performance outliers ("healing") as they emerge over time to stabilize processors over long timescales (i.e.months).
Healing is much faster than optimization (see Section V E) and scales sub-linearly with the number of outliers parallelized.
In our past proposal, we established Snake's theoretical foundations and software abstractions.In this section, we map optimization into Snake.We then provide broadly applicable implementations of its parameters and them explore tens of optimization strategies between the local and global optimization limits over thousands of simulated optimization threads.Finally, we promote the most promising strategies to the experiments in the main text.

A. Optimization variables
The qubit frequency trajectories used to implement gates can be parameterized by many optimization variables.We allocate one variable per gate: • Interaction frequency (f ij ): q i and q j 's average uncoupled |0⟩ ↔ |1⟩ transition frequencies, where they execute CZ ij [2].
We choose these variables because we have empirically verified that they have a significant impact on gate performance.Although we could add more trajectory variables, for example gate lengths, each such variable exponentially expands the configuration space and presents a scalability barrier.Instead we either fix or locally optimize those variables within our calibration system [1].

CZXEB Cycle Error
Benchmarks (e c ) Data Interaction Frequency (GHz) Estimator FIG.S6.Isolated CZXEB benchmarks versus interaction frequency, with idles fixed.These data can be interpreted as a linecut of the processors much higher dimensional error landscape, where all idles and interactions are variables.The inset shows the common scale, with the data (blue) and algorithm error estimator predictions (orange) overlaid.Error bars correspond to 68% confidence intervals and are typically smaller than the data points.The error axis (vertical) is logarithmically spaced to highlight small inaccuracies.The grey boxes represent the qubits of our processor and map to the processor graph in Figure S1b.The algorithm error estimator can reproduce complex error patterns, some of which span an order of magnitude.However, there are also clear inaccuracies, some of which are discussed in the text.

B. Optimization bounds
Each frequency variable is subject to hard bounds set by the quantum and classical control hardware.The bounds are computed from the characterization data and embed a detailed understanding of our hardware and control systems.Furthermore, they are contracted as much as reasonably possible via physics intuition to reduce the configuration space and thus runtimes.
Here we list several hard bounds: • Idle frequency bounds Maximum detuning from qubits' maximum frequencies to limit dephasing.
Minimum and maximum detuning from the LO to limit microwave pulse-distortion due to the finite DAC bandwidth.

• Interaction frequency bounds
Maximum detuning from qubits' maximum frequencies to limit dephasing.
For N = 68 configurations, these hard bounds constrain SQ and CZ gates to average operating bandwidths ∼ 450 ± 120 (±1σ) MHz and ∼ 635 ± 112 MHz, respectively.From these operating bandwidths, together with a 2 MHz hardware discretization, we can estimate the average number of idle and interaction frequency options (k in Section V of the main text) and the total number of frequency configurations for the processor: • Idle frequency options: ∼ 225 ± 60.
• Frequency configurations: ∼ 2 1437 total configurations for our processor.This number significantly exceeds our processor's 2 68 Hilbert space dimension, and even the ∼ 2 265 atoms in the visible universe, which is surpassed near N ∼ 10.

C. Optimizer parameters
Snake can implement a wide array of optimization strategies through judicious selection of four parameters -the seed strategy, traversal strategy, scope, and innerloop optimizer.These must be tuned to the problem of interest, while balancing runtime and performance, which often compete.We interpret this tuning as adapting Snake to the optimization landscape presented by the algorithm error estimator and expect it to be especially critical when applying our strategy beyond our hardware.
Below we overview each parameter's role and then tune them over ∼ 7, 000 simulated optimization threads of our quantum processor.We evaluate their runtime and performance against an untrained but representative algorithm error estimator, quoting relative performance measures (Figure S7).By tuning the parameters against an estimator and not hardware benchmarks, these results are not susceptible to drift or the inaccuracy of the estimator in predicting benchmarks.Furthermore, an experiment of this scale would take ∼ 1.5 years in hardware.
Once tuned, the parameters are remarkably robust.We have seen one set of tuned parameters remain reasonably effective for a variety of quantum algorithms and multiple generations of frequency tunable qubits, some of which underwent significant architectural modifications.

Seed strategy
Determines how Snake prioritizes seed gates to launch optimization threads from.
• Tuned value: For experimental configurations up to N = 68, we generated solutions from all seeds and selected the one that minimized the trained algorithm error estimator.For larger simulated processors, we selected a subset of all seeds randomly.
We note that when scaling beyond hundreds of qubits, generating solutions from all seeds will become prohibitively time consuming.Identifying a strategy for seeding gates that may offer a performance advantage is thus expected to become important.One systematic approach for developing such a strategy is to correlate optimized estimator values against statistics computed from the corresponding seeds characterization data, which are available prior to optimization.

Traversal strategy
Determines how Snake drives graph traversal within each optimization thread, which we characterize via a traversal rule and traversal heuristic:  • Traversal heuristic: Sorts candidate gates returned by the traversal rule to implement some desired traversal pattern.Here we tested textbook breadth-first (BFS), depth-first (DFS), and random (RND) heuristics [42].
• Tuned value: We use a scope-dependent traversal strategy (gray bars in Figure S7).However, the ARB traversal rule with the BFS traversal heuristic often outperformed other strategies.
We note that many algorithms exist for approaching graph-based problems [7,42] and can be adapted to Snake.For example, heuristics such as the fail-first heuristic (i.e.prioritize traversals between highly constrained gates) and/or techniques such as back-tracking [7] and/or Monte Carlo Tree Search (MCTS) [7,43] may offer advantages beyond the textbook strategies tested.Multiple optimization traversals, which we consider an extension of healing, may also boost performance.

Scope
Bounds the dimension of the optimization problem solved at each traversal step.We note that we refer to S as the distance d P in reference [40] -we do not adopt that name here to avoid collisions with the error correction distance d.Here we tested scopes S = 1, 2, 3, 4, 5, 6, S max , which correspond to maximum optimization dimensions of 1D (local), 5D, 9D, 21D, 29D, 49D, |F |D (global).
We note that the scope may be dynamically varied at each traversal step depending on the characterization data.For example, larger scopes may outperform lower scopes when optimizing highly constrained gates with anomalous circuit parameters and/or particularly detrimental TLS defects [44].Furthermore, we expect the optimal scope to depend strongly on the hardware architecture and to scale with the spatial extent of engineered and parasitic interactions.Optimizing hardware with higher connectivity (e.g. with three-qubit gates) would likely benefit from a larger scope than lower connectivity (e.g. with two-qubit gates).Similarly, optimizing hardware with longer-range stray coupling would benefit from a larger scope than shorter-range stray coupling.
• Tuned value: We use a dimension-dependent inner-loop optimizer.We exhaustively search < 3D problems for their globally optimal values.We stochastically search ≥ 3D problems via a tuned global optimizer.
We note that Snake can deploy nearly-arbitrary continuous or discrete inner-loop optimizers, treating the optimization variables accordingly.Of particular interest are model-free reinforcement-learning agents [7,43,51,52].
In the short term, model-free agents could refine configurations found via model-based optimization to pensate for inaccuracies in the algorithm error estimator, which are expected to increase with processor size due to increased control and hardware inhomogeneities.In the longer term, they could replace model-based optimization entirely [53,54] and eliminate the research burden of developing performance estimators.

D. Optimization runtime budget
We develop a runtime budget with the objective of operating a distance 23 surface code logical qubit with N = 1057 physical qubits.Since the surface code has a lenient qubit failure tolerance of > 1% [55] and since our outlier emergence probability is ∼ 0.01 / 24 hours / gate (see Section VIII B), we have ∼ 24 hours to characterize, optimize, calibrate, and benchmark our processor and finally execute the algorithm before restarting the process.
If we target 12 hours for executing the algorithm, we have 12 hours to go from characterization to benchmarking.Since characterization, calibration, and benchmarking take ∼ 2 hours and are nominally independent of processor size due to parallelization, we have ∼ 10 hours left.From this large window, we only budget 0.5 hours for optimization, leaving ∼ 9.5 hours for unforeseen scaling overhead.
The 0.5 hour runtime budget can be fulfilled via a com-bination of optimization, healing, and stitching (see Section V E).Furthermore, as the hardware evolves, we expect the outlier emergence probability to decrease and the control system to become faster through hardware and software advancements, which should relax all budgets and enable operating even larger surface codes.

E. Optimization runtime scalability
To understand how optimization thread runtimes scale with processor size, we optimize simulated processors of variable size at several scopes (Figure S8).The trends are complex, but runtimes generally increase with both scope and processor size.For our default scope S = 2 in particular, runtimes fit well to the heuristic scaling model r = a+bN +cN 2 , where r is the runtime, N is the number of qubits, and a, b, and c are heuristic coefficients.
Fitting this model to our data, we find best-fit parameters a = −8.4± 15.5 seconds, b = 3.6 ± 0.1 seconds / qubit, and c = 0.0010 ± 0.0001 seconds / qubit 2 .The linear term dominates towards of qubits, as we would expect from Snake's algorithmic structure.This is the term quoted in the scaling section of the main text.
To estimate healing runtimes r h , we assume that outliers emerge with probability 0.01 / 24 hours / gate (see Section VIII B), that each outlier conservatively targets 2 qubits' idle and interaction frequencies for healing (see Section VIII C), and that healing occurs periodically every T days.In combination, r h = 0.01 × 2 × T × r.
To estimate stitching runtimes r s , we assume that a processor can be split into approximately equal-sized regions and that stitching overhead is negligible.We believe the latter assumption is valid, especially since stitching can itself be parallelized.In combination, r s = r/R.

VI. BENCHMARKING SYSTEM
The benchmarking system measures the performance of frequency configurations via benchmarking algorithms that should be representative of the target quantum algorithm but much cheaper to execute (Figure S9).Here we use: • Two-qubit cross-entropy benchmarking (CZXEB): Measures the error per cycle e c,ij for SQ i , SQ j and CZ ij , respectively [1,5].Four CZXEB benchmarking algorithms with distinct CZ "Layer" patterns are necessary to benchmark all CZ gates (Figure S9c).They are run backto-back during benchmarking.When referencing "CZXEB", we reference all of these algorithms and their respective benchmarks in combination.
• Single-qubit randomized benchmarking (SQRB): Measures the error per gate e i for SQ i (Figure S9) [56].These benchmarks were not c Parallel SQ Layer Parallel CZ Layers e ij e ij m q i ... ... (c) SQ and CZ gate layers used for the "Parallel" benchmarks in (a).Measuring CZXEB for all gates requires running four distinct algorithms, each of which is characterized by interleaving a distinct CZ layer with the SQ layer.When referencing CZXEB, we implicitly reference all four algorithms and corresponding benchmarks.
presented in the main text since CZXEB is considered a more holistic metric.SQRB benchmarks corresponding to the experiments in the main text are in Figures S10, S11, and S12.SQRB can be combined with CZXEB to infer CZ error contributions to the cycle errors (e ij for CZ ij ).
In some cases, we label benchmarks as follows: • Isolated: Benchmarks taken in sparse configurations where stray coupling is negligible.
• Parallel: Benchmarks taken in configurations where stray coupling is non-negligible.Unless otherwise noted, all benchmarks are Parallel, including those presented in the main text.
All SQRB and CZXEB benchmarks are reported as average errors.Conversion factors between average errors, depolarizing errors, and Pauli errors are defined in Table I of Section V of the Supplementary information of Reference [1].Statistics for SQRB and CZXEB distributions presented in the main text and the this Supplementary information are in Tables II, III, IV, and V. FIG.S12.Optimization scalability.SQRB benchmark distributions corresponding to the configurations presented in Figure 4.An outlier standard is not defined for SQRB.SQRB trends are consistent with CZXEB.Some boxes have been horizontally shifted to reduce overlap.

VII. SIMULATION ENVIRONMENT
Due to complex interplay between hardware inhomogeneities, error components, hard bounds, and our control optimization strategy, trustworthy scaling simulations require us to emulate our quantum computing stack from the hardware to the control system.We emulate them as follows: • Hardware: We developed [34] a generative model of our quantum processor architecture (more below).This approach enables us to embed the statistics of empirically measured characterization data into our simulations.
• Characterization: We sample [7] the generative model to generate simulated processor characterization data.Each random sample is distinct but should nominally follow the statistics of empirically measured characterization data.
• Optimization: We use the Snake optimizer on simulated characterization data exactly as we would on real characterization data.
• Calibration and Benchmarking: We use our trained algorithm error estimator on optimized configurations with the simulated characterization data to estimate benchmarks.We assume that the impact of calibration is implicitly embedded into the trained weights of the estimator.
Since the generative model is an important component of the simulation environment and has not yet been discussed in detail, we provide an overview below and direct the reader to Reference [34] for additional details.
Relaxation Rate Spectra Flux-Sensitivity Spectra 5.5

A. Simulated processor generative model
Our goal is to develop a statistical model that can be sampled for characterization data for simulated processors of arbitrary size and connectivity that are statistically indistinguishable from our real quantum processor.Towards that end, we interpret an arbitrary quantum processor as a statistical sample from some quantum processor joint probability density P (D, P, N ).Here D is the characterization data, P is a set of architectural parameters (e.g.qubit circuit parameters, coupler circuit parameters, TLS parameters), and N is the processor's size and connectivity.Within this statistical picture, D, P, N are random variables.
Generating a simulated quantum processor and its corresponding characterization data amounts to sampling P .To do so, we consider the chain-rule decomposition P (D, P, N ) = P (D|P, N )P (P|N )P (N ), which can be represented as a Bayesian network, and employ prior sampling [7] (Figure S13a) as follows: 1. Select simulated processor size and connectivity N .

Sample architectural parameters for all qubits of
the simulated processor from P (P|N ) under the naive assumption that they are conditionally independent amongst themselves and the qubits.The distributions of architectural parameters were determined by statistically analyzing our real quantum processor's characterization data.
3. Sample characterization data by propagating the architectural parameters through conditional dependencies P (D|P, N ) determined through textbook physics, published literature, and metrology.
If the generative model is accurate, the simulated processor's characterization data should be statistically indistinguishable from our real quantum processor.

B. Simulated characterization data
We validate the accuracy of our generative model by comparing various statistics between the real and simulated architectural samples P and real and simulated characterization data D (not shown).Here, we simply compare the relaxation and flux-sensitivity spectra sampled for a simulated processor against our real processor (Figure S13).We believe this is a holistic test of accuracy because these spectra have complex characteristics that require an accurate confluence of architectural parameters to accurately reproduce.Apart from experimental noise, which are filtered during optimization, the simulated spectra are nearly indistinguishable from real spectra.In turn, we believe the generative model is sufficiently accurate for trustworthy simulations.

VIII. ADDITIONAL EXPERIMENTAL DETAILS A. Experimental controls
Due to the complexity of our quantum computing stack, developing good experimental controls for any subcomponent -including our frequency optimization system -is non-trivial.Our primary controls are the random baseline configurations.These are expected to sample the average performance of the hardware and calibration systems without frequency optimization.However, we note that all baseline configurations are generated within frequency hard bounds, which themselves embed some error mitigation (see below).Therefore, we believe that the performance advantage quoted in the main text for our optimization strategy is an underestimate.

B. Impact of drift
Both slow gradual drift (e.g.due to temperature fluctuations in the lab) and abrupt catastrophic drift (e.g.due to TLS fluctuating into gates) can generate outliers over time.To understand the extent to which drift impacts our experimental results, we compare the runtimes of the components of our control system against estimated rate at which outliers emerge for N = 68 configurations.
• Total runtime: ∼ 2.5 hours / experiment.We thus do not believe that our results are significantly impacted by outliers emerging from drift.

C. Healing experiment
For the healing experiment presented in Figure 2, frequencies were targeted for healing via thresholding and manual discretion.
• Interaction frequencies were typically targeted when their corresponding CZXEB benchmarks exceeded the outlier standard (e c,ij ≥ 1.5 × 10 −2 ).Re-optimizing interactions is relatively inexpensive and does not necessarily require the corresponding idles to be re-optimized.
• Idles were typically targeted when their corresponding SQRB benchmarks were anomalously high (e i ≳ 1.5 × 10 −3 ) and/or when they corresponded to multiple CZXEB benchmarks that exceeded the outlier standard.Re-optimizing idles is relatively expensive and requires all hinged interactions to be re-optimized also.
Heals were conducted within ∼ 1 hour of benchmarking the optimized but unhealed configurations to mitigate the impact of drift.Furthermore, new characterization data were taken for targetted devices under the assumption that the original data were no longer valid.The primary concerns are TLS fluctuations [38,44], which are represented in the T 1 spectra.

D. Metrology experiment
In Figure S14, we show frequencies and cycle errors for all 16 configurations considered in the metrology experiment in Figure 3, each of which was optimized with distinct error mitigation strategies active.In the following sections, we qualitatively and quantitatively analyze these data to understand the impact of each error mitigation strategy and their interplay.

Qualitative analysis
To qualitatively understand the impact of and interplay between error mitigation strategies, we inspect CZXEB cycle error contributions e m for all error mechanisms m across all 16 configurations, each of which corresponds to distinct error mitigation strategies s active (Figure S15).Here e m = ⟨e m,ij ⟩ ij , where ⟨.⟩ ij is the median over pairs ij.The median was chosen to expose   Error contributions from (a) relative to the baseline to highlight interactions between mitigation strategies.Amplification in an error mechanism when another mitigation strategy is active is interpreted as competition.For example, we show in the second panel that dephasing competes with stray coupling.Each error mechanism's suppression limit, which is achieved when only its respective mitigation strategy is active, is reproduced on each panel for reference (white).Ideally, our optimization strategy would reach each error mechanism's suppression limit when all mitigation strategies are active (last panel).typical trends while preventing outliers from biasing our analysis.Each contribution e m,ij sums the corresponding weighted error components (e.g.SQ i dephasing, SQ j dephasing, and CZ ij dephasing for m = dephasing) as computed from the corresponding configuration's optimized frequencies and characterization data.
All panels in Figure S15 suggest that activating a particular error mitigation strategy suppresses its corresponding error mechanism as intended.However, the panels with only one mitigation strategy active also highlight non-trivial interactions between them.Of particular interest are amplifications in an error mechanism when another mitigation strategy is active, which we interpret as competition.Next we describe noteworthy competition and how it arises from the underlying physics.
Dephasing mitigation squeezes frequencies into a narrow band near their respective maxima where flux-sen-sitivity vanishes df dϕ = 0 [21] (see orange curves in Figure S13b).In turn, it boosts frequency collisions and thus competes with stray coupling.Stray coupling mitigation disperses frequencies to reduce frequency collisions.In turn, it boosts frequency excursions during CZs and thus competes with pulse distortion.Pulse distortion mitigation biases idles into |11⟩ ↔ |02⟩ resonance (a checkerboard with neighbors q i and q j at f i = f j − |η j | or f j = f i − |η i |, where η i and η j are qubit anharmonicities [2,57]) and interactions into resonance (at f ij = (f i + f j )/2) to minimize frequency excursions during CZ gates.In turn, it boosts frequency collisions and thus competes with stray coupling.Finally, relaxation mitigation interacts non-trivially with all mechanisms since it avoids relaxation hotspots with complex frequency dependencies [21,41,58,59] and randomness due to TLS [44] (see blue curves in Figure S13b).Despite nontrivial interactions between error mitigation strategies, the last panel of Figure S15b suggests that our optimizer can effectively reconcile their competition and suppress all error mechanisms simultaneously.

Quantitative Analysis
To quantitatively understand the impact of and interplay between error mitigation strategies, we analyze CZXEB cycle error contributions e m,ij for all error mechanisms m and all pairs ij across all 16 configurations, each of which corresponds to distinct error mitigation strategies s active (Figure S16a).Towards that end, we construct a database of error contributions with 1744 rows (109 pairs × 16 configurations) and 16 columns (4 mechanisms × 4 strategies), randomly sample ∼ 500 rows, compute a metric of interest, and repeat 100,000 times.This bootstrapping procedure [6,8] generates a distribution for the metric, from which we report the median and 95% confidence interval.
Our first metric is the Spearman correlation coefficient (C m,s ) between each error mitigation strategy and mechanism (Figure S16c).The Spearman correlation identifies monotonic trends, making it more suitable for analyzing non-linear data with outliers than the more common Pearson correlation.We find that activating a strategy is moderately-to-strongly correlated with a suppression in the corresponding error mechanism (−0.67Our second metric is the relative error contribution ∆ m,s = e m,s=1 /e m,s=0 − 1 (Figure S16d).Here e m,s=1(0) = ⟨e m,ij ⟩ ij,s=1(0) is the median over pairs in all configurations where strategy s is active (inactive).We find that activating a particular strategy moderately-to-strongly suppresses its corresponding error mechanism (−0.75 +0.07 −0.07 ≤ ∆ m,s ≤ −0.33 +0.03 −0.03 for diagonals), while relatively weakly interacting with others (−0.15 +0.12 −0.13 ≤ ∆ m,s ≤ 0.26 +0.11 −0.11 for off-diagonals).Interestingly, relaxation mitigation appears to be less effective than other mitigation strategies (∆ m,s = −0.33 +0.03 −0.03 with m = s =relaxation).However, the last panel of Figure S15b actually shows that relaxation approaches its suppression limit, even when all mitigation strategies are active, suggesting that relaxation is limited by our processor's performance limits and not our optimizer.
These results suggest that our mitigation strategies are both selective and effective at suppressing their corresponding error mechanisms.In turn, they support our association of error components with mitigation strategies and that our optimizer can effectively navigate them.

E. Scaling experiment
When building configurations for the scaling experiment presented in Figure 4, multiple configurations were used for smaller configuration sizes (N < 40) to boost statistics (see "Configurations" column in Table V).Optimized configurations were generally healed one or more times to resolve calibration failures and to push the performance limits of our processor.However, frequencies were never selected manually.Unoptimized baseline configurations were never healed.As a result, ∼ 1% of all gates across all baseline configurations failed calibrations.
The best-fit parameters for the saturation model are presented in the Table I.The parameters e sat and e scale are remarkably similar between experiment and simulation.However, there is a sizeable gap for N sat .We trust the experimental value more since the simulated data are sparse towards smaller N (i.e. the d = 3, 5, and 7 logical qubits have N = 7, 49, and 96, respectively).

F. Stitching experiment
The stitching demonstrations in Figure 4 employed convenient stitch geometries.Even though stitched-andhealed configurations performed as well as their unstitched counterparts, we expect that the number of stitched regions and seam geometry will ultimately need to be optimized.First, we expect that progressively increasing the number of stitched regions -which would favorably lead to shorter optimization runtimes -will eventually start to degrade performance as more constraints between more independently optimized configurations will have to be reconciled.Second, we expect that optimizing the seam geometry will be necessary for applications like error correction, where the geometry of underperformant gates is particularly important [55,60].

G. Algorithm specificity
Ideally a single frequency configuration could reach high performance on arbitrary quantum algorithms.However, that configuration would have to mitigate frequency collisions between all possible combinations of parasitically coupled gates.That stringent requirement may be infeasible from a constraint perspective.
To test this possibility, we optimize our processor for arbitrary quantum algorithms by configuring the stray coupling error components to penalize for frequency collisions between all possible combinations of parasitically coupled SQ and CZ gates.We compare the performance of the configuration against configurations optimized without stray coupling mitigation, with the default CZXEB-specific stray coupling mitigation, and the baseline, outlier, and crossover standards.The configuration optimized for arbitrary quantum algorithms significantly underperforms the default configuration optimized with CZXEB-specific stray coupling mitigation and the crossover standard.Furthermore, it only moderately outperforms the configuration without stray coupling mitigation at all and the random baseline standard (Figure S17).This result suggests that the optimization problem is overconstrained by the added stray coupling error components.Therefore, with the current magnitude of stray coupling, algorithm-specific optimization is critical.

IX. DEPENDENCIES
Optimization was performed on an Lenovo ThinkStation P620 with 128GB of ram and an AMD Ryzen Threadripper PRO 3945WX.Data were analyzed using a combination of numpy [61], scipy [62], pandas [63], and matplotlib [64].Boxplots were generated using matplotlib.boxplot.The percentiles in Tables II, III, IV, V were generated using numpy.percentile.The tested inner-loop optimizers leveraged a mixture of homebrew code and scipy's built-in optimizers.The training pipeline was written in tensorflow [36].

Fig. 1 .
Fig. 1.Frequency optimization.(a) Our quantum processor with N = 68 frequency-tunable superconducting transmon qubits represented as a graph.Nodes are qubits (e.g.black dot) and edges are engineered interactions between them (e.g.blue and green lines).(b) A quantum algorithm (A) comprising single-and two-qubit gates with one qubit (qj) distinguished.(c) Corresponding qubit frequency trajectories (F ), parameterized by single-qubit idle (fj for qubit qj) and two-qubit interaction (fij for qi and qj) frequencies.Quantum computational errors depend strongly on frequency trajectories since most physical error mechanisms are frequency dependent (red dots are non-exhaustive examples).Namely, pulse distortion errors (1) happen during large frequency excursions.Relaxation errors (2) happen around relaxation hotspots, for example due to two-levelsystem defects (TLS, horizontal resonance).Stray coupling errors (3) happen during frequency collisions between coupled computational elements.Dephasing errors (4) happen towards lower frequencies, where qubit flux-sensitivity grows.(d) We leverage our understanding of physical errors to estimate the algorithm's error (E) and then optimize it with respect to qubit frequency trajectories.(e) We employ the Snake optimizer, which can solve optimization problems at an arbitrary dimension, controlled by the scope parameter (S).These graphs show possible idle (nodes) and interaction (edges) frequency optimization variables (blue) at one Snake optimization step, for scopes ranging from S = Smax (global limit) to S = 1 (local limit).(f) Snake optimization threads for three scopes (increase downwards).Snake's high configurability enables it to scalably overcome frequency optimization complexity and be adapted to a variety of quantum operations, algorithms, and architectures.

Fig. 2 .
Fig. 2. Optimization and healing performance.(a) CZXEB cycle error benchmarks (ec, boxes, left axis) and calibration failures (orange bars, right axis in (c)) for the random baseline (red), outlier (orange diamond), and crossover (green) performance standards used to evaluate frequency configurations and our optimization strategy.Each box shows the 2.5, 25, 50, 75, and 97.5th percentiles and mean (see annotations on the baseline).The standards' means are extended across panels for comparison.(b) Benchmarks for configurations optimized at different scopes (S) ranging from S = 1 (local limit) to S = Smax (global limit).Intermediate dimensional optimization (2 ≤ S ≤ 4) outperforms both local and global optimization, finding configurations near the crossover standard.S = 4 performs best but S = 2 offers a better balance between performance and runtime (see SI) and is set as our default.(c) Benchmarks for each configuration in (b) after healing, which significantly suppresses performance outliers.In (a), (b) and (c), each box corresponds to one configuration.(d) Benchmark heatmaps illustrating optimization and (e) healing of targeted gates in the S = 5 configuration.Each hexagon corresponds to the cycle error for one pair (ec,ij).Performant gates are blue, outliers are red, and unoptimized and targeted gates are grey.

Fig. 3 .
Fig. 3. Optimization performance versus error mitigation strategy.(a) CZXEB cycle error benchmarks (ec, black) and calibration failures (orange bars, rightmost axis) for configurations optimized with all combinations of dephasing, relaxation, stray coupling, and frequency-pulse distortion error mitigation strategies activated (see lower matrix).The random baseline (red), outlier (orange), and crossover (green) standards are shown for comparison.(b) (upper) Idle frequency (fi), (center) interaction frequency (fij), and (lower) cycle error (ec,ij) heatmaps for (leftmost) the baseline standard with no mitigation strategies activated, (central) configurations with only one strategy activated, and (rightmost) the default configuration with all strategies activated.As more mitigation strategies are progressively activated (from left to right in (a)), cycle errors and calibration failures trend downwards, highlighting the importance of metrology on the performance of our optimization strategy.

cFig. 4 .
Fig. 4. Optimization scalability.(a) Experimental and (b) simulated CZXEB cycle error benchmarks (ec) in optimized (black) and unoptimized baseline (red) configurations of variable size.Simulated processors have size and connectivity corresponding to surface code logical qubits with distance d.The crossover standard (green), outlier standard (orange), and stitched configurations (purple) are shown for comparison.The solid lines are fits of the saturation model to the optimized (black) and baseline (red) benchmark means.Some boxes have been horizontally shifted to reduce overlap.In (a), N < 40 boxes combine benchmarks from multiple configurations to boost statistics.In (b), the x axis is linear in d with N = 2d 2 − 1 and the shaded region illustrates the experimentally accessible regime of our processor.(c) Benchmark heatmaps illustrating stitching of our N = 68 processor and (d) N = 1057 simulated processor.Outliers are not substantially amplified at seams, which is our primary concern.We note that stitching the d = 23 logical qubit with R = 4 is equivalent to stitching four d = 11 logical qubits.
.K. conceived, prototyped, and led the development of the Snake optimizer, algorithm error estimator, and simulated processor generative modeling frameworks.An.B. and C.Q. contributed to engineering the optimizer.Al.B. and A.D. contributed to engineering the generative model.C.Q, Al.B., A.D., K.J.S., M.Y.N., W.P.L., V.S., T.I.A., and Y.Z.contributed to error metrology research.S.H. led the development of parallel calibration infrastructure with engineering contributions from An.B and Z.C.. Finally, A.M., P.R., A.N.K., J.K, V.S., Y.C., and H.N. supported research and development.
FIG. S3.Iterative training protocol.We train the algorithm error estimator over multiple iterations on benchmarks of increasing complexity.At the start of the protocol (top), the estimator's weights are untrained (grey).The different segments separated by vertical lines correspond to the weights for distinct subsets of error components.At each iteration, distinct subsets of error components and their weights are isolated and trained on training data corresponding to judiciously chosen benchmarks (right).Namely, isolated benchmarks isolate relaxation, dephasing, and pulse distortion error components, while parallel benchmarks introduce stray coupling error components.Once trained, the weights are fixed (left) to their trained values (blue) and the training protocol proceeds onto the next iteration.At the end of the protocol (bottom), all of the estimator's weights have been trained.
FIG. S4.Algorithm error estimator predictions versus measurements.(a) Measured versus predicted SQRB and CZXEB benchmarks for the (upper) training and (lower) test data and for Isolated and Parallel benchmarks.N is the number of samples.The data were taken in frequency configurations of variable complexity.Deviations from the diagonal dashed line -where predictions match measurements -correspond to inaccuracies.(b) Cumulative distribution functions (CDF) of the Inaccuracy and (c) Relative Inaccuracy for the data in (a).The dashed vertical lines in (b) and (c) correspond to the distribution medians.

FIG. S7 .
FIG.S7.Tuning Snake's parameters.The optimized untrained algorithm estimator value (black boxes) and runtime (red boxes) corresponding to N = 68 optimization threads versus Snake's scope, traversal heuristic, and traversal rule parameters.Each box corresponds to multiple seeds and shows the respective distribution's 0, 25, 50, 75, and 100th percentile (horizontal notches) and mean (diamond).The gray vertical bars are the tuned parameters used in experiment for each scope.

7 N = 6 8 Periodicity
FIG.S8.Optimization, healing, and stitching runtime scalability.(a) Optimization runtime versus simulated processor size N for several scopes S of interest.The default scope S = 2 (black) and 30 minute optimization budget (orange) are reproduced on all panels.The "+" markers are points of interest for the largest experimental (N = 68) and simulated (N = 1057) configurations investigated.(b) Estimated healing runtimes for several heal periodicities T .The "+" markers correspond to daily healing.We extrapolate runtimes below the range of available data.(c) Estimated stitching runtimes for several stitched regions R. The "+" markers correspond to the stitching experiment and simulation presented in the main text.
FIG. S9.Benchmarking performance.(a) Benchmark types used to evaluate performance.The ellipsis indicates that gates are benchmarked over more qubits in parallel.Initialization into |0⟩ and measurement in the Z basis are implicit.(b) Errors measured by the benchmark types in (a).(c)SQ and CZ gate layers used for the "Parallel" benchmarks in (a).Measuring CZXEB for all gates requires running four distinct algorithms, each of which is characterized by interleaving a distinct CZ layer with the SQ layer.When referencing CZXEB, we implicitly reference all four algorithms and corresponding benchmarks.

FIG. S10 .
FIG. S10.Optimization and healing performance.SQRB benchmark distributions corresponding to the configurations presented in Figure2.An outlier standard is not defined for SQRB.SQRB trends are consistent with CZXEB.

FIG. S13 .
FIG. S13.Generating simulated processors.(a) Schematic of the generative model that we use to generate simulated processor characterization data.(b) Comparison of the energy relaxation rate (T −1 1 , blue) and flux-sensitivity spectra ( df dϕ , orange) for our N = 68 processor and (b) N = 97 simulated processor.The inset shows the common scale.Our generative model produces simulated processor characterization data that are nearly indistinguishable from our processor.
FIG. S14.Error metrology frequency configurations and benchmarks.(a) (upper) Idle frequencies (fi), (center) interaction frequencies (fij), and (lower) CZXEB cycle errors (ec,ij) for configurations optimized with all combinations of error mitigation strategies activated, to supplement the data shown in Figure 3.(b) Continuation of (a), with all annotations shared.

FIG. S15 .
FIG. S15.Error metrology qualitative analysis.(a) CZXEB cycle error contributions em for each error mechanism m and each configuration shown in FigureS14.Each bar corresponds to one error mechanism and each panel corresponds to a distinct combination of error mitigation strategies active (grey).The baseline is reproduced on each panel for reference (white).(b) Error contributions from (a) relative to the baseline to highlight interactions between mitigation strategies.Amplification in an error mechanism when another mitigation strategy is active is interpreted as competition.For example, we show in the second panel that dephasing competes with stray coupling.Each error mechanism's suppression limit, which is achieved when only its respective mitigation strategy is active, is reproduced on each panel for reference (white).Ideally, our optimization strategy would reach each error mechanism's suppression limit when all mitigation strategies are active (last panel).
FIG. S16.Error metrology quantitative analysis.(a) CZXEB cycle error contributions em,ij for all error mechanisms m and all pairs ij across all 16 configurations, each of which corresponds to distinct error mitigation strategies s active.One row is shown for example.(b) Spearman correlation Cm,s between the activation of error mitigation strategy s and error contributions from mechanism m.Cm,s = −1 means that activating strategy s perfectly monotonically suppresses error mechanism m.Ideally, all diagonals would be −1 and all off-diagonals would be 0. (c) Relative error contribution ∆m,s between the activation of error mitigation strategy s and error contributions from mechanism m.Dm,s = −1 means that activating strategy s fully suppresses mechanism m.Ideally, all diagonals would be −1 and all off-diagonals would be 0.For (b) and (c) we report medians and their 95% confidence intervals.
2. Two qubit gate error components g = CZ ij .
Stray coupling parameters between CZ ij and CZ kl , χ ij,kl .
Algorithm error estimator runtime.Runtime versus the number evaluations of our algorithm error estimator on a high-performance desktop.The estimator is defined over 177 gate frequency variables and comprises ∼ 40, 000 error components.Despite its large scale, it can still be evaluated > 100×/second.

TABLE II .
Benchmarks (×10 −3 ) corresponding to the distributions shown in Figure2of the main text.

TABLE III .
Benchmarks (×10 −3 ) corresponding to the distributions shown in Figure3of the main text.

TABLE IV .
Benchmarks (×10 −3 ) corresponding to the distributions shown in Figure4of the main text.

TABLE V .
Benchmarks (×10 −3 ) corresponding to the distributions shown in Figure4of the main text.

TABLE VI .
Definitions for symbols used in the main text and supplemental information.Distance of a surface-code logical qubit fi Idle frequency for qi fij Interaction frequency for qi and qj F Set of fi and fij corresponding to a frequency configuration F * Set of optimized gate frequencies corresponding to F Fg Subset of F that are relevant to estimating the error of gate g Fg,m Subset of Fg that are relevant to estimating the error of gate for physical error mechanism m FS Subset of F selected for optimization by Snake based on scope S k Approximate number of frequency options per gate ei Error per gate for SQi, measured by SQRB and reported as an average error eij Error per gate for CZij, which can be inferred from SQRB and CZXEB benchmarks ec,ij Error per cycle for SQi, SQj, and CZij, measured by CZXEB and reported as an average error ec Distribution of ec,ij corresponding to some configuration(s) Nsat Qubit saturation constant in the heuristic saturation scaling model esat Saturated error in the heuristic saturation scaling model e scale Scaling penalty in the heuristic saturation scaling model D Algorithm-independent error component for gate g and physical error mechanism m wg,m Algorithm-dependent weight corresponding to ϵg,m P Probability density, used in the context of simulated processors N Processor size and connectivity, used in the context of simulated processors PProcessor architectural parameters, used in the context of simulated processors argmin x (f (x)) Minimizer for some function f (x) with respect to its argument(s) x