Demonstrating multi-round subsystem quantum error correction using matching and maximum likelihood decoders

Quantum error correction offers a promising path for performing high fidelity quantum computations. Although fully fault-tolerant executions of algorithms remain unrealized, recent improvements in control electronics and quantum hardware enable increasingly advanced demonstrations of the necessary operations for error correction. Here, we perform quantum error correction on superconducting qubits connected in a heavy-hexagon lattice. We encode a logical qubit with distance three and perform several rounds of fault-tolerant syndrome measurements that allow for the correction of any single fault in the circuitry. Using real-time feedback, we reset syndrome and flag qubits conditionally after each syndrome extraction cycle. We report decoder dependent logical error, with average logical error per syndrome measurement in Z(X)-basis of ~0.040 (~0.088) and ~0.037 (~0.087) for matching and maximum likelihood decoders, respectively, on leakage post-selected data.


I. INTRODUCTION
The outcomes of quantum computations can be faulty, in practice, due to noise in the hardware.To eliminate the resulting faults, quantum error correction (QEC) codes can be used to encode the quantum information into protected, logical degrees of freedom, and then by correcting the faults faster than they accumulate enable fault-tolerant (FT) computations.A complete execution of QEC will likely require: preparation of logical states; realization of a universal set of logical gates, which may require the preparation of magic states; repeated measurements of syndromes; and the decoding of the syndromes for correcting errors.If successful, the resulting logical error rates should be less than the underlying physical error rates, and decrease with increasing code distances down to negligible values.
The choice of a quantum error correcting code will require consideration of the underlying hardware and its noise properties.Specifically for a heavy-hexagon lattice [1,2] of qubits, subsystem QEC codes [3] are attractive because they are wellsuited for qubits with reduced connectivities.Other codes have shown promise due to their relatively high threshold for FT [4] or large number of transversal logical gates [5].Although their space and time overhead may pose a significant hurdle for scalability, there exist encouraging approaches to reduce the most expensive resources by exploiting some form of error mitigation [6].
In the decoding process, successful correction depends not only on the performance of the quantum hardware, but also on the implementation of the control electronics used for acquiring and processing the classical information obtained from syndrome measurements.In our case, initializing both syndrome and flag qubits via real-time feedback between measurement cycles can help mitigate errors.At the decoding level, whereas some protocols exist to perform quantum error correction asynchronously within a FT formalism [7,8], the rate at which the error syndromes are received should be commensurate with their classical processing time to avoid an increasing backlog of syndrome data.Also, the efficient performance of some particular protocols, like using a magic state for a logical T -gate, require the application of real-time feed-forward.
Thus, the long term vision of quantum error correction does not gravitate around a single ultimate goal but should be seen as a continuum of deeply interrelated tasks.The experimental path in the development of this technology will comprise the demonstration of these tasks in isolation first and their progressive combination later, always while continuously improving their associated metrics.Some of this progress is reflected in numerous recent advances on quantum systems across different physical platforms, which have demonstrated or approximated several aspects of the desiderata for FT quantum computing.In particular, FT logical state preparation has been demonstrated on ions [9], nuclear spins in diamond [10] and superconducting qubits [11].Repeated cycles of syndrome extraction have been shown in superconducting qubits in small error detecting codes [12,13], including partial error correction [14] as well as a universal (albeit not FT) set of single-qubit gates [15].A FT demonstration of a universal gate set on two logical qubits has recently been reported in ions [16].In realm of error correction, there have been recent realizations of the distance-3 surface code on superconducting qubits with decoding [17] and post-selection [18], as well as a FT implementation of a dynamically protected quantum memory using the color code [19] and the FT state preparation, operation, and measurement, including its stabilizers, of a logical state in the Bacon-Shor code in ions [19,20].
In this work we combine the capability of real-time feedback on a superconducting qubit system with a maximum likelihood decoding protocol hitherto unexplored experimentally in order to improve the survivability of logical states.We demonstrate these tools as part of the FT operation of a subsystem code [21], the heavy-hexagon code [1], on a superconducting quantum processor.This code benefits from the use of flag qubits at small cost in terms of circuit depth.By conditionally resetting each flag and syndrome qubit after each syndrome measurement cycle, we protect our d=3 system against errors arising from the noise asymmetry inherent to energy relaxation.We further exploit some recently described decoding strategies [14] and extend the decoding ideas to include maximum likelihood concepts [4,22,23].

II. THE HEAVY-HEXAGON CODE AND MULTI-ROUND CIRCUITS
The heavy-hexagon code example we consider is an n = 9 qubit code with minimum distance d = 3 [1].The Z and X gauge (see Fig. 1a) and stabilizer groups are generated by For this work, we focus on a particular kind of FT circuit, but the same approach can be used more generally with different codes and circuits.Two sub-circuits, shown in Fig. 1(b), are constructed to measure the Xand Z-gauge operators.The Z-gauge measurement circuit also acquires useful information by measuring flag qubits.
We prepare code states in the logical |0 (|+ ) state by first preparing nine qubits in the |0 ⊗9 (|+ ⊗9 ) state and measuring the X-gauge (Z-gauge).We then perform r rounds of syndrome measurement, where a round consists of a Z-gauge measurement followed by an X-gauge measurement (respectively, X-gauge followed by Z-gauge).Finally, we readout all nine code qubits in the Z (X) basis.We perform the same experiments for initial logical states |1 and |− as well, by simply initializing the nine qubits in |1 ⊗9 and |− ⊗9 instead.

III. DECODING ALGORITHMS
In the setting of FT quantum computing, a decoder is an algorithm that takes as input syndrome measurements from an error correcting code and outputs a correction to the qubits or measurement data.In this section we describe two decoding algorithms: perfect matching decoding and maximum likelihood decoding.

A. The decoding hypergraph
The decoding hypergraph [14] is a concise description of the information gathered by a FT circuit and made available to a decoding algorithm.It consists of a set of vertices, or error-sensitive events, V , and a set of hyperedges E, which encode the correlations between events caused by errors in the circuit.Fig. 2 depicts parts of the decoding hypergraph for our experiment.
Constructing the decoding hypergraph for stabilizer circuits with Pauli noise can be done using standard Gottesman-Knill simulations [24].First, an error-sensitive event is created for each measurement that is deterministic in the error-free circuit.A deterministic measurement outcome m ∈ {0, 1} is determined by some function of previous measurement outcomes M, or m = F m (M), where F m can be found by simulation of the error-free circuit.The value of the associated error-sensitive event is defined to be m − F m (M) (mod 2), which is zero (also called trivial) in the absence of errors.Thus, observing a non-zero (also called non-trivial) errorsensitive event implies the circuit suffered at least one error.In our circuits, error-sensitive events are either flag qubit measurements or the difference of subsequent measurements of the same stabilizer.Note that measurements of a stabilizer are found by adding together measurements of the constituent gauge operators [1].
Next, hyperedges are added by considering circuit faults.Our model contains a fault probability p C for each of several circuit components C ∈ {cx, h, id, idm, x, y, z, measure, initialize, reset}.(5) Here we distinguish the identity operation id on qubits during a time when other qubits are undergoing unitary gates, from the identity operation idm on qubits when others are undergoing measurement and reset.We reset qubits after they are measured, while we initialize qubits that have not been used in the experiment yet (see Supp.D for more detail).Numerical values for p C are listed in Supp.D.
For initialization and reset errors, a Pauli X is applied with the respective probabilities after the ideal state preparation.For measurement errors, Pauli X is applied with probability p measure before the ideal measurement.A one-qubit unitary gate (two-qubit gate) C suffers with probability p C one of the three (fifteen) non-identity one-qubit (two-qubit) Pauli errors following the ideal gate.There is an equal chance of any of the three (fifteen) Pauli errors occurring.
When a single fault occurs in the circuit, it causes some subset of error-sensitive events to be non-trivial.This set of error-sensitive events becomes a hyperedge.The set of all hyperedges is E. Two different faults may lead to the same hyperedge, so each hyperedge may be viewed as representing a set of faults, each of which individually causes the events in the hyperedge to be non-trivial.Associated with each hyperedge is a probability, which, at first order, is the sum of the probabilities of faults in the set.
A fault may also lead to an error which, propagated to the end of the circuit, anti-commutes with one or more of the code's logical operators (say it has k logical qubits and a basis  ) used for Z stabilizers in blue, and flag qubits and syndromes used in X stabilizers in white.The order and direction that CX gates are applied within each sub-section (0 to 4) are denoted by the numbered arrows.(b) Circuit diagram of one syndrome measurement round, including both X and Z stabilizers.The circuit diagram illustrates permitted parallelization of gate operations as set by barriers; and as each two-qubit gate duration differs, the final gate scheduling is determined with a standard circuit transpilation pass.
of 2k logical operators).We can keep track of which logical operators anti-commute with the error using a vector from Z 2k 2 .Thus, each hyperedge h is also labeled by one of these vectors γ h ∈ Z 2k 2 , called a logical label.Note that if the code has distance at least three, each hyperedge has a unique logical label [25].
Lastly, we note that a decoding algorithm can choose to simplify the decoding hypergraph in various ways.One way that we always employ here is the process of deflagging [26].This means that instead of including error-sensitive events from the flag qubit measurements directly, we use the flag information to immediately (before any more gates are applied) apply virtual Pauli Z corrections and adjust subsequent errorsensitive events accordingly.Hyperedges for the deflagged hypergraph can be found through stabilizer simulation incorporating the Z corrections.

B. Perfect Matching Decoding
Considering X and Z errors separately, the problem of finding a minimum weight error correction for the surface code can be reduced to finding a minimum weight perfect matching in a graph [4].Matching decoders continue to be studied because of their practicality [27] and broad applicability [28,29].In this section, we describe the matching decoder for our distance-3 heavy-hexagon code.
The decoding graphs, one for the X-errors (Fig. 2a) and one for the Z-errors (Fig. 2b), for minimum weight perfect matching are in fact subgraphs of the decoding hypergraph in the previous section.Let us focus here on the graph for correcting X-errors, since the Z-error graph is analogous.In this case, from the decoding hypergraph we keep nodes V Z corresponding to (the difference of subsequent) Z-stabilizer measurements and edges (i.e.hyperedges with size two) between them.Additionally, a boundary vertex b is created, and size-one hyperedges of the form {v} with v ∈ V Z , are represented by including edges {v, b}.All edges in the X-error graph inherit probabilities and logical labels from their corresponding hyperedges (see Table S1 (S2) for X (Z)-error edge data for 2-round experiment).
A perfect matching algorithm takes a graph with weighted edges and an even-sized set of highlighted nodes, and returns a set of edges in the graph that connects all highlighted nodes in pairs and has minimum total weight among all such edge sets.In our case, highlighted nodes are the non-trivial error-sensitive events (if there are an odd number, the boundary node is also highlighted), and edge weights are either chosen to all be one (uniform method) or set as w e = log ((1 − p e )/p e ), where p e is the edge probability (analytic method).The latter choice means that the total weight of an edge set is equal to the log-likelihood of that set, and minimum weight perfect matching tries to maximize this likelihood over the edges in the graph.
Given a minimum weight perfect matching, one can use the logical labels of the edges in the matching to decide on a correction to the logical state.Alternatively, the X-error (Zerror) graph for the matching decoder is such that each edge can be associated to a code qubit (or a meausurement error), such that including an edge in the matching implies an X (Z) correction should be applied to the corresponding qubit.

C. Maximum Likelihood Decoding
Maximum likelihood decoding (MLD) is an optimal, albeit non-scalable, method for decoding quantum error-correcting codes.In its original conception, MLD was applied to phenomenological noise models where errors occur only just before syndromes are measured [23,30].This of course ignores the more realistic case where errors can propagate through the syndrome measurement circuitry.More recently, MLD has been extended to include circuit noise [22,31].Here, we describe how MLD corrects circuit noise using the decoding hypergraph.
MLD deduces the most likely logical correction given an observation of the error-sensitive events.This is done by cal-culating the probability distribution Pr[βγ], where β ∈ Z
We can calculate Pr[βγ] by including each hyperedge from the decoding hypergraph, starting from the zero-error distribution, i.e.Pr[0 |V | 0 2k ] = 1.If hyperedge h has some probability p h of occurring, independent of any other hyperedge, we include h by performing the update where is just a binary vector representation of the hyperedge.This update should be applied once for every hyperedge in E.
Once Pr[βγ] is calculated, we can use it to deduce the best logical correction.If is observed in a run of the experiment, indicates how measurements of the logical operators should be corrected.For more details on specific implementations of MLD, refer to Supp.B.

IV. EXPERIMENTAL DISCUSSION
For this demonstration we use ibm peekskill, a 27 qubit IBM Quantum Falcon processor [32] whose coupling map enables a distance-3 heavy-hexagon code, see Fig. 1.The total time for qubit measurement and subsequent real-time conditional reset, for each round, takes 768ns and is the same for all qubits.All syndrome measurements and resets occur simultaneously for improved performance.As the final step, a simple X π -X π dynamical decoupling sequence is added to all code qubits during their respective idling periods.
Qubit leakage is a significant reason why the Pauli depolarizing error-model assumed by the decoder design might be inaccurate.It is sometimes possible to detect whether a qubit has leaked when it is measured.Thus, we can post-select on runs of the experiment when leakage has not been detected, similar to [17].See Supp.F for more information on the postselection method.
In Fig. 3a, we initialize the logical state |0 (|+ ), and apply r syndrome measurement rounds, where one round includes both X and Z stabilizers (total time of approximately 5.3µs per round, Fig. 1b).Using analytical perfect matching decoding on the full data set (500,000 shots per run), we extract the logical errors in Fig. 3a, red (blue) triangles.Details of optimized parameters used in analytical perfect matching decoding can be found in Supp.D. Fitting the full decay curves (eq.S.1) up to 10 rounds, we extract logical error per round without post-selection in Fig. 3a-inset of 0.059(2) (0.058(3)) for |0 (|1 ) and 0.113(5) (0.107(4)) for |+ (|− ).
In Fig. 3b, we compare the logical error per round from the post-selected data sets using the three decoders described previously in Section III.To match the current limitation of MLD to r = 4 rounds, all error rates in Fig. 3b were fit using only rounds r = 0 to 4, with error bars here denoting one standard deviation of the fitting parameter.We observe a consistent improvement in decoding moving from matching uniform (pink), to matching analytical (green), to maximum likelihood (grey).A quantitative comparison between the three decoders for all four logical states at r = 2 rounds is provided in Supp.G.

V. CONCLUSIONS AND OUTLOOK
The results presented in this work highlight the importance of the joint progress of quantum hardware, both in size and quality, and classical information processing, both concurrent with circuit execution and asynchronous to it, as described with the studied decoders.Our experiments incorporate midcircuit measurements and conditional operations as part of a quantum error correction protocol.These technical capabilities will serve as foundational elements for further enhance- FIG.3: (a) Logical error, from matching analytical decoding, vs. number of syndrome measurement rounds r, where one round includes both Zand X-basis measurements.Error bars denote sampling error of each run (500,000 shots).Dashed line fits of error (including all rounds) yield error per round plotted in (a-inset).Applying the same decoding method on leakage-post-selected data (rejection rate per round quoted above post-selected error rates in inset), shows substantial reduction in overall error.See Supp.F for details.
(b) Comparison of fitted error per round (including up to 4 rounds in the fit) for all four logical states using matching uniform, matching analytical, and maximum likelihood decoders.Error bars here represent one standard deviation on the fitted rate.
ment of the role of dynamic circuits in quantum error correction, for example towards real-time correction and other feed-forward operations that could be critical for large-scale FT computations.We also show how experimental platforms for quantum error correction of this size and capabilities can trigger new ideas towards more robust decoders.Our comparison between a perfect matching and a maximum likelihood decoder sets a promising starting point towards the understanding of the trade-off between decoder scalability versus performance in the presence of experimental noise.All these key components will play a crucial role in larger distance codes, where the quality of the real-time operations (qubit conditional reset and leakage removal, teleportation protocols for logical gates, and decoding), along with device noise levels, will determine the performance of the code, potentially enabling the demonstration of logical error suppres-sion with increased code distance.

SUPPLEMENT A. Minimum weight perfect matching edge probabilities
Here we list the edge probabilities for the decoder graphs used in minimum weight perfect matching.Note that for experiments on logical |0 and |1 , we need only correct X errors and so just use the Z stabilizers, Fig. 2a.For experiments on logical |+ and |− , we need only correct Z errors with the graph in Fig. 2b.Edge weighs are given for the logical |0 and |+ 2-round experiments in the respective Tables S1 and S2.

B. Maximum likelihood implementations
There are at least two different ways to implement maximum likelihood decoding (MLD), which we call the offline and online usages of the decoder.They can differ significantly in time complexity depending on the specific application.
In the offline case, one calculates and stores the entire distribution Pr[βγ] and queries it to determine the correction for each run of the circuit.The calculation takes O(|E|2 |V |+2k ) time, since we must perform updates from Eq. ( 6) to the distribution for each hyperedge in E. Determining a correction using Eq. ( 7) takes O(2 2k ) time per run.
Alternatively, one can forgo storing the whole distribution, and instead calculate sparse distributions specific to each observation string β * in a data set.Online MLD achieves this by pruning the distribution as updates are performed, keeping only entries consistent with β * .We imagine receiving one bit of β * at a time.For the j th bit, updates are made using Eq. ( 6) for all hyperedges that contain bit j and have not already been included.In fact, all updates for a given bit can be combined into a pre-calculated transition matrix.Since no further updates will be made to bit j, we can now truncate the distribution by keeping only entries Pr[βγ] where β j = β * j .During the course of online MLD, there is some maximum instantaneous size of the probability distribution, say S max , and the total time to determine a correction is O(|V |S max ) per run.Note that S max depends on the decoding hypergraph and also the order in which error-sensitive events are incorporated.It can be argued that for n, k codes, repeated rounds of syndrome measurements, and events incorporated chronologically, 2 n+k ≤ S max ≤ 2 2n , because hyperedges do not span more than two rounds of error-sensitive events.The online decoder is also amenable to dynamic programming, storing partially calculated probability distributions up to some moderately-sized j.For instance, in our analysis of three-round experiments, we store distributions up to j = 15, while for four rounds we keep up to j = 21.
Since online MLD takes exponential (in n, the number of physical qubits in the code) time per run, if |V | is small enough, the offline MLD is preferable.If |V | is large but n and k are small (perhaps a small code experiment performing many rounds of syndrome measurements), the online decoder becomes the only feasible option.In the experiments here, online MLD becomes Edge e Qubit Q(e) First-order edge flip probability pe (z 1 0 , z 1 1 ) 2 44/15pcx + 14/3p id + 3pinit + 2p idm (z 1 0 , b)    2a.Here x t s indicates the s th X-stabilizer at time t, as in Fig. 2b, and b is the boundary node.If an edge e is chosen by the matching algorithm, a Pauli Z correction is applied to qubit Q(e) if it is not ∅.
preferable over offline MLD for three rounds and greater.

C. Simulation Details
We obtain theoretical simulation results using stabilizer simulations of the Qiskit software stack [33].In order to faithfully estimate the performance of quantum error correction circuits on IBM Quantum Falcon systems, we performed simulations of the quantum circuits with qubits mapped onto the Falcon devices using customized error models that captures the realistic noise behavior of experimental hardware.
Circuit errors in our simulation are modeled as depolarizing errors, so that the effect for different error sources of varying strength can be captured.Noise models were built following error locations and error channels described in Section III A: depo-larizing error model for each single and two qubit operation in the quantum circuit with error rates obtained from simultaneous randomized benchmarking (RB), measurement, initialize, and reset error in the form of bit-flip error for each of those operation, and idling error in the form of depolarizing noise.
Using the above described error model, we define a realistic depolarizing error model where simulations are carried out with noise parameters directly exported from ibm peekskill (Table S3 and S4), including • specific error rates for each single and two-qubit quantum operation with depolarizing quantum channel parameter obtained from simultaneous RB according to the relation gate = where gate , n, α gate represents error per gate, number of qubits in gate, and depolarizing quantum channel parameter, • initialization, measurement, and reset error obtained as described in Table S3, • idling errors with noise strength proportional to coherence limit of the gate, where coherence limit is computed using T 1 , T 2 and idle time of each qubit during the execution of each quantum operation in the circuit.
Furthermore, to demonstrate average performance of the circuit in a relatively uniform depolarizing error model, we define an average depolarizing error model where instead of the specific error rates for different gates and qubits stated above we use average error rates throughout the entire device to define the depolarizing error channels.

D. ibm peekskill and experimental details
Data in this section uses the qubit numbering (Q F N contrasting with Q N in Fig. 1) notation presented in Fig. S1a, matching standard IBM Quantum Falcon systems.Summarized in Table S3 are single qubit benchmarks for ibm peekskill, where single qubit gates for all qubits (excluding virtual Z gates) are identically 35.55ns.While the Falcon layout has 27 qubits, for the d=3 circuits presented in this paper we only needed to use 23 of those qubits as shown in Fig. S1a, excluding qubits    S3: Single qubit device parameters, using IBM-Falcon qubit numbering presented in Fig. S1a for ibm peekskill.
Single qubit error per gate (EPG) is obtained by performing randomized benchmarking (RB) on a given qubit with all coupled/spectator qubits idling.In contrast, simultaneous single qubit EPG (EPG simul) is obtained by performing one qubit RB concurrently two-qubit RB on neighboring gates (presented in Table S4).This is done to realistically approximate simultaneous application of gates in the X and Z stabilizers.To separate readout error from initialization and reset error, readout error is extracted from overlap of gaussian fits to ground and excited state histograms.The initialization sequence for the data presented in this paper was three rounds of conditional reset -Xπ e−f -then three more rounds of conditional reset in an effort to reduce f state population while maintaining a fast experiment repetition rate.The initialization error is that measured on average, after this sequence is applied simultaneously on all qubits.Reset error is average non-zero state population after a single round of conditional reset (simultaneously done across all qubits) after preparing all qubits on with an X π/2 , to capture mid-circuit reset needed for each syndrome measurement round.
The always-on coupling between connected qubits on ibm peekskill results in undesirable static ZZ, plotted in Fig. S1b, as a function of qubit-qubit detuning.To mitigate some of these effects, a simple X π -X π dynamical decoupling sequence is added to code qubits throughout the circuit.Furthermore, by introducing mixed dimensionality simultaneous RB [34], we can further capture the undesired side-effects of this coupling by comparing one and two-qubit gate error taken with standard RB with spectator qubits/gates fully idling or with those simultaneously driven as set by scheduling requirements of the Z and X checks.Simultaneous gate error for gates and qubits not part of these measurements (always idling during the experiments presented in the main text) are thus not included in this extra characterization (in table as NaN).These results are presented in Tables S3 and S4.Optimization of two-qubit gates was undertaken on ibm peekskill to ensure that no significant degradation in gate error or increase in leakage out of the computational manifold occurred in simultaneous benchmarking.
Using the same methodology presented in [14], reset operations conditioned on the preceding measurement result are used for mid-circuit reset operations shown Fig. 1b.The total time of the measurement + reset cycle is 768ns, and includes an approximately 400ns measurement pulse, cavity ring-down time overlapping with classical control path delays, and application of the conditional X π .For consistency, all qubits are calibrated to use the same duration pulse and delays, with pulse amplitude calibrated individually to optimize QND-ness of readout.
To optimize the performance of the analytical perfect matching decoding on experimental data, an optimization algo-   [35], with lengths and gate directions optimized for overall device performance.EPG is measured with spectator qubits idling while simultaneous EPG is taken with spectator qubits undergoing single qubit RB.
rithm was run to find a set of input error parameters that minimizes the decoder output logical error rates.Here we chose to use the L-BFGS-B algorithm [36] due to efficiency of optimization and ability to work with simple linear constraints.The optimization resulted in the following set of input error parameters for the analytical perfect matching decoding algorithm p C = [0.01,0.0028, 0.0, 0.001, 0.002, 0.0028, 0.0028, 0.0, 0.0005, 0.0, 0.00001] following the error locations C = {cx, h, s, id, idm, x, y, z, measure, initialize, reset} as defined in Section III B. We use the following equation to fit logical errors at syndrome measurement round, r, where A is SPAM error, τ = −1 ln(1−2 ) , and is the logical error rate per syndrome measurement round ( 3a-inset and 3b)

E. Leakage in the system
Leakage errors outside the computational space comprising the states |0 (g-state) and |1 (e-state) into |2 (f -state) or higher states cannot be corrected by our quantum error correction code and thus pose a serious threat to fault-tolerant computing.For fixed-frequency superconducting qubits, a certain set of qubit frequency assignments may lead to a frequency collisions during the cross-resonant gate operation [2].For example, when the target qubit frequency is close to the e → f transition frequency of the control qubit, leakage error is induced during the two qubit gate operation.Another example is a simultaneous operation of a two-qubit gate with a spectator single-qubit gate where the spectator qubit frequency together with target qubit frequency match the e → f transition of the control qubit.This can result in leakage errors which can be characterized by randomized benchmarking of the corresponding single-and two-qubit gates [37].
Leakage errors can also occur during measurements [38].As we speed up the measurement time by increasing the measurement power, qubits become more prone to leakage.We characterize this measurement-induced leakage by repeatedly measuring the qubit and extracting the leakage rate.The experiment is described in Fig. S2a, where the sequence consists of X π/2 followed by a measurement tone.The X π/2 pulse will map either |0 or |1 to the equator of the Bloch sphere, so the sequence randomly samples either |0 or |1 during the subsequent measurement.The obtained measurement leakage rate thus obtained is an average of the leakage rates from |0 and |1 states.The outcomes obtained from the sequence in Fig. S2a are classified according to calibration data obtained by preparing the |0 , |1 , and |2 states, using the closest distribution mean for each outcome, and then applying readout error mitigation by constraining the formalism described in [39] for multi-qubit readout to our single-qubit three-state subspace.This single-qubit readout error mitigation is applied to the ensemble of measurements obtained for each iteration of the pulse sequence.The measurement sequence is repeated for m = 70 times and we average over the 10, 000 shots for each m to compute the averaged probability that the qubit is binned in the |2 state.Fig. S2b shows the measurement leakage probability, p meas leak , where the qubit leaks to the |2 state per measurement.Eventually a steady state population in the |2 state, determined by the measurement leakage and seepage rates, is reached.We extract the leakage and seepage rates using the equation where the leakage rate Γ L is the probability of the qubit leaking during a measurement, the seepage rate Γ S is the probability of a leaked state returning to the qubit subspace during a measurement.Here, Γ L,S measures rate per measurement, therefore it is a unitless quantity.The obtained average and median value of Γ L are 6.54 × 10 −3 and 4.86 × 10 −3 per measurement, respectively.We extract the two-qubit gate leakage and seepage rate of the |2 state from simultaneous randomized benchmarking, with the simultaneity chosen to match the Z− and X−stabilizer sequences as illustrated in Fig. 1.Similarly, we extract the leakage/seepage rate from repeated measurement described in Fig. S2a.In these estimations, we account for the number of gate operations and measurements for each syndrome/flag qubits as well as the code qubits measured at the end.For instance, a two round experiment for the logical Z−basis consists of an X−check for state preparation, two rounds of X− and Z−checks, and a final measurement of the code qubits.Each check consists of two-qubit gates and measurements.As a result, there are three sets of two-qubit gates and measurements on X−check qubits, two sets of two-qubit gates and measurements on Z−check qubits, and one measurement of the code qubits.The post-selection procedure discards the result if any of the qubit is leaked from the computational subspace.Therefore, we sum all the leakage probabilities to compute p tot leak for each syndrome measurement round.Fig. S2c, S2d shows p tot leak as a function of the number of rounds for the Z− and X− logical bases, respectively.Each bar represents p tot leak from two-qubit gates (blue) and measurement (red) operations.The leakage error caused by measurement contributes the most for early rounds, then tends to saturate.The leakage contribution from two-qubit gates becomes significant for later rounds.This analysis shows that reducing leakage error from both two-qubit gates and measurements is important.Decreasing leakage induced by two-qubit gates in our architecture will be associated with slower gates.With respect to measurement, as noted above, it is well known that a strong drive on a superconducting qubit system can lead to transitions both beyond the computational space [38] and beyond the confinement of the Josephson cosine potential [40].There is therefore a trade-off to be considered between readout error and measurement length and leakage probability.Slower readout impacts the system by increasing the idle time of the qubits not being measured.There have been proposals to deal with leakage in superconducting qubit systems by moving all the qubit excitations to the readout resonator, from which they decay to the environment [41], or by designing readout resonator leakage reduction units (LRU) [42] which exploit particular transition levels of the qubit-resonator system and which transform leakage errors into Pauli errors.LRU have also been proposed at the code level [43].These options, as well as higher branching capabilities in readout and control electronics to conditionally reset qubits to the ground state from higher excitation levels, could be explored in experimental systems demonstrating quantum error correction in the near future.

F. Post-selection method
We post-select all our results to remove detected leakage events in any of the qubits in our system.To do this, we look at 5,000 integrated outputs for each qubit when prepared in each of the states |0 , |1 , and |2 .We show this calibration for Q F 12 (see Fig S1a ) in Fig. S3a.The overlap between the |1 and |2 states, which is significant in all 23 qubits used in this work, makes the classification of these states challenging.Furthermore, the presence of decay events (|1 to |0 , |2 to |1 , or |2 to |0 ) may impair the results using this training data within a supervised learning protocol.We instead apply clustering methods to our calibration data using a Gaussian Mixture Model (GMM) with three clusters, each cluster with an independent diagonal covariance matrix.The diagonal entries of the covariance matrices can be used to extract the standard deviations of the distribution for each qubit state.This offers a convenient way for us to define more flexible classification rules, compared to, for example, simpler clustering algorithms like K-means.Once the centroids and standard deviations (σ x and σ y ) are determined from the calibration data, we define regions for each state within the I/Q plane determined by a radius of 3σ on each axis around the corresponding centroid (see Fig. S3).For any given measurement in any of the qubits, if the integrated outcome is within the |0 -state region and the I-quadrature is negative, we classify that outcome as |0 .If the integrated outcome is not within the |0 -state region or the I-quadrature is positive, if it is within the |1 -state region we classify it as |1 , and if it is within the |2 -state region but not within the |1 -state region, we classify it as |2 .For all other results, we classify the output according to its closest centroid.
This classification method is applied to every qubit after every measurement and the experimental runs in which any qubit is measured as |2 is discarded.Fig. S3b shows the readout outcomes of Q F 12 after the last initialization measurement.We only discard uncorrectable errors (|2 state) and retain experimental shots in which a qubit is in the |1 state after initialization, as that should be a correctable error by the code.Fig. S3c shows the Q F 12 results after the first X−check for a logical |0 state preparation.Both the initialization and the mid-circuit contain the 500,000 shots that are used for each error correction run in our experiments.For the initialization classification we obtain populations of 0.9910, 0.0071, and 0.0019 for the |0 , |1 , and |2 states, respectively.For the mid-circuit X−syndrome classification, those populations are observed to be 0.4972, 0.4962, and 0.0066.

G. Error for r = 2 rounds
Table S5 shows a comparison across the decoders studied in this work for state preparation and two rounds of syndrome measurement for the logical states |+ L , |− L , |0 L , and |1 L .The results for the matching decoder with analytical input for the states |0 L and |+ L correspond to the values shown in Fig. 3a S5: Comparison of logical error extracted using matching uniform, matching analytical, and maximum likelihood decoders on both full and leakage-post selected (PS) data-sets for r = 2 rounds.The uncertainty corresponds to sampling noise, with each full data set corresponding to 500,000 shots, and the post-selected data sets keep a number of shots shown in the last column.

FIG. 2 :
FIG. 2: Decoding graphs for three rounds of (a) Z and (b) X stabilizer measurements for correcting X and Z errors, respectively, on the d = 3 heavy-hexagon code with circuit-level noise.The blue (a) and red (b) nodes in the graph correspond to stabilizers, and the black nodes correspond to the boundary.Node labels are defined by the stabilizer measurement (Z or X), along with a subscripts indexing the stabilizer, and superscripts denoting the round.(c) Black edges, arising from Pauli Y errors, connect the two graphs in (a) and (b).The three, size-4 hyperedges involving the top Z-stabilizer (gold outline).(d) The four, size-4 hyperedges involving the bottom Z-stabilizer (gold outline).
FIG. S1: (a) Translation of Fig. 1a qubit numbering (Q N ) to standard IBM-Falcon numbering(Q F N ).(b) Static ZZ between all connected qubits pairs versus detuning between qubits.Median qubit anharmonicity, see TableS3for breakdown, is -345 MHz FIG. S2: (a) Repeated measurement sequence for extracting leakage error during the measurement.The X π/2 pulse allows us to randomly sample leakage events from |0 or |1 states.(b) The leakage probability (p meas leak ) to the |2 state measured at Q F 14 .The leakage and seepage rate is obtained by fitting the data with Eq.S.2.(c, d) Qubit leakage in the system as a function of syndrome measurement rounds for Z− and X−basis logical states.Bar plots show the p tot leak as computed from the gate and measurement leakage rates, obtained from randomized benchmarking (2Q gates) and from the sequence shown in (a), respectively.Experimental results, p exp leak = 1 − p accept , where p accept is the acceptance probability calculated from the method outlined in Supp.F, are shown as black symbols for comparison.The experimental results plotted here do not include initialization leakage.
FIG. S3: (a) Readout calibration data for Q F 12 (see Fig. S1a).The qubit is prepared in its |0 , |1 , and |2 states and measured.The collected statistics can be seen in as blue (|0 ), red (|1 ), and grey (|2 ) where the dot-dashed lines represent 3-σ for each distribution.(b) 3-state classification results for Q F 12 after qubit initialization, and (c) after the first X−syndrome measurement.

TABLE S1
: Edge data for the X-error decoding graph shown in Figure2a.Here z t s indicates the s th Z-stabilizer at time t, as in Fig.2a, and b is the boundary node.If an edge e is chosen by the matching algorithm, a Pauli X correction is applied to qubit Q(e) if it is not ∅.

TABLE S2 :
Edge data for the Z-error decoding graph shown in Figure Table S3 for breakdown, is -345 MHz

TABLE S4 :
Two qubit gates used in X and Z stabilizers for ibm peekskill.The CX gates are constructed from the echoed cross-resonance gate for r = 2 rounds.