Parallel window decoding enables scalable fault tolerant quantum computation

Large-scale quantum computers have the potential to hold computational capabilities beyond conventional computers. However, the physical qubits are prone to noise which must be corrected in order to perform fault-tolerant quantum computations. Quantum Error Correction (QEC) provides the path for realizing such computations. QEC generates a continuous stream of data that decoders must process at the rate it is received, which can be as fast as 1 μs per QEC round in superconducting quantum computers. If the decoder infrastructure cannot keep up, a data backlog problem is encountered and the computation runs exponentially slower. Today’s leading approaches to quantum error correction are not scalable as existing decoders typically run slower as the problem size is increased, inevitably hitting the backlog problem. Here, we show how to parallelize decoding to achieve almost arbitrary speed, removing this roadblock to scalability. Our parallelization requires some classical feed forward decisions to be delayed, slowing-down the logical clock speed. However, the slow-down is now only polynomial in the size of the QEC code, averting the exponential slowdown. We numerically demonstrate our parallel decoder for the surface code, showing no noticeable reduction in logical fidelity compared to previous decoders and demonstrating the predicted speedup.

Large-scale quantum computers have the potential to hold computational capabilities beyond conventional computers for certain problems.However, the physical qubits within a quantum computer are prone to noise and decoherence, which must be corrected in order to perform reliable, fault-tolerant quantum computations.Quantum Error Correction (QEC) provides the path for realizing such computations.QEC continuously generates a continuous stream of data that decoders must process at the rate it is received, which can be as fast as 1 MHz in superconducting quantum computers.A little known fact of QEC is that if the decoder infrastructure cannot keep up, a data backlog problem [1] is encountered and the quantum computer runs exponentially slower.Today's leading approaches to quantum error correction are not scalable as existing decoders typically run slower as the problem size is increased, inevitably hitting the backlog problem.That is: the current leading proposal for fault-tolerant quantum computation is not scalable.Here, we show how to parallelize decoding to achieve almost arbitrary speed, removing this roadblock to scalability.Our parallelization requires some classical feed forward decisions to be delayed, leading to a slow-down of the logical clock speed.However, the slow-down is now only polynomial in code size, averting the exponential slowdown.We numerically demonstrate our parallel decoder for the surface code, showing no noticeable reduction in logical fidelity compared to previous decoders and demonstrating the parallelization speedup.
Quantum error correction (QEC) generates a stream of syndrome data to be decoded.An offline decoder collects and stores all the syndrome data generated during a hardware run (often called a shot) and then performs decoding as a post-processing step.Offline decoding is sufficient for computations consisting solely of Clifford gates (e.g.CNOT and Hadamard gates).However, faulttolerant quantum computations must adapt in response to certain logical measurement results, which must be decoded to be reliable.For instance, when performing T := diag(1, e iπ/4 ) gates using teleportation and a magic state [2,3], we must decide whether to apply a Clifford S := diag(1, e iπ/2 ) correction before performing the next non-Clifford operation (see Fig. 1).This logic branching decision can only be reliably made after we decode the syndrome data from the T gate teleportation [1,4,5].Therefore, online, or real-time, decoding is necessary for useful quantum computation.Classical computation occurs at finite speed, so online decoders will have some latency, but they need only react fast enough to enable feed-forward and Clifford correction.
How fast do decoders need to be?A fundamental requirement was first noted by Terhal [1] in her backlog argument "Let r proc be the rate (in bauds) at which syndrome bits are processed and r gen be the rate at which these syndrome bits are generated.We can argue that if r gen /r proc = f > 1, a small initial backlog in processing syndrome data will lead to an exponential slow down during the computation, . . ." Terhal proved that quantum algorithms with T -depth k have a running time lower bounded by cf k when f > 1 S Z T T T ψ τ FIG. 1.A gate-teleportation circuit to perform a T gate using a magic state |T := T |+ , including a classically controlled S gate depending on the measurement outcome.In faulttolerant implementations with logical qubits, the logical Z measurement must be decoded before the S correction can be correctly applied.This leads to a response time τ that is largely determined by the decoding time but also includes communication and control latency.and c is some constant.Refs.[6,7] provide more detailed reviews of this backlog argument.However, for all known decoders, as we scale the device decoding becomes more complex, the value of f increases and inevitably we encounter the backlog problem.
Here we solve this problem, removing a fundamental roadblock to scalable fault-tolerant quantum computation.We propose parallelized window decoding that can be combined with any inner decoder that returns an (approximately) minimum weight solution, presenting results for minimum-weight perfect matching (MPWM) [8][9][10] and union-find (UF) [11,12].
The previous leading idea was to modify decoders to work online was proposed by Dennis et al [8]: "take action to remove only these long-lived defects, leaving those of more recent vintage to be dealt with in the next recovery step." Here defects refer to observed changes in syndrome.Dennis et al called this the overlapping recovery method [8,13].Later, similar approaches were adopted for decoding classical LDPC codes [14], where this is known as sliding window decoding.Roughly speaking, given a sequence of defects proceeding in time one decodes over some contiguous subset, or window.The decoder output gives only tentative error assignments, and from these only a subset -those of an older vintage -are 'committed'.Here, committing means making a final correction decision for potential error locations, with all corrections performed in software.One then slides the window up and the process repeats.
Sliding window decoding is inherently sequential.Let us consider a single code block (e.g. a surface code patch) with each QEC round taking τ rd seconds.If each window is responsible for committing error corrections over n com rounds of syndrome data, then it takes time n com τ rd to generate all this data.If the time to decode each window is τ W , including any communication latency, then avoiding Terhal's backlog problem requires that τ W < n com τ rd .Since τ W typically grows superlinearly with the decoding volume, this leads to a hard upper bound on the achievable distance d.For example, a distance d surface code has τ W = Ω(n com d 2 ) and therefore we are restricted to d 2 ≤ O(τ rd ).Scaling hardware based on a fixed device physics means τ rd is fixed.This imposes a hard limit on code distance.The reader should pause to reflect how remarkable it is that the current leading proposal for fault-tolerant quantum computation is not scalable.
As with sliding window decoding, our parallel window decoder breaks the problem up into sets of overlapping windows.Rather than solving these sequentially, some windows are decoded in parallel by adapting how overlapping windows are reconciled.Through numeric simulations, we find that sliding, parallelized and global approaches differ in logical error rates by less than the error bars in our simulations.We show that, by scaling classical resources, parallel window can achieve almost arbitrarily high r proc regardless of decoding time per window τ W . Furthermore, we show that while there is still an inherent latency determined by τ W leading to a slowdown of the logical clock speed, this is only linear in τ W , rather than the exponential slow down resulting from Terhal's backlog argument.We conclude with a discussion of the implications of this work for practical decoder requirements and extensions to a number of other decoding problems.After making this work public, similar results were posted by the Alibaba team [15].The Alibaba numerics present the logical fidelity of the decoder, but do not include numerical results on decoding speed and improvements through increasing number of processors used.

A. Matching decoders
Windowing techniques, both sliding and parallel, can be combined with most decoders acting internally on individual windows.We will refer to these as the "inner decoders".However, for brevity, in the main text we will describe the procedure for the case of matching decoders, such as MWPM and union-find.A matching decoder is applicable when any error triggers either a pair of defects or a single defect.For example, in the surface code X errors lead to pairs of defects (when occurring in the bulk) or a single defect (when occurring at so-called rough boundaries of the code).To fully formulate a matching problem, all errors must lead to a pair of defects.Therefore, errors triggering a single defect are connected to a virtual defect commonly called the boundary defect.We then have a graph where the vertices are potential defects (real or boundary) and edges represent potential errors.Given an actual error configuration, we get a set of triggered defects and we can enforce that this is an even number by appropriately triggering the boundary defect.A matching decoder takes as input this set of triggered defects and then outputs a subset of edges (representing a correction) that pair up the triggered defects.Running a decoder on our entire defect data set at once (no windowing) will be referred to as global decoding, but global decoding is not compatible with the real-time feedback required for non-Clifford gates.

B. Sliding window decoding
Instead of decoding a full history of syndrome data after the computation is complete, sliding window decoding starts decoding the data in sequential steps while the algorithm is running.At each step, a subset (window) of n W rounds of syndrome extraction is processed.The window correction graph is acquired by taking all the vertices and edges containing defects in the selected rounds.The measurement errors in the final round of a window only trigger a single defect within the window.Therefore, all final round defects are additionally connected to the boundary defect, referred to as the rough top time boundary.
Following the overlapping recovery method [8,13], a window can be divided into two regions: a commit region consisting of the "long-lived" defects in the first n com rounds, and a buffer region containing the last n buf rounds (n W = n com + n buf ).An inner decoder (e.g.MWPM or UF) outputs a subset of tentative correction edges within the window.Only the correction edges in the commit region are taken as final.Sometimes, the chains of tentative correction edges will cross from the commit to the buffer region.Applying only the part of the chain in the commit region will introduce new defects, At each decoding step a number of syndrome rounds (window) is selected for decoding (orange region in left columns), and tentative corrections acquired.The corrections in the older part of the window (green region in right columns) are of high confidence and are committed to.The window is then moved up to the edge of the commit region and the process repeated.We decide to commit to the edges going from the commit region out of it, producing artificial defects defined by nodes outside of the region belonging to such an edge.
referred to as "artificial defects" along the boundary between the commit and buffer regions.The window is then moved up by n com for the next decoding step that now includes the artificial defects along with the unresolved defects from the buffer region of the preceding step and new defects in the following rounds.Fig. 2 illustrates sliding window for the simple example of a repetition code, naturally extending to surface codes by adding another spatial dimension.Notice in Fig. 2 the creation of artificial defects where tentative corrections cross between commit and buffer regions.
Due to these artificial defects, sliding window decoding (and also parallel window decoding, described below) requires an inner decoder which returns an approximately low weight correction, such as UF or MWPM.Decoders, such as those based on tensor network contractions, identify the optimal homology class (all error differing by stabilizers are in the same class) that contains a low-weight correction.Once a homology class has been identified, we can always efficiently select a representative correction from the class but this could be a high weight correction (e.g. containing many stabilizer loops), leading to additional artificial defects at the boundary of the committed region, and then to logical errors when the next window is decoded.Therefore, additional modifications beyond those discussed in this work would be needed to use homology-based inner decoders.
Processing only a subset of the syndrome data at a time inevitably reduces the logical fidelity of the decoder.However, a logical fidelity close to that of the global decoder can be retained by making the unaccounted failure mechanisms negligible compared to the global failure rate.In particular, the error chains beginning in the committed region need to be unlikely (compared to the global failure rate) to span the buffer region and extend beyond the window.If the measurement and qubit error rates are comparable, to achieve this for distance d codes, it suffices to make the buffer region of the same size n buf = d [8].In the Appendix C, we demonstrate numerically that by choosing n buf = n com = d there is no noticeable increase in logical error rate when applying the sliding window algorithm.

C. Parallel window decoding
Here we present our main innovation to overcome the backlog problem, which we call parallel window decoding.We illustrate the method in Fig. 3.As in Fig. 2, our illustration is for a repetition code example, naturally extending to a surface code, with further extensions discussed in Section I E.
Parallel window decoding proceeds in two layers.First, we process a number of non-overlapping windows in decode layer A concurrently.As opposed to the sliding window approach, there are potentially unprocessed defects preceding the rounds in an A window.We thus need to include a buffer region both preceding and following the  commit regions.Additionally, we set both time boundaries to be rough, connecting the first and last round of defects to the boundary node.We set n buf = n com = w, giving a total of n W = 3w per window for some constant w.Using the same reasoning as with the sliding window we set w = d.Note that in Fig. 3 we use w < d to keep the illustration compact.
Having committed to corrections in adjacent windows and computed the resulting artificial defects, in layer B we fill in the corrections in the rounds between the neighbouring A commit regions.For convenience, we separate A windows by d rounds, so that B windows also have n W = 3d rounds.As the corrections preceding and succeeding the rounds in B windows have been resolved in layer A, the B windows have smooth time boundaries and do not require buffers.
Crucially, if the size of windows and the commit region in layer A are chosen appropriately, we expect no significant drop in logical fidelity compared to the global decoder.As with sliding windows, this is because each error chain of length ≤ d is guaranteed to be fully captured within one of the windows.In Fig. 4a we verify this by simulating the decoding process.We find that the logical error rates of rotated planar codes using the global MWPM and parallel window MWPM are within the numerical error of each other across a range of code sizes and number of measurement rounds.The same holds for UF-based decoders with data presented in the Appendix C.This approach is highly parallelizable: as soon as the last round of window A n has been measured, the data can be given to a worker process to decode it.However, as the window B n requires the artificial defects generated by windows A n and A n+1 adjacent to it (see Fig. 3), it can only start once both processes have completed.In the Appendix D, we sketch a schematic defining how the data pipelining could be implemented in an online parallel window decoder to achieve a high utilization of available decoding cores.
Assuming no parallelization overhead, the syndrome throughput will scale linearly with the number of parallel processes N par .In this case, N par n com rounds are committed to in layer A, and N par n W in layer B. Each round takes τ rd to acquire and the two layers of decoding take 2τ W .To avoid the backlog problem, we need the acquisition time to be greater than the decoding time: Therefore, the number of processes needs to be at least: In practice, the overhead of data communication among worker processes needs to be considered.In the parallel window algorithm, each process only needs to receive defect data before it is started, and return the artificial defects and the overall effect of the committed correction on the logical operators (see Appendix D).
Thus, we expect the data communication overhead to be negligible compared to the window decoding time.Indeed, in Fig. 4b we demonstrate this by simulating parallel window decoding in Python using MWPM as the inner decoder, showing how using N par = 16 leads to greater than an order-of-magnitude increase in decoding speed.Some sub-linearity can be seen due to parallelization overheads in software, particularly for low-distance codes where the decoding problem is relatively simple.In the Appendix C, we repeat these simulations using UF decoder where the overhead is more noticeable due to faster decoding of individual windows.However, hardware decoders such as FPGA (Field Programmable Gate Array) and ASIC (Application-Specific Integrated Circuit) are more suited to parallel data processing, allowing a large number of processes without being bottle-necked by the communication overheads (discussed further in Appendix D).Lastly, even with some sub-linearity, the backlog can be averted provided as we really only need that arbitrary decoding speed is achieved with polynomial number of processors.

D. Resulting resource overheads
While we can achieve almost arbitrarily high syndrome processing rates, there is still an inherent latency determined by the time to decode each window τ W .If τ W is large compared to the physical QEC round time τ rd , we may slow down the logical clock of the quantum computer to compensate for this latency.This slowdown is achieved simply by extending the delay time τ as shown in Fig. 1.If we pick N par as described in Eq. ( 2), at every instance a block of n lag = N par (n com + n W ) rounds are being decoded at once.The last round for which the full syndrome history has been decoded is therefore going to be n lag rounds behind the most recently measured syndrome data.Therefore, we can set the response time after each T -gate (as defined in Fig. 1) to However, combining Eq. ( 2) and Eq. ( 3) the total response time is ≈ 2τ W .That is, for an algorithm with k layers of T gates, the total response time is τ k ≈ 2kτ W .This is in stark contrast to the exponential in k response time observed by Terhal [1].Furthermore, using an efficient decoder for each window, the average window decode time τ W scales polynomially with code size d, so τ W = O(d α ) for some constant α.Since code size is poly-logarithmic in algorithm depth k and width W , d = O(log(kW ) β ) for some constant β.The response time per layer of T -gates is a poly-logarithmic factor so τ = O(log(kW ) αβ ).Strictly speaking, this additional overhead increases the decoding volume kW by a logarithmic factor, but overall still gives a poly-logarithmic complexity.
We define logical clock time as how long it takes to execute one logical non-Clifford gate.Using lattice surgery to perform T -teleportation, and assuming no bias between measurement and physical errors, takes dτ rd time for lattice surgery and τ response time.This gives a logical clock time of τ clock := dτ rd + τ .Alternatively, this time overhead can be converted into a qubit overhead by moving Clifford corrections into an auxiliary portion of the quantum computer [16], for example using auto-corrected T -gate teleportation [3,17].In algorithm resource analysis, a common assumption is that T gates are performed sequentially [3,[18][19][20][21][22][23][24][25] as then only a few magic-state factories are needed to keep pace.Auto-correction gadgets enable us to perform the next T -gate before the response time has elapsed.The price is that an auxiliary logical qubit must instead be preserved for time τ , after which it is measured in a Pauli basis depending on the outcome of the decoding problem.Therefore, instead of a time overhead we can add τ /dτ rd auxiliary logical qubits.If we have an algorithm with 100 logical qubits and τ clock = 10dτ rd , then: without auto-correction we incur a 10×time cost; and with auto-correction we instead require 9 auxiliary logicals qubits and so a 1.09×qubit cost.Under these common algorithm resource assumptions, we find seemingly large time overheads from parallel window decoding can be exchanged for modest qubit overheads.Indeed, the autocorrection strategies trade time for space resource, but the overall space-time volume is preferable under these resource estimation assumptions (1.09× instead of 10×).Note that the additional space-time volume required for magic state distillation will depend only on the number of magic states produced and not on whether we use autocorrected teleportation.

E. Extensions
Error mechanisms (e.g.Y errors in the bulk of the surface code) sometimes trigger more than a pair of defects, but reasonable heuristics can often be used to approximately decorrelate these errors to produce a graphical decoding problem.This decorrelation works well for the surface code.However, many codes cannot be decorrelated and require a non-matching decoder.Even when decorrelation approximations are possible, logical fidelities can be improved by using a non-matching decoder that accounts for this correlation information [26][27][28][29].Extensions of parallel window decoding to non-matching inner decoders are outlined in Appendix B.
By judicious choice of window shapes and boundaries, one could consider 3D-shaped windows that divide the decoding problem in both space and time directions.Similarly, we can construct 3D-shaped windows for parallel execution with only a constant number of layers.When slicing in the time direction we only needed 2 layers of windows, but when constraining window size in D dimensions a D + 1 layer construction is possible, with the minimum number of layers being determined by the colorability of some tiling (see Appendix A for details).The decoding frequency (number of rounds decoded per second) as a function of the number of decoding processes for the parallel window algorithm.The decoding frequency increases approximately linearly with the number of processes, achieving an order of magnitude faster decoding when using 16 processes.The sub-linearity most noticeable on small decoding problems is due to the parallelization overhead in the software implementation.Where the error bars are not visible, they are smaller than the marker size.Here we plot the decoding frequency r dec , therefore the rate of syndrome processing is rproc = r dec (d 2 − 1).
When performing computation by lattice surgery, during merge operations the code temporally has an extended size [3,21,30,31], and windowing in the spatial direction will become necessary to prevent the window decode time τ W from significantly increasing.One may also wish to spatially window for a single logical qubit with windows smaller than the code distance since the decoder running time τ W reduces with window size, and therefore the logical clock time may decrease (alternatively auto-correction qubit overhead may reduce).But there are subtle tradeoffs.Firstly, for windows of size ω < d in either the space or time direction, there may be adversarial failure mechanisms of weight (ω + 1)/2 < (d + 1)/2 that are no longer correctly decoded.One may speculate that this reduces the effective code distance to ω.However, in practice, percolation theory arguments [32] show that for a distance d code, the largest error clusters are typically of size O(polylog(d)).This leaves open the possibility that windows of size O(polylog(d)) < ω < d will suffice and be of practical value for stochastic (even if not adversarial) noise, though substantial further investigation is required.We remark that this discussion assumes that measurement errors (that create vertical error chains) have a comparable probability as physical Pauli errors.If there is a large measurement error bias, then we must appropriately scale the duration of lattice surgery operations and the vertical extent of our windows.

II. CONCLUSIONS
Parallel window decoding avoids the exponential backlog growth that is unavoidable (for large enough computations) with sliding window decoders.For many leading hardware platforms, such as superconducting devices, syndrome backlog can be a severe practical obstacle, even for modest code sizes.In recent superconducting experiments a QEC round was performed every 1.1µs by Krinner et al. [33] and every 921ns by the Google Quantum AI team [34].Our results are applicable to all hardware platforms, but the speed of superconducting quantum computers means these are amongst the most challenging systems for real-time decoding.Indeed, both aforementioned teams instead performed offline decoding, omitting a crucial aspect of scalable error correction.
To meet this challenge, improving the speed of decoders is currently an area of intense research.For example, LILLIPUT [35] is a recently proposed fast online sliding window decoder, implemented as an FPGA-based look-up table.For d ≤ 5 surface codes, the authors reported that a round of syndrome data could be processed every 300ns, fast enough even for superconducting qubits.However, the memory requirements of lookup tables scale exponentially in qubit number, making this decoder impractical for all but the smallest code sizes.The UF decoder scales favourably, and modelling of it on a dedicated microarchitecture [12] suggested it would be fast enough for distance 11 surface codes.However, the au-thors acknowledged "further study is necessary to confirm the validity of our model in a real device".Riverlane has recently released performance data, showing that real-time FPGA decoding should be possible on superconducting hardware with up-to distance 9 codes [36].There have been other approaches to accelerating decoders.A parallelized version of minimum weight perfect matching (MWPM) has been proposed [37] but never implemented and its performance is unclear.Adding a predecoding stage has also been identified as a way to further accelerate decoding and potentially boost logical fidelity [7,[38][39][40][41][42], but this has not been tested in an online setting.As such, even for modest code distance such as d = 11, it is unclear whether conventional decoding approaches will be fast enough.
On the other hand, a parallel window decoder, as introduced here can achieve almost arbitrarily high decoding speed given enough classical resources and some (polynomially scaling) quantum resource overheads.Therefore, this approach resolves both fundamental scalability issues and practical obstacles for hardware with rapid QEC cycle times.

III. METHODS
All simulations were performed on an AMD EPYC 7742 processor.We used the PyMatching package [10] to perform MWPM.For UF we used a custom Python implementation of the algorithm described in Ref. [11].
In all experiments, phenomenological Pauli noise with physical error rate p was used, meaning that there is a probability p for a data error on every qubit at each round.Further, every syndrome measurement had an error with probability p.
To compute the timing for Fig. 4b and additional results in the Appendix C, we perform the decoding on 8(N par + 1)d rounds to ensure a full two cycles of parallel decoding, averaging over 5000 repetitions.We assume initialisation and readout in the Z basis, meaning that the initial and final rounds of defects are smooth.Moreover, in parallel window decoding, we take the first round to always "belong" to layer A, and the first 2d rounds of the first window are committed to.The last round belongs to a layer B if the total number of rounds n tot satisfies n tot mod 4d ∈ (−d, d], in which case the decoding is performed normally with the last B window potentially being of reduced size.Otherwise, the last window belongs to layer A and the commit region of the last window is from the bottom of the regular commit region to the last round.Our main argument has centered around how to perform parallel window decoding over windows defined by time intervals.However, as motivated in the main text, we also may want to parallelize with respect to spatial directions.This is required to support long range lattice surgery operations [3,21,30,31,43], and may also be desirable within single patches.Here we outline how this works, with a guiding example given in Fig. 5. First, given some space (e.g. a decoding graph or hypergraph) we divide the space up into non-overlapping commit regions.We regard each vertex in the decoding problem as having a space-time coordinate in R D (with D = 2 for the surface code).Each edge in the decoding graph is assigned a space-time coordinate corresponding to the mid-point between the vertices it connects.For edges connecting to the boundary, we can just equate the non-boundary vertex coordinate with the edge coordinate.Then for any space-time region, we can associate a set of vertices and edges residing within this region.Assuming a topological code that has local stabilizers, then will always be a maximum distance R between any pair of vertices connected by an edge.
Therefore, to find a valid ordering of layers, it suffices to solve a colouring problem.That is, we define collections of commit regions and seek to assign them colours, such that (i) no two regions of the same colour are adjacent; (ii) length scales are set so that regions of the same colour are always separated by distance R. Given such a colouring, we can map colours to decoding layers, for example red → A, green → B and blue → C. Any permutation of layers remains a valid choice.
We can regard commit regions A and B of Fig. 3 as representing a 2-colouring of a 2D space.This is extended to 3D (and thereby the surface code decoding problem) by extruding into a 3rd dimension.Fig. 5-i shows a hexagonal 3-colouring of a 2D space, and Fig. 5-iii shows the extruded 3D version of this tiling.For a D dimensional space there exist tilings that can be coloured using D + 1 colours with each tile of bounded size, which for instance has been proved in the context of colour codes [44,45].In Fig. 5-iii, we tile a D = 3 space using only 3 colours, but the regions are unbounded size with respect to depth in the 3rd dimension.If we desire constant size tiles, then a tiling of 3D space could be achieved using 4 colours.
Our examples show the minimum number of colours.Given a limited number of processors N par , we may choose to use more colours so that for each colour there are no more than N par regions.
Next, we consider the buffer regions required to provide confidence in the corrections in the commit regions.In Fig. 3, the buffer windows are placed above and below the commit region of layer A. In higher dimensions, the buffer regions must include all possible error locations (edges) within a distance w of the commit region.However, previously committed regions must not be included in the construction of buffers.Additionally, we do not want artificial defects pushed into a previously resolved region.Therefore, where a window meets a previously committed region the boundary must be set to smooth (no artificial defects allowed).
For example, Fig. 5-ii shows buffer regions and boundaries for a hexagonal tiling.In layer A, the buffer region extends in every direction from the commit region.All the boundaries in A are rough.In layer B, the buffer extends in all directions except those already resolved in layer A. Furthermore, the layer B window boundaries are set rough except where they meet the resolved layer A commit regions (where they are instead smooth, as illustrated).The final layer C will only have smooth boundaries and no buffer regions.space-time co-ordinate based on the mid-point of its associated vertices.Note that for non-topological codes the decoding hypergraph may not be localized in Euclidean space, though repeated syndrome extraction means that there will be a time axis such that hyperedges contain vertices that are contained within a constant range on the time axis.
For buffer regions, we follow the same recipe as in the matching case.The difference between rough and smooth boundaries needs additional care.Wherever we have a rough boundary (extremal hyperedges in a buffer region that are not adjacent to any previously corrected/committed regions), we need to allow for the possibility of creating artificial defects.This can be achieved by connecting every hyperedge on a rough boundary to the boundary vertex.

Appendix C: Numerical validation of decoder performance
In the main text, we presented numerical results for parallel window decoding using a MWPM inner decoder.
Here we present and discuss some additional numerical results: the performance of sliding window decoders with a MWPM inner decoder; and parallel window decoding with a UF inner decoder.
In Fig. 6a we confirm that the sliding window decoding has a negligible drop in logical fidelity for n W = 2d, n com = d when compared to the global MWPM decoder.Furthermore, in Fig. 6b we measure the decoding frequency as a function of code size for square rotated planar codes.As the code size grows, the decoding frequency is expected to reduce as O(1/poly(d)) for both MWPM and UF which is consistent with our data.Therefore, using sliding window decoding combined with any of the leading inner decoding algorithms, there will always be a code distance for which τ W > n com τ rd .This sets a limit on the distance up to which error correction codes can scale using sliding window decoding.
Next, we discuss parallel window decoding when UF replaces MWPM as the inner decoder.As with MWPM, we see no significant increase of the logical error rate when using parallel window decoding (Fig. 6c), and a roughly linear increase with the number of processes N par for large codes.However, in the case of smaller codes the decoding problem is relatively easy and we see diminishing returns with increased parallelism as the paralleliza-tion overheads in Python start being comparable with the decoding time of individual windows.
Sending data to a worker process, starting the decoding of a window and receiving the resulting data takes a finite amount of time τ 0 .Therefore, if N par τ 0 > τ W all parallel processes will never be fully utilized and the processing will be bottle-necked by these overheads.However, in a hardware decoder, we expect τ 0 to be below 10ns using modern hardware and syndrome compression techniques [12], allowing us to scale to over 100 processes.As separate processes do not need to share data, further parallelization of data communication is possible, allowing for even higher bandwidths.

FIG. 3 .
FIG.3.Parallel window decoding schematic for repetition code with extra spatial dimension added for surface codes.The decoding proceeds in two layers.In layer A, a number of non-overlapping windows is decoded in parallel.The high confidence corrections in the middle of each window are committed to, and the artificial defects passed on to layer B. Windows in layer B are fully committed to, resolving all the defects between the committed regions of layer A and completing the correction.

FIG. 4 .
FIG.4.Logical error rate and decoding frequency on a rotated planar code using Minimum Weight Perfect Matching (MWPM) under phenomenological Pauli noise with 2% physical error rate.(a) Logical error rates as a function of the number of rounds of syndrome extraction for different code sizes for both the global offline MWPM (shaded bands), and using the parallel window algorithm (points).The parallel window decoder has no numerically significant drop in logical fidelity compared to the global decoder.(b) The decoding frequency (number of rounds decoded per second) as a function of the number of decoding processes for the parallel window algorithm.The decoding frequency increases approximately linearly with the number of processes, achieving an order of magnitude faster decoding when using 16 processes.The sub-linearity most noticeable on small decoding problems is due to the parallelization overhead in the software implementation.Where the error bars are not visible, they are smaller than the marker size.Here we plot the decoding frequency r dec , therefore the rate of syndrome processing is rproc = r dec (d 2 − 1).
Appendix A: Parallel window decoding in time and space

FIG. 5 .
FIG.5.Parallel window decoding in both time and 1 spatial dimension and the relationship to colourability of tessellations.(i) A 3-colour hexagonal tessellation of a 2D space, with each colour assigned a layer label A, B or C. Note that hexagons of the same colour never touch.(ii) A protocol (in 2D) based on the hexagonal tiling.The colours here match those used in Fig.3.That is, dark orange indicates a commit region and light orange shows the buffer region.Zig-zag boundaries represent rough boundaries.Green indicates regions where all the defects have been resolved.(iii) The hexagonal pattern of (i) extruded into the 3rd dimension, so it is suitable for surface code decoding (e.g 2D+1 decoding problems).

FIG. 6 . 1 FIG. 7 .
FIG.6.Logical error rate and decoding frequency on rotated planar code using sliding window MWPM decoder, and parallel window decoder with union-find under phenomenological Pauli noise with 2% physical error rate.(a) Logical error rates as a function of the number of rounds of syndrome extraction for different code sizes for the global MWPM (lines), and using the sliding window MWPM decoder (points).(b) The decoding frequency as a function of the code size d for square rotated planar codes using a sliding window MWPM decoder.(c) Logical error rates as a function of the number of rounds for global UF (lines) and using the parallel window algorithm with UF inner decoder (points).(d) The decoding frequency as a function of the number of decoding processes for the parallel window UF algorithm.Where the error bars are not visible, they are smaller than the marker size.Here we plot the decoding frequency r dec , therefore the rate of syndrome processing is rproc = r dec (d 2 − 1)