Exploiting Degeneracy in Belief Propagation Decoding of Quantum Codes

Quantum information needs to be protected by quantum error-correcting codes due to imperfect physical devices and operations. One would like to have an efficient and high-performance decoding procedure for the class of quantum stabilizer codes. A potential candidate is Pearl's belief propagation (BP), but its performance suffers from the many short cycles inherent in a quantum stabilizer code, especially highly-degenerate codes. A general impression exists that BP is not effective for topological codes. In this paper, we propose a decoding algorithm for quantum codes based on quaternary BP with additional memory effects (called MBP). This MBP is like a recursive neural network with inhibitions between neurons (edges with negative weights), which enhance the perception capability of a network. Moreover, MBP exploits the degeneracy of a quantum code so that the most probable error or its degenerate errors can be found with high probability. The decoding performance is significantly improved over the conventional BP for various quantum codes, including quantum bicycle, hypergraph-product, surface and toric codes. For MBP on the surface and toric codes over depolarizing errors, we observe error thresholds of 16% and 17.5%, respectively.


INTRODUCTION
To demonstrate an interesting quantum algorithm, such as Shor's factoring algorithm [1], a quantum computer needs to implement more than 10 10 logical operations, which means that the error rate of each logical operation must be much less than 10 −10 [2]. With limited quantum devices and imperfect operations [3], [4], quantum information needs to be protected by quantum error-correcting codes to achieve fault-tolerant quantum computation [5]. If a quantum state is encoded in a stabilizer code [6], [7], the error syndrome of an occurred error can be measured without disturbing the quantum information of the state. A quantum stabilizer code constructed from a sparse graph is favorable since it affords a two-dimensional layout or simple quantum errorcorrection procedures. This includes the families of surface and toric codes [8], color codes [9], bicycle codes [10], and generalized hypergraph-product (GHP) codes [11], [12].
For a general stabilizer code, the decoding problem of finding the most probable coset of degenerate errors with a given error syndrome is hard [13], [14], and an efficient decoding procedure with good performance is desired. The complexity of a decoding algorithm is usually a function of code length N . Edmonds' minimum-weight perfect matching (MWPM) [15] can be used to decode a surface or toric code [16]- [19]. The complexity of MWPM is O(N 3 ), and can be reduced to O(N 2 ) if local matching is used with minor performance loss [19]- [21]. Duclos-Cianci and Poulin proposed a renormalization group (RG) decoder, which uses a strategy analogous to the decoding of a concatenated code, to decode a toric (or surface) code with complexity proportional to N log( √ N ) [22]. Both MWPM and RG can be generalized for color codes (see Appendix D).
On the other hand, most sparse-graph quantum codes can be decoded by belief propagation (BP) [10], [23]- [25] or its variants. 1 BP is an iterative algorithm with complexity O(N j) per iteration [25], [36], where j is the mean columnweight of the check matrix of a quantum code. In general, an average number of iterations τ ≈ log log N is sufficient for BP to converge [37], [38]. In practice, a maximum number of iterations T max proportional to τ up to a large enough constant will be chosen. So the complexity of BP is O(N jT max ) or more precisely O(N jτ ) = O(N j log log N ).
Although BP seems to have the lowest complexity, a long-standing problem is that BP does not perform well on quantum codes with high degeneracy, unless additional complex processes are included [29]- [31]. (We say that a code has high degeneracy or is highly-degenerate if it has many stabilizers of weight lower than its minimum distance.) The Tanner graph of a stabilizer code inevitably contains many short cycles, which deteriorate the messagepassing process in BP [10], [26], especially for codes with high degeneracy [30], [31], [39]. Any message-passing or neural network decoder may suffer from this issue. One may consider variants of BP with additional efforts in pretraining by neural networks [40]- [43] or post-processing [30], [31] such as ordered statistics decoding (OSD) [44], but these methods may not be practical for large codes. In this paper we will address this long-standing BP problem by devising an efficient quaternary BP decoding algorithm with additional memory effects (cf. (15)), abbreviated MBP, so that the degeneracy of quantum codes can be exploited. 1. See: BP with random perturbation (tested on a bicycle code) [26], RG-BP on toric codes [22], RG-BP on color codes [27], [28], BP-MWPM on surface/toric codes [29], BP-OSD on (generalized) bicycle, HP and topological codes [30], [31], and BP with small set flipping (BP-SSF) on HP codes with expander graphs [32] (where SSF [33], [34] is a generalization of bit-flipping BP for classical expander codes [35]).
Many known decoders in the literature treat Pauli X and Z errors separately as binary errors, which may incur additional computation overhead or performance loss. MBP directly handles the quaternary errors.
The problem of a hard-decision decoding of a classical code is like an energy-minimization problem in a neural network [45], where an energy function measures the paritycheck satisfaction (denoted by J S ). It is known that BP has been used for energy minimization in statistical physics [26], [38], [46]. Moreover, an iterative decoder based on the gradient decent optimization of the energy function has been proposed [47]. These motivate us to consider a soft-decision generalization of the energy function with variables that are log-likelihood ratios (LLRs) of Pauli errors and make connections between BP and the gradient decent algorithm. We define an energy function with an additional term (denoted by J D ) that measures the distance between a recovery operator and the initial channel statistics. Then we show that BP in log domain is like a gradient descent optimization for this generalized energy function (cf. (7)) but with more elegant step updates. This explains why the conventional BP usually works on a nondegenerate quantum code, since the energy topology is similar to the classical case (cf. Sec. 3.3).
For a highly-degenerate quantum code, it has many lowweight stabilizers corresponding to local minimums in the energy topology so that the conventional BP easily gets trapped in these local minimums near the origin (see Fig. 2). This suggests that we should use a larger step (which can be controlled by message normalization) [25]. However, this is simply not enough since the energy minimization process may not converge if large steps are made. An observation from Hopfield nets is that inhibitions (edges with negative weights) between neurons can enhance the perception capability of a network and improve the patternrecognition accuracy [48]- [52]. MBP is mathematically formulated to have this inhibition functionality, which helps to resist wrong beliefs passing on the Tanner graph (due to short cycles [37]) or to effectively accelerate the search in a gradient descent optimization to avoid getting trapped. An important feature of MBP is that no additional computation is required and thus the complexity of MBP remains the same as the conventional BP with message normalization.
The performance of MBP can be further improved by choosing an appropriate step-size for each error syndrome. However, it is difficult to precisely determine the step-size. If the step-size is too large, MBP may return incorrect solutions or diverge. We propose to adaptively choose the step-size using an ε-net. This adaptive scheme will be called AMBP, and its complexity is still O(N j log log N ) since the chosen ε is independent of N .
An optional technique that can adopted in MBP is to use fixed initialization [25], [53]. The energy function and energy topology are defined according to the channel statistics. If MBP performs well on a certain channel statistics (say, at a certain depolarizing rate 0 ), it means that MBP can correctly determine most syndrome-and-error pairs on that topology. Thus it is better to decode using this energy topology, regardless of the true channel statistics. This technique works for any quantum codes.
Computer simulations of MBP (or AMBP) on various quantum codes are performed. Note that MBP naturally extends to a model of simultaneous data and measurement errors [54]; however, perfect syndrome measurements are assumed in this paper. We will first do a case study on the five-qubit code [55] to show how the memory effects help the gradient descent optimization. Then we decode quantum bicycle codes [10], a GHP code in [30], and (rotated) surface and toric codes [56], [57].
Bicycle codes have good error-correction performance and low decoding complexity but they may have high errorfloor if there are many low-weight stabilizers [10, Fig. 6], which occurs if the generator vector of a bicycle code has low weight. This vector generates a bicycle check matrix with fixed stabilizer weight, called as row-weight. A bicycle code has its minimum distance upper bounded by its rowweight. However, the performance of BP depends more on the weight distribution of codewords [37], [38], rather than just the code distance. We simulate the cases in [10,Fig. 6] and each error-floor is significantly improved using MBP. In particular, the convergence behavior can be improved, especially when the row-weight is small (i.e., when the code is more degenerate). If AMBP is used, the achieved performance is close to the quantum Gilbert-Varshamov rate (see Fig. 12 and its discussions.) Next, we consider a GHP code constructed in [30] 48,16]] and row-weight 8. Since the row-weight 8 < D, the code is highly-degenerate so the energy topology is hard for conventional BP to have good convergence. In [30], a high order BP-OSD-ω with order w is proposed. BP is used with post-processing by OSD, together with a post-selection on 2 ω possible errors. The considered GHP code needs BP-OSD-w with w = 15 so that degenerate errors can be found with high probability. OSD has subroutines of sorting, Gaussian elimination, and classical re-encoding [44]. Thus, further with the postselection on 2 w errors, the complexity of BP-OSD-ω is quite high. On the other hand, AMBP efficiently finds degenerate errors and can outperform BP-OSD-15 (see Fig. 13).
Finally we consider the decoding threshold on surface codes ( surf ) or toric codes ( toric ) as a benchmark [16]. Theoretical estimation suggests that surface or toric codes have a threshold of 18.9% over depolarizing errors [58]- [60], which agrees with the hashing bound of increasing overhead [61]. In the following, we compare decoders with practical complexity. MWPM achieves a threshold of 15.5% [17], [19]. RG combined with BP (RG-BP) achieves toric = 16.4% [22]. A decoder based on matrix product states (MPS) achieves 17% ≤ surf ≤ 18.5% [62] (which has complexity O(N 2 ) like MWPM but is more complex in practice since MPS needs many matrix operations). Union-find (UF) has complexity almost linear in N and a threshold of 9.9% on toric codes over bit-flip errors [63] (which after rescaled by 3 2 is 14.85% over depolarizing errors). BP-assisted MWPM (BP-MWPM) has a threshold of 17.76% with complexity O(N 2.5 ) [29]. AMBP roughly achieves surf = 16% and toric = 17.5%. The surface and toric codes have mean column-weight j ≤ 4 [8], [57]. Hence the complexity of (A)MBP is O(N jτ ) = O(N log log N ), since τ = O(log log N ) is good enough in our in simulations, which agrees with the classical expectation mentioned earlier, although we need to consider short cycles and degeneracy. The thresholds and complexities of various decoders are provided in Table I.  [15] 15.5% [19] 15.5% a,b [17] a. If only the threshold of a decoder over bit-flip errors is provided, we rescale it by a factor 3 2 like [10, Eq. (40)]. b. It is usually considered that surf ≤ toric , since a toric code does not have boundary qubits. Figure 10 in [19] seems to suggest toric = 15% < surf = 15.5% according to the intersection points. However, the MWPM threshold on toric codes over bit-flip errors is estimated to be 0.1031 [17], which, after rescaled by 3 2 , matches the toric = 15.5% finally claimed in [19]. Figures 10 and 12 in [29] seem to suggest toric = 17.76% < surf = 17.84%, which may cause confusion, so a single threshold value 17.76% was concluded in [29].
Our simulation results show that MBP significantly improves the decoding performance of conventional BP. In particular, degeneracy is exploited so that degenerate errors may be returned by the decoder. It is known that BP can be treated as a recurrent neural network (RNN) [42], [64]. Similarly, our MBP induces an RNN with inhibition without the pre-training process. This may provide an explanation why RNN decoders (which may contain many negativeweight edges after training) work well on degenerate codes. This paper is organized as follows. In Sec. 2, we introduce stabilizer codes and the decoding problem. In Sec. 3, we interpret the decoding problem as energy minimization and introduce our MBP algorithm. We also show the RNN induced by MBP and link it with the inhibition technique in Hopfield nets. In Sec. 4, we provide the simulation results of MBP decoding on various quantum codes. Finally we conclude and discuss some future research topics in Sec. 5.

Stabilizer codes
We consider errors that are tensor product of Pauli matrices In particular, we will simulate independent depolarizing errors with rate ∈ (0, 3/4) so that each qubit independently suffers a Pauli error I, X, Y, or Z, according to a distribution The weight of an N -fold Pauli operator in {I, X, Y, Z} ⊗N , regardless of the global phase, is the number of its nonidentity components. For small , Pauli errors of lower weight occur with higher probability and we would like to mitigate their effects. In the following, the notation of tensor product ⊗ may be omitted if no confusion arises in our discussions. A stabilizer group S is an Abelian subgroup of {1, −1} × {I, X, Y, Z} N such that −I N / ∈ S. Assume that S has a set of N −K independent generators {S m } N −K m=1 . For simplicity, consider S m ∈ {I, X, Y, Z} N , though a generator may still have negative phase. A binary [[N, K, D]] stabilizer code defined by S is the 2 K -dimensional subspace in C 2 N that is the joint (+1)-eigenspace of S [6], [7]. The parameter D is called the minimum distance of the code and will be defined below. The elements in S are called stabilizers. Two Pauli operators either commute or anticommute with each other. If a Pauli error anticommutes with certain stabilizers, measuring those stabilizers will return eigenvalues −1. Consider the measurement result ±1 to be mapped by +1 → 0 and −1 → 1. Hence the measured eigenvalues of {S m } N −K m=1 , after mapping, are called the error syndrome of the error and a nonzero error syndrome suggests a detected error. Since a stabilizer has no effect on the code space, an error that is a stabilizer is harmless. Thus the minimum distance of the stabilizer code defined by S is the minimum weight of a Pauli error that is harmful but cannot be detected. Let For F ∈ S and an N -fold Pauli error E, the Pauli error EF and E are equivalent on the code space and hence EF is called a degenerate error of E. A quantum code is said to be degenerate if there are stabilizers of weight less than its minimum distance; otherwise, the code is nondegenerate. An [[N, K, D]] stabilizer code has N − K independent stabilizer generators. We judge how degenerate a code is by the percentage of its maximum number of independent stabilizer generators of weight less than D. Roughly speaking, a code is highly-degenerate or with high degeneracy if this percentage is high. For example, a surface or toric code of D ≥ 5 is highly-degenerate since it is straightforward to find a set of N − K independent generators of weight ≤ 4 [8], [57].
For better decoding performance of BP, M ≥ N − K stabilizers {S m } M m=1 will be measured. 2 We write S m = S m1 S m2 · · · S mN ∈ {I, X, Y, Z} N for m = 1, 2, . . . , M , and define an M × N check matrix The row-weight of the m-th row of S is referred to as the weight of the stabilizer S m . For an error E = E 1 E 2 · · · E N ∈ {I, X, Y, Z} N , define its binary syndrome vector z = (z 1 z 2 · · · z M ) ∈ {0, 1} M by where the bilinear form F 1 , F 2 = 0, if two Pauli operators F 1 and F 2 commute, and F 1 , F 2 = 1, otherwise. Definition 1. For a code of length N and minimum distance D, let r×BDD denote the bounded distance decoding (BDD) of the code so that any Pauli errors of weight no larger than t = rD−1 2 are correctable. We say that r×BDD has correction radius t and decoding error rate at depolarizing error rate . 2. Additional stabilizers (M > N − K) may provide stronger protection, e.g., a toric code has every qubit equally protected by four stabilizers due to additional stabilizers. In the case that the syndrome measurement operations are faulty, we may need to measure additional stabilizers to obtain reliable syndrome information [54], [65].
We use BDD to denote 1×BDD for simplicity. Usually a good decoding procedure on a classical code has performance between BDD and 2×BDD. However, the degeneracy of a quantum code is not considered in BDD; we may have decoding performance much better than 2×BDD in the quantum case. In addition, since we do not know how to estimate the exact channel fidelity for large quantum codes, r×BDD serves as a good benchmark.

BP decoding of quantum codes
Codes based on low-density parity-check (LDPC) matrices (also known as sparse-graph codes) achieves near channelcapacity performance in classical coding theory [37], [38]. The parity-check matrix of a code can be depicted as a Tanner graph, which is a bipartite graph containing variable nodes and check nodes, connected properly by edges [66], [67]. Belief propagation (BP) on the Tanner graph iteratively passes messages between the nodes so that an estimate of the marginal distribution of the error at each coordinate can be approximated [68], [69].
Consider an M × N check matrix S ∈ {I, X, Y, Z} M ×N . The relations between a Pauli error operator and its syndrome bits can be depicted by a Tanner graph with N variable nodes (corresponding to the N -fold Pauli error to be estimated) and M check nodes (corresponding to the given binary syndrome vector) such that an edge (m, n) of type S mn connects variable node n and check node m whenever I = S mn ∈ {X, Y, Z} [23], [26]. Figure 1 illustrates the Tanner graph for a check matrix Given a syndrome vector z ∈ {0, 1} M , the decoding problem is to find the most probable Pauli error, or one of its degenerate errors, in {I, X, Y, Z} N . A quaternary BP (BP 4 ) algorithm computes an approximated marginal distributionP (E n = W | z) = q W n for W ∈ {I, X, Y, Z} for n = 1, 2, . . . , N in linear domain [23] and outputŝ E = (Ê 1 ,Ê 2 , . . . ,Ê N ) such that The computation can also be done in log domain [25] by using log-likelihood ratios (LLRs) defined by for n = 1, 2, . . . , N . If the syndrome ofÊ matches z, then the decoder will outputÊ. Since z is binary, estimatingÊ can be efficiently calculated by passing scalar messages on the Tanner graph in linear domain [23], [24] or log domain [25].
In this paper we will discuss BP 4 in log domain with an LLR vector Γ = (Γ 1 , Γ 2 , . . . , Γ N ) ∈ R 3N , where Γ n = (Γ X n , Γ Y n , Γ Z n ) ∈ R 3 for n = 1, 2, . . . , N . (This will be defined in Algorithm 1, referred to as MBP 4 . A conventional LLR-BP 4 is a special case of Algorithm 1 with α = 1.) There are two types of messages iteratively passed on each edge (m, n) connecting variable node n and check node m. Let M(n) denote the set of neighboring check nodes of variable node n, and N (m) denote the set of neighboring variable nodes of check node m. (In other words, N (m) is the support of S m .) We will simplify a notation M(n) \ {m} as M(n) \ m. A variable-to-check message λ Smn (Γ n→m ) from variable node n to check node m carries the log-likelihood ratio that E n commutes or anticommutes with S mn , where On the other hand, suppose that S mn = I; then by (3), we have a check bit relation Consequently a check-to-variable message ∆ m→n from check node m to variable node n will tell us the log-likelihood ratio of whether E n commutes or anticommutes with S mn . More precisely, as shown in [25], we have where for a set of k real scalars a 1 , a 2 , . . . , a k ∈ R, the operation is defined by Then Γ n is updated according to ∆ m→n for all m ∈ M(n) and the initial distribution of E n (cf. (17)). The BP algorithm iteratively updates the LLRs of the marginal distributions {Γ n } N n=1 according to the passed messages on the Tanner graph. If the Tanner graph has no cycles, BP will compute the exact marginal distributions [67]- [71]. If there are not many short cycles, the approximation is usually very good [67], [70], [71].
Here we discuss a four-cycle example. Consider S = [ X Y Z Z ] and E = IZ. Then z = (1, 0), and it is difficult for a parallel-scheduled BP to determine whether the solution is ZI or IZ, so BP will oscillate between II and ZZ. Poulin and Chung suggested to do heuristic processes like random perturbation between BP iterations to have opportunity to find ZI or IZ [26]. Actually, a serial-scheduled BP with proper message magnitude can easily infer a degenerate errorÊ = ZI without additional processes. We refer the readers to [23]- [25] for more details about refined BP decoding algorithms for quantum codes, where the techniques of message normalization and message scheduling are adopted.

BP DECODING AS ENERGY MINIMIZATION
A classical decoding can be considered as an energy minimization problem [45]. Given the error syndrome, the energy function is defined with respect to the parity checks so that each error vector matching the syndrome can be considered as a local minimum of the energy function. An iterative decoding algorithm can be used to find a local minimum of the energy function with an initial point defined by the channel statistics [47]. In particular, a decoding algorithm based on gradient optimization was proposed in [47].
We would like to characterize the energy function minimization problem for BP decoding of quantum codes. The energy function is defined on R 3N , where each point represents the LLRs in (4). Instead of using only the energy defined by parity checks [45], [47], we introduce an additional term regarding the distance between a data point and the channel statistics. Then we explain why conventional BP fails to solve this energy function minimization problem for quantum codes with high degeneracy. To solve the problem, we introduce our MBP 4 .

Energy function of BP decoding
Assume that qubit n undergoes a Pauli error according to a distribution The channel statistics vector is the channel information we have before decoding. For depolarizing errors with rate , we have (p I n , p X n , p Y n , p Z n ) = (1 − , /3, /3, /3) and hence Λ X n = Λ Y n = Λ Z n = ln 1− where η > 0 is a real scalar, The value of η does not matter in the following discussion and will be postponed to discuss in Appendix C. The second term J S (Γ) measures the satisfaction of each check S m , similar to the case in [47, Sec. IV]. The additional term J D (Γ), which is convex in Γ, measures the distance between a point Γ ∈ R 3N and the channel statistics Λ. This is critical since the initial channel statistics Λ affects the performance of BP [25], [53] and this should be reflected in the energy function. An interpretation is that an logical error close to N (S) \ ±S has small J S but large J D .

Gradient decent optimization
Next we show that BP decoding is like a gradient decent minimization of the corresponding energy function (7). This provides a better understanding of how BP works.
One can easily verify that and Notice that ∆ m→n in (11) and ∆ m→n in (5) are similar but different in two aspects. First, they have the same sign but ∆ m→n is resized by tanh −1 . Second, Γ n used in (11) and Γ n→m used in (5) differ by a term W, S mn ∆ m→n (cf. (18)). The gradient decent method will update Γ W n by where ζ > 0 is the decent step-size. Herein, we consider a fixed decent step-size and simply assume ζ = 1. Let We have Γ W n updated as Since ηg mn (Γ) > 0, both ω Consequently, the step update in (13) depends on terms ∆ m→n , especially those with large magnitude.
BP has a similar computation, but with more elegant step update determined by terms ∆ m→n (cf. (17)), because the magnitude of ∆ m→n (which is resized by tanh −1 ) provides adequate (enough-large) strength for update. An evidence is that Γ n tends to have correct marginal distribution if updated by ∆ m→n , as shown in [25]. We conduct some pre-simulations for several codes and different (α, β), and observe that updating Γ n (for decoding) is better using

Energy topology
Before we develop our MBP 4 (in Sec. 3.4), let us first analyze the difficulties that BP suffers in decoding quantum codes with high degeneracy. One may also consider the energy function J in (7) as a sum of the convex distance and the non-convex parity-check satisfaction terms. Since the non-convex term is the more difficult part in optimization, for simplicity, we analyze the energy topology with only the parity-check satisfaction term J S (Γ). We illustrate J S like a topography on Γ for either a classical or a quantum decoding problem in Fig. 2. Note that mathematically the function J S has several singular points corresponding to operators that match the given syndrome. However, BP will not get to these singular points since the values of the LLRs are numerically protected in simulation.
First, consider a classical code with minimum distance d. In Fig. 2 (a), the small circle in the center denotes the all-zero vector, which will be called the origin of the topography. The other black solid circles denote certain low-weight codewords. The (blue) dashed circle is a classical Hamming ball with diameter d centered at the origin. The target error is denoted by a cross in the (blue) Hamming ball. The other crosses have the same error syndrome as the target error. We also draw a (purple) dotted circle with diameter d centered at the target error. There will be energy barriers around this circle's boundary as shown in Fig. 2 (c), which is the energy profile along the diagonal dashed line in Fig. 2 (a). A syndrome-based BP starts from the origin and its goal is to find the target error, which corresponds to a global minimum of the energy function. In our case, the starting point lies in a local convex hull of the target error. Since BP is like a gradient decent optimization as shown in the previous subsection, the target error can be found by BP. Sometimes, the shape of a local minimum is very narrow and BP may converge better using a smaller step-size. This is usually done by message normalization [72], [73].
Next, consider a nondegenerate quantum code with The decoding topography is like Fig. 2 (a) and, indeed, BP usually works well with a small step-size (see examples in [23]- [25]).
Finally, consider a degenerate quantum code with D much larger than d. Figure 2 (b) illustrates the energy topography of this case. A set of small circles in the center denotes the low-weight stabilizers. Each group of black solid circles denotes a set of equivalent logical operators in N (S) \ ±S. A classical Hamming ball becomes very small, as the (blue) dashed circle with diameter d centered at the origin (I ⊗N ). This is because the quantum code is degenerate and there are low-weight stabilizers closer to I ⊗N than the other operators in N (S) \ ±S. The target error and its low-weight degenerate errors are the crosses in the (purple) dotted circle with diameter D. The other groups of crosses are logical errors with the same error syndrome as the target error. Similarly, a syndrome-based BP starts from the origin. The energy profile along the diagonal dashed line in Fig. 2 Fig. 2 (d). Because low-weight stabilizers are like low-weight classical codewords, there are many ripples in the shape of the topography, causing conventional BP to get trapped or wander around this region (or sometimes oscillate).
To help BP escape these local traps, one should use a large enough step-size, unlike the classical strategy, which favors a small step-size for convergence. On the other hand, due to the degeneracy of the quantum code, there are many equivalent solutions (such as the crosses in the (purple) dotted circle in Fig. 2 (b)) and it suffices to find a degenerate error of the target. Exploiting this degeneracy improves the decoding performance. Using a large step-size helps BP to approach any of the degenerate errors. However, using only larger steps in BP may deteriorate the convergence behavior. In the next subsection we will show that this issue can be mitigated by introducing a mechanism of inhibition.

BP with additional memory effects (MBP)
Motivated from the analysis of the energy topology of a degenerate quantum code and the gradient decent optimization process in the previous subsections, we propose a quaternary BP in log domain with additional memory effects as follows. To do an iterative update, consider an equation inspired from (13) with ∆ m→n replaced by ∆ m→n in (5) and with fixed inhibition (inspired from BP) as We notice that the term −β m ∈M(n)

S m n =W
∆ m →n is from gradient decent optimization, but not in BP when approximating the marginal distribution [25]. Furthermore, we find that in most cases, β = 0 has better performance. 3 Thus, instead of using (14), we consider To compute the required quaternary distributions, we propose Algorithm 1, referred to as MBP 4 . Our update rule for Γ n in (17) parallels to (13), suggested by the gradient of the energy topology, but with β = 0 according to the previous discussion. Vertical Step. For n ∈ {1, 2, . . . , N } and W ∈ {X, Y, Z}, compute -Repeat from the horizontal step.

Remark 2.
The presentation of Algorithm 1 differs from that in [25] in a way that the term − W, S mn ∆ m→n , called inhibition, is separated out. Unlike [23], [25], where the corresponding inhibition is scaled by 1/α, we suggest to keep the inhibition strength fixed, since this part is the belief inherited in check node m and should not be altered when we update the outgoing belief in variable n to make the decoding less affected by the short cycles.
How to choose the factor α is intriguing. The gradient optimization step suggests that α ∝ 1/g mn (Γ) by (13). Since Γ is initialized as the channel statistics vector Λ at the first step and α is fixed in BP for simplicity, we plot 1/g mn (Λ) as a function of the channel depolarizing rate in Fig. 3 for various stabilizer weight k = |N (m)|. The figure suggests two things. First, α should be larger as gets smaller. Second, for a larger k, the maximum required α seems to saturate at a larger value. The saturation suggests that the energy function becomes similar when get small enough, which means that there might be error-floor in the region of small . However, this occurs at a smaller for a larger k. These observations are consistent with our simulation results in Appendix B.1, in which we simulate as many values of α as possible for investigation. Remark 3. A refined computation in Algorithm 1 is: It is more efficient to update λ Smn (Γ n→m ) in this way since for each n, computing λ Smn (Γ n ) needs at most three computations of λ Smn (·) for S mn ∈ {X, Y, Z}; on the other hand, directly computing λ Smn (Γ n→m ) needs |M(n)| (usually ≥ 3) computations of λ Smn (·).
The computation in the horizontal step can be simplified as in Remarks 1 and 4 of [25]. Then the MBP 4 complexity is O(N j) per iteration. This is verified in Appendix A.
For reference, we provides the MBP 4 in linear domain in Appendix E to be compared with [23, Algorithm 3].

MBP decoding as an RNN
It was known that a BP decoding process can be modeled as an RNN [42], [64]. Similarly we can derive an RNN from Algorithm 1. The RNN usually represents the BP with a parallel schedule. In the Tanner graph induced by the M × N check matrix S, two types of messages are iteratively updated: variable-to-check and check-to-variable messages. Hence, there will be two hidden neuron layers computing messages ∆ m→n or λ Smn (Γ n→m ) per edge alternatively in each layer, and there are N input neurons {Λ n } N n=1 and N output neurons {Γ n } N n=1 . (z m can be considered embedded in ∆ m→n or separated as an additional input neuron.) An estimated errorÊ will be inferred from {Γ n } N n=1 . The RNN will iterate until a validÊ matching syndrome z is found or a maximum number of iterations T max is reached. Figure 4 (a) illustrates the RNN derived from the BP on the Tanner graph in Fig. 1. A neuron denoted by Γ n→m computes λ Smn (Γ n→m ) although the symbol λ Smn is not explicitly shown. Γ n is updated also by Λ n , although not   Fig. 1 with the first two iterations unrolled according to (15) and (16). Since there are five edges in Fig. 1, the RNN has five neurons per hidden layer. (b) Another (equivalent) way to update Γn→m between the two iterations in (a) according to (18) (or (19)). In both subfigures, small circles are input/output neurons, and large circles are hidden neurons. A solid line is an edge connecting hidden neurons. A dashed line is an edge connecting a hidden neuron and an input/output neuron. A dotted line in (a) is a special edge with weight −(1 − 1 α ) so that ∆m→n is rescaled and added to Γ W n→m when W, Smn = 1 as in (15). A such edge is implicitly embedded in (b) by (18).
shown in the figure. Note that there are additional edges (dotted curves) from ∆ m→n to Γ n→m , which are not considered in the previous BP methods [10], [23]- [26], [30]- [32], nor in the BP-based neural networks [42], [64]. These (dotted) edges appear due to the fixed inhibition and provide additional memory effects, which contribute to the major improvement. This agrees with the known result that using proper inhibition between neurons enhances a network's perception capability in Hopfield nets [48]- [52]. Figure 4 (b) shows another (equivalent) way of updating Γ n→m by (18). According to the BP update rules, every solid edge in Fig. 4 (a) is an excitation (with positive weight); in Fig. 4 (b), every dashed line is an excitation, but a solid line is an inhibition (with negative weight −1).
To sum up, we have shown that MBP 4 can be extended to a neural network decoder, which may be improved by using appropriate weight for each edge.
In this subsection, we propose a variation of MBP with α chosen adaptively as shown in Algorithm 2. The value of α controls the search radius of MBP 4 . Typically, a fixed α is chosen so that BP focuses on an error correction region, e.g., 1×BDD to 2×BDD. For codes with high degeneracy, we intend to correct errors of higher weight and consequently we need to consider variations in α.
Generating a solution by referring multiple decoding instances is an important technique in Monte Carlo sampling methods (cf. parallel tempering in [58]) and neural networks (cf. [64, Fig. 4]). This is like an ε-net. Thus we conduct multiple instances of MBP 4 each with a different value of α, and choose the solution from the instance with a largest α and valid solution. This largest α is the most conservative α with a valid solution and will be denoted by α * . This adaptive scheme (Algorithm 2) is referred to as AMBP 4 .
Note that the procedure in Algorithm 2 tests each value of α i in a sequential manner; these α i 's can be tested in parallel if the physical resources for implementation are available, followed by a final check to determine α * .

SIMULATIONS RESULTS
In this section, we first study MBP 4 on the well-known [ [5,1,3]] code [55]. Then we simulate the decoding performance of (A)MBP 4 with various values of α on quantum bicycle codes, a GHP code, and surface codes. (The results of toric codes are shown in Appendix B.3).
When evaluating the logical error rate or other error rates, an error bar between two crosses shows a 95% confidence interval if fewer than 100 error events are collected.
In the following discussions, we will discard the global phase; or equivalently, when we refer S, it includes ±S.
In simulations, a decoding is successful if it outputs an E ∈ ES. Thus it is important to determine whetherÊ is a degenerate error of the target error E. For convenience, some literature may consider a decoding to be successful only whenÊ = E (such as in [10], [74]); however, this is not accurate, especially for highly-degenerate codes. A simple observation is thatÊ ∈ ES if and only ifÊE ∈ S.

Lemma 4.
Suppose that {S m } N −K m=1 together with {X 1 ,Z 1 ,X 2 ,Z 2 , . . . ,X K ,Z K } form a set of independent generators for N (S). ThenÊ ∈ ES if and only ifÊE commutes with all elements in {S m } N −K m=1 ∪{X j ,Z j } K j=1 .
An efficient method to find {X j ,Z j } K j=1 is by using the standard form discussed in [6], [75].
Next we discuss how degeneracy is exploited. Let n tot be the number of tested error samples for a data point in simulations. Suppose that E (i) andÊ (i) are tested and estimated errors, respectively, for i = 1, 2, . . . , n tot . Let n e = # of pairs (E (i) ,Ê (i) ) : Empirically, we have the classical block error rate P (Ê = E) = n 0 /n tot , the quantum logical error rate P (Ê / ∈ ES) = n e /n tot , and the undetected error rate P (ÊE ∈ N (S) \ S) = n u /n tot .
Since (Ê / ∈ ES) ⊆ (Ê = E), by Bayes rule, we have A classical strategy for improvement is trying to lower n 0 /n tot , which means that a target error needs to be accurately located for a given syndrome. Such a strategy has a limit in performance due to short cycles or strong code degeneracy. A better strategy should also try to lower n e /n 0 , which will be called the error suppression ratio by exploiting degeneracy. For example, if the decoder converges to any of the degenerate errors, the decoding is a success (i.e., n e does not increase although n 0 may increase by one).
Recall that Algorithm 1 with α = 1 is equivalent to the conventional quaternary BP (conventional BP 4 , or in short, BP 4 ). We will demonstrate that MBP 4 outperforms BP 4 on various kinds of quantum codes.
For comparison, we will also consider BP 4 with typical message normalization (denoted by α c ) as in [25], which is like the classical message normalization [72], [73]. Given some α c > 0, consider that (18) is replaced by and this decoding will be referred to as normalized BP 4 . Notice how this is different from the rule (15) for MBP 4 . BP can run with the parallel or serial schedule. (For the conversion of the schedules, see [23].) Algorithm 1 using each of these schedules will be referred to as parallel MBP 4 or serial MBP 4 . Similarly, we have parallel BP 4 or serial BP 4 for conventional BP, and parallel AMBP 4 or serial AMBP 4 for Algorithm 2. We may also consider parallel normalized BP 4 or serial normalized BP 4 . If a BP algorithm is referred without the prefix parallel or serial, the parallel schedule is assumed.

The [[5,1,3]] code
The [ [5,1,3]] code [55] has a check matrix This code is worth investigation because there are many four-cycles in a small check matrix. We remark that using the serial schedule enables conventional BP 4 to decode the [ [5,1,3]] code [23]. Herein we use the parallel schedule to investigate the effect of α, as well as β, in (14).

Definition 5.
To understand the effect of β in (14), we consider an extended Algorithm 1 with an additional term −β m∈M(n) Smn =W ∆ m→n in the end of (17).
We simulate Algorithm 1 (i.e., β = 0) with various α to decode the [[5, 1, 3]] code. The performance curves are plotted in Fig. 6, including the BDD performance. As can be seen, α ≈ 1.5 has the best performance, comparable with BDD at small . Next we simulate extended Algorithm 1 with α = 1.5 and various β, and the results are plotted in Fig. 7. Observe that large needs large β and small needs small β, matching the observation in Fig. 5.
The [ [5,1,3]] code can correct any single-qubit error, but conventional BP 4 (the case with α = 1) fails to decode the error IIIY I, no matter what value β is.
In the following, we simply consider β = 0. We plot Γ n at each iteration in decoding IIIY I at = 0.003 in Fig. 8, where Γ n = Λ n at iteration 0. With α = 1, the state of Γ n oscillates between IIIII and Y Y Y Y Y continuously. If α = 1.5 is used instead, MBP 4 has a resistance to wrong beliefs, and the decoding converges correctly.
We plot the change of J S in Fig. 9.   Using α = 1.5, MBP 4 enlarges the swing and lowers J S to finally go to a small value < 0. This is because wrong beliefs are resisted and there are additional memory effects between iterations. We also consider normalized BP 4 with α c = 1.5 as in (24). In this case, the swing is large but there are no memory effects, so it continuously oscillates between 8 and 19 and cannot have J S < 0, as shown in Fig. 9 (c). Next we consider BP decoding with a fixed 0 = 0.003 to initialize Λ n , regardless of the actual depolarizing rate. We find that the performance curve of MBP 4 with α = 1.5 matches the BDD curve for small . Even more, it matches the Reimpell-Werner bound for the [[5, 1, 3]] code [76], which outperforms the BDD at a large due to degeneracy. The curves of BDD, Reimpell-Werner bound, and MBP 4 with α = 1.5 are compared in Fig. 10.

Remark 6.
Initializing {Λ n } with respect to a fixed 0 , instead of the actual depolarizing rate, is a strategy to maintain the decoding stability without curve fluctua- tion [53]. 4 This also reflects the importance of choosing the network initial state [77]. This technique of fixed initialization works for any quantum codes and is useful when the channel parameter is hard to estimate. 4. Sometimes we encounter the curve fluctuation, as in Figs. 6 or 7. Using a fixed 0 for initialization can improve this. This can be proved by the theory in [53]. Put it more simply: if BP has r×BDD performance at some physical error rate 0 , then most errors within the corresponding correction radius would be correctable by BP. Using an initialization with fixed 0 enables BP to have the performance curve interpolated or extrapolated without fluctuation for different .

Bicycle code
Bicycle codes are a kind of sparse-graph quantum codes with flexible length, code rate, and row-weight [10]. Let k be the row-weight of a bicycle code. A smaller k means the code has many low-weight stabilizers and the code is more degenerate. However, since the minimum distance D ≤ k due to the construction, if k is too-small, the code may have a high error-floor.
MacKay et al. showed that, for a bicycle code with parameters [[N, K]] = [[3786, 946]], a row-weight k ≥ 24 is required to have good binary BP (BP 2 ) performance at a target block error rate 10 −4 (see [10, Fig. 6]). We construct bicycle codes with the same parameters. Figure 11 (a) Fig. 6] that k = 24 is required to achieve the logical error rate of 10 −4 before hitting the error-floor. Also shown in Fig. 11 (a) are the BP 2 performance curves in [10]. It can be seen that BP 4 performs better than BP 2 , because the correlations between X errors and Z errors are considered in BP 4 . (The same phenomenon is discussed in [10, Fig. 11].) Now we use MBP 4 with α > 1, and the performance is significantly improved, as shown in Fig. 11 (b). (Note that different need different α as shown in Appendix B.1.) It shows that a bicycle code with row-weight k = 16 is able to achieve the logical error rate of 10 −6 , which is less affected by the error-floor. Thus we have greatly improved the BP performance on bicycle codes.
The minimum distance D of a bicycle code is unknown, so it is hard to compare with r×BDD as in Definition 1. Instead, we directly specify the correction radius t and plot some BDD curves in Fig. 11 (b). For k = 16, depending on the logical error rate, the performance of MBP 4 is close to BDD with t between 140 and 200, despite the fact that there is a far more smaller D ≤ k = 16. If t = 170 is considered, since D/2 ≤ 8, it is better than 20×BDD.
The average numbers of iterations are shown in Fig. 11 (c). Fewer iterations are required if the decoder has better convergence behavior. After applying α, the case of k = 12 has obvious improvement, since there are more lower-weight stabilizers, i.e., there are more low-weight degenerate errors for MBP 4 to converge. On the other hand, the cases of k > 12 may require slightly more iterations to have MBP 4 improved from BP 4 .
Although it is not shown, normalized BP 4 (cf. (24)) also improves from BP 4 , but MBP 4 outperforms normalized BP 4 , especially for small k. Normalized BP 4 can compete the neural BP [42] on [[256, 32]] bicycle codes, as shown in [25]. It would be difficult to train a neural BP on large code sizes such as [[3786, 946]] here.
Next we study whether MBP 4 has good error suppression ratio n e /n 0 (23). This ratio reduces if the decoder finds degenerate errors often when the actual error is not located. (M)BP 4 can have n e /n 0 < 1 for k = 16 and 12 in the region of ≤ 0.049. Detailed event counts n 0 , n e , n u , and n tot in this region are provided in Table II. Observe that a decoder (especially MBP 4 ) exploits the degeneracy more for a code with smaller k (stronger degeneracy). When gets smaller, the ratio n e /n 0 reduces significantly for MBP 4 on both codes, especially for k = 12, although the minimum  distance of this case is too small to have a low error-floor. We remark that conventional BP 4 has n e /n 0 ≈ 1 for k ≥ 16. Also listed in Table II are the numbers of undetected errors, which are nonzero for k = 12. However, the ratio n u /n tot tends to be small.
To further improve the decoding performance, we apply AMBP 4 with α * ∈ {2.4, 2.39, . . . , 0.5}. Herein we consider the serial schedule because it accelerates the message update and enlarges the error-correction radius in finite iterations. The performance curves in Fig. 11 (b) are significantly improved as shown in Fig. 12.
For quantum communication, we may consider a target logical error rate 10 −4 [10], and quantum retransmission is possible if necessary [78]. Consider that ≈ t/N for large N . The quantum Gilbert-Varshamov rate [6], [79] states that there exist codes of rate 1/4 to achieve arbitrarily low logical error rate at = 0.063 for asymptotically large N . As shown in Fig. 12, the [[3786, 946]] bicycle code with k = 16 achieves logical error rate 10 −4 at = 0.057, which is close to the quantum Gilbert-Varshamov rate.

Generalized hypergraph-product code
Herein we consider GHP codes [12], in particular, the 48,16]] GHP code constructed in [30], which has row-weight 8 < D and is thus highly-degenerate. The performance of this code is shown in Fig. 13. Using conventional BP 4 does not have good enough performance. Using MBP 4 , we find that most errors can be decoded with α ≈ 1.2 to 1.5. However, since this code exhibits high degeneracy, a smaller α (larger step-size) may be needed. Thus we apply AMBP 4 with α * ∈ {1.5, 1.49, . . . , 0.5}.
We use r×BDD for reference. Observe that AMBP 4 has slope roughly aligned with 1×BDD or 2×BDD, but its performance is close to 8×BDD at logical error rate 10 −6 , since more low-weight errors are corrected. We also draw the curve of classical block error rate P (Ê = E) = n 0 /n tot , which shows that the improvement of AMBP 4 from BP 4 mostly comes from exploiting degeneracy.
For reference, we also plot the performance curves given in [30], where the improved one is based on a layeredscheduled BP with post-processing by OSD-w, i.e, OSD is combined with a post-selection with a parameter ω to sort out 2 ω errors in ω unreliable coordinates (so the overall complexity is high). For most codes in [44], using ω = 0 is sufficient, but this GHP code needs large ω = 15 to achieve better performance. As can be seen, AMBP 4 can outperform BP-OSD-15 on this code. The complexity of AMBP 4 is low enough so we simulate to lower logical error rate.

Surface code
In this subsection we consider the surface codes with a 45 • rotation [57]. A parallel study of rotated toric codes is provided in Appendix B.3.
An [[L 2 , 1, L]] surface code can be defined on an L × L square lattice for L ≥ 3. Figure 14 (a) shows an example In both figures, a qubit is represented by a yellow box numbered from 1 to N . Since the toric code is defined on a torus, there are orange boxes on the right and bottom in (b), each representing the qubit of the same number. An X-or Z-type stabilizer is indicated by a label W ∈ {X, Z} between its neighboring qubits. For example, in (a), the label X between qubits 1, 2 is X 1 X 2 and the label Z between qubits 1, 2, 6, 7 is Z 1 Z 2 Z 6 Z 7 . The performances of (M)BP 4 on surface codes are shown in Fig. 15. It can be seen that conventional BP 4 does not work on surface codes; even worse, the logical error rate is higher for larger distance. We remark that normalized BP 4 (24) is not effective either. On the other hand, using serial MBP 4 with α < 1 significantly improves the performance.
Several BDD performance curves for L = 17 are also provided in Fig. 15. Gallager expected BP to have perfor- mance around 1×BDD to 2×BDD [37]. Serial MBP 4 with α = 0.65 achieves this. However, this is not enough for surface codes. In Fig. 16, we consider r×BDD performances for [[L 2 , 1, L]] codes at r = 1, 2, 3. The depolarizing rate at which the curves of the same r intersect is called the "r×BDD error threshold" for the [[L 2 , 1, L]] codes. The theoretical error threshold for a two-dimensional code is 18.9% [58]- [60], and thus we would need a decoder with correction radius up to 0.189N , which needs r×BDD with non-fixed r scaled as 0.189 In other words, as L increases, an optimum decoder needs to correct most errors in a radius roughly equal to 0.189N , despite that the minimum distance only scales with √ N . The (0.189 √ N × 2)×BDD curves are also drawn in Fig. 16 (bold lines) and their intersection suggests an error threshold of roughly 18.9%.
We observed that the performance of MBP 4 on surface codes saturates for large L, i.e., the slope of the performance curve does not increase when L gets larger. (A similar phenomenon is observed in the neural BP [42, Fig. 4(b)].) Using AMBP 4 with α * ∈ {1.0, 0.99, . . . , 0.5}, we have much improved performance for L = 17 as shown in Fig. 15, at the cost of higher computation complexity.
Note that a maximum number of iterations T max = 150 is allowed in the simulations so that serial MBP 4 with a fixed α can perform better. If proper α * can be selected, the value of T max can be reduced a lot (e.g., to 30) without significant performance loss. In fact, sometimes it is possible to use a smaller value of α instead of the α * above, and the convergence can be even faster and remain correct. We discuss this possibility in Appendix B.2 (see (25)).
Next we explain how MBP 4 has improvement over BP 4 . Again consider serial MBP 4 with α = 0.65. We examine the types of corrected errors as in (23). The ratio n e /n 0 becomes smaller if the decoder can exploit the degeneracy, which is indeed the case of MBP 4 as shown in Fig. 17 (a). Although it is not shown, BP 4 has (bad) n e /n 0 ≈ 1, but (good) undetected error rate ≈ 0 for L > 7. The improvement of serial MBP 4 over BP 4 comes at a cost of non-negligible undetected error rate, as shown in Fig. 17 (b). (A similar phenomenon is also observed in the neural BP [42, Fig. 2(d)].) Undetected error events occur because we use a larger step-size, so BP may jump far to a logical error with the syndrome falsely matched. However, this is not a random search, or otherwise the ratio n e /n 0 would be as large as 1−(1/2 2K ) = 3/4. Also observe that, in Fig. 17 (b), the undetected error rate becomes smaller as L increases.
The average numbers of iterations are shown in Fig. 17 (c). Serial MBP 4 spends much fewer iterations (which means much lower complexity), while BP 4 spends much more. It means that MBP 4 improves from BP 4 by exhibiting better convergence behavior, rather than relying on increasing complexity through performing more iterations. Recall that better performance does not necessarily apply better convergence in average number of iterations (Fig. 11 (c)). Figure 18 is provided for observing the decoding threshold of AMBP 4 on surface codes, which is roughly 16%.
The analysis of (A)MBP 4 on toric codes is similar and given in Appendix B.3. A threshold of roughly 17.5% is observed for AMBP 4 on toric codes (Fig. 24).

CONCLUSION AND FUTURE RESEARCH
We analyzed the energy topology of BP for decoding quantum codes, and proposed (A)MBP 4 , which efficiently explores the degeneracy and significantly improves the performance over conventional BP, especially for highlydegenerate codes. (A)MBP 4 has complexity almost linear in the code length. To further improve (A)MBP 4 , one may try to analyze the behavior of those uncorrected errors.
Considering BP as an RNN, our proposed scheme has adjustable edge weights (gradient decent step-size) scaled by 1/α and fixed inhibition strength. The network introduces proper memory effects, allowing BP to resist wrong beliefs or accelerate the search.
By using AMBP 4 , the achieved thresholds are roughly 16% and 17.5% on surface and toric codes, respectively, at the cost of complexity to choose α * .
A method to choose α * at a lower cost would be desired. To determine α * , the information of the error syndrome or channel statistics could be useful. For example, a syndrome vector of high weight may correspond to an error of high weight and a large step-size may be needed.
We note that α may be estimated by density evolution (DE) [72], [73], [80], [81] or extrinsic information transfer (EXIT) charts [82]- [84]. Some other methods are mentioned in [82]. However, cycles in the Tanner graph may affect the accuracy of these methods [73], [84]. With cycles, [73] suggested to consider a method in [85], which estimates the initial step-size and emphasizes its importance. What we did in Sec. 3.2 (and Fig. 3) also serves for this purpose.
One may introduce parameters α mn,i and β mn,i for each edge (m, n) at each iteration i, but this would require us to develop a strategy of choosing these parameters. In Algorithm 1, there are more useful edges (e.g., the dotted lines in Fig. 4 (a)) not considered in BP-based neural networks [42], [64]. With these added edges, an MBP-based neural network may be considered. The energy function can be defined as in (7) or any appropriate one. For training the edge weights, their initial values are important [77]. Our simulation results suggest that (α mn,i , β mn,i ) can be initialized as (α, 0), with possible disturbance if needed.
The values of (α mn,i , β mn,i ) have some derivable trend as in (9)- (13). For example, in Algorithm 1, we replace 1/α by g mn derived for each (m, n) at each iteration by (10) up to a constant such that g mn = 1 at the first iteration; then the decoder performance improves (see [86] arXiv version).
We remark that surface (or toric) codes can be decoded by MBP 4 with a good parallelism. For example, consider the [ [16,2,4]] toric code in Fig. 14 (b). The qubits can be divided into four groups {1, 2, 5, 6}, {3, 4, 7, 8}, {9, 10, 13, 14}, and {11, 12, 15, 16}. Then the qubits in each group can be decoded in a serial order, while all groups can be run simultaneously. Keeping a group size of four, we have N/4 groups, which if run with parallelism have O(N/(N/4)) = O(1) decoding time. We simulate this decoding order for surface codes and a threshold of roughly 15.5% is observed.
Our scalar-based approach could be extended to the case of fault-tolerant circuits. We have an initial study in the data and syndrome error model [54]. Extending this to the fully fault-tolerant model is our ongoing work.  4 We verify that the runtime of MBP 4 is O(N j) per iteration. First, consider toric codes, which have fixed column-weight j = 4. We test serial MBP 4 with α = 0.75 at depolarizing rate 0.32 on one core (4.9GHz) of an Intel i9-9900K machine. The average runtime per iteration is shown in Fig. 19, which is obviously linear in N . Then we consider surface codes, which have mean column-weight slightly smaller than 4. As expected, the average runtime per iteration is again linear in N and the slope is smaller than that for the toric codes, as shown in Fig. 19.
The results match the expectations (Fig. 3): a smaller needs a larger α; a larger k needs a larger α before the performance saturation occurs at a smaller .
A code of larger row-weight k has a larger mean columnweight kM/N . Since bicycle tend to be nondegenerate, we recall some classical expectations [37], [38]: a code of smaller column-weight has earlier curve rolling-off (at larger ); a code of larger row-weight has lower error-floor. Our results match these expectations. The first example is an error of weight 7 as in Fig. 21 (a). The error X at qubit 4 anticommutes with stabilizer Z 3 Z 4 Z 10 Z 11 , so the corresponding syndrome bit is 1. Conventional BP 4 (parallel BP 4 ) cannot decide whether an error X is at qubit 3 or qubit 4. To break this symmetry, possible methods include post-processes (random perturbation [26], OSD [30], [31], etc) or pre-processes (like training [40]- [43]). On the other hand, using serial MBP 4 with α = 0.65 can Physical error pattern 1 Parallel BP 4 (failure) Serial MBP 4 with α = 0.65 (success)   Fig. 21 (a) at each iteration (iter.) using the decoders in Fig. 21 (b) and Fig. 21 (c), as well as various decoder configurations in between.

B.2 Surface codes
iter. Fig. 21  quickly decide a degenerate error X at qubit 3, without additional processes. Now we consider the complete error set on the surface code Fig. 21 (a). The update of the estimated error at each iteration is shown in Table III for various combinations of BP decoding. Conventional BP 4 (no matter parallel or serial) gets trapped in the same small-weight error Fig. 21 (b). Normalized BP 4 with α = 0.65 (no matter parallel or serial) diverges. Parallel MBP 4 with α = 0.65 can approach the solution but oscillates. Serial MBP 4 with α = 0.65 effectively converges to a degenerate error Fig. 21 (c).
The reason that BP 4 easily gets trapped around a smallweight error is explained in Sec. 3.3. Serial MBP 4 with α = 0.65 performs a fast asymmetric update with large step-size and fixed inhibition (with provides memory effects for convergence, cf. Fig. 9 (b)). As shown in Table III, it performs an aggressive search without jumping wrongly, and converges to a degenerate error Fig. 21 (c), equivalent to the actual error Fig. 21 (a) up to three stabilizers X 3 X 4 , Z 15 Z 16 Z 22 Z 23 , and X 32 X 33 X 39 X 40 .
In the second example, two more X errors at qubits 6 and 7 are added. Now the overall error is as in Fig. 21 (d) and of weight 9 > D = 7. Similar to the previous example, conventional BP 4 gets trapped around a small-weight error Fig. 21 (e). Serial MBP 4 with α = 0.65 finds a degenerate error (of weight 11) in Fig. 21 (f), equivalent to the actual error in Fig. 21 (d) up to four stabilizers X 3 X 4 , X 5 X 6 , Z 15 Z 16 Z 22 Z 23 , and X 26 X 27 X 33 X 34 .
To demonstrate of the effects of a further smaller α, we reduce α from 0.65 to 0.5 in the two examples. In each case, the decoding correctly converges in two iterations: Serial MBP 4 with α = 0.5 on the error in Fig. 21 (a) : . Serial MBP 4 with α = 0.5 on the error in Fig. 21 (d) : The first result in (25) is equivalent to the error in Fig. 21  Note that, as shown in Table III, using normalized BP 4 with α c < 1 makes the decoding to diverge. This can also be observed by monitoring the variation of the energy function and we will discuss it in Appendix C.

B.3 Toric codes
Now we give the decoding performance for [[L 2 , 2, L]] toric codes with even L. An example of the toric code lattice is shown in Fig. 14 (b). Note that a toric code has every stabilizer generator associated with four qubits, and every qubit involved in four stabilizer measurements. Thus the corresponding Tanner graph is regular, and every qubit is equally-protected. This makes the toric code more suitable to have the same α for all the edges. Figure 22 provides the simulation results of the toric codes with various distances. Since every qubit is equallyprotected, the performance of each toric code is generally better than the surface code of comparable size in Fig. 15. Note that the performance curve of BP has no fluctuation with or without fixed initialization.
Our simulations suggest to use α = 0.75 in MBP 4 for better performance, which is larger than α = 0.65 used for surface codes (Fig. 15). This agrees with the observation in Fig. 3 that a code with larger row-weight should choose larger α, since a toric code has fixed row-weight 4, which is larger than the mean row-weight (between 2 and 4) of a surface code. Similar to Fig. 15, the performance of MBP 4 saturates when L gets larger. This can be improved by AMBP 4 with α * ∈ {1.0, 0.99, . . . , 0.5}. We show a case "L = 18, AMBP 4 " and several BDD curves in Fig. 22. As can be seen, AMBP 4 can correct most errors within 4×BDD correction radius in this case. This is better than a comparable case "L=17, AMBP 4 " in Fig. 15, which achieves roughly 3×BDD.
By initializing Λ with respect to a fixed 0 = 0.001, we get a slightly better interpolation performance for large so we apply this technique when choosing α * . Similar to Fig. 17, we provide some statistics for decoding toric codes in Fig. 23. Again, serial MBP 4 with α < 1 significantly improves the conventional BP 4 by exploiting degeneracy (Fig. 23 (a)) at the cost of having some undetected errors (Fig. 23 (b)), and MBP 4 has better algorithm convergence (Fig. 23 (c)).
By using serial AMBP 4 on toric codes, we observe a threshold of roughly 17.5%, as shown in Fig. 24.   the hard-decision information of the variable nodes. (Bitflipping BP can also be used in practice, e.g., it was used in decoding expander codes [35].) To see the difference of the energy functions discussed above, we provide examples in Fig. 25.
For decoding the error pattern in Fig. 21 (a), we plot the change of the energy function for the six configurations in Table III. First, we consider the approximationJ S in (26) with B = 6, and the results are plotted in Fig. 25 (a). Conventional BP 4 (no matter parallel or serial) has achieved low energy, though the decoding is not successful. If a larger step-size is used in normalized BP 4 (with α c = 0.65), the decoding then jumps randomly and diverges, resulting in high energy. When using MBP 4 with α = 0.65, the decoding converges to lowest energy. In this example, serial MBP 4 successfully converges to a degenerate error at iteration 10.
Using (30) or (31) results in a figure similar to Fig. 25 (a). Thus we consider the energy functionJ S in (32), and the results are plotted in Fig. 25 (b). Obviously, only serial MBP 4 finally converges without getting trapped in local minima.
Note that, in Table III, parallel BP 4 has a hard-decision pattern not changed after iteration 5. But the energyJ S orJ S can still change, as shown in Fig. 25. For example, consider the output distribution (q I n , q X n , q Y n , q Z n ) to oscillate between two points (0.4, 0.3, 0.25, 0.05) and (0.4, 0.05, 0.25, 0.3); then the hard-decision is the sameÊ n = I but, e.g., for an edge type X, the probability q I n + q X n = 0.7 > 0.5 for the first point and q I n + q X n = 0.45 < 0.5 for the second point, which can cause different energy levels.
There are two further notes. First, although parallel MBP 4 gets trapped, it achieves low energy in both figures. This explains why BP with post-processing usually works.
Second, when we try to plot J = J D + ηJ S in (7) rather than J S in (8), it needs a quite large η ≈ 10 6 to have a correct trend; or otherwise, a successful decoding may result in high energy level. This matches the expectation as follows. When decoding a highly-degenerate code, η should be large to focus more on J S (since any degenerate errors can lower it) rather than J D (since a single low-weight error can dominate it, even if the error has incorrect syndrome).

APPENDIX D COLOR CODES
Compared to the results for surface and toric codes (Table I), the results for color codes are as follows. The decoding problem of color codes can be cast as a hypergraph matching problem and approximately solved by MWPM with a threshold of 13.3% over depolarizing errors [88]. In addition, a color code can be projected onto two surface codes and decoded by RG-BP with a threshold of 8.7% over bit-flip errors [27]. A color code can be also projected onto three surface codes and decoded by MWPM with a threshold of 8.7% [89], or 8.4% if decoded by UF [63], both over bitflip errors. Alternatively, without the need of the projection, color codes can be decoded by RG-BP, with a threshold of 7.8% over bit-flip errors [28]. Theoretical estimation suggests that a color code family can have a threshold of roughly 10.9% over bit-flip errors [90], [91]. For more information on the thresholds of various decoders, see [92], [93].
For reference, AMBP 4 on color codes has a threshold of roughly 14.5% over depolarizing errors [86].

APPENDIX E LINEAR-DOMAIN MBP
Algorithm 3 provides the MBP 4 in linear domain. The practical complexity can be improved as in Remark 3.