High-throughput GPU layered decoder of quasi-cyclic multi-edge type low density parity check codes in continuous-variable quantum key distribution systems

The decoding throughput during post-processing is one of the major bottlenecks that occur in a continuous-variable quantum key distribution (CV-QKD) system. In this paper, we propose a layered decoder to decode quasi-cyclic multi-edge type LDPC (QC-MET-LDPC) codes using a graphics processing unit (GPU) in continuous-variable quantum key distribution (CV-QKD) systems. As described herein, we optimize the storage methods related to the parity check matrix, merge the sub-matrices which are unrelated, and decode multiple codewords in parallel on the GPU. Simulation results demonstrate that the average decoding speed of LDPC codes with three typical code rates, i.e., 0.1, 0.05 and 0.02, is up to 64.11 Mbits/s, 48.65 Mbits/s and 39.51 Mbits/s, respectively, when decoding 128 codewords of length \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${10}^{{6}}$$\end{document}106 simultaneously without early termination.


Results
implementation of the layered Bp decoding algorithm. Given that the messages can be updated at variable/check nodes and can be performed in parallel, the layered BP decoding algorithm is deployed on GPU. This section optimizes the GPU implementation of the layered BP decoding algorithm.
The decoder implementation is optimized in such a way that the message is stored in global memory for coalesced access. For memory access in a warp, coalesced access means that the data address of a thread always keeps the same as the thread index, instead of the unordered access. Since the GPU kernel is executed by a warp consisting of 32 threads, the decoding latency can be hidden well for a code with length being a multiple of 32. The layered BP decoder has a coalesced global memory access and stores the parity-check matrix in one file for indexing the corresponding messages. Such a file denoted by H_compact1, will be applied in calculating the messages related to the check nodes. Each element in file H_compact1 contains three pieces of information: the amount of the shift, the position of the element after row rearrangement in the base matrix and the position of the column where the non-negative element located in the base matrix. For example, Fig. 1 displays a 4-by-8 base matrix with an expansion factor of 100. Each non-negative element of the base matrix H in Fig. 1a indicates the amount of shift and '−1' represents an all-zero matrix. The second information indicating the position of the element after row rearrangement is demonstrated in Fig. 1b. Then, one sub-matrices shown in Fig. 1c are used for indexing the needed messages. Accordingly, the one-dimensional matrix on the right side in Fig. 1c represent the degrees of the base matrix (i.e., each element of the one-dimensional matrix represents the number of elements that are not '-1' in the corresponding column of the base matrix).
In our decoder, only one kernel is required to complete the iterative process. In this single kernel, one thread maps a kind of information in one codeword, and the same kind of information of all codewords is then stored sequentially. There are three kinds of information, which include: the log-likelihood ratios (LLRs) of variable nodes, the message of check nodes to variable nodes, and the message of variable nodes to check nodes. We store the same kind of information of different codewords sequentially, e.g., represents the LLR of the VN v i of the k-th codeword denoted by CW k . The coalesced access to the message of variable nodes is illustrated in Fig. 2. Therein, the threads (th 0 , th 1 , ..., th K ) first map the LLR of v 0 in codewords (CW 0 , CW 1 , ..., CW K ) one by one. Once the LLR of v 0 is stored, then the thread group maps the LLR of v 1 in codewords (CW 0 , CW 1 , ..., CW K ) until the LLRs of all variable nodes are stored. The message of check nodes to variable nodes or variable nodes to check nodes is also stored in this manner. The GPU-based layered BP decoder updates the messages of variable nodes and check nodes simultaneously, while enabling multiple codewords to be decoded in parallel. For each individual codeword, the required number of GPU threads is the same as the number of check nodes in a sub-matrix. Each thread computes the messages received from the neighboring variable nodes while also calculating the LLR messages for each adjacent variable node. This procedure is illustrated in Fig. 3 by taking an LDPC code with 4 check nodes and 6 variable nodes as an example. Note that, at each iteration, one thread corresponds to a check node. If the expansion factor Z is equal to 100, 1 × 100 threads corresponding to check nodes of a sub-matrix send messages to neighboring variable nodes and also calculate the messages from variable nodes. Next the 1 × 100 threads are reused to update messages at the second group of check nodes and their neighboring variable nodes. The number of threads that are reused is equal to that of rows of the base matrix. Nonetheless, the layered BP decoder consumes less thread resources and the number of threads assigned to each sub-matrix is only 1 × 64 × Z (recall that Z is the expansion factor) Scientific RepoRtS | (2020) 10:14561 | https://doi.org/10.1038/s41598-020-71534-5 www.nature.com/scientificreports/  The layered BP decoder only involves one file H_compact1 for indexing. This leads to one GPU kernel implementation of the layered BP decoder for each decoding iteration, whose structure is demonstrated in Fig. 4 (This figure is derived from Fig. 4.8 of the thesis 24 ). In the unique kernel, the amount of computation in one thread to calculate the message from a variable node to a check node or from a check node to a variable node, defined by the number of edges on which messages are computed, is equal to the degree of the corresponding check node. The message L  q nm . There are Z threads in total that are performed at the same time. The decoder accesses message readily by using H_compact1 and the LLRs of variable nodes are delivered to the next layer, i.e., the (l + 1)-th layer. The above process is a complete iteration.
The original layered decoder decomposes the H matrix into multiple sub-matrices on the basis of layers, which is equivalent to treating each layer as a sub-code. Each sub-matrix utilizes 1 × Z threads and the serial calculation is conducted among the sub-matrices. In order to increase the layered decoder's thread utilization, we merge the unrelated sub-matrices into a new sub-matrix. For example, a 3-by-3 base matrix shown in Eq. (1) with the expansion factor Z can be divided into three sub-matrices and the degree of any variable node in each sub-matrix is equal to one or zero. Herein, a non-negative integer a in Eq. (1) such as '1' , '0' and '2' corresponds to a matrix obtained by cyclically shifting the Z × Z identity matrix to the right by a bits and '-1′ indicates the all-zero matrix of Z × Z.
If a base matrix is of the form given in Eq. (2), we can combine its first two rows into one layer. In other words, the first two rows form a sub-matrix in which the degree of any variable node is equal to one or zero and the third row separately forms a sub-matrix. Two sub-matrices wok in a serial manner by using 2 × Z and 1 × Z threads, respectively.
Given a base matrix of the form shown in Eq. (3), the first and the third rows of this matrix can be combined into one sub-matrix and the second row forms a sub-matrix.
The thread utilization rate η is computed by where T 1 is the number of layers in each sub-matrix, T 2 is the number of codewords, and T 3 is the total number of threads.
There will be much time needed for the system to call external functions as it is done frequently in a kernel function when using CUDA. Moreover, there will also be some additional waiting time as the warp divergence increases the waiting time when warp threads encounter control flow statements and enter different branches, which means that the remaining branches are currently blocked except for the branch being executed. In this www.nature.com/scientificreports/ work, the kernel function distinguishes the sign of the input data by calling the application programming interface (API) provided by CUDA, thereby avoiding warp divergence and reducing the calls of external functions. An infinite value or invalid value may appear because of the iterative running of the kernel function. To avoid this, a clipping function included in the CUDA Math API, i.e., device float fminf(), is utilized. With the aid of clipping, the decoding throughput is increased from 60.29 to 64.11 Mbit/s. Another optimization that is being performed is directed to reducing the branches since the branch structure has great drawbacks, especially when different threads utilize different branches with a high probability. For instance, each thread has different calculation amounts and computation time and thus the finished threads need to wait for other unfinished ones. Based on this, we can transform the branch structure to an arithmetic operation when parity checks are used and thereby reduce the decoding time. where the expansion factor is 2,500. The codeword constructed by a cycle elimination algorithm is applied in our work 25 . Figure 5 demonstrates the error correction speed, when different number of codewords are decoded simultaneously. The speed grows steadily from 1 to 128 codewords and it does not converge even if the number of codewords reaches 128. Note that, due to the shortage of storage space, the decoding speed is not considered when the number of codewords decoded simultaneously exceeds 128. Thus, the proposed layered BP decoder in this paper decodes 128 codewords simultaneously and its thread utilization rate is computed by 1 × 128 × 2500 ÷ 67108864 = 0.00477 , where each sub-matrix consists of one layer of the base matrix and the maximum number of threads for NVIDIA TITIAN Xp GPU is 67,108,864. Table 1 compares the performance of the layered BP decoder with two types of sub-matrices. One type consists of a single layer and the other consists of multiple layers. When decoding 128 codewords of length 10 6 simultaneously, the layered BP decoder with sub-matrices formed by multiple layers performs better than its counterpart in terms of decoding throughput. The improvement is 3.2 Mbits/s, 1.9 Mbits/s and 1.6 Mbits/s, respectively, when three code rates 0.1, 0.05 and 0.02 are considered. Based on this, it appears that combining uncorrelated sub-matrices could be further improved that would then speed up the decoding while also promoting thread utilization.

Discussion
As described in Ref. 17 , the early termination scheme can be used as an efficient way to reduce the complexity of LDPC decoder, which avoids unnecessary iterations at high SNR. However, this scheme may not be efficient at low SNR since the decoding often fails after multiple iterations. Table 2 illustrates the performance comparison between the flooding and the layered BP decoders without early termination, when three code rates 0.1 0.05 and 0.02 are considered, respectively. The former, as illustrated in Table 1, can simultaneously decode 64 codewords of length 10 6 while the latter, as illustrated in Table, can simultaneously decode 128 codewords. In the decoding process, the flooding decoder uses the whole matrix whereas the layered BP decoder uses the sub-matrices which consists of unrelated layers of the base matrix. In Table 2, the first row represents code rates and the second row represents is SNR under the BIAWGNC. The third row β indicates the reconciliation efficiency related to the code rate and the number of discarded parity bits, which has an influence on the reconciliation distance and the secret key rate. The sixth row represents the number of edges of Tanner graph involved in the decoding and the ninth row represents the average latency of one decoding iteration. The raw throughput K raw in the last row is given by  Herein, N c is the number of codewords that are decoded simultaneously, L c is the code length, T p is the latency per iteration, and T is the number of iterations. The total decoding latency T all consists of the latency of initialization, iterative decoding, hard decision and memory copy between CPU and GPU. It can be observed that the required number of decoding iterations for the layered decoder is half of that for the flooding decoder. Accordingly, the decoding speed of the layered BP decoder is more than twice that of the flooding decoder and the values for three tested codes is 64.11 Mbits/s, 48.65 Mbits/s and 39.51 Mbits/s, respectively. Table 3 demonstrates the performance comparison of the layered BP decoders with or without early termination at different SNRs, where the code length is 10 6 and the code rate is 0.1, respectively. When SNR = 0.161 and 0.171, the layered BP decoder without early termination performs a little faster than that with early termination since less calculations are required to determine whether a valid codeword is obtained in the former. Nevertheless, when SNR = 0.181, the layered BP decoder with early termination is better and the corresponding decoding speed is up to 93.49 Mbits/s. This improvement is attributed to the fact that the introduction of the early termination condition reduces the number of iterations significantly at high SNR.
As we know, the total secret key rate K t of a CV-QKD system is given by where f is the repetition rate, γ is a constant representing the ratio that the part of the repetition rate is utilized to generate secret key (the remaining part of the repetition rate is used for parameter estimation, signal synchronization, parameter monitoring, etc.), K r is the secret key rate per pulse, β = R C(s) is the reconciliation efficiency, I is the mutual information between two participants, χ is the Holevo bound, R is the code rate and C(s) is the channel's capacity. Therefore, a better code, which achieves a lower FER for a given SNR or requires a lower SNR (corresponding to a higher β) to achieve a given FER, will bring a higher secret key rate under the condition of the same repetition rate. Moreover, supporting a higher repetition rate by improving the decoding throughput may also lead to a higher secret key rate despite a little higher FER.
In our work, we use a large expansion factor since it often brings a large decoding throughput while the FER performance is not necessarily degraded. Figure 6 shows the FER vs SNR curves of the layered and the flooding decoders for three code rates when different expansion factors are considered. From observations of Fig. 6, it can be noted that the codes with Z = 2,500 have the best FER performance among four expansion factors for  Number of iterations  100  50  150  75  200  100   Decoding method  flooding  layered  flooding  layered  flooding  layered   Total number of edges  3,767,500 3,767,500 3,480,000 3,480,000 3,337,  www.nature.com/scientificreports/ rates 0.05 and 0.02. Moreover, the layered decoders always have the comparable performance as the flooding decoders for given rates and expansion factors. We also compare the performance of the layered BP decoder with other decoders described in previous works. As can be seen from Table 4, the author of Ref. 18 reported a random MET-LDPC code with rate 0.1, where the decoding speed was up to 7.1Mbits/s using a GPU implementation at SNR = 0.161. The decoder was implemented by the flooding BP algorithm without early termination while using internal parallelism and external parallelism. Internal parallelism means several messages are propagated concurrently for a single BP execution corresponding to one message being decoded and external parallelism indicates several vectors are decoded at the same time. In Ref. 17 , the GPU-based decoder decodes a QC-MET-LDPC code with an expansion factor of 512 and achieves 9.17Mbits/s under an early termination condition. Such a decoder consists of four kernels. In the VN to CN message passing kernel, the edge messages are stored in terms of the indices of VNs sequentially, and in the CN to VN message passing kernel, they are stored in terms of the indices of CNs sequentially. Moreover, the authors of Ref. 17 re-ordered the messages in order to avoid the access to unordered memory.
Note that if the messages of the degree-1 variable nodes are updated only once at the end of the iterative decoding procedure, the computational complexity and the consumption of the thread will be reduced. As a consequence, the decoding speed of the MET-LDPC code of length 10 6 in Ref. 19 is up to 30.39 Mbits/s, where the parity check matrix is divided into two files that can then be stored. One stores the CNs adjacent to VNs sequentially in terms of the indices of VNs, the other stores the mapping relations of VNs to CNs. In our work, the proposed layered BP decoder maps the threads to check nodes in each sub-matrix which consists of unrelated rows, and the amount of computation for each thread is in proportion to the degree of a check node. Moreover, the decoder combines three kinds of information related to each nonnegative element of the base matrix into one integer and stores all such integers in a file, which reduces the consumption of GPU memory and the copy time thereby obtaining a high decoding throughput up to 64.11 Mbits/s with no performance degradation. Figure 7 investigates how the FER relates to the total secret key rate since the key rate is a core index of a QKD system (we assume that the maximum supportable repetition rate is equal to the decoding throughput since  www.nature.com/scientificreports/ the error-correction is usually the most complicated step). Together with the increase of SNR, both FER and β decrease. As a consequence, the total secret key rate is not monotonically increasing over the whole FER ranges. One observes that our GPU-based layered decoder outperforms those given in Ref. [17][18][19] from the total secret key rate point of view. According to Eq. (7), although our GPU decoder has a higher FER, the improvement on the supportable repetition rate due to the higher decoding throughput is sufficient to compensate the resultant loss of the total secret key rate. Despite the high throughput of the GPU-based layered BP decoder, there are some shortcomings. For example, most of threads are not used and thus, there exists much space to increase the number of codewords that are decoded simultaneously. In addition, memory shortage may limit the number of parallel decoding. Our future work will focus on reducing memory consumption when decoding one codeword. As a result, thread utilization will then be increased by decoding more codewords simultaneously.

Methods
In this paper, the three-edge-type QC-LDPC codes are always used 26 . The degree distribution of a MET-LDPC code is specified by a pair of multivariate polynomials ν(r,x) and μ(x), where ν(r,x) is related to variable nodes and μ(x) is related to check nodes. The multivariate polynomial pair (ν(r,x), μ(x)) is defined as follows: where b represents different types of channels (Typically, b only has two values, i.e., 0 or 1, corresponding to two channels which transmit bits and puncture bits, respectively), d denotes the degrees of different edge types, r represents variables corresponding to the different types of channels, and x denotes variables related to edge types. Moreover, ν b,d and μ d denote the probabilities of variable nodes of type (b, d) and check nodes of type d, and the code rate is computed by The construction of MET-QC-LDPC codes used in our simulations is described as follows.