High-speed and Large-scale Privacy Amplification Scheme for Quantum Key Distribution

State-of-art quantum key distribution (QKD) systems are performed with several GHz pulse rates, meanwhile privacy amplification (PA) with large scale inputs has to be performed to generate the final secure keys with quantified security. In this paper, we propose a fast Fourier transform (FFT) enhanced high-speed and large-scale (HiLS) PA scheme on commercial CPU platform without increasing dedicated computational devices. The long input weak secure key is divided into many blocks and the random seed for constructing Toeplitz matrix is shuffled to multiple sub-sequences respectively, then PA procedures are parallel implemented for all sub-key blocks with correlated sub-sequences, afterwards, the outcomes are merged as the final secure key. When the input scale is 128 Mb, our proposed HiLS PA scheme reaches 71.16 Mbps, 54.08 Mbps and 39.15 Mbps with the compression ratio equals to 0.125, 0.25 and 0.375 respectively, resulting achievable secure key generation rates close to the asymptotic limit. HiLS PA scheme can be applied to 10 GHz QKD systems with even larger input scales and the evaluated throughput is around 32.49 Mbps with the compression ratio equals to 0.125 and the input scale of 1 Gb, which is ten times larger than the previous works for QKD systems. Furthermore, with the limited computational resources, the achieved throughput of HiLS PA scheme is 0.44 Mbps with the compression ratio equals to 0.125, when the input scale equals up to 128 Gb. In theory, the PA of the randomness extraction in quantum random number generation (QRNG) is same as the PA procedure in QKD, and our work can also be efficiently performed in high-speed QRNG.

Quantum Key Distribution (QKD), which based on the fundamental quantum mechanics, can generate the information-theoretical secure (ITS) keys for distant communication parties [1][2][3] . Practical QKD systems are mainly composed of two phases: the quantum communication phase and the post-processing phase 4,5 . In the post-processing phase, partial information about the secure key may still be leaked to the eavesdropper Eve after the key/basis sifting and error correction procedures. Privacy amplification (PA), the most significant post-processing procedure, coverts the weak secure correlated key to a uniform and ITS key to Eve [6][7][8] .
Given the input weak secure key W with length of n and the security level ε, the optimal PA scheme in theory can be achieved with (dual) universal hash functions using Toeplitz kind of matrix (T) with computational complexity of O(nlogn) 9 , and the length of consumed random seed in PA is αn, with min-entropy of αn + O(1), α ∈ (0,1] [10][11][12][13] . Nowadays, state-of-art academic QKD experiments are performed with several GHz pulse rates [14][15][16][17][18] , advanced multiplexing technologies 19,20 and extracts secure keys even with high-dimensional scenarios [21][22][23] . Meanwhile, a rigorous statistical fluctuation analysis has to be performed to remove the finite-size key effects on the final secure key 24,25 . Therefore, a high throughput and large-scale (usually larger than several Megabits) PA scheme has to be implemented to real-time extract the secure key with achievable generation rate close to the asymptotic (infinite-key) limit.
The simplest implementation idea of a large-scaled PA scheme is directly performing multiplication operation between W and T, resulting in the computational complexity with O(n 2 ). However, such matrix-vector multiplication is very suitable to be implemented with Field-Programmable Gate Array (FPGA) platform. H. Zhang et al. firstly divided T into many smaller blocks and proposed a block parallel PA scheme to speedup the Toeplitz hashing procedure 26 . S. Yang et al. 27 and J. Constantin et al. 28 proposed advanced block partition strategies to reduce the overhead of multiplication operations respectively, resulting in the throughput around 64 Mbps with input scale of 1 megabits 27 .
Actually, majority optimized PA schemes are performed using fast Fourier transform (FFT) with complexity reduced to O(nlogn) 8,29,30 . Given fixed security level ε (i.e. 10 −10 ), the farther communication distance, the larger input length of PA scheme should be adapted. For example, in entanglement-based QKD systems, the input length n should be increased to at least the order of 10 8  It's a huge challenge to implement large-scale FFT based PA schemes on FPGA platforms due to the limited resources and ultra complicated hardware design. Implementation of PA schemes on MIC, GPU or other dedicated computational devices consumes ultra high power and volume and significantly increases the design complexity. Improving the throughput of FFT enhanced PA schemes on CPU platforms is a very conventional option, since it can be efficiently integrated to the whole QKD system. However, it's feasible with CPU implementations only for small input scales (≤10 6 ) and rapidly becomes the performance bottleneck with larger input scales. Therefore, in this article, we propose a fast Fourier transform (FFT) enhanced high-speed and large-scale (HiLS) PA scheme on commercial multi-core CPU platform. In the HiLS PA scheme, W is divided into many blocks and the random seed for constructing Toeplitz matrix T is shuffled to multiple sub-sequences respectively, then PA procedures are parallel implemented for all sub-key blocks with correlated sub-sequences, afterwards the outcomes are merged as the final secure key. When the input scale is 128 Mb, our HiLS PA scheme reaches 71.16 Mbps, 54.08 Mbps and 39.15 Mbps with the compression ratio equals to 0.125, 0.25 and 0.375 respectively. Therefore, HiLS PA scheme can be applied to 10 GHz QKD systems with even larger input scales and the evaluated throughput is around 32.49 Mbps with the compression ratio equals to 0.125 and the input scale of 1 Gb, which is ten times larger than the previous works for QKD systems. Furthermore, with the limited computational resources (128 GB memory, 1 TB storage and 16 CPU cores in total), the achieved throughput of HiLS PA scheme is 0.44 Mbps with the compression ratio equals to 0.125, when the input scale equals up to 128 Gb. In theory, the PA of the randomness extraction in quantum random number generation (QRNG) is same as the PA procedure in QKD [32][33][34] . Thus, HiLS PA scheme can also be efficiently performed in high-speed QRNG.

Related Work
Privacy amplification was first proposed in the context of quantum key distribution by Bennett et al. 6 , where the channel with perfect authenticity but no privacy (public classical channel) can be used to repair the defects of a channel with imperfect privacy but no authenticity (quantum channel). The schematic diagram of PA in QKD is shown in Fig. 1, Alice and Bob firstly distribute quantum signals via a noisy and lossy quantum channel (fiber or free space), then share correlated and weak secure key W after basis/key sifting and error correction procedures via a public channel. The min-entropy of shared weak secure key W is n. Let random variable E summarizes Eve's entire learned knowledge about W, here, H(W|E) ≤ t, t < n. PA, where Alice and Bob publicly discuss a extractor function G:{0,1} n →{0,1} r , such that reduces Eve's learned information of the final secure key K f from t to at most ε 6,7,35,36 . Nowadays, most practical extractors are known to the universal hash function, especially the (modified) Toeplitz matrix defined as 13  where T(A) is a r × (n − r) Toeplitz matrix, A is a random seed, A = (a 0 , a 1 , …, a n−1 ) ∈ {0,1} n−1 , T(A) i,j = a j−i+r−1 . Also, we define W I = (w 0 , w 1 , …, w r−1 ) and W TA = (w r , w r+1 , …, w n−1 ). Therefore, the final secure key can be calculated as www.nature.com/scientificreports www.nature.com/scientificreports/ In order to efficiently implement the calculation of T(A)W TA using fast Fourier transform (FFT), we have to extend T(A) to a special circulant Toeplitz matrix with scale of (n − 1) × (n − 1) and extend W TA to a vector with length of n − 1 by padding zeros. The optimized multiplication of a circulant matrix and a vector is shown as 1 where "*" denotes the Hadamard product operator, F denotes the Fourier transform operator, F −1 is the inverse Fourier transform operator, X is a vector and H is a circulant Toeplitz matrix with first row h. Since the complexity of F and F −1 operations is O(nlogn) and the complexity of Hadmard product operation is O(n), the computational complexity of optimized PA algorithm is O(nlogn) 8,12 .
In theory, QKD can generate ITS keys for communication parties, even the quantum channel is under control of the eavesdropper Eve. Imperfect implementation and active attacks would leak some information about W to Eve. Alice and Bob can quantify the bound of leaked information accurately with the infinite post-processing block size. In this paper, we take entanglement based QKD as an example, the secure key rate can be calculated as 37 where q is the basis sifting factor, Q μ is the gain of detected entangled photon pairs, ν s is the repetition rate of the entangled source, e b is the measured quantum bit error rate, e p U is the estimated upper-bound of phase error rate, f(x) is the error correction efficiency, H 2 (x) is the binary Shannon entropy.
In practice, e p U can not be measured directly and could not be accurately estimated due to the statistical fluctuations with finite post-processing block sizes. Here, we simulate the required throughput of PA algorithm in a 10 GHz entanglement based QKD with the parameters shown in Table 1. The entangled photon source is put into the middle of communication parties, the finite-size-effect for the final secure key K f is considered with post-processing block size from the order of 10 4 to infinite, and the failure probability ε ph = 10 −10 for estimating e p U 4 . The analyzed results are shown in Fig. 2, the post-processing block size should be at least the order of 10 8 to achieve a secure key rate close to the asymptotic limit. Directly implementing PA algorithms with ultra large-scale inputs will limit the performance of full QKD systems. Meanwhile, the required throughput of PA algorithm is around 40 Mbps without any channel loss.

High-speed and Large-scale Privacy Amplification Scheme
The schematic diagram of proposed high-speed and large-scale (HiLS) privacy amplification scheme for QKD is shown in Fig. 3. Weak secure key W with length of n is gained after the basis/key sifting and error correction procedures for the measured raw key string at Alice's (Bob's) side. Then, Alice and Bob estimate the final secure key length r with rigorous statistical fluctuation analysis procedure. Afterwards, Alice and Bob publicly discuss a random seed with length of n − 1 bits to construct the universal hash function. Our proposed HiLS PA scheme mainly consists of three steps: splitting and shuffling, sub-PA and secure-key merging.
Step 1: Splitting and shuffling. In this step, we divide W to several sub-vectors and divide the Toeplitz . First of all, we construct a vector A by padding km − r (tm − n + r) zeros to the head (tail) of the exchanged random seed with length of n − 1 bits. Then, we shuffle A into k + t − 1 sub-vectors, defined as A i : = [a im , a im + 1 , …, a (2+i)m−1 ], 0 ≤ i < k + t − 1. Therefore, the divided sub-matrix can be constructed by H i,j = T(A i+j ), i ∈ [0, k) and j ∈ [0, t), and we have where For W, we first pad tm − n + r zeros to the tail and take first r bits and the rest bits to construct the sub-vector W I and W TA . Then, divide W TA into t sub-vectors, defined as Step 2: Sub-PA. In this step, the efficient implementation using FFT of multiplication Y i,j is performed to sub-vector W j and sub-matrix H i,j , where, i ∈ [0, k) and j ∈ [0, t).
Step 3: Secure-Key merging. First   The weak secure key length is n, the final secure key length is r, the sub-block size is m,   Table 2. Specifications of server computer.

Results
The implementation of HiLS PA scheme is evaluated on the multi-core server computer, the specifications are shown in Table 2. Due to FFT operation may suffer errors caused by finite-precision float-point arithmetic, we suggest the scale of FFT operation smaller than the order of 10 8 . Meanwhile, considering the thread synchronization and thread safety issues, the calculations of (inverse) Fourier transforms and also Hadamard products are paralleled in the architecture of shared memory multi-processes.
We evaluate the throughput of HiLS PA scheme with different input scale (n) and various sub-block size (m). The result is shown in Fig. 4, where we set the input weak secure key length n equals from 16 Mb to 512 Mb, and splitting factor, defined as m n is various from 1 32 to 1 2 . Figure 4 shows us that for given n (in our implementation, can be up to 1 Gb), HiLS PA scheme can always achieve optimized throughput when splitting factor = . , the amount of split sub-matrices stays the level as the case with splitting factor equals to 0.125, but the FFT operating scale is same as the case with splitting factor equals to 0.25, which results even worse throughput to HiLS PA scheme. This situation would also happened when the splitting factor . < < . In the case of input scale is 1 Gb, the throughput of HiLS PA scheme reaches up to 32.49 Mbps and 15.0 Mbps with the compression ratio equals to 0.125 and 0.25, which contributes much rigorous statistical fluctuation analysis and is remarkable higher than the required throughput when the total channel loss is expected larger than 87.6 dB.
With limited resource (128 GB memory, 1 TB storage and 16 CPU cores in total), the HiLS PA scheme with input scale of 128 Gb and the compression ratio equals to 0.125, runs around 83 hours, resulting a throughput of 0.44 Mbps. The implementation of PA with such large inputs on GPU platform is very difficult due to the complicated computation and memory scheduling strategies. Meanwhile, the throughput of the HiLS PA scheme can be easily improved on high-speed multi-core CPU platforms with much larger configured memory.