True Randomness from Big Data

Generating random bits is a difficult task, which is important for physical systems simulation, cryptography, and many applications that rely on high-quality random bits. Our contribution is to show how to generate provably random bits from uncertain events whose outcomes are routinely recorded in the form of massive data sets. These include scientific data sets, such as in astronomics, genomics, as well as data produced by individuals, such as internet search logs, sensor networks, and social network feeds. We view the generation of such data as the sampling process from a big source, which is a random variable of size at least a few gigabytes. Our view initiates the study of big sources in the randomness extraction literature. Previous approaches for big sources rely on statistical assumptions about the samples. We introduce a general method that provably extracts almost-uniform random bits from big sources and extensively validate it empirically on real data sets. The experimental findings indicate that our method is efficient enough to handle large enough sources, while previous extractor constructions are not efficient enough to be practical. Quality-wise, our method at least matches quantum randomness expanders and classical world empirical extractors as measured by standardized tests.

x {0,1} 2 1 n , an issue that we will elaborate more on later. To extract randomness from a source X, we need: (i) a sample from X, (ii) a small uniform random seed Y, (iii) a lower bound k for H ∞ [X], and (iv) a fixed error tolerance ε > 0. Formally, a (k, ε)-extractor Ext , when taking input from any (n, k)-source X together with a random seed Y. In typical settings = = d n n polylog( ) (log ) c 2 for a constant c > 0, and m > d. The seed is necessary since otherwise it is impossible to extract a single random bit from one source 2 . We note that other notions of the output being random, other than closeness to the uniform distribution, are possible and have been studied in a number of general science journal articles [13][14][15][16] . These are based on measures of randomness such as approximate entropy. Since our measure is total variation distance to the uniform distribution, our generated output provably appears random to every other specific measure, including e.g., approximate entropy. What does it mean to extract randomness from big sources? Computation over big data is commonly formalized through the multi-stream model of computation, where, in practice, each stream corresponds to a hard disk 17 . Algorithms in this model can be turned into executable programs that can process large inputs. Formally, a streaming extractor uses a local memory and a constant number (e.g., two) of streams, which are tapes the algorithm reads and writes sequentially from left to right. Initially, the sample from X is written on the first stream. The seed of length d = polylog(n) resides permanently in local memory. In each computation step, the extractor operates on one stream by writing over the current bit, then moving on to the next bit, and finally updating its local memory, while staying put on all other streams. The sum p of all passes over all streams is constant or slightly above constant and the local memory size is s = polylog(n).
The limitations of streaming processing (tiny p and s) pose challenges for randomness extraction. For example, a big source X could be controlled by an adversary who outputs n-bit samples , for some t 1 ≠ t 2 and large integer Δ > 0. Besides such simple dependencies, an extractor must eliminate all possible determinacies without knowing any of the specifics of X. To do that, it should spread the input information over the output, a task fundamentally limited in streaming algorithms. This idea was previously 18 formalized, where it was shown that an extractor with only one stream needs either polynomial in n, denoted poly(n), many passes, or poly(n)-size local memory; i.e., no single-stream extractor exists. Even if we add a constant number of streams to the model, the so-far known extractors 19,20 cannot be realized with n o(log ) 2 many passes (a corollary 17 ), nor do they have a known implementation with tractable stream size.
An effective study on the limitations of every possible streaming extractor goes hand in hand with a concrete construction we provide. The main purpose of this article is to explain why such a construction is at all possible and our focus here is on the empirical findings. The following theorem relies on mathematical techniques that could be of independent interest (Supplementary Information pp. [26][27][28][29][30] and states that Ω n (log log ) 2 2 many passes are necessary for all multi-stream extractors. This constitutes our main impossibility result. This unusual, slightly-above-constant number of passes, is also sufficient, as witnessed by the two-stream extractor presented below.
Theorem. Fix an arbitrary multi-stream extractor Ext: {0, 1} n × {0, 1} d → {0, 1} m with error tolerance ε = 1/poly(n), such that for every input source X where H ∞ [X] ≥ κn, for any constant κ > 0, and uniform random seed Y, the output Ext (X, Y) is ε-close to uniform. If Ext uses sub-polynomial n o(1) local memory then it must make = Ω p n (log log ) 2 2 passes. Furthermore, the same holds for every constant  λ ∈ + number of input sources.

Our RRB Extractor
We propose and validate a new empirical method for true randomness extraction from big sources. This method consists of a novel extractor and empirical methods to both estimate the min-entropy and generate the initial random seed. Figure 1 depicts a high-level view of the complete extaction method. This is the first complete general extraction method, not only for big sources but for every statistical source.
We propose what we call the Random Re-Bucketing (RRB) extractor. For our RRB extractor we prove (Supplementary Information pp. [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25] that it outputs almost-uniform random bits when given as input a single sample from an arbitrary weak source -as long as the source has enough min-entropy. Mathematical guarantees are indispensable for extractors, since testing uniformity and estimating entropy of an unknown distribution, even approximately, is computationally intractable 21 . A key-feature of the RRB extractor in Fig. 2 is its simplicity, with the technical difficulty being in proving its correctness, which requires a novel, non-trivial analysis. RRB is the first extractor capable of handling big sources without additional assumptions. Previous works require either (i) unrealistic running times or (ii) ad hoc assumptions about the source. In particular, the local extractors such as von Neumann 22 and Local Hash fail significantly in terms of output quality, whereas the throughput of Trevisan's extractor 19 and its followups degrade significantly (see Fig. 3) with the size of the sample 12 even with practical optimization considered; e.g., 103,407 years of computing time for a 20 GB input sample and ε = 10 −3 , κ = 1/2. We note that we choose to compare to the Local Hash and von Neumann extractors since these are the only extractors experimented upon in previous work (see ref. 23 for empirical work using von Neumann's extractor, and see refs 12, 24 and 25 for empirical work using Local Hash), and importantly, both extractors happen to be streaming extractors. Thus, due to their special attention in previous work they are two ideal candidates for comparison. We refer the reader to Table 2  In Stage III the same local extractor h is used for the first γ fraction of blocks. The number of super-blocks b also depends on an error tolerance ε and the empirically estimated min-entropy rate κ. In the main body, we explain how to realize this description as an algorithm that uses two streams. The RRB extractor consists of the following three stages.   Table 2. Comparative extraction quality performance. The raw data consists of 12 files each of size 1000 MB from the 12 data categories and the adversarial data are generated by simply replacing 10 MB in each file with fixed values. NIST tests are applied on the raw data and extraction output of von Neumann, local hash, and RRB extractors on raw data and adversarial data. The second column is the total number of NIST tests per setting. The third column is the number of NIST tests that fail because of proportion, and the fourth column is the number of NIST tests that fail because of the second-order P-value. All are compared with the expected number of the ideal uniform random bits. Except from RRB and "RRB on adversary", all other test results indicate nonuniform output (i.e. noticeably different from ideal uniform). The implementation is scalable since it uses bit random seed. For example, 44 passes and 57 KB of a random seed suffice to extract 1 GB of randomness from a 20 GB input. (For min-entropy k ≥ 0.2n = 4 GB, and error rate ε = 10 −27 < 1/n 2 . With a total of 50 passes, the error rate can be as small as ε = 10 −100 . Most of the seed is used to sample a random Toeplitz hash h.). Stage III with γ = 1 has been used before [26][27][28] in randomness extraction from sources of guaranteed next-block-min-entropy. This guarantee means that every block, as a random variable, has enough min-entropy left even after revealing all the blocks preceding it, i.e., it presumes strong inter-block independence. Such a precondition restricts the applicability of Stage III since it appears too strong for common big sources, especially when there is an adversary. However, by introducing Stages I & II we can provably fulfill the precondition for a theoretically lower bounded constant γ. In practice, a larger (i.e., better than in theory) γ can be empirically found and validated.

I. Partition the n-bit long input into
Stage I equalizes entropy within each super-block and, subsequently, Stage II distributes entropy globally. After Stage II, the following property holds. Let   , γ = Ω(1) and b O = Ω(n/b). Therefore, Stage III extracts ε-close to uniform random bits.
To invoke the extractor, it is necessary to find an initial random seed and estimate the min-entropy rate κ of the source. The proposed method includes an empirical realization of a multi-source extractor to obtain 4 MB initial randomness from 144 audio samples each of 4 MB. We also propose and validate an empirical protocol that estimates both κ and γ simultaneously by combining RRB itself with standardized uniformity tests.
Finally, we note that the RRB extractor bears some superficial similarities to the Advanced Encryption Standard (AES) block cipher, which is an encryption scheme widely used in practice. That is, at a high level both schemes efficiently mix information, though they do so in very different ways, e.g., in AES this is done on a much more local scale, whereas we mix information globally. Moreover, unlike the RRB extractor that we propose, the AES block cipher cannot have provable guarantees without proving that P ≠ NP.

Methods
The proposed method is validated in terms of efficiency and quality, measured by standard quality test suites, NIST 29 and DIEHARD 30 . The results strongly support our new extractor construction on real-world samples. The empirical study compares multiple extraction methods on many real world data sets, and demonstrates that our extractor is the only one that works in practice on sufficiently large sources.
Our experiments are explained in more detail below and we summarize them here. Our samples range in size from 1.5 GB-20 GB and they are from 12 data categories: compressed/uncompressed text, video, images, audio, DNA sequenced data, and social network data. The empirical extraction is for ε = 10 −20 and estimated min-entropy rate ranging from 1/64 to 1/2, with extraction time from 0.85 hours to 11.06 hours on a desktop PC (Fig. 3). The extracted outputs of our method pass all quality tests, whereas the before-extraction-datasets fail almost everywhere (Tables 1 and 2). The output quality of RRB is statistically identical to the uniform distribution. Such test results provide further evidence supporting that the extraction quality is close to the ideal uniform distribution, besides the necessary 31 rigorous mathematical treatment.
Extraction method. The complete empirical method consists of: (i) initial randomness generation, (ii) parameter estimation, and (iii) streaming extraction. Components (ii) and (iii) rely on initial randomness.
We first extract randomness from multiple independent sources without using any seed. Then, we use RRB to expand this initial randomness further.
Parameter estimation determines a suitable pair (κ, γ) of min-entropy rate κ = k/n and effectiveness factor Experimental set-up. We empirically evaluate the quality and the efficiency of our RRB extractor.
Quality evaluation is performed on big samples from twelve semantic data categories: compressed/uncompressed audio, video, images, text, DNA sequenced data, and social network data (for audio, video, and images the compression is lossy). The initial randomness used in our experiments consists of 9.375 × 10 8 bits ≈ 117 MB generated from 144 pieces of 4 MB compressed audio and one piece of 15 GB compressed video. The produced randomness is used for parameter estimation on samples ranging in size from 1 GB to 16 GB from each of the 12 categories. The estimated κ and γ vary within [1/64, 1/2] and [1/32, 1/2] respectively, cross-validated (i.e., excluding previously used samples) on samples of size 1.5 GB-20 GB with error tolerance ε = 10 −20 . Final extraction quality is measured on all 12 categories by the standard NIST and DIEHARD batteries of statistical tests.
For comparison, we measure quality and efficiency for three of the most popular representatives of extractors. The quality of Local Hash and von Neumann extractors is evaluated on 12 GB of raw data (from the 12 categories) and on 12 GB adversarial synthetic data. The efficiency is measured for von the Neumann extractor, Local Hash, and Trevisan's extractor. See the Supplementary Information for tables and  Empirical initial randomness generation. Seeded extraction, as in RRB, needs uniform random bits to start. All the randomness for the seeds in our experiments is obtained by the following method (which we call it randomness bootstrapping) in two phases: (i) obtain initial randomness ρ through (seedless) multiple-independentsource extraction, and (ii) use ρ for parameter estimation and run RRB to extract a longer string ρ long , ρ | | = ρ| − 2 long /54 7 , where ρ is the part of ρ used as the seed of RRB during bootstrapping. By elementary information theory, ρ long can be used instead of a uniformly random string.
Phase (ii) uses the 4 MB extracted by BIWZ out of which 3.99 MB are used in parameter estimation for compressed video. The remaining 10 KB are used to run RRB on 15 GB compressed video, which is generated and compressed privately, i.e., without adversarial control. Our hypothesis is that the estimated parameters are valid for RRB, i.e., n bits of compressed video contain min-entropy n/2 that can be extracted by RRB with effectiveness factor γ = 1/32. This hypothesis is verified experimentally. With the given seed and κ = 1/2, γ = 1/32, and ε = 10 −100 , RRB extracts the final 9.375 × 10 8 random bits.
Empirical parameter estimation protocol. There are two crucial parameters for RRB: the min-entropy rate κ and the effectiveness factor γ. In theory, γ is determined by κ, n, ε. In practice, better, empirically validated values are estimated simultaneously for κ and γ. This works because in addition to min-entropy, κ induces the next-block-min-entropy guarantee for a fraction of γ blocks.
For every semantic data category, the following protocol estimates a pair of (κ, γ).
First, obtain a bit sequence s of size 1 GB by concatenating sampled < 1 MB segments from the target data category. Then, compress s into s′ using LZ77 34 (s′ = s if s is already compressed). Since the ideal compression has |s′ | equal to the Shannon entropy of s, the compression rate ′ and effectiveness factor γ γ , extract from s using RRB, with parameters κ, γ, and ε = 10 −20 and seed from the initial randomness. Apply NIST tests on the extracted bits for every (κ, γ) pair. If the amount of extracted bits is insufficient for NIST tests, then start over with an s twice as long. We call a pair of (κ 0 , γ 0 ) acceptable if NIST fails with frequency at most 0.25% for every run of RRB with parameters κ ≤ κ 0 and γ ≤ γ 0 . This 0.25% threshold is conservatively set slightly below the expected failure probability of NIST on ideal random inputs, which is 0.27%. If (κ 0 , γ 0 ) is a correctly estimated lower bound, then every estimate (κ, γ) with κ ≤ κ 0 and γ ≤ γ 0 is also a correct lower bound. Hence, the extraction with (κ, γ) should be random and pass the NIST tests. We choose the acceptable pair (if any) that maximizes the output length.
There is strong intuition in support of the correct operation of this protocol. First, the random sampling for s preserves with high probability the min-entropy rate 35 . Second, an extractor cannot extract almost-uniform randomness if the source has min-entropy much lower than the estimated one. Finally, NIST tests exhibit a certain ability to detect non-uniformity. Verification of the estimated parameters is done by cross-validation. passes over two streams, for input length n, min-entropy rate κ, error tolerance ε and seed length d. RRB is also parametrized by the effectiveness factor γ as shown below.
Given n, ε, and the estimated κ, γ, we initially set k = κn, the output length γκ = m n 1 2 , and the number of super-blocks = from the initially generated randomness, and store it in local memory. We interpret y as  In Stage II, we compute the re-bucketing of x y shift( , ) 1 1 , … , x y shift( , ), b b which is stored on σ 2 . The re-bucketing output is denoted by (z 1 , … , z n/b ), where every z j collects the j-th bit from all shifted super-blocks, i.e., =       … z x y j (shift( , ) , , . The re-bucketing of b super-blocks can be done with ⌈ ⌉ b log 2 iterations, where every iteration reduces the number of super-blocks by a factor of two by interlacing (with the help of σ 1 ) the first and second half of σ 2 . In particular, the first iteration merges every pair , which consists of n/b blocks (i.e. for j = 1, 2, … , n/b) each of length 2. During the ⌈ ⌉ b log 2 many iterations, RRB spends ⌈ ⌉ b 3 log 2 passes to compute (z 1 , … , z n/b ) and store it on σ 1 .
In the final stage, we output is a hash function realized through a Toeplitz matrix specified by y 0 from the seed and b O = γn/b the number of blocks used for the output. This m-bit-long output can be locally extracted with 2 passes. Therefore, RRB extracts m bits with κ 3 log  6 3 log log  2log  2log 3 1   n  2  2  2  2  2 passes. The local memory size is dominated by Stage I, which requires + d n 2 log 2 bits to store the seed and two counters for head positions.
The above description is for the estimated (κ, γ). If there is theoretical knowledge for κ and the error tolerance ε is given, then RRB provably extracts m = Ω(n) bits that are ε-close to uniform with = Empirical statistical tests. Each statistical test measures one property of the uniform distribution by computing a P-value, which on ideal random inputs is uniformly distributed in [0, 1]. For each NIST test, subsequences are derived from the input sequence and P-values are computed for each subsequence. A significance level α ∈ [0.0001, 0.01] is chosen such that a subsequence passes the test whenever P-value ≥ α and fails otherwise. If we think that NIST is testing ideal random inputs, then the proportion of passing subsequences has expectation 1 − α, and the acceptable range of proportions is the confidence interval chosen within 3 standard deviations. Furthermore, a second-order P-value is calculated on the P-values of all subsequences via a χ 2 -test. An input passes one NIST test if (i) the input induces an acceptable proportion and (ii) the second-order P-value ≥ 0.0001. An input passes one DIEHARD-test if P-value is in [α, 1 − α].
We compare the statistical behavior of bits produced by our method with ideal random bits. For ideal random bit-sequences, α is the ideal failure rate. Anything significantly lower or higher than this indicates non-uniform input. In our tests, we choose the largest suggested significance level α = 0.01; i.e., the hardest to pass the test. All tests on our extracted bits appear statistically identical to ideal randomness. See the Supplementary Information for details.

Experimental platform details. The performance of the streaming RRB, von Neumann extractor, and
Local Hash is measured on a desktop PC, with Intel Core i5 3.2 GHz CPU, 8 GB RAM, two 1 terabyte (TB) hard drives and kernel version Darwin 14.0.0. The performance of Trevisan's extractor is measured on the same PC with the entire input and intermediate results stored in main memory. We use the following software platforms and libraries. TPIE 36 is the C+ + library on top of which we implement all streaming algorithms -TPIE provides application-level streaming I/O interface to hard disks. For arbitrary precision integer and Galois field arithmetic we use GMP 37 and FGFAL 38 . Mathematica 39 is used for data processing, polynomial fitting, and plots. Source code is available upon request.