Design of a cryptographically secure pseudo random number generator with grammatical evolution

This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as an entropy source and show that the initial seeds exhibit an average entropy value of 7.940560934 for 8-bit entropy, which is close to the ideal value of 8. We then construct two random number generators, GE-PRNG and GE-CSPRNG, both of which employ these initial seeds. We use Monte Carlo simulations to establish the efficacy of the GE-PRNG using an experimental setup designed to estimate the value for pi, in which 100,000,000 random numbers were generated by our system. This returned the value of pi of 3.146564000, which is precise up to six decimal digits for the actual value of pi. We propose a new approach called control_flow_incrementor to generate cryptographically secure random numbers. The random numbers generated with CSPRNG meet the prescribed National Institute of Standards and Technology SP800-22 and the Diehard statistical test requirements. We also present a computational performance analysis of GE-CSPRNG demonstrating its potential to be used in industrial applications.


Scientific Reports
| (2022) 12:8602 | https://doi.org/10.1038/s41598-022-11613-x www.nature.com/scientificreports/ Grammatical Evolution (GE) [14][15][16][17][18] is a bio-inspired population-based Machine Learning (ML) tool that makes use of Backus-Naur Form (BNF) grammars 19 to generate legal structures for various problem domains. In this research work, we investigate GE as a potential entropy source to generate high entropy initial seed, as every evolutionary run is capable of producing a different seed. We propose a GE-based PRNG and GE-based CSPRNG, both of which can generate random sequences which adheres to meeting all standards of an excellent pseudo randomness.

Results
Generation of random sequences with control_flow_incrementor. The approach for generating random numbers for constructing CSPRNG is analogous to an incremental counter, where we define a loop to act as an incrementor counter and hence we name this approach control_flow_incrementor. This method uses a variable initialized to 0. To generate a new pseudo-random sequence, the current value of the variable is incremented and appended to the least significant bit (LSB) position of the initial seed. The resulting string is processed using a cryptographically secure hash algorithm such as SHA3-512 20 , depending on user requirement to make it secure against brute-force attacks. We can easily extend the size of the strings making it suitable for different applications with simple concatenation functions. For example, if we need a 1024-bit key, two 512-bit strings can be concatenated. As statistically inferred from our NIST experiments, we reseed our CSPRNG after generating every 4000 sequences for our 4096-bit CSPRNG to comply with the CSPRNG properties.
GE has the capability to generate n useful initial seeds depending upon the choices in production rules and the definition of production rule for pattern formation in the BNF grammar. We store all best individuals obtained during each run of GE in a repository. This can be useful in scenarios where we may not be able to obtain the best individuals in the first generation itself, given the inherent randomness of evolutionary computation. In such situations, we employ the mechanism of last in first out (LIFO) for seed extraction from the repository. We conducted an experiment to verify the potential of updating our entropy pool with unique seeds. In this experiment, 1,000 evolutionary runs resulted in 5076 unique seeds, in a time of 46.73 s, where each evolutionary run took an average time of 0.046 s. Randomness  In all 15 cases, α is defined to be 0.01, which indicates 99% probability of sequences to be random if the test is passed. The random sequences obtained by the control_flow_incrementor approach with GE were generated for 4096-bits as modern systems that use RSA 22 for encryption standards usually require keys of 4096 bits. The longest data stream of 947,200 random bits passed all the 15 tests after which point the reseeding was done to obtain new samples. In both the cases, i.e. single objective CSPRNG which fulfils entropy requirements and the many objective CSPRNG which fulfils all three desired characteristics, viz, entropy, randomness and autocorrelation the difference in results is observed due to the fact of different seeds used to obtain the random sequences. The results of the tests are given in Table 1. The randomness and entropy of the strings are validated as all the tests are passed by the generated output sequences.  Table 2. The 2-D spheres test was integrated with 3-D spheres and thus we tested our random sequences on 17 tests. On observing the results, out of 17 tests, all tests have been passed to meet out the diehard criteria, which ensures the randomness of the proposed PRNG.

Statistical test for GE based CSPRNG.
Security analysis of GE-CSPRNG. Reverse engineering is a crucial attack that exploits the inherent features of any design. It helps the attacker gain knowledge about the design and to analyse the weaker points of it. It leads to cloning of the design/device, fault injections, trojans, cryptanalysis etc. It has three fundamental phases namely information extraction, modelling and analysis. The proposed PRNG integrates a novel random seed generation with a SHA3-512 to achieve adequate randomness. As the proposed BNF grammar for GE-PRNG is flexible, the production rules can be extended by merely adding additional choices for symbols, thus having the potential to generate ∑2 n number of unique seeds, where n is the total number of choices for each of the production rules. The BNF grammar proposed in the article has the potential to generate 1,514,240 (as explained in Eq. (1)) unique seeds with the permutations of the production rules and available choices. Hashing the initial seed with SHA3-512 ensures it against session replay attacks and brute force attacks.
where ch(1)is the number of choices for rule1, ch(2)is the number of choices for rule2, etc.
The number of choices for each of the eight rules in this grammar are 1, 1, 1, 8, 26, 26, 28 and 10. This gives the possible permutations for the grammar as 1,514,240, which equates to the number of unique seeds for our grammar. Similarly, with this logic, we can generate a larger number of unique initial seeds simply by adding more combinations of choices in the BNF grammar by the inclusion of additional production rules. This feature shows the flexibility of GE-PRNG to be scalable for applications that cater to a large number of users needing unique initial seeds for generating random numbers.
The length of random numbers generated by a PRNG after which the sequences start to repeat themselves is called the period of the PRNG. The period of the proposed PRNG depends upon the limit of variables for the incrementor loop in the system program. For example, if we use long long int in C, the period for the PRNG would be 9,223,372,036,854,775,807, as that is the limit for a long long int in C.
Also, the proposed CSPRNG meets the Avalanche criteria in which the PRNG produces a new set of random values even for a single bit change in the inputs. Since the SHA3-512 has a strong foundation of irreversibility, it helps the PRNG to produce high quality irreversible random numbers. Hence, the problem of reverse engineering can be alleviated.

Analysis of computational performance of GE-CSPRNG.
To analyse the Encryption throughput (ET) and Number of cycles required for encrypting per byte (NpCB) for GE-CSPRNG we followed the same procedure as used by the authors 24 . By using the functionality available in Python's OpenCV library, we generated 10 plain www.nature.com/scientificreports/ RGB images of varying sizes. We then encrypted the images, by using the encryption method 25 , which is basic shuffling of pixels in the RGB images. The encryption was done with the keys obtained from GE-CSPRNG and using Python's rand() function for comparative analysis. Then we calculate ET and NPCB using Eqs. (2) and (3). Table 3 shows the comparative analysis of ET for ten RGB images with GE-CSPRNG and the Python rand() function, that is Mersenne Twister (MT). The comparative ET from both the approaches show that GE based PRNG is performing well in terms of speed and surpassing the one with Python's rand() function. The best ET and NpCB for GE-PRNG was found for the image with size 100 × 100, followed by 800 × 800.
The experiments were performed by employing Linux version 16.04, the GE tool, libGE-version 0.32 in C/ C++ from Biocomputing and Developmental Systems Research Group at University of Limerick, and Python 3.7 with the following configurations: Intel(R) Core (TM), i5-1035G1 CPU 1 GHz 8 GB RAM.
Simulation based tests for GE based PRNG. Random numbers find their applicability in various simulations and for validating ML models. We evaluated the potential of GE based PRNG for usage in simulations and sampling applications with two different methods, viz, Monte Carlo and fitting ML regression models.
Monte Carlo simulation for the estimation of pi. Monte Carlo 26 simulations are widely used for ensuring the quality of random numbers generated. They are used to validate whether a given functionality of PRNG successfully achieves its target goal by calculating the value of pi from the random numbers generated by the PRNG. The value of pi is estimated 27 with the help of Eq. (4) for calculating the radius of a circle.
where x and y are random numbers generated by GE-PRNG. 1,000,000 or more random numbers are generated in the range from 0 to 1 and the value for Eq. (5) is calculated. If the value of the equation is less than 1, the point is placed inside the circle, i.e., the random numbers pair is valid. If the value of the equation is greater than 1, the point is placed outside the circle and the point is discarded while calculating the value of pi. At the end of the prescribed 1,000,000 runs, the value is obtained by Eq. (5) and then it is compared with the actual value of pi. The closer the estimated value to the actual value of pi, the better is the performance of the PRNG. The value of pi noted with our observation over 1,000,000 runs is 3.146564000, while the actual value of pi is 3.1415926535 28 up to 6 decimal precision which gives a strong validation of randomness.
Coverage analysis. To validate the suitability of GE-PRNG for its potential use of generating samples for ML datasets, we generated samples across 16 real-world benchmark regression datasets with varying numbers of instances. The test evaluates the quality of random samples generated in terms of mean and standard deviation with respect to ground truth (entire data). The samples were generated for columns of varying lengths representing different data types such as categorical, integers and float. The samples were visualised by plotting histograms to analyse their distribution. A comparative analysis was then performed against the samples obtained from  www.nature.com/scientificreports/ Python's rand() function-based MT. We observed that the samples from GE-PRNG followed the same type of data distribution 29 as observed in the original datasets. This implied that in future GEPRNG can play a crucial role in synthetic data generation. The results of the analysis are presented in Supplementary Table S1.

Discussion
There have been several PRNGs proposed in the literature that incorporate bio-inspired methods, such as GP and Genetic Algorithms (GA) 30 . For example, Koza 31 used GA to transform a seed J into a PRNG using a set of standard arithmetic operators and a parameter set of {population size: 500, crossover: 0.9 and mutation: 0}.
The resulting PRNG has been tested digitally on software and passed both Gap-Measure and Chi-square tests. Poorghanand 32 employed a GA using 16 Linear Feedback Shift Registers with XOR and inverse-XOR gates to generate high entropy 128-bit random numbers and successfully passed the NIST suite of tests. A Hebbian neural network initialised by a GA was proposed by Jhanjharia 33 . With a parameter set of {population size: 50, crossover: 0. and mutation: 0.05} this was capable of producing 192-bit output and successfully passed each of the Cumulative Frequency, Gap-Measure and Chi-square tests. More recent work by Kosemen 34 , extended Koza's work and resulted in a design that passed the NIST suite of tests, producing output within 0.24960 s. This used the following parameters: {population size: 50, crossover: 0 and mutation: 0.5}. Due to GE's modular structure, it is highly convenient as an alternative search strategy, whether evolutionary, deterministic or any other. Such a flexible approach to Genetic Programming (GP) makes GE a robust machine learning tool that can be applied to a diverse set of problem domains. Moreover, GE's potential as an entropy source for a secure CSPRNG, using the basic BNF grammar, makes it possible to arm GE with problem-specific constructs as demonstrated with our approach. Further, GE-PRNG holds the promise of a potential rich source for generating synthetic datasets for fitting ML regression models. Due to the ability to customize the production rules of BNF grammar, GE-PRNG can serve applications requiring one-time password (OTP) generation and the keys obtained with GE-CSPRNG can be employed for storage encryption, network encryption, confidential compute, etc. We illustrate in Supplementary Table S2, a comparative analysis of GE-based CSPRNG with respect to existing software and hardware.

Conclusion
We have presented a combined application of GE and our novel approach, control_flow_incrementor, for the design of a basic and cryptographically secure pseudo random number generator. We used control_flow_incrementor to generate random numbers and GE as the entropy source to return an initial seed. The validations of the CSPRNG with NIST SP800-22 and Diehard battery of tests indicate the feasibility of GE as a source of initial seeds leading to the efficient construction of CSPRNG. By utilizing the seed repository for reseeding the initial seed, our CSPRNG is able to generate highly uncorrelated random sequences at a faster rate with minimal computational costs, making it highly efficient for securing sensitive data. Monte Carlo simulations were performed to validate the quality of random numbers for sampling with GE-PRNG. Furthermore, extending production rules with additional choices will make it adaptable across a wide range of industrial applications.

Methods
This section discusses the procedure of obtaining initial seeds for the design of PRNG and CSPRNG with GE. The functionality of GE is based upon the generation of structures described by a formal language grammar, typically in the form of a BNF. At its heart is a GA that searches through the space of syntactically legal structures. The GA individuals, or genomes, are variable length binary strings that get mapped onto a phenotype, the actual structure being created, which is typically a program or some other, complex structure. The mapping process is guided by the grammar, which, in BNF can be represented as the tuple {S, N, T, P}, where S is a start symbol, T is the set of terminals, that is, items which can appear in the language. N is the set of non-terminals, temporary items to facilitate the mapping process, and P, a set of production rules that expand S into a legal structure consisting of only terminals. Figure 1 shows a set of production rules that map the start symbol < expr > to a seed, which is made up entirely from terminals. GE operates by evolving populations of these individuals, each of which is tested for fitness, i.e. how well they perform at the task in hand, entropy in this case. These individuals are then probabilistically selected for crossover and mutation, based on their fitness, and a new population is created.
As shown in Fig. 1, the GE-based PRNG takes a 512-bit initial seed and, combined with the proposed approach, control_flow_incrementor, generates random numbers. Our control_flow_incrementor follows the logic of an incrementor counter initialized at 0. With each iteration, the value of the counter is incremented by any predecided number (we have used 1, although this can be any value) and appended to the initial seed obtained from the entropy source. This intermediate string is then hashed with SHA3_512 to generate secure random numbers.
The evolutionary process that leads to initial seeds with GE can be summarized in the following steps: 1. Initialise evolutionary parameters {population, generations, crossover, mutation}; 2. Obtain binary string from BNF grammar; 3. Evaluate the individuals with a fitness function 35 using Shannon's entropy 36 equation (7) to get the fitness scores; 4. Generate new population for next generation using crossover and mutation; 5. Repeat until there is no improvement in fitness. We fulfill all the requirements of a desirable PRNG by mapping each of characteristics with a fitness function as seen in Table 4. We use the Shannon's entropy as a fitness function F1 to obtain high entropy initial seeds. The fitness function Hamming Distance, F2, ensures that unique seeds are generated with each evolutionary run. These seeds are subsequently stored in a repository to reseed the PRNG after a defined period. Period of a PRNG is the number of pseudorandom sequences after which the PRNG needs to be reseeded to avoid repetition in the output sequences, thus maintaining randomness. Moreover, these seeds from the repository can also be used as an entropy source thus avoiding evolutionary runs for seed generation. Similarly, the fitness function, F3, Autocorrelation, ensures no repetition of patterns within the same seed.
A fitness function, in the context of evolutionary computation, is the measure of how good a solution is for a given problem. The fitness function, F1, defined in Eq. (7) is based upon Shannon's entropy function as defined in Eq. (6).
w h e r e N is the total number of permutations possible in the binary sequence of the initial seed(e.g.for 1 bit , N is 2 {0, 1}and for 2 bit, N is 4 {00, 01, 10, 11}) . pi is the frequency of the patterns of 0 s and 1s in the binary sequence. n is the number of bits in the range [1 − 8].
The fitness function in Eq. (7) is the summation of all the n-bit entropy values H 1 …..H 8 , divided by n, i.e. number of bits 30 . The division by n ensures equal contribution of all n-bits entropies in the final calculation of fitness. The sample example of an initial seed returned from GE, below, depicts the procedure followed to calculate the entropy. 1001110101101100100110001110011011111111111011001001000110001111101000001100001001011111 1010110101001110000111101001111000010100100010010000101101010001101110100001001010001101110
Single objective fitness function. To generate high entropy initial seeds, we first performed preliminary experiments with GE on a variety of different BNF grammars with fitness function F1. We obtained an entropy of 7.31 in version 1 with the evolutionary parameters {population size: 10, generations: 15, crossover: 0.9, mutation: 0.04} and subsequent modifications to the production rules and evolutionary parameters led to the optimal parameters for our fifteenth version as seen in Fig. 2a and details of entropy improvements depicted in Supplementary Table S3.
The optimal evolutionary parameters obtained were as follows: {population size: 2, generations: 5, crossover: 0.0001, mutation: 0.01}. The initial seed obtained from GE is subsequently used to generate random sequences with the control_flow_incrementor approach. This strongly supports the use of GE as a high-quality entropy source for initial seeds which is independent of hardware or software requirements. The production rules in grammar can be used to yield arbitrarily large output random sequences, thus making it easily adaptable across diverse applications. Hence, we propose GE based PRNG to be suitable for domains such as sampling Machine Learning (ML) models, generating One-time Passwords (OTP), and to be used as a CSPRNG. The genetic operators, mutation and crossover, were derived after performing a series of experiments over 100 runs. Figure 2b Figure 3a, To achieve all the characteristics of a CSPRNG, we used the combination of all the three fitness functions as a single setup combining F1, F2 and F3. For experimentation purposes, we used NSGA-II with pymoo 37 to achieve many-objective fitness functions. The seed obtained from a many-objective fitness function will be the one with the maximum entropy, the maximum hamming distance and the minimum autocorrelation. The optimal solution obtained for three objectives was evaluated with a hypervolume indicator from a reference point [7.95, 298, 0.06]. The hypervolume indicator 38 is a measure for many-objective optimization where the volume enclosed between a reference point and a Pareto front is calculated. A reference point 39 is the maximum/minimum point obtained in the series or runs for an evolutionary experiment. The smaller the volume, the more optimal are the solutions obtained from the runs. Figure 3c illustrates Figure 2. Tuning of BNF grammar and genetic operators, mutation and crossover, to obtain the optimal entropy. In all combinations of the BNF Grammar as depicted in (a) where 20 generations were sufficient to obtain the optimal entropy. The initial seed leading to optimal entropy using the 15th version of BNF Grammar was secured with SHA3-512, whereas in the previous 14 versions, the hashing algorithm SHA2-512 was used to secure the initial seed. In all of the cases, the entropy remained in the range 7.92 to 7.94. (b) Values for crossover were varied from 0.0001 to 0.1 while keeping mutation fixed at 0.01. With the value of 0.0001, the optimal entropy remained consistent and the time taken to obtain the entropy was also consistent for that value. Similarly, in (c), we vary the mutation rate from 0.01 to 0.5 while keeping crossover fixed at 0.001. In this case, the most consistent value for time taken was exhibited by a mutation rate of 0.01. Finally, these two values were used with the 15th BNF grammar version to produce an optimal entropy value of 7.940560934 from 100 runs as illustrated in (d).
Reference point (7.95, 288, 0.061) (a) (b) (c) Figure 3. Results of the fitness evaluations with multi-objective and many-objective fitness functions for a 512-bit seed generated by GE-PRNG. (a) The Pareto fronts for a multi-objective fitness function made up of F1, entropy, and F2, Hamming distance. This indicates the maximum dissimilarity between any pair of strings generated by GE-PRNG while maintaining the optimal achieved entropy for a particular generation; (b) shows the Pareto fronts for a multi-objective fitness function made up of F1, entropy, and F3, Autocorrelation. This experiment validates minimal autocorrelation within a string without impacting the optimal achieved entropy.
Finally, (c) shows the hypervolume for a many-objective fitness function made up of all three, F1, F2, F3. The combination of all three objectives produces a solution that meets all the prescribed characteristics of an PRNG.

Data availability
All data generated or analysed during this study are included in this published article and its Supplementary Information file. The tool libGE -version 0.32 used in the experimentation will be made available upon request to the corresponding authors.