Inference of Large-scale Time-delayed Gene Regulatory Network with Parallel MapReduce Cloud Platform

Inference of gene regulatory network (GRN) is crucial to understand intracellular physiological activity and function of biology. The identification of large-scale GRN has been a difficult and hot topic of system biology in recent years. In order to reduce the computation load for large-scale GRN identification, a parallel algorithm based on restricted gene expression programming (RGEP), namely MPRGEP, is proposed to infer instantaneous and time-delayed regulatory relationships between transcription factors and target genes. In MPRGEP, the structure and parameters of time-delayed S-system (TDSS) model are encoded into one chromosome. An original hybrid optimization approach based on genetic algorithm (GA) and gene expression programming (GEP) is proposed to optimize TDSS model with MapReduce framework. Time-delayed GRNs (TDGRN) with hundreds of genes are utilized to test the performance of MPRGEP. The experiment results reveal that MPRGEP could infer more accurately gene regulatory network than other state-of-art methods, and obtain the convincing speedup.

Inferring gene regulatory network (GRN) is the primary and important biochemical network, which contains the regulatory relationships among genes, proteins and small molecules 1,2 . To infer and analyze gene regulatory network could understand the intracellular physiological activity and function of biology, interaction in the pathway and how to make the organism change [3][4][5] . Time delay is a very important characteristic in biological regulation mechanism, especially for regulation process 6,7 . The proteins translated by transcription factor (TF) regulate the target gene. This regulation process requires a time lag, which involves the regulation of protein translation, folding, nuclear transport, turnover, and the extension of the target mRNA 8,9 . Thus time-delayed factor is critical to gene regulation process. Inferring time-delayed GRN (TDGRN) is one of the major hotspots in system biology 10 .
To design gene regulatory network modeling methods need to consider time-delayed factor. The time-delayed versions of GRN modeling methods have been proposed. Li et al. proposed a unified approach based on time-delayed correlation algorithm for design of time-delayed gene expression matrix and inference of TDGRN 11 . Ngom et al. proposed a new extending version of Bayesian network, namely Max-Min high-order dynamic Bayesian network, to model the time lags between TFs and target genes 12 . Chueh and Lu presented a new method based on time-delay Boolean networks to infer biological pathways 13 . Kordmahalleh et al. proposed a hierarchical recurrent neural network (HRNN) to identify TDGRN with time series gene expression data 14 .
To understand deeply the specific mathematical relationships between TFs and target genes, differential equation model was proposed to infer GRN [15][16][17][18] . Some research added time-delayed factor into differential equation model for TDGRN inference. Chowdhury et al. presented time-delayed S-System (TDSS) model to identify simultaneously both instantaneous and time-delayed interactions of TDGRN 19 . But in Chowdhury's method, differential evolution (DE) algorithm was utilized to optimize all parameters in a TDSS model, and the computing load is very large for the large-scale GRN. In order to reduce computing load, we proposed restricted gene expression programming (RGEP) and particle swarm optimization (PSO) to evolve the TDSS model 20 35 .
In order to decrease the computing cost of large-scale TDGRN identification, this paper proposes a novel MapReduce-based parallel restricted gene expression programming (MPRGEP) algorithm for TDSS model identification. In order to evolve the structure and parameters of TDSS model simultaneously, the structure and parameters are encoded as a chromosome in MPRGEP algorithm. According to partition number, split chromosome population over a cloud computing system's nodes. At each cloud computing node, sub population is optimized iteratively by a novel hybrid evolutionary method based on gene expression programming and genetic algorithm. Then merge them to save as offsprings.

Method
Mapreduce overview. Storage, pretreatment and analysis of biological high-throughput sequencing data have gradually become the main bottleneck of system biology research [36][37][38] . Hadoop has provided a new solution for big data processing. Hadoop is open-source distributed computing system based on Hadoop Distributed File System (HDFS) and MapReduce framework, and applied to the storage, management and analysis of massive data [39][40][41] . HDFS is distributed file system, which is utilized to store massive data. MapReduce model is a software framework for big data processing in parallel. MapReduce framework is completed by Map and Reduce operation units, which is described in Fig. 1. In Map phase, input data could be divided into m data blocks. Computing nodes calculate Map function in parallel. The pair output <key, value> of Map function is stored in each computing node. In Reduce phase, all the intermediate results are combined according to key values and generate the final output, which are stored in HDFS.

MapReduce-based restricted gene expression programming algorithm. Time-Delayed S-system.
Due that time-delayed S-system has high accuracy and flexibility, and contains time-delayed factors, which is suitable for modeling time-delayed systems. t-th nonlinear time-delayed differential equation in TDSS model is described as followed 42 .
time point, τ g ij and h ij τ are the time-delayed factors, N is the total number of genes in TDGRN, α i and β i are rate constants of production function and consumption function, g ij and h ij are kinetic orders.
Chromosome of restricted gene expression programming. In order to better represent and evolve TDSS model, the restricted version of GEP (RGEP) was presented 43 . In RGEP each chromosome of RGEP contains only two genes. An example of RGEP chromosome is described in Fig. 2. The subtraction operator (−) is utilized to connect two genes. Each gene contains head part and tail part, which are created randomly using function set (F) and variable set (T).
Where *n represents the multiplication of n operands, represents the input variable and R denotes constant. In each gene, the symbols of head part can be selected from set I 1 randomly. The tail part is created randomly with variable set T only. In advance, the head length (h) is specified for the problems solved. The tail length (t) is calculated according to h.
Where n represents the largest number of the operands of functions in set F. TDSS model has three kinds of parameters: rate constants (α i and β i ), kinetic orders (g ij and h ij ) and time-delayed factor ( g ij τ and τ h ij ), so we add these parameters into the chromosome in RGEP. As shown in Fig. 2, gene1 and gene2 are given α i and β i , respectively. In each gene, kinetic order (g ij or h ij ) and time-delayed factor (τ g ij or τ h ij ) need to be allocated to each terminal node. Figure 3 describes the arithmetic expression trees (ETs) of Fig. 2. Its decoding differential equation expression is shown as follows.
Hybrid evolutionary method. In order to search the optimal TDSS model, an original hybrid optimization approach based genetic algorithm [44][45][46] and gene expression programming [47][48][49] is proposed in REGP. The structure and parameters in a TDSS model need to be optimized, which are shown in Fig. 3. In our hybrid evolutionary method, two genes of RGEP and parameters are encoded into one chromosome, which is depicted in Fig. 4. One chromosome contains three kinds of encoding forms. In Fig. 4, gene1 and gene2 are structure-based encoding, rate constants (α i and β i ) and kinetic orders (g ij and h ij ) are real-based encoding, and time-delayed factors (τ g ij and τ h ij ) are binary-based encoding. Single evolution strategy could not reach the optimization purpose, so a hybrid evolutionary method is utilized to reproduce the chromosomes.
(1) Mutation. Mutation probability p m is defined in advance. According to the encoding case, three mutation strategies are utilized, which are introduced as followed.
(1) Structure-based mutation • Single-point mutation. The symbols in the head part could be changed to any symbol, which is selected from set I 1 randomly. The symbols in the tail part can only be changed into a symbol from variable set T. Therefore, single-based mutation could create the legal offspring. • Single-gene mutation. One gene in a chromosome is selected by random, which is replaced by the new gene.
• Change all the variables. All terminal symbols in the structure-coding region are replaced with another terminal symbols.
(2) Real-based mutation For each real value X in the real-coding region, create a real value r in the interval [0, 1] randomly. If r < p m , real value X could be mutated with the following Equation.

(3) Binary-based mutation
For each binary value in the binary-coding region, generate a real value r in the interval [0, 1] randomly. If r < p m , the corresponding binary value is inverted.
(2) Crossover. According to the encoding case, three crossover strategies are utilized. First two parents (X and Y) are chosen with the crossover probability p c , which is defined in advance.
(1) Structure-based crossover • Single-point recombination. A random point is selected from the structure coding region. Exchange the symbol operators of two parents, which are after this point. • Single-gene recombination. Two random genes chosen from two parents are swapped.
(2) Real-based crossover Two parents (X and Y) implement crossover operator with following Equation.
. 0 99 t γ is a variable related to iteration number t. This strategy can change the individuals with a wide range in the early stage of optimization, and protect the better individuals in the later stage.   is depicted in Fig. 5. Decomposition strategy is utilized. From G 1 to G n , regulatory relationships of each gene are identified by optimizing the TDSS models.
containing structure and parameters. The chromosome structure is described in Fig. 5. (2) The fitness values of all the chromosomes are calculated. If the optimal model is found, stop; otherwise go to (3). (3) The hybrid evolutionary method is utilized to create the offsprings, which is introduced in Section 2.2.3.
According to encoding type, select different crossover and mutation strategies. Go to (2).
Through the optimized TDSS model, gain the regulatory relationships of each gene. Finally the regulatory relationships of all genes constitute gene regulatory network.

MapReduce-based hybrid evolutionary method.
To infer large-scale gene regulatory network and reduce high computation load, our hybrid evolutionary method based on Hadoop MapReduce framework is proposed. This framework distributes evolutionary tasks to Map and Reduce modules. Figure 6 shows the hybrid evolutionary framework with the Hadoop MapReduce model.
(1) Input data. The input data are stored on the HDFS, which contain two types of data. The first type of data is chromosome information including the structure and parameters. The second type of data is the fitness value of the corresponding chromosome.  file, the framework divides the chromosome population into computation nodes (Mappers) in order to achieve parallel computing. In order to realize the crossover operation between chromosomes, we randomly divide the population into different partitions. The chromosomes with the same partition id could implement crossover operator. The number of partitions k is defined in advance. The partition id of chromosome partition id _ is generated randomly, which is set as the key output of Map phase. The chromosome, fitness f i and total fitness value sum_f are set as the value output of Map stage.
(3) Reduce phase. The input data in Reduce phase are from Map phase. After the complete execution of the corresponding Map nodes, the Reduce phase can be executed. In the Reduce phase, the chromosomes with the same partition id _ are collected into a group, obtaining a sub population. The optimization tasks of sub population are distributed to the same computational node (Reducer). With f i and sum_f, roulette sampling algorithm is utilized to create the offsprings. The individuals in sub population could implement crossover and mutation operator. The gained sub offsprings and fitness values are written to output file of the Reduce phase in order to update the input data on the HDFS. If the number of iterations reaches the termination condition, the algorithm is terminated; otherwise, go to the Map phase.

Experiments
Our proposed parallel algorithm MPRGEP is implemented on MapReduce framework. The hadoop version is 2.6.2 and hadoop cluster consists of one master and 30 slaves. The infrastructure hardware of all nodes is comprised of 3.5 GHz Intel Xeon E5-1620 CPU, 4GB DDR2, and Linux CentOS 6.4 (64-bits). The nodes are connected by local area network with transmission speed of 1,000 Mbps. Three criterions are utilized to evaluate the performance of MPRGEP.
Where TP, FN, FP and TN are presented in Fig. 7.
Artificial dataset. In this part, the parameters of experiments are shown in Table 1, which are selected empirically. The used function set is {*2, *3, *4, *5}. The first artificial dataset is from a 30-gene time-delayed GRN, which is shown in Fig. 8 19,20 . Kimura's method (S-system model based on decomposition strategy and a cooperative coevolutionary algorithm) 21 , DBN (dynamic Bayesian network learned by the likelihood maximization) 22 and TDSS (time-delayed S-system model based on PSO) 23 are also applied for 30-gene artificial TDGRN identification. The averaged performance results of four inferred algorithms are represented in Table 2. From Table 2, it could be seen that MPRGEP has a higher sensitivity (S n ) than other three methods, which reveals that our method can infer more true-positive regulatory relationship. MPRGEP could identify less false-positive regulatory relationships.   The open-source software GeneNetWeaver 3.1 is utilized to generate three yeast S.cerevisiae sub gene regulatory networks with 50 genes, 100 genes and 150 genes, respectively. Time-delayed regulatory relationships are created randomly and time-delayed values are selected from [0, 3]. Three time-delayed gene regulatory networks are described in Table 3. Initial conditions are randomly generated. For each network, 10 time-series datasets are generated and each dataset contains 21 time points from 0 to 20.
Our method is executed in the single machine and computing clusters with 20 computing nodes, respectively. Through several runs, the averaged performances are listed in Table 4. From the inference results, we know that MPRGEP not only can solve large-scale time-delayed gene regulatory network, but also perform well in terms of S n and S p . Table 4 also reveals that MapReduce framework could reduce running time of GRN inference, which makes it possible to identify large-scale GRN with more genes.
In order to validate the parallel computing performance, MPRGEP algorithm is utilized to identify three above time-delayed GRNs in three computing clusters with 10, 20 and 30 nodes, respectively. The runtime and speedup performance are depicted in Figs 9 and 10. From Fig. 9, it could be seen clearly that as the number of genes rises, the running time also rises. With the increment of computing nodes, the running time decreases. Figure 10 shows that as the number of computing nodes increases, our proposed parallel algorithm accelerates significantly. The best speedup performance of MPRGEP is the case that MPRGEP is run on 30 computing nodes to infer GRN with 150 genes. The speedup curve is not linear because of serial bottlenecks and infrastructure barriers in MapReduce framework.   In MPRGEP, the computational tasks of hybrid evolutionary algorithm are mainly concentrated in the Reduce phase. The sub population with the same partition id will be assigned to the same Reduce for optimization. If the number of Reducers is fixed in advance, the number of partitions can affect the speed of parallel computation.   We make the experiments with three partition numbers, 1, 200 and 1000. Node number in the computing cluster is set as 20. The running time is depicted in Fig. 11. From the result, we can see that the hybrid evolutionary algorithm performs best when partition number is set as 200. When the partition number is 1, the sub population contains all the population and is optimized in one Reducer. Parallel strategy doesn't work. When the number of partition number is given to 1000, the number of sub populations is too large. In this case, more Reducers are needed. The allocation and merging of resources could waste more time.
Real biology dataset. In this part, the dataset is from the Gene Expression Omnibus (GEO) at http://www. ncbi.nlm.nih.gov/geo/ (GEO accession: GSE30052) 34,50 . This dataset contains 5,744 probe sets, 10,928 genes and 49 time points. In order to validate the parallel performance of MPRGEP, one subset from this dataset is extracted, containing 500 genes. The experiment is executed in the computing clusters with 20 nodes. The parameters are also from Table 1. The running results are described in Fig. 12. From Fig. 12, it can be seen that our method could be accelerated evidently.

Discussion and Conclusion
With the rapid development of biotechnology, gene regulatory networks inferred contain more genes, so there is necessity for developing advanced computational algorithm to infer gene regulatory network with gene expression data. This paper proposes time-delayed S-system model to model instantaneous and time-delayed regulation interactions in time-delayed gene regulatory network. A novel MapReduce-based parallel restricted gene expression programming (MPRGEP) algorithm is utilized for TDSS model identification. The experiment results reveal that our parallel algorithm is promising in terms of accuracy and speedup when used to infer large-scale TDGRN.