Bacterial Foraging Optimization –Genetic Algorithm for Multiple Sequence Alignment with Multi-Objectives

This research work focus on the multiple sequence alignment, as developing an exact multiple sequence alignment for different protein sequences is a difficult computational task. In this research, a hybrid algorithm named Bacterial Foraging Optimization-Genetic Algorithm (BFO-GA) algorithm is aimed to improve the multi-objectives and carrying out measures of multiple sequence alignment. The proposed algorithm employs multi-objectives such as variable gap penalty minimization, maximization of similarity and non-gap percentage. The proposed BFO-GA algorithm is measured with various MSA methods such as T-Coffee, Clustal Omega, Muscle, K-Align, MAFFT, GA, ACO, ABC and PSO. The experiments were taken on four benchmark datasets such as BAliBASE 3.0, Prefab 4.0, SABmark 1.65 and Oxbench 1.3 databases and the outcomes prove that the proposed BFO-GA algorithm obtains better statistical significance results as compared with the other well-known methods. This research study also evaluates the practicability of the alignments of BFO-GA by applying the optimal sequence to predict the phylogenetic tree by using ClustalW2 Phylogeny tool and compare with the existing algorithms by using the Robinson-Foulds (RF) distance performance metric. Lastly, the statistical implication of the proposed algorithm is computed by using the Wilcoxon Matched-Pair Signed- Rank test and also it infers better results.

In Bioinformatics, the sequence alignments are used to show evolutionary relationships by constructing phylogenetic trees. Sequence alignment and phylogenetic analysis are strongly related due to measuring the relatedness of homologous sequence. Generally the protein sequence consists of amino acids, which are linked with each other. Sequence alignment describes the mode of arrangement of protein sequence, in order to distinguish the areas of similarity among them 1 . Aligning refers to matching as many characters as possible from each sequence. Primarily, the sequence alignment is applied to infer functional, morphological and evolutionary relationship between the protein sequences. The alignment of the sequence is used to find similarity level between the query sequence and different database sequences.
Today, there are several sequence alignment techniques are available and this research study concentrates on the multiple sequence alignment. One of the fundamental problems in computational biology is the alignment of multiple sequences of DNA/Protein. The computational approaches which are used to align the Protein/ DNA sequences generally falls into two categories: global and local alignments 2 . The multiple sequence alignment comes under the category of global alignment and it's an adjacent of pairwise alignment to incorporate more than two sequences at a time. Various methods have been implemented on MSA, but these approaches add up under three major classes such as: dynamic programming, Progressive and Iterative methods. In this research work the proposed BFO-GA algorithm comes under the category of Iterative -Progressive method for incorporating the advantage of those methods. The remaining part of this research study is developed as follows: Section 2 illustrates the background field of several methods of solving multiple sequence alignment, Section 3 describes the methodology of MSA multi-objectives and optimization, Section 4 illuminates the proposed algorithm, Section 5 emphasizes the experimental outcomes for the benchmark databases and finally Section 6 spotlights the conclusion and turns over the range for further enhancement.  35 .
The major performance measure used for multiple sequence alignment is the Sum of Pairs (SP) and Total Column (TC) score. From the matched residues of Protein/DNA, the SP is calculated and the gap penalties are determined by mismatched residues or occurrences of gaps, whereas the similarity is assessed by the substitution matrix score. The similarity matrix score is constructed as 20 × 20 for protein sequences and 4 × 4 for DNA sequences, which represent entire possible transitions between the Protein/DNA sequences. There are two common substitution matrix are available such as Percent Accepted Mutation (PAM) and BLOcks Substitution Matrix (BLOSUM). There are different versions of substitution matrix such as BLOSUM 30, BLOSUM 45, BLOSUM 62, BLOSUM 80, PAM100 and PAM200. In this study, the similarity value is different from the substitution matrix which gives an arithmetical score for matches and mismatches of residues 35 .
Multiple sequence alignment is a complicated problem which consists of three distinct difficulties such as, choice of the sequences, choice of an objective function and optimization of a function. In the proposed BFO-GA algorithm the choice of the sequences is chosen based on the non-dominated optimal solution by using the crowding distance measure. And the optimization of the function is attained by using the BFO-GA algorithm.

Multi-Objectives and Optimization.
In this research work a multi-objective hybrid algorithm named Bacterial Foraging Optimization -Genetic Algorithm is proposed for multiple sequence alignment problems. Typically, the Sum of Pairs (SP) and the Total Column Score (TCS) performance measures are used to find the optimal solution for the MSA Problem. This research study concentrates on three objective functions to determine the optimal solution such as Maximization of Similarity, Minimization of Variable Gap Penalty and Maximization Non-Gap Percentage.
Similarity. The computation of position weight matrix for the alignment is generated from the resulted alignment solution. The dominance value (ce) of the leading amino acid or nucleotide in each column is set up as follows: x where f(x, y) is the score value of amino acid or nucleotide x on the column y in the position weight matrix despite of the survival of gaps. h is the sequence alignment length and ce(y) is the dominance value of the dominant amino acid or nucleotide on column y. The similarity of the alignment SM is defined as the average of dominant value of all columns in the position weight matrix and it is expressed in Eq. 3.
The candidate alignment SM, which has the greatest probability is identified as the best alignment, if the value of similarity is nearer to 1. The computation of similarity among all sequences is calculated for an alignment.
Gap penalty. A gap is an artificial insertions and deletions (indel) into sequence to move similar segments of aligning residues into good alignment. A gap in same columns is not taken into account which has no substance. Different types of gap penalty scores are available such as Constant, Linear, Convex, Affine and profile based variable gap penalties. In this research work affine and variable gap penalty scoring is calculated for the existing and proposed algorithm such as the Genetic Algorithm, Ant Colony Optimization, Artificial Bee Colony, Particle Swarm Optimization and BFO-GA algorithm to anticipate better outcomes.
Affine gap penalty. Insertions and deletions are scored using an affine gap penalty that penalizes the gap once for opening and then proportionally to its length dependent. Two parameters are applied, namely gap opening and gap extension 40 . The formula for calculating the affine gap penalty in the pairwise alignment of rows x and y is determined by

xy open extend
Gap open → cost of opening a gap Gap extend → cost of extending a gap by one more space g → length of gap string The optimization of affine gap is to group the gaps together, which will minimize the affine gap penalty.  where, R and T are a length of 2 sequences, n-Average of residue mismatch score, m-Percent identity scaling factor The gap extension penalties are recalculated based on the following elements.
• Depending on the difference in the lengths of the sequences.
where, R and T are the lengths of the two sequences.
• Position-specific gap penalties (counting the frequency of each residue at either end of gaps in alignments, store in • Increased gap penalties near existing gaps.

GOT
GOT 0 5 (if there is hydrophilic residue at xth position) (9) • Residue-specific penalties (no hydrophilic stretch and gap, GOP is multiplied by one of the 20 numbers).
x where S x is the value of residue located on the x th position of sequence S in the residue table.
Finally the GOP and GEP are calculated based on equations-7, 8, 9 and 10.
Based on these factors the variable gap penalty is inserted into the input of Protein/DNA sequence.
Non gap percentage. The arithmetic significance of an alignment score usually depends on a theoretical form of non-gapped alignments. Some methodologies generally use too much of gaps to raise the identities in alignment.
The non-gap percentage is defined as the total number of amino acids in the sequences with respect to the number of gaps in the sequences 41 . • The solution of d m is not poorer than d n in all objective functions.
• The solution d m is definitely superior to d n at least in one objective function.
This contributes to the characterization of Pareto-optimal solution 37 . The complexity for the non-dominated sorting based multi-objective evolutionary is O MN ( ) 2 , where M is the number of objective functions and N is the total number of people in the population. Once applied the non-dominated sorting algorithm, the diversity among non-dominated individuals are introduced using crowding distance and the selection is pulled in by employing the crowded tournament selection. This approach is able to discover much better spread of solutions and enhanced convergence close to the true Pareto-Optimal front solution 42 .

Proposed BFO-GA Algorithm
The non-dominated optimal solution for the multiple sequence alignment problems is predicted by using the proposed BFO-GA algorithm. The proposed BFO-GA algorithm is a scattered optimization process, which is founded on the individual and group behavior of E. coli bacteria. It consists of chemotaxis, swarming, reproduction phase, selection, crossover, mutation, elimination and dispersal phase. The chemotaxis is a central step in BFO-GA algorithm, where a bacterium takes steps over the foraging site in order to gain the alignment with higher fitness value. All of the above phases for the BFO-GA algorithm are iterated until the maximum cycle is reached. The pseudo code for the proposed algorithm is given in Fig. 1.
The parameters which are employed in the Pseudo code of the proposed BFO-GA algorithm in Fig. 1  The overall framework for the proposed BFO-GA algorithm is shown in Fig. 2. Figure 2(a) shows the major steps involved in the proposed BFO-GA algorithm.
Initialization of bacterium in employing phase. The set of unaligned Protein/DNA sequences is presented as an input. The input sequences are in different length. In order to align the sequences, they should be in same length. The gaps are inserted randomly to shuffle the residues in between them to get the optimal alignment. The percentage of gaps added to the largest sequence should be less than 20% of the longest sequence length 35 . After this the other sequences, adjust to the largest sequence length to get the same length of all sequences. The evaluation of the population using employed bacterium for calculating new food sources is completed. The Fig. 2(b) shows the initial population for employing the BFO-GA algorithm. Chemotaxis. The Swimming and tumbling characteristics of bacteria is used to search for the food and it is known as chemotaxis. If a bacterium is said to be 'swimming' , it impresses in a pre-defined direction. If it is supposed to be 'tumbling' , it impresses in an entirely different way. Then movement of i th bacterium in P th chemotaxis step can be represented by following equation.
where, dir(i) → direction vector. dir (i) is a random number lying between [−1, 1]. S i,P+1,F,E (x1, x2, x3) → position of ith bacterium at a point in x1, x2, x3 coordinate system, in P th Chemotaxis, F th reproduction and E th elimination and dispersal step. C(i) → unit run-length of a bacterium In this proposed BFO-GA algorithm, the swimming length of the bacteria in multiple sequence alignment is randomly applied by the user. Only, in this research work the proposed algorithm gives better outcomes when the bacterium makes a motion in a forward direction with a swimming length of 5. Swarming. In favor of the bacteria to pass at the highest food location, it is trusted that the optimum bacterium till a point of time in the search time should make an endeavor to draw in other bacteria so that together they unite at the desired location more quickly. To accomplish this, a penalty function based upon the degraded non-dominated sorting algorithm is executed to determine the fittest bacterium which has higher crowding distance and lower social status. The relative lengths of each bacterium from the fittest bacterium till that search duration are added to the original cost function. Figure 2(c) illustrates the chemotaxis and swarming length of 5 with the forward direction for the initial population.
Reproduction. The singular set of bacteria, after getting changed through several chemotactic stages reaches the breeding phase. At this stage, the best set of bacteria gets divided into two groups. The healthier half replaces with the other half of bacteria, which gets eliminated, due to their poorer foraging abilities. This formulates the population of bacteria constant in the development process. The reproduction of the initial population for the protein sequences is shown in Fig. 2(d).
Selection Phase. In selection phase, the sorting of individuals is done in the mating pool according to their fitness and then every two best individuals are selected for crossover. The best fitness bacterium is calculated by scoring each alignment according to the Multi-objectives (Equations 2-13). The fast non-dominated sorting algorithm is executed to relieve the best bacterium which has higher crowding distance and lower rank 42 . The choice of the best bacterium is done by crowded tournament selection. Based on the fitness value, every two best individuals are selected for crossover and it is shown in Fig. 2(e). Crossover Phase. The single point crossover is applied to generate new offsprings from the parents. Again the fitness is calculated and the best bacterium is selected. For every two best individuals, the initialization of parameter value for performing the crossover operation in BFO-GA is set to 0.3 and it is shown in Fig. 2(f).

Mutation Phase.
With the final best bacterium the mutation operation is done to generate new offsprings which perform modifications to provide the possible difference for the offspring alignments. It avoids the premature convergence of alignment. Now the fitness value is calculated and the best bacterium is identified. For every two best individuals, the initialization of parameter value for performing the mutation in BFO-GA algorithm, the parameter is set to 0.8 and it is shown in Fig. 2(g).

Elimination and dispersal.
In the evolutionary process, an unexpected event can take place, which may drastically alter the process of evolution and cause the elimination of the set of bacteria and disperse them to a novel environment. As an alternative of raising up the usual chemotactic growth of the set of bacteria, this unknown event may pose a raw set of bacteria nearer to the food location. From a broader perspective, elimination and dispersal are part of the population level long distance motile behavior. In optimization, it aids in thinning out the behavior of stagnation which normally takes place in parallel search algorithms. The worst bacterium is replaced by the best developed offspring if their fitness values are better than worst bacterium. The best bacterium is selected for reproduction ( Fig. 2(h)), and the remaining bacterium are eliminated.

Experimental Results
In this research study, the proposed algorithm is examined with the well-known benchmark datasets for analyzing the execution of the algorithm based on the potency. In summation, the public presentation of the proposed algorithm has been assessed by comparing with several optimization techniques, namely Genetic Algorithm (GA), Ant Colony Optimization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization (PSO) and existing online tools namely T-Coffee, Muscle, K-Align, MAFFT and Clustal Omega.
Performance Measures. This research focuses on the performance measures such as the ratio of pairs correctly aligned namely Sum of Pairs (SP), the ratio of the columns correctly aligned namely Total Column Score (TCS) and the multi-objectives such as maximization of similarity, gap penalty and Non-Gap percentage. The experiments are taken out in 2 X Intel Xenon E5-2670 V2 (2.5 GHz/10-core) CPU with 64 GB of memory, running Cent OS and the proposed BFO-GA algorithm was implemented in Java.
The first performance standard used in this work named Sum-of pairs (SP) and it is set as the number of correctly aligned amino acids or residues with respect to the total number of residue pairs in the reference alignment. Consider the example test alignment of size R*T and a reference alignment of size R*Tr, where X is the number of sequences and T,Tr are the total number of columns in the test and reference alignment. Here … .. The score SP i for the ith column will be the sum of F iab for all pairs of residues in this column is represented in Eq. 15.
Similarly SP ri is the score SP i for the i th column in the reference alignment. The sum-of-pairs score for the test alignment is defined in Eq. 16 And the second most common scoring scheme for Multiple Sequence Alignment is Total Column Score (TCS). Generally, TCS is defined as the number of correctly aligned columns with respect to the total number of columns in the reference alignment. Consider the example test alignment of size R × T and a reference alignment of size R × Tr, where R is the number of sequences and T, Tr are the total number of columns in the test and reference alignment. Here the score is defined as Col i = 1 if all the residues are aligned in the reference alignment, The total column score for test alignment is represented in Eq. 17. And the Sabmark database contains 3280 protein sequences in Twilight Zone families. That is the sequence similarity lies between 0-25% identity and common evolutionary origin cannot be established between most pairs of the sequences. The Prefab benchmark database contains 1682 reference alignments. Finally the Oxbench database consists of reference alignments in the master reference set and 605 sequences in the full reference set. Choosing of gap penalty for the benchmark datasets used in this study are keyed out based on the different gap penalty values such as 2%, 5%, 10%, 15% and 19%. It was found that 19% of gap value among various percentages gave better answers and hence it was specified.
The Fig. 3 shows the average results for 19% of gap value and 500 numbers of generations. In this study two sets of observational results are acquired, where the first one is to count the values of objective functions such as similarity, gap penalty and non-gap percentage for five algorithms (GA, ACO, ABC, PSO and the proposed BFO-GA algorithm). The second one is to calculate the performance measures, namely SP and TCS values. The proposed algorithm has been performed for 25 runs and the intermediate results are exhibited.
From the Fig. 3, it is inferred that the proposed BFO-GA algorithm achieves higher performance for all multi-objective values than the existing algorithms. For all the datasets, the proposed algorithm provides more expert results for the value of similarity, gap values and non-gap percentage. It is also found that the similarity and non gap percentage values for variable gap penalty is better than the values achieved by using an affine gap penalty. The comparisons of similarity using affine and variable gap penalty of five reference BAliBASE datasets for proposed and existing algorithms are shown in Fig. 3. The comparisons of Affine and Variable gap penalty of five reference BAliBASE 3.0 datasets for the proposed and existing algorithms is shown in Fig. 4. The comparisons of non-gap percentage for the alignment of five reference BAliBASE 3.0 datasets for the proposed and existing algorithms is shown in Fig. 5. The comparisons of similarity using affine and variable gap penalty of well-known benchmark datasets such as Sabmark, Prefab and Oxbench for proposed and existing algorithms are shown in Fig. 6. Likewise, the comparisons of Affine and Variable gap penalty of alignment benchmark datasets in the above mentioned for the proposed and existing algorithms are shown in Fig. 7. Ultimately, the comparisons of non-gap percentage for the alignment benchmark datasets for the proposed and existing algorithms is shown in Fig. 8.
The performance standards such as the Sum-of-Pairs (SP) and Total Column Score (TCS) for the proposed algorithm are compared with existing algorithms (GA, ACO, ABC, PSO) and also with various online MSA tools The final stage yields the statistical significance of the proposed algorithm which is estimated using non-parametric test, namely Wilcoxon Matched-Pair Signed-Rank test between each pair of methods by using significant confidence level of 5% (P-value < 0.05). Each entry in the Table 1 consists of P-value assigned by Wilcoxon Matched-Pair Signed-Rank test for the divergence between the pair of methods. The upper right corner of the matrix is obtained from SP score and the lower-left corner is obtained from TCS score. The execution time for the proposed BFO-GA algorithm with respect to affine and variable gap penalties is shown in Figs 11 and 12. Phylogenetic Tree Construction. In this research four well-known benchmark datasets such as BAliBASE 3.0, Prefab 4.0, SABmark 1.65 and Oxbench 1.3 are used for comparing the proposed BFO-GA algorithm with the other existing algorithms. After performing the MSA, the resulting alignments are passed to the online tool ClustalW2 46 to reconstruct the phylogenetic trees of the families. The Supplementary Fig. 1 shows the reference phylogeny for a subset of one reference family in BaliBASE 3.0 named RV 3, as well as the consequent phylogenetic trees reconstructed from the alignments obtained from the other four algorithms. Robinson-Foulds (RF) distance 47 is employed to assess the quality of the trees between the inferred trees and the acknowledgments.     And also the RF distance is used to measure the smallest distance between trees to see the better inferred trees. Table 2 summarizes the results of RF distances predicted by the ClustalW2. The minimum distances in each row are indicated in bold. The results inferred that the phylogenetic trees inferred from the BFO-GA resulting alignments has the smallest distances in five of eight databases. One of the common performances metric for      the SP and TCS value fluctuates in 5% and 10% and in later 15% and 19% iterations the SP and TCS values has increased. From all the iterations, it is noted that the proposed algorithm has best average results and it is found that for 19% gap penalty value better results are reached among all the iterations. For all the BAliBASE datasets the proposed algorithm provides more dependable outcomes with respect to affine and variable gap penalty values. Based on the experimental results and discussion, this research work concludes that the proposed BFO-GA algorithm can improve both the multi-objectives and performance measures than the existing algorithms.

Conclusion and Future Enhancement
Today, the multiple sequence alignment problems are an unresolved issue for researchers. The alignment methods used to solve this problem should be habitually enhanced as they are important in the analysis of enormous data provided by next-generation sequencing and high-throughput experiments. The primary objective of this research study is to assess the evolutionary algorithms such as GA, ACO, ABC, PSO and exploring ways to further improve its execution to arrive at optimal solution. After careful analysis of the existing algorithms, this research work proposed BFO-GA algorithm to perform multiple sequence alignment and directs the result towards an optimal answer. The multi-objective optimization technique is used to resolve the problem which maximizes the similarity, non-gap percentage, and minimizes the value of gap penalty which goes to the Pareto -optimal result. The statistical significance is computed to compare the significance of the proposed algorithm with other existing methods by using the Wilcoxon Matched-Pair Signed-Rank test. From the experimental results, it is exposed that the proposed BFO-GA algorithm outperforms the other existing algorithm in terms of all Multi-objectives and performance measures. And besides the proposed algorithm achieves good outcomes yet for low similarity of the sequences. The conserved blocks are not received, while performing the multiple sequence alignment. Hence it is concluded that they are not homologous sequences. Ultimately, the phylogenetic tree is constructed for the RV3 reference family in BaliBASE 3.0 by using the resulting MSA alignments provided by the proposed BFO-GA algorithm. Based on the RF distance values, it is inferred that the proposed algorithm achieves better results than the other methods.
In future the proposed algorithm can be blended or run with any other evolutionary algorithm to obtain the best optimal results. Different objectives may be innovated to find most excellent solutions of multiple sequence alignment and to get more conserved blocks. As well, this algorithm can be utilized for secondary and tertiary structure prediction of these successions.