Niche harmony search algorithm for detecting complex disease associated high-order SNP combinations

Genome-wide association study is especially challenging in detecting high-order disease-causing models due to model diversity, possible low or even no marginal effect of the model, and extraordinary search and computations. In this paper, we propose a niche harmony search algorithm where joint entropy is utilized as a heuristic factor to guide the search for low or no marginal effect model, and two computationally lightweight scores are selected to evaluate and adapt to diverse of disease models. In order to obtain all possible suspected pathogenic models, niche technique merges with HS, which serves as a taboo region to avoid HS trapping into local search. From the resultant set of candidate SNP-combinations, we use G-test statistic for testing true positives. Experiments were performed on twenty typical simulation datasets in which 12 models are with marginal effect and eight ones are with no marginal effect. Our results indicate that the proposed algorithm has very high detection power for searching suspected disease models in the first stage and it is superior to some typical existing approaches in both detection power and CPU runtime for all these datasets. Application to age-related macular degeneration (AMD) demonstrates our method is promising in detecting high-order disease-causing models.

The control parameters of harmony search algorithm are specified, which include HMS, harmony memory considering rate (HMCR), pitch-adjusting rate (PAR), fret width (fw) (fret width is called formerly bandwidth: bw) and the termination criterion (i.e., the maximum function evaluation times(MaxFEs) ).
Step 2. Initializing the harmony memory (HM) and calculating the fitness value of each harmony. End where, rand(0,1) represents a uniformly distributed random number between 0 and 1.
The harmony memory (HM) consists of HMS harmonies, as follow,

End
where idworst is the index of the worst harmony in HM.
Step 5. Checking the stopping criterion. If stopping criterion (MaxFEs) is meet, computation is terminated. Otherwise, Step 3 and Step 4 are repeated.

Niche technique
Niche technique, inspired by the way organisms evolve in nature [3], can be employed to enhance the population diversity and maintain multiple different optima in the search landscape [4]. For example, continual function schwefel 2.26 has many local optimal solutions, whose unique globally optimal solution (420.9867, 420.9867,…, 420.9867) is near the boundary (see Fig.A-1); the swarm intelligent search algorithm is very easy to trap into one of local optimal regions, which might cause the global optimal solution lost. In niche technique, when many candidate solutions are gathered together in a small region of search space, the niche algorithm is triggered to recognize the niche region, and then the recognized niche region is set as forbidden zone for forcing search algorithm to explore new solutions in unexplored regions. In Fig.A-1, red ellipse denotes a niche region, where c represents the center of this niche, two dimensional vector  

Niche technique for discovering more disease factors
For a new complex disease, we previously have not known biological information, whose risk factors may be some single SNP locus, combinations of multi-SNP loci, or hybrid models composed of the additive models, epistasis models and so on.
To be able to detect various risk factors, we need consider as many kinds of SNP combination models as possible. For the disease models with strong marginal effects, it can be found easily and quickly using swarm intelligent algorithm, however, for datasets including multiple disease models with marginal effect and disease models without marginal effect, it is extremely difficult to detect all disease-causing models due to the interference with one another seriously.
To tackle this problem, in this study, a new niche technique is merged into harmony search algorithm for exploring all disease-causing SNP combinations efficiently. The goal of the niche technique is mainly to discover some SNP-combinations with strong marginal effect, where the marginal effects do not only come from single SNP markers, but also may be synergistic effects of multi-SNP makers. In the search process of HS, the position and size of each niche region are recorded into a tabu table for forcing the HS algorithm to search new solutions in unexplored regions. In this way, all possible k-way SNP combination models having strong association with phenotype can be extracted one by one. to jump out of the local region for exploring new disease-causing models due to the strong attraction of the marginal effect. At this point, the niche algorithm is triggered automatically to identify niche region whose center c has the highest score ) and radius ris defined as the number of common SNP loci of all k-way models in HM, (in this example, there are two the common SNP loci 1 x and 2 x , thus When niche region ) , ( r c have been identified using the niche algorithm (see the identification algorithm(1)), it will be stored into tabu table (TT) that is used for avoiding evaluating some models repeatedly and forces the HS algorithm to search other regions for finding new possible risk factors. Then some best solutions (having strongest causal association with phenotype status) which are chosen from HM1, HM2 and HM3 are put into elite solution set Es1, Es2 and Es3, respectively. After the process HS search is end, the Es1, Es2 and Es3 are combined into a candidate set (CS).  In the search process of NHSA, the niche identification algorithm is automatically triggered to identify a niche region when the harmony memories HM1, HM2 and HM3 cannot be updated during several iterations. Within a niche region, the radius of the niche is recorded for preventing generating new solution in the niche, which can effectively avoid the search algorithm trapping into a local region.

Algorithm (1): Niche identification algorithm:
During several iterations (T times), if there is no improvement for harmony memory (HM), we think all the solutions in HM might have been aggregated into a local region. And at this moment, niche identification algorithm is triggered to identify the niche, its work as follows: (1) Select the best harmony H best in HM as the center of niche. There are more than ten kinds of BN models [5][6][7] that have been developed to find causal relationships, perform explanatory analysis , describe the causal influence and make predictions. In GWAS studies, BN model is also used to detect the interaction effect among SNP markers, which can represent the relationship between genetic variants and disease status.

Let a set of SNP variables
, , , J y y y  ; we represent the homozygous major allele, heterozygous allele and homozygous minor allele as 0, 1 and 2.
In a DAG of Bayesian network for representing the relationships of SNP markers and disease states, there are only directed edges linking from the SNP markers to diseases status, and there is no any edge linked from disease state to SNP markers and no edge connected among SNP markers. In the DAG, if and only if SNP i x is a direct cause of phenotype j y , there is a directed edge linking from node i x to node j y . Figure A-3 shows an example of k-SNPs epistasis Bayesian network model associated with phenotype j y , where 1 2 ( , , , ) j y j J   denotes the phenotype state.

Fig.A-3 k-SNPs epistasis Bayesian network model
A k-combinatorial model associated with phenotype can be expressed by the probability distribution [7] as follow: where, Y is the state variable of phenotype and ( There have been several Bayesian network structure learning models successfully used to identify the SNP interaction, such as score-based methods used to evaluate the association between genotypes and phenotype by scoring the model's fitness to the data.
where, M  denotes the parameters of the probability distributions of phenotype variable Y given the SNP variables According to the theorem 1 given in [8] , under the following assumptions: (1) The values of the phenotype variable are generated via independent identically distributed (i.i.d) sampling from P(Y |X).
(2) Prior belief about the distribution P(Y|X=x i ) is independent of prior belief about P(Y|X=x j ) for all values x i and x j of X (x i ≠x j ).
(3) For all values x j , prior belief about the distribution P(Y|X=x j ) is modeled using Dirichlet dirtribution with hyperparameters i  and ij  .
The equation (5) has a closed-form solution which is as following expression:  [7,9] which define the prior probability over the the parameters M  , and We can view the hyperparameters as prior belief numbers of cases from a previous hypothetical study which belongs to the disease state j where the genotypes is given by i-th combination. When parameters 0 ij   and 0 ij n  , the Equation (6) can be expressed as follow: According to the analysis in [7], for the BN model M, it is unknown for us to prior belief of the model M. Thus, we can consider all BN models to be equally plausible and think the prior probability   P M is a constant for all BN models. Thereby, the Equation (4) can be expressed as : According to Equation (7).
, which means that all possible distributions of Y given i X x  is equally likely. Thus the Equation (9) can be written as The evaluation criterion of BN model turns to the K2-Score as follow formula

Gini Index
For a binary classification case-control problem, the Gini index is a diversity index [11] which is defined as Equation (12) where, , i j p is the estimated probability that the i-th genotype combination actually associated with phenotype j y .
 means the estimated probability that genotype combination is misclassified as phenotype j y . i P is the percentage of i-th genotype combination in sample set.
For example, in Table A    We compared joint entropy with logistic regression based AIC-score [9] and Bayesian network based K2-score by calculating the association between each 2-way SNP genotype combinations and phenotype outcome. The evaluation scores of three methods for all 2-way genotype combinations are displayed respectively with surface graph in Fig.A-4(a-c), and the corresponding contour plots are shown in Fig.A-4(d-f).

Comparison joint entropy with other scoring functions
As shown in Fig.A-4 (a-b) and Fig.A-4 (d-e), the logistic-regression-based AIC-score criterion and Bayesian-network-based K2-score criterion have hardly any discriminability for different 2-way genotype combinations. Yet, in Fig.A-4 (c) and Fig.A- . Fig (a), (b) and (c) are the scoring surface graphs for all 2-way SNP combination models evaluated using logistic regression based AIC criterion, Bayesian network based K2-score and joint entropy, respectively. The smaller the score is, the higher the association with phenotype. In Fig (a) and (b), the distribution of scores is disorderly and unsystematic; it is very difficult to discover heuristic information from them to explore the disease-causing SNP combination models. Yet, the joint entropy, in Fig. c, Fig (d), (e) and (f) are the contour graphs for all 2-way SNP combination models evaluated using logistic regression based AIC criterion, Bayesian network based K2-score and joint entropy, respectively.

Screening Stage: step (1)~step (8).
(1) There harmony memories (HM1, HM2 and HM3) are initialized randomly by selecting 3-way SNP combination from the set of all SNP loci, where each 3-way SNP combination is referred to as a harmony. Each harmony of HM1 is evaluated using K2-score, the harmony in HM2 is scored by Gini function and the method of joint entropy is used to evaluate the harmony of HM3.
(3) If the new generated 3-way model is superior to the worst harmony (with high K2-score, yellow color) in HM1, the worst harmony in HM1 is replaced by the new model. If the new model is better than the worst harmony in HM2 or HM3, it will replace the worst harmony. For example, in the HM1, the worst harmony is (X 8 , X 9 , X 10 ) whose K2-score=5.1, it is worse than (X 4 , X 5 , X 6 ), therefore, it will be replaced by the new harmony (X 4 , X 5 , X 6 ). Similarly, because harmony(X 4 , X 5 , X 6 ) is better than the worst harmony (X 6 , X 7 , X 9 ) of HM3, it will instead of the worst harmony in HM3. However, in HM2, the worst harmony (X 1 , X 6 , X 10 ) is better than new generated harmony(X 4 , X 5 , X 6 ), thus the HM2 cannot be updated.
(10) Testing the solutions using G-test statistical method.    In Table E-3 (see supplementary info file 2), some FDR values on the columns of SNPHarvester, Boost and MACOED equal "NaN", which means that no k-way SNP combinations in all datasets were found to be a disease-causing model (Both the FP and TP are equal to 0, FP/(FP+TP) cannot be calculated) . MACOED  CSE  SNPHarvester  Beam  Boost   Sample  size  Model  TPR  SPC  ACC  FDR  TPR  SPC  FDR  ACC  TPR SPC ACC FDR  TPR  SPC  ACC  FDR  TPR  SPC ACC  FDR  TPR  SPC  ACC  FDR   800   DME-1  1%  100%  84%  0%  6%  100%  0%  68%  24%  24%  24%  76%  0%  100%      It can be seen from Figure E-7(a) that our method is superior to other five algorithms for DME-4~DME-12 on index ACC, and it is better than CSE, Beam, SNPHarvester and Boost on DME-1~DME-3. In Figure E-7(b), our method is superior to or equal to all other approaches on the SPC metric, and it shows very high true negative rate in Figure E-7(c) for DME-5~DME-12, but it is inferior to MACOED and CSE for DME-1~DME-4, which is because many true disease models obtained in the first stage are filtered out due to the failure to pass the threshold of G-test p-value. In Figure E-7(d), the NHSA-DHSC shows very low FDR, which demonstrates our method is effective for decreasing the type I error.

NHSA-DHSC
We can see from Figure E-8 that the G-test has higher precision on SPC and FDR than chi-square test. For DME-1~DME-4, the TPR of G-test is inferior to that of chi-square test, which is because the G-test is more stringent than chi-square. As we known, statistical method must balance the increase in type I errors and the increase in type II errors. A stringent significant threshold can decrease the type I errors for some models, however, which also can decrease the power (type II errors) for other models, such as DME models 1-4.Therefore, we believe that the modified G-test in this work is more effective than chi-square for testing the candidate solutions. In this figure, we respectively employ G-test and chi-square test to verify the solutions of candidate set (CS) in the 1st stage of NHSA-DHSC, where 12 DME datasets with 1000 SNP markers and 4000 samples are tested.