Abstract
Genomewide association study is especially challenging in detecting highorder diseasecausing models due to model diversity, possible low or even no marginal effect of the model, and extraordinary search and computations. In this paper, we propose a niche harmony search algorithm where joint entropy is utilized as a heuristic factor to guide the search for low or no marginal effect model, and two computationally lightweight scores are selected to evaluate and adapt to diverse of disease models. In order to obtain all possible suspected pathogenic models, niche technique merges with HS, which serves as a taboo region to avoid HS trapping into local search. From the resultant set of candidate SNPcombinations, we use Gtest statistic for testing true positives. Experiments were performed on twenty typical simulation datasets in which 12 models are with marginal effect and eight ones are with no marginal effect. Our results indicate that the proposed algorithm has very high detection power for searching suspected disease models in the first stage and it is superior to some typical existing approaches in both detection power and CPU runtime for all these datasets. Application to agerelated macular degeneration (AMD) demonstrates our method is promising in detecting highorder diseasecausing models.
Introduction
With the rapid development of highthroughput genotyping technology, singlenucleotide polymorphism (SNP) data increases explosively, which establishes favorable conditions to detect cause of disease for researchers. Though genome wide association study (GWAS) has successfully identified many single SNP genetic variants associated with disease status or phenotypic traits^{1,2,3,4} what has been widely acknowledged is that it generally fails to detect highorder SNPcombinations which may be an important contributor to pathogenic factors synergistically affecting disease status^{5}. Detecting such model from a dataset with hundreds of thousands of SNPs is facing following two challenges^{6}.
The first challenge is the enormous computation burden imposed by the combination explosion of genotype. The number of candidate kway SNPcombinations for a dataset with n SNP markers equals \((\begin{array}{c}n\\ k\end{array})=\frac{n(n1)\cdots (nk+1)}{k!}\propto {n}^{k}\). Obviously, it is unworkable to test all kway SNP combinations at wholegenome scale when k > 3, even with highperformance computers available at present. The second challenge arises from the diverse nature of SNP interaction models, such as additive effect model, nonadditive effect model and statistical epistasis model. Furthermore, some spurious multiloci combination models may also be associated with phenotype due to statistics with high degree of freedom, the huge number of hypothesis tested and limited sample sizes^{7, 8} which all could result in a high false discovery rate (FDR).
For the first challenge, several multiloci detection algorithms^{5, 6, 9,10,11,12,13,14,15,16,17,18,19,20,21,22} have been proposed for improving the detecting speed. SNPHarvester algorithm^{9} uses stochastic strategy to generate multiple paths for identifying kway SNP interaction models. BEAM^{23} introduces a Bayesian partition model and employs a Markov chain Monte Carlo sampling strategy to discover the model with maximum posterior probability. In Boost^{5}, Boolean operation is adopted to examine all pairwise SNP interaction using exhaustive search. Sangseob Leem et al.^{11} introduces a fast algorithm for detecting high order epistatic interactions by performing clustering with kmeans algorithm on all SNPs, in which the candidates of kway are selected from the k clusters, reducing the number of combinations. Collins RL et al. use multifactor dimensionality reduction (MDR) to detect threelocus epistatic interaction^{12}, the ReliefF algorithm is used first to select a small candidate set for reducing computation burden. Dynamic clustering and cloud computing^{10} are also employed to detect highorder genomewide epistatic interaction in which forty virtual machines are constructed for speeding up the detection of multilocus epistasis. Jonathan et al. present a multipoint method for studying the genomewide association by imputation of genotypes^{13}, which is a modelbased imputation method for inferring genotypes at observed or unobserved SNPs.
The main problem of these algorithms is their huge computational cost and preference to some types of disease models, e.g., to the models with obvious marginal effect.
Recently swarm based intelligent optimization algorithm attracts much attentions in reducing computational burden due to its power of effectively resolving NPhard problems in polynomial time. M Aflakparast et al. propose Cuckoo search (CS) algorithm^{14} to explore multiloci epitasis. In the CS, by dividing SNP sites into M groups according to correlation among SNPs, only kway (k < = M) SNP combinations are selected out of the M groups. Ant colony optimization (ACO) is adopted in AntEpiSeeker^{15} and MACOED^{17}, where the former employs chisquare test(χ ^{2}) score to evaluate association between SNP combinations and phenotype, while the latter adopts Bayesian based K2score and logistic regression based AIC in screening SNP combinations in the first stage, and χ ^{2} is adopted to test the significance difference between control and case in the second stage. Shang J. et al. use particle swarm optimization (PSO) to discover SNPSNP interactions^{18}, which uses oppositionbased learning, dynamic inertia weight and postprocedure to enhance the search ability of the PSO for finding SNPSNP interactions. Although swarm intelligent search have the ability to speed up detection process, for highorder disease models unfortunately, they are easy to trap into local search. Table 1 presents the characteristic of the five stateoftheart algorithms for detecting multilocus diseasecausing models.
As to the second challenge which relates to diversity of disease models, logistic regression, linear regression, LD and haplotypebased method, and Bayesian network scoring^{23,24,25,26,27} have been proposed. None of the methods are universally better but with either low statistical power or preference to some types of disease models.
To reduce computation burden and adapt to diversity of various types of disease models, we propose a Niche Harmony Search Algorithm to detect highorder SNP combinations (NHSADHSC) associated with complex diseases. It follows two stages: screening and testing. In screening stage, a new niche technique is merged into harmony search algorithm for exploring all suspected diseasecausing SNP combinations. To quickly find as more types of diseasecausing models as possible, we employ three computationally lightweight and functionally complementary evaluation functions, Bayesian network based K2score, Giniscore and Joint entropy, for calculating the association between SNP combinations and disease status. The suspected SNP combination models gained from the screening stage are stored into a candidate set (CS). In testing stage, a modified Gtest method is used to test the authenticity of the candidate SNP combination models in CS.
Our experiments indicate that the proposed NHSADHSC is superior in detection power, running speed and identification ability for diverse disease models compared to current intelligent algorithms.
Outline
Figure 1 presents the outline of the NHSADHSC algorithm. The goal of the first stage is to quickly find all suspected kway diseasecausing models from all kway SNP combinations where the kway diseasecausing model denotes a kSNPs combination that has joint effect on the disease status (k is the number of SNPs). It is responsible for significantly increasing the risks of complex diseases^{28, 29}.
As shown in Fig. 1, there are \({C}_{n}^{k}\) kway SNP combinations, in which only a few models are diseasecausing ones. Niche technology is used to obtain as many types of pathogenic models as possible from a large number of SNP combinations. To search for them efficiently, harmony search (HS) algorithm is used and three harmony memories HM1, HM2 and HM3 are employed to store candidate solutions: HM1 stores candidate SNP combinations screened by Bayesian network based K2score, and HM2 stores the ones by Giniscore. The two scores are complementary in that the K2score is superior in identifying models with low marginal effects and giniscore is, compared to K2score, more capable of identifying highorder models when their genetic heritability (H^{2}) is very low^{30}. Joint entropy is adopted to evaluate the harmonies in HM3. Unlike K2score and Giniscore, joint entropy is a heuristic factor for guiding the HS algorithm to quickly explore the highorder disease models with very low or even no marginal effects (DNME). We propose joint entropy as a heuristic factor since we found via our data experiments that the factor is really powerful in identifying some highorder disease models with even no marginal effect, for which the K2score and Giniscore are powerless.
Our screening process is as follows.

(a)
Initialize HM1, HM2 and HM3 by selecting kway SNPcombinations randomly from all kway SNPcombinations.

(b)
Generate new solution H_{new} repeatedly using HS rules to update the HM1, HM2 and HM3 according to the natural evolutional theory of “Survival of the fittest” (see algorithm (1) in Methods section).

(c)
Until when HM1, HM2 and HM3 cannot be further updated, niche algorithm is automatically triggered to identify new niche (see algorithm (2) in Methods section). Meanwhile, some best solutions are chosen from HM1, HM2 and HM3 separately to substitute some worse solutions in elite sets Es1, Es2 and Es3.

(d)
Reinitialize the HM1, HM2, and HM3 randomly, where the solutions from identified niche regions cannot be visited and evaluated for avoiding repeated search in the niche regions.

(e)
When terminal condition is satisfied, solutions in Es1, Es2 and Es3 are merged and stored in candidate set (CS).
(Testing Stage) Some spurious disease models may be included in the CS in the first stage. In the second stage, we adopt modified Gtest method^{31} to further verify the authenticity of the candidate models in CS.
In this study, the niche technique is mainly to discover some SNPcombinations with strong marginal effect and make the HS find all suspected diseasecausing models, where the marginal effects do not only come from single SNP markers, but also may be synergistic effects of multiSNP makers. In the search process of HS, the position and size of each niche region are recorded into a taboo table for forcing the HS algorithm to search new solutions in unexplored regions. In this way, all possible kway SNP combination models having strong association with phenotype can be extracted one by one. The detailed process of the NHSADHSC algorithm is introduced in Methods section and the related techniques are presented in supplementary info file.
Experimental Results
20 simulated disease models, which contain twelve 2locus Disease models with Marginal Effects (DME) and eight highorder Disease models with No Marginal Effect (DNME), are employed to investigate the performance of the NHSADHSC algorithm. The simulation datasets for the 20 disease models, real Agerelated macular degeneration (AMD) data, and seven evaluation metrics are introduced in Methods section in detail. We compare the experimental results of NHSADHSC with those gained from five stateoftheart algorithms (BEAM^{23}, SNPHarvester^{9}, BOOST^{5}, CSE^{14} and MACOED^{17}).
Experimental results for simulation datasets
Detecting DME disease models
We first investigate the proposed algorithm on 12 DME data sets with 100 SNP markers, which aims to test the sensitive of our method for dataset with different sample size. The bar charts shown in Fig. 2 present the power of six algorithms to solve 12 DME models with sample size 800, 1600, 2000, 3200, 4000 and 5000. Figure 3 displays the runtime, mean evaluation times (MEs) and ACC value of all methods. Other three statistical metrics TPR, SPC and FDR are summarized in Table E3 (see supplementary info file), and the TPR and SPC are presented using point line chart in Fig. E4 (see supplementary info file).
(Analysis of Detection Power) As can be observed from Fig. 2, for most of DME models, the detection power of NHSADHSC in the first search stage outperforms the other five methods. Especially, for DME1~DME4, NHSADHSC has very obvious advantages. In the second stage, the power of NHSADHSC decreases apparently on DME1~DME4 because some diseasecausing models fail to pass threshold pvalue from Gtest statistics (Bonferroni correction), which makes the true positive rate (TPR) decrease. The MACOED is similar to the NHSADHSC algorithm; it employs the ACO algorithm to search the candidate solutions in the first phase and uses the chisquare to further test the authenticity in the second phase. Almost on all DME models, the power of our approach is higher than that of MACOED in two corresponding stages. The NHSADHSC also has obvious advantage on power over BEAM, SNPHarvester and Boost, and the 1^{st} power of it obviously outperforms the CSE.
(Ability for reducing the computational burden) Fig. 3(a) indicates that the Boost takes the least Runtime among six methods, but the NHSADHSC takes less Runtime than MACOED, CSE, BEAM and SNPHarvester apparently, and the Runtime of NHSADHSC increases very slow with the increasing sample size but MACOED, CSE, BEAM and SNPHarvester are converse. In Fig. 3(b), NHSADHSC uses a very small number of evaluations to detect pathogenic models, and significantly lower than MACOED and CSE, which demonstrates that our approach reduces the computational burden effectively.
(Performance on TPR, FDR, SPC and ACC) As shown in Fig. 3(c), for ACC, our method outperforms other five algorithms on the majority of models. It is indicated in Table E3 that, for DME2~DME4, all of algorithms has poor performance on TPR and FDR when the sample size is less than 2000, where the TPR of MACOED is higher than that of NHSADHSC, but the FDR of NHSADHSC is very lower than that of MACOED, which demonstrates that the Gtest in our method is more stringent for avoiding false positive rate than chisquare of MACOED. We can find from Fig. E4 (see supplementary info file) that the NHSADHSC and Boost have highest SPC value among six algorithms but the NHSADHSC is superior to Boost on TPR apparently. And NHSADHSC is very outstanding for all datasets with different sample size except for DME2~DME4 with small sample size (< = 2000) (see Fig. E5 Supplementary info file).
In supplementary info file, we also present the experiment results and analysis for dataset with 1000 SNPs.
To sum up, for DME models, our method decreases the computational burden effectively and its detection power, SPC, ACC and FDR are superior to most of compared algorithm, which demonstrates the NHSADHSC is promising to detect DME models.
Detecting DNME disease models
For 8 highorder DNME models, we compare NHSADHSC with three stateoftheart heuristic search algorithms which can detect the highorder SNP combination associating with complex diseases. The detailed experimental results are summarized in Table 2.
Seen from Table 2 is that the NHSADHSC is obviously superior to other algorithms in terms of power and Runtime except for DNME7, and it finds the highorder diseasecausing models successfully using very few number of evaluating SNP combination models. For DNME1 with 100 SNP sites, there are 161700 3way combination models (\({C}_{100}^{3}=161700\)), the NHSADHSC algorithm can identify out the diseasecausing model by evaluating 1093 combination models from all 3way combinations. For DNME8, the diseasecausing models can be identified from 75287520 5way SNP combinations by evaluating 3849 5way models. Therefore, we believe that the NHSADHSC algorithm is promising in detecting highorder SNP combinations associated with complex diseases.
Experiments on AMD data
Detection on all 103611 SNP loci of AMD dataset
NHSADHSC algorithm are employed to detect kway (k = 2, …, 4) SNP combinations associated with the AMD. The corresponding results are respectively listed in sheet S2~S4 of Supplementary Dataset File 1.
In Fig. 4(a), 2way SNP combination network is created using software Cytoscape 3.3^{32} (http://www.cytoscape.org/). There are 571 nodes and 565 edges in the network. The edge of the network denotes a 2way SNPcombination with pvalues from Gtest (case and control) less than 1e8. The Node in the network represents a SNP locus which has joint effect with adjacent nodes on phenotype. Figure 4(b) and (c) are the subnetworks of Fig. 4(a). The SNP nodes of Fig. 4(b) have more than 5 adjacent nodes in the network of Fig. 4(a), and the nodes of Fig. 4(c) have more than 10 adjacent nodes in the network of Fig. 4(a).
Figure 5(a) is the gene interaction network that is mapped from SNP network in Fig. 4(a), in which each edge connecting two genes denotes two SNP loci in the two genes are associated with phenotype. Figure 5(b) presents the interaction relationship of the six important genes (CFH, BBS9, NA, INPP4B, ABL1 and ANKS1B). In Fig. 5, the thicker the adjacent edge, the more number of SNPpairs linked the two genes. In Fig. 5(a), there are 260 adjacent edges between CFH and NA.
Detection of the remaining SNP loci of AMD dataset (removed five important SNPs)
In order to further investigate the unknown disease models from AMD dataset, we first remove five widely reported SNPs (‘rs380390’,‘rs10272438’,‘rs1329428’,‘rs1363688’,‘rs1394608’) from AMD data. Then the proposed NHSADHSC is used to detect highorder SNPcombinations associated with AMD from the remaining SNP loci. The results are listed in sheet 2 of Supplementary Dataset File 2.
Figure 5(c) shows the 2way SNP interaction network in which there are 78 edges, where each edge denotes a SNPpair associating with disease status (the pvalue cutoff is 1e8). Figure 5(d) is the gene interaction network mapped from SNP network Fig. 5(c).
Experimental results analyses
As shown in Fig. 4, many SNP nodes are connected with three important nodes rs380390, rs1329428 and rs10272438. Degrees of the three SNP nodes are respectively equal to 421, 32 and 34, where the rs380390 and rs1329428 (both in an intron of the CFH gene) have been widely believed to be associated with the AMD^{10, 18, 30, 33,34,35,36,37,38}, the rs10272438 (in the BBS9 gene) also has been reported in [36, 37, 43, and 44]. In addition, the rs1363688 (degree = 12) and rs7104698 (degree = 9), which are not in known gene regions, have been separately reported in refs 10, 18, 30 and refs 30 and 39. To our best knowledge, the rs3775652 (degree = 36 in the network of Fig. 4(a)) has not been reported by other researchers, it is in the gene INPP4B that is an oncogenic regulator in human colon cancer^{40}. Another SNP rs1394608 in the SGCD gene has also been reported recently in refs 30, 39, 41,42,43. It indicates in Fig. 4(c) that there is no edge between the rs380390 and rs1329428, the Gtest pvalue of SNP pair (rs380390, rs1329428) is equal to 3.24 × 10^{−6} that is larger than the threshold pvalue 1.0 × 10^{−8}. However the rs10272438 and rs3775652 are connected with all other SNP nodes. As a consequence, we speculate that rs380390 and rs1329428 may be the diseasecausing variation locus and both they were independently associated with Agerelated macular degeneration (AMD), but there is very low synergetic effect to AMD among them. The two SNPs may be the potential driver variation locus to AMD and the CFH is the potential driver gene on the basis of single nucleotide variations.
In the gene network of Fig. 5, there are a total of 188 gene nodes, where six nodes (CFH, BBS9, NA, INPP4B, ABL1 and ANKS1B) have the most adjacency edges. There are 177 SNPs mapped to the CFH gene, 16 SNPs mapped to the BBS9 gene, 17 SNPs mapped to the INPP4B gene, 5 SNPs mapped to the ABL1 gene and 5 SNPs mapped to the ANKS1B gene. The NA (375 SNPs are mapped to NA) denotes nongene coding region. In these genes, the gene CFH has been widely believed to be associated with agerelated macular degeneration (AMD) disease^{44}. Although the other five genes have not been reported to be related to the AMD, they are all associated with other complex disease. For example, it has reported that the gene BBS9 is associated with consanguineous Pakistani family with Bardet Biedl syndrome^{45}, INPP4B is an oncogenic regulator in human colon cancer^{40} and is upregulated and functions as an oncogenic driver through SGK3 in a subset of melanomas^{46}. Other four important genes (MPP7^{47},ABL1 ^{48}, ANKS1B ^{49, 50}, and IMMP2L ^{51, 52}) are also associated with somatic mutations in cancers.
As shown in Fig. 5(c), there are five SNPs (‘rs3775652’, ‘rs7104698’, ‘rs10511467’, ‘rs10512413’, ‘rs1930022’) having more adjacent nodes, where the SNP ‘rs3775652’ is in gene INPP4B, and SNP rs10512413 is in gene ABL1, other three SNPs are in nongene coding region (NA). We can see evidently from Fig. 5(d) that the gene INNP4B is related with many genes. Therefore, it can be speculated the SNPs ‘rs3775652’ and gene INPP4B should also be important effect on AMD.
Tables 3 and 4 list top eight 3way SNPcombinations with pvalues from Gtest less than 1e11 and top eight 4way SNPcombinations with pvalues from Gtest less than 1e12, respectively.
We can see from Table 3 that the eight 3way SNPcombinations all contain the SNP locus that have strong marginal effect to AMD, such as rs380390, rs10272438 and rs1363688.
In Table 4, there are three 4way SNPcombinations (rs2157998, rs1334722, rs1740752, rs7104698), (rs10489581, rs10517007, rs10511467, rs10507949), (rs10489581, rs10511467, rs10507949, rs1025758) that don’t contain the known SNP locus associating with AMD and each SNP locus has low marginal effect. We can see from the last column that the SVM prediction accuracies of these SNP combinations are all larger than 75%, it may be worth to study for biologist.
In addition, we can notice that the all pvalues from chisquare equal zero in Tables 3 and 4, which demonstrate that the chisquare loses efficacy for testing the association of highorder SNPcombinations when the number of samples is not big enough.
Discussion
In this work, we propose NHSADHSC algorithm to detect highorder SNP combinations associating with the phenotype. And the experimental results demonstrate it has strong global exploration power for detection of highorder diseasecausing models from thousands of SNPs. Compared to the existing algorithms, it has following advantages.

(1)
For highorder diseasecausing models in which some individual SNP locus have strong marginal effects, existing intelligent optimization algorithms are easily trapped into local search, resulting in repeated search in a small region (part of SNPs), leading to the loss of the optimal solution. To tackle the problem, our method adopts niche technique to dynamically identify the SNP locus with marginal effect and then uses taboo table to store the identified SNP locus, which can effectively avoid repeated search in a local region and find all possible diseasecausing models in a short time.

(2)
For diverse diseasecausing models, to our best knowledge, many existing algorithms usually employ single scoring method (e.g. statistical test method, Bayesian network, and regression method) to identify disease models, which usually results in preference to some specific types of disease model and failure to identify other types of disease models. To address the problem, NHSADHSC employs two lightweight identification methods (Bayesian network based K2score and Giniscore), which has been turned out to be complementary each other in literature^{30}.

(3)
For some highorder diseasecausing models with very low or even no marginal effect, existing intelligent optimization algorithms (e.g. MACOED, CSE) and heuristic algorithms (BEAM) always are powerless, which is because no suitable heuristic factor can be found by traditional scoring methods for detecting the disease models. After a large amount of experiments, we find that joint entropy can provide some heuristic clues for guiding the HS to search the highorder disease models. Therefore, in this work, we employ three lightweight and complementary evaluation methods to identify the diseasecausing models, where the joint entropy is utilized as heuristic factor to explore the disease models with very low or even no marginal effects.

(4)
The experimental results on simulation DME datasets demonstrate that the proposed NHSADHSC algorithm has very high detection power in the first stage and very low FDR values. Compared with two outstanding intelligent search algorithms MACOED and CSE, our method has very obvious advantages on runtime, power, MEs and FDR.

(5)
For eight highorder DNME models, our method can quickly find the diseasecausing models except for DNME7, which demonstrates that our method has ability to detect some of highorder disease models.

(6)
In the real AMD experiment, the NHSADHSC successfully found some widely reported SNP locus (e.g. ‘rs380390’, ‘rs1329428’, ‘rs10272438’, ‘rs1363688’) associated with AMD and also found some new SNP markers (e.g. ‘rs3775652’) that are associated with AMD, we notice that two SNPs (‘rs380390’, ‘rs1329428’) widely believed to be associated with AMD may be different driver factors to AMD, they have not strong synergistic effect to AMD. In the analysis of gene network, many SNPcombinations are mapped into genes CFH, BBS9, ABL1, ANKS1B, IMMP2L, INPP4B, SGCD and noncoding regions, where CFH has believed to be associated with AMD, the other genes also have associated with some complex diseases. Therefore, we can speculate that the SNP (‘rs380390’,‘rs10272438’) and gene CFH are associated directly with AMD; others may be indirectly associated with AMD. In other words, SNP (‘rs380390’,‘rs10272438’) and gene CFH may be driver loci to AMD, others are passengers.
However, the NHSADHSC is not a flawless method, it also has some shortcomings:

(1)
For multiplicative models with small size of sample, it has low performance on TPR because some true diseasecausing models cannot pass to the test of Gtest, which demonstrates the Gtest is also not good enough to adapt all diseasemodels.

(2)
For some highorder DNME models, it is also powerless, such as DNME7.
Methods
In this section, we first define a mathematical model for detecting kway SNP combination in section 5.1. In section 5.2, the proposed NHSADHSC algorithm is introduced in detail, which involves the niche identification algorithm. In section 5.3, we introduce three scoring functions for calculating the association between kway SNP combination and disease status and analyze the simulation datasets in section 5.4. In section 5.5, we present seven evaluation metrics for comparing the performance of six algorithms. The parameters setting for six algorithms are introduced in section 5.6.
Mathematical model for detecting kway SNP combination
Let a set of SNP variables {X = X _{1}, X _{2},…, X _{N}} indicate N SNP markers for n individuals (samples), Y be the phenotype variable with values of {y1, y _{2},…, y _{J}}; we represent the homozygous major allele, heterozygous allele and homozygous minor allele as 0, 1 and 2, respectively. For a kway combination model, I denotes the number of genotype combinations (there are 3^{k} genotype combinations for a kway SNP variables), J is the number of phenotype states Y (which is equal to 2 for a casecontrol dataset). n _{ i } is the number of samples in the dataset with SNP loci taking the value of i^{th} genotype combination, n _{ ij } represents the number of samples that the i^{th} genotype combination actually associated with phenotype y _{ j }.
Definition (highorder association). Let S = {S _{1}, S _{2},…S _{ K }}(1 < N, s _{ i } ∈ X) be a set with k SNP loci. ϕ(S, Y) is a score function for evaluating the association between S and phenotype Y. The kway SNP combination S is said to be synergistically associated with phenotype Y if and only if \(\forall {S}^{\text{'}}\subset S\wedge \varphi (S,Y)\succ \varphi ({S}^{\text{'}},Y)\)(\(\succ \) is a binocular operator for comparing the association strength), and it is said to be associated strongly with \(Y\) if \(\varphi (S,Y) > \theta \) (is threshold value).
The optimization problem for finding a kway diseasecausing combination model can be expressed as
$$\mathop{\max }\limits_{X}\,f(X,Y),X=({X}_{{S}_{1}},{X}_{{S}_{2}},\cdots {X}_{{S}_{k}})$$where, \({s}_{i}(i=1,2,\mathrm{...},k)\) is the location of i^{th} SNP site, \({X}_{{s}_{i}}\) denotes the value of the \({s}_{i}th\) SNP marker, \({X}_{{s}_{i}}\ne {X}_{{s}_{j}}(i\ne j)\).\(f(X,Y)\) denotes the objective function for evaluating the association between genotype \(X\) and phenotype \(Y\).
Niche Harmony Search Algorithm
Harmony search (HS) algorithm is a swarm intelligent optimization algorithm^{53}. It mimics the process of improvising a musical harmony when a music orchestra is aiming at composing the most harmonious melody (see standard HS algorithm in supplementary info file). HS algorithm does not dependent on substantial gradient information and an initial search point and it can solve both continuous and discrete combination optimization problems, efficiently. However, it is still not good enough to solve complex multimodal optimization problems if our demand for multiple candidate solutions is more prominent. Recently, some strategies are adopted to find multiple solutions in the study of intelligent optimization algorithm. For example, in refs 14,15,17 and 18 elite set are employed to store optimal and suboptimal solutions, but the solutions in elite set are likely only from one local region, which might make other solutions be lost if the search algorithm has been trapped into a local search. In recent years, niche techniques have received extensive attention for obtaining all possible candidate solutions, which can effectively enhance the search capability of HS owing to avoiding repeatedly search in a small region.
For an optimization task, each harmony corresponds to be a vector consisting of k decision variables. Some good harmonies form a harmony memory (HM) which later would be used for creating better harmonies. The harmony memory size (HMS) is defined as the number of harmonies in HM.
In this study, each harmony of HM denotes a kway SNP combination, and the goal of optimization is to find some best harmonies (kway model) which are associating with disease status Y.
To explore as many suspected kway genetic variations as possible, we propose a niche HS algorithm (named NHSADHSC) for detecting kway SNP combinations associated with phenotype, in which niche strategy^{54, 55} are merged into HS algorithm for enhancing global exploration power of HS.
The flowchart of NHSADHSC for the first screening phase is shown in Fig. 6, where the algorithm (1) introduces the process for improvising a new harmony and algorithm (2) presents the method for identifying niche region.
Figure A–5 in Supplementary info file shows an example explaining the process of the NHSADHSC algorithm for detecting 3way diseasecausing models with a total SNP number of 10.
In the search process of NHSADHSC, the niche identification algorithm is automatically triggered to identify a niche region when the harmony memories HM1, HM2 and HM3 cannot be updated during several iterations. Within a niche region, the radius of the niche is recorded for preventing generating new solution in the niche, which can effectively avoid the search algorithm trapping into a local region. In the supplementary info file, the niche technique is introduced in detail.
Evaluation functions for calculating the association of SNP combinations with the phenotype
Three lightweight evaluation methods, Bayesian network based K2score, Gini score and joint entropy, are adopted for improving detection power and speed of the HS algorithm, where the former two are for the adaptability of the diversity of disease models, and the latter is for the disease models with low marginal effect.
K2score and GIscore
By for calculating the association of SNPs with the phenotype, Bayesian network based K2score and Gini index (see supplementary info file) are first consideration. The two scoring methods are complementary for identifying diverse disease models^{30}, some models that cannot be identified correctly by K2score can be correctly identified by Giniscore and vice versa. Furthermore, as can be seen in equations (1–2), both K2score and GIscore just only require calculating the number of genotype combinations once for each kway SNPcombination, which are not repeatedly count the frequency of genotype combinations.
Where \({p}_{i,j}\)(p _{ i,j } = n _{ ij }/n _{ i }) is the estimated probability that the i ^{th} genotype combination actually associated with phenotype \({y}_{j}\). \((1\sum _{j=1}^{J}{p}_{i,j}^{2})\) is the estimated probability that genotype combination is misclassified as phenotype \({y}_{j}\). \({P}_{i}\)(P _{ i } = n _{ i }/n) is the percentage of i ^{th} genotype combination in sample set.
Joint entropy as a heuristic factor for guiding HS to explore the diseasecausing SNP combinations with very low or even no marginal effect
In general, intelligent search algorithm (e.g. HS) can obtain the global optimization solution with a very small number of evaluating to the trail solutions. It is based on some heuristic factors available to speed up the search process. Similarly, for the detection of disease causes, the heuristic factors are also very important, such as marginal effects of individual SNP to phenotype, which can guide the HS to search the diseasecausing loci quickly. Nevertheless, sometimes individual SNP site contained in a highorder diseasecausing model has no marginal effect on phenotype, and available evaluation methods, such as Bayesian network, logistic regression, mutual information and many more, cannot distinguish whether a kway SNPcombination contains portion of diseasecausing SNP locus. In this scenario, a kway diseasecausing model is just as if an isolated point in a very large search space, none of heuristic factors (clues) can guide HS algorithm to find the diseasecausing model quickly.
To address the problem, we did a great deal of experiments and found a kway (k > 2) SNPcombination including part of diseasecausing SNP markers often has smaller joint entropy than that including no SNPs in the causative combination model, which is because the genotype of diseasecausing SNPcombinations often has smaller divergence than that of no diseasecausing SNPcombinations. In other words, joint entropy has the capability of differentiating SNP combinations containing part of causative SNPs from the SNP combinations containing no causative SNPs. However, for the disease data with very low marginal effect, existing scoring methods usually have no ability to distinguish the SNP combination including some of risk factors from other SNP combination models, such as the Bayesian network based K2score, logistic regression and Gini score.
In supplementary info file, we compare joint entropy with logistic regression based AICscore and Bayesian network based K2score^{17}, the two compared scoring methods are very effective for detecting disease models with marginal effect, we can see from the results that the joint entropy is more effective for detecting highorder diseasecausing models with very low marginal effect than Bayesian network based K2score and logistic regression.
For a kway SNPcombination \(X=({X}_{1},{X}_{2},\cdots ,{X}_{k})\), its joint entropy can be calculated as follow
Where n _{ i } is the number of samples in the dataset taking the i ^{th} genotype combination, n is the total number of samples in the dataset.
It can be seen obviously from equations (1), (2) and (3) that the k2score, GIscore and joint entropy can be obtained simultaneously by calculating the genotype frequency on kway SNPcombination, and the computation cost of the three scoring functions is also nonadditivity, which means that we need not to repeatedly calculate the number of sample taking ith genotype combination three times.
It’s important to note that the equation (3) considers only the genotype of SNP combination without regard to the phenotype. However, in equations (1) and (2), the genotype X and phenotype Y are considered simultaneously. In this work, the main role of joint entropy is rather than an identification method of diseasecausing model, it is utilized as heuristic factor to guide the harmony search algorithm to rapidly locate the diseasecausing SNP markers with no or very low marginal effect.
Gtest
Gtest is a maximum likelihood statistical significance test^{31}. Compared to chisquared test, the Gtest will lead to the same test effect for samples of a rational size. However, for some cell case it is always better than the chisquared test^{56}. And for testing goodnessoffit, Gtest statistical method is more efficient than Pearson χ ^{2} test method^{56,57,58}.
For kway SNP combination model, the formula for calculating G value is as follow
where, O _{ ij } and E _{ ij } are respectively the observed numbers and expected number of genotypes when phenotype takes the state y _{ j } and genotypes take i ^{th} kcombination. The ln denotes natural logarithm function. We can get the observed number O _{ ij } from dataset by using simple counting statistics method. The expected number E _{ ij } of genotype frequency could be obtained according to HardyWeinberg principle^{59}.
For highorder SNP combination, what often happens is that the number of some genotype combinations equals zero or very small, for example, a 2way SNP combination has nine 2way genotypes as Table 5.
As shown in Table 5, there are very few samples on genotypes “0–2”, “1–1” and “1–2”. The conventional Gtest method considers all the nine genotypes and the degree of freedom is equal to (2–1) × (9–1) = 8. In this work, we think the three columns (“0–2”, “1–1” and “1–2”) should not be considered, and the degree of freedom equals (2–1) × (6–1) = 5, which is more precise than the standard Gtest method.
To enhance the statistical precision of Gtest, we do a minor modification for calculating Gtest value as follows,
$$G=2\sum _{i=1}^{I}\sum _{j=1}^{J}{Q}_{ij}\cdot {P}_{ij}$$ $${P}_{ij}=\{\begin{array}{c}\mathrm{ln}\,\frac{{Q}_{ij}}{{E}_{ij}},\sum _{j=1}^{J}Q{}_{ij}\, > \,\xi \\ 0,otherwise\end{array}$$The degree of freedom d (d = (I − 1)(J − 1) is modified correspondingly, as follows:
$$\begin{array}{c}d=(I1)(J1)\\ for\,{\rm{i}}={\rm{1}}\to {\rm{I}}\\ \quad \quad {\rm{if}}\,\sum _{j=1}^{J}Q{}_{ij}\, < \,\xi \\ \quad \quad \quad d=d1\\ \quad \quad {\rm{endif}}\\ {\rm{endfor}}\end{array}$$Simulation Datasets
Twelve disease models with marginal effects (DME)
The 12 DME models^{17} have both marginal effects and interaction effects, which contain four multiplicative models (DME1~DME4), four threshold models (DME 5 DME 6) and four concrete models (DME 7 DME 12).
DME1~DME4 (H^{2} = 0.005, MAF = 0.05, 0.1, 0.2 and 0.5) are multiplicative models with two disease locus, in which the disease prevalence given the frequency of genotype combination increases multiplicatively with the incremental presence of the disease. The genetic heritability (H^{2}) of DME 1 DME 4 are all equal to 0.005, minor allele frequencies (MAF) of them equal 0.05, 0.1, 0.2 and 0.5, respectively. It is very difficult to identify the disease locus from the four DME models due to very low genetic heritability.
DME5~DME8 (H^{2} = 0.02, MAF = 0.05, 0.1, 0.2 and 0.5) are the threshold models in which the prevalence of genotype frequency does not increase until the number of disease alleles pass the threshold. The four DME models have strong marginal effect and interaction effect, in which a SNP marker with strong marginal effect would form many false disease models with other SNP markers that are not truly associated with the phenotype state.
DME9~DME12 (H^{2} = 0.02, MAF = 0.05, 0.1, 0.2 and 0.5) are the concrete model that has low marginal effect and strong interaction effect. Characteristics of these twelve DME models are compared in Figs. E1~E3 of supplementary info file, the parameters of 12 DME models are presented in Table E1 (see supplementary info file). For each DME model, there are 100 simulation datasets generated using GAMETES_2.0^{60} (https://sourceforge.net/projects/gametes/).
Eight highorder disease models with no marginal effects (DNME)
The DNME models are not constrained to specific predetermined models^{61}. They are generated using multiobjective optimization algorithm that aims to maximize the joint effects of kSNP, minimize the marginal effects of individual SNPs and limit to the HardyWeinberg equilibrium (HWE) constraints. The data sets of DNME models are downloaded from http://discovery.dartmouth.edu/model_free_data/, which contain 8 DNME data models (see Table E2 in supplementary info file) with three to five functional SNPs. For each data model, there are 100 datasets each having 1500 controls and 1500 cases. The DNME2, DNME4 and DNME6 are constrained by HWE; the other five are no HWE constraint.
Real AMD data
We use NHSADHSC algorithm to conduct highorder SNP association study on AMD data (Agerelated macular degeneration)^{33}. The AMD data contains 103611 SNPs genotyped for 50 controls and 96 cases. The experiment aims to find out all suspected highorder SNP combinations associated strongly with the phenotype.
Evaluation metrics
In simulation experiments, we adopt seven indices (Runtime, Power, MEs, TPR, SPC, ACC, and FDR) to evaluate the performance of algorithms. The seven indices are defined as follows,

(1)
Runtime: The time it takes for finding a diseasecausing model from beginning search to the end.

(2)
Power = #S/#T. Power is a measure of the capability for detecting the diseasecausing models from all dataset, where the #S is the number of having found the diseasecausing model from all #T dataset (in the experiment, there are 100 data matrix for each disease model).

(3)
MEs denotes the mean number of SNPcombinations that need be calculated the association with phenotype using scoring methods before the diseasecausing model is found. In the experiment, if the known diseasecausing models have been found, the searching algorithm would be terminated automatically ahead of meeting termination condition, the number that kway SNP combination models have been evaluated currently is defined as mean evaluation times (MEs) and the elapsed time from start to end is denoted as computation time (Runtime). The search algorithm would be terminated if the number of SNP combinations that are evaluated using evaluation functions is larger than maximum allowable number of times.

(4)
True positive rate: \(\mathrm{TPR}=\mathrm{TP}/(\mathrm{TP}+\mathrm{FN})\)

(5)
Specificity: \(\mathrm{SPC}=\mathrm{TN}/(\mathrm{FP}+\mathrm{TN})\) (if FP+TN = 0, then SPC = 0)

(6)
Accuracy: \(\mathrm{ACC}=(\mathrm{TP}+\mathrm{TN})/(\mathrm{TP}+\mathrm{TN}+\mathrm{FN}+\mathrm{FP})\)

(7)
False discovery rate: \(\mathrm{FDR}=\mathrm{FP}/(\mathrm{TP}+\mathrm{FP})\) (if TP + FP = 0, then FDR = 0)
The TPR, SPC and ACC in this study are employed to measure the statistical precision of the hypothesis testing method for having found diseasemodels in the screening stage. The TP is equal to the number of diseasemodels that have passed the threshold (Bonferroni correction, pvalue = 0.05/N, N is the number of combinations) of the testing method, FN is the number of diseasemodels failed to pass the threshold of the testing method. FP is the number of nondiseasemodels passed the threshold, TN equals the number of nondiseasemodels failed to pass the threshold.
Parameters setting of NHSADHSC
Experiments for simulation datasets
The parameters of NHSADHSC are set as follows:
The sizes of HM1, HM2, HM3, Es1, Es2 and Es3 are all equal to 50 for dataset with 100SNPs and 100 for dataset with more than 100SNP sites, the maximal size of candidate set (CS) is 10. HMCR = 0.9 and PAR = 0.35. In the second stage, the threshold of pvalue equals 0.05/N (Bonferroni correction, N is the number of combinations). In order to prevent from the preference of location, we randomly embed the locations of diseasecausing SNPs into the simulation data.
For CSE, the fraction of eggs discarded each generation is set to 0.25, maximum number of steps to take in a levy flight is set to 1, the number of groups is 10, and the number of nests is set to100.
The parameters of MACOED are set as: the number of ants is 500 for dataset with 1000SNPs, and 50 for dataset with 100SNPs.
For SNPHarvester, the maximal and minimal order of interactions is equal to k (for kway models).
The parameters setting for Boost and Beam are set to the default value of original papers.
To make a fair comparison, for three intelligent search algorithms NHSADHSC, CSE and MACOED, we set the same terminal condition (Maximum number of evaluating the SNPcombinations: T_{max}) T_{max} = 4500 for dataset with 100 SNP sites, T_{max} =60000 for 1000 SNPs.
We set the same computation environment for six compared algorithms: all experiments were performed on Windows 7 operation system with Intel(R) Core(TM) i3–3470 CPU@3.2 GHz, 8 GB memory, and all the program codes were written in MATLAB R2014b.
Experiments for AMD data
The sizes of HM1, HM2, and HM3 are set to 500.
The sizes of Es1, Es2, and Es3 are all equal to 2000.
HMCR = 0.9 and PAR = 0.35.
T_{max} = k×3E+6. (k is the number of SNP sites of highorder SNP combinations)
The experimental environment is the same as that of simulation dataset.
Future work
It has been widely acknowledged that multiple SNP loci may be an important contributor to pathogenic factors of complex disease, however, at present there is still no an effective approach in detecting multiloci diseasecausing models at GWAS due to enormous computation burden. Therefore, detecting highorder disease models has many rooms to explore using highperformance and cloud computing. In addition, with the rapid development of new gene sequencing technique, detecting the epistatic interactions in noncoding genomic regions^{62, 63} and making sense of the rare variants at GWAS are worth to study in the future.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
17 April 2018
A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has not been fixed in the paper.
References
 1.
Hindorff, L. A. et al. Potential etiologic and functional implications of genomewide association loci for human diseases and traits. Proceedings of the National Academy of Sciences 106, 9362–9367 (2009).
 2.
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
 3.
Easton, D. F. et al. Genomewide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).
 4.
Fellay, J. et al. A wholegenome association study of major determinants for host control of HIV1. Science 317, 944–947 (2007).
 5.
Wan, X. et al. BOOST: a fast approach to detecting gene–gene interactions in genomewide case–control studies. Am. J. Hum. Genet 87, 325–340 (2010).
 6.
Fang, G. et al. HighOrder SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions. PLoS one 7, 362–366, doi:10.1371/journal.pone.0033531 (2012).
 7.
Lehár, J., Krueger, A., Zimmermann, G. & Borisy, A. Highorder combination effects and biological robustness. Mol Syst Biol 4, 215–215 (2008).
 8.
Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440–5 (2003).
 9.
Yang, C. et al. SNPHarvester: A Filteringbased Approach for Detecting Epistatic Interactions in Genomewide Association Studies. Bioinformatics 25, 504–511 (2009).
 10.
Guo et al. Cloud computing for detecting highorder genomewide epistatic interaction via dynamic clustering. BMC Bioinformatics 15, 102, doi:10.1186/1471210515102 (2014).
 11.
Sangseob Leem et al. Fast detection of highorder epistatic interactions in genomewide association studies using information theoretic measure. Computational Biology and Chemistry 50, 19–28 (2014).
 12.
Collins, R. L., Hu, T., Wejse, C., Sirugo, G., Williams, S. M. & Moore, J. H. Multifactor dimensionality reduction reveals a threelocus epistatic interaction associated with susceptibility to pulmonary tuberculosis. BioData Mining 6, 4, doi:10.1186/1756038164 (2013).
 13.
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet 39, 906–913 (2007).
 14.
Aflakparast, M. et al. Cuckoo search epitasis: a new method for exploring significant genetic interactions. Heredity 112, 666–674 (2014).
 15.
Wang, Y. et al. AntEpiSeeker: detecting epistatic interactions for casecontrol studies using a twostage ant colony optimization algorithm. BMC Res. Notes 3, 117, doi:10.1186/175605003117 (2010).
 16.
Moore, J. H. et al. Bioinformatics challenges for genomewide association studies. Bioinformatics 26, 445–455 (2010).
 17.
Jing, P.J. & Shen, H.B. MACOED: A multiobjective ant colony optimization algorithm for SNP epistasis detection in genomewide association studies. Bioinformatics 31, 634–641 (2015).
 18.
Shang, J. et al. An improved oppositionbased learning particle swarm optimization for the detection of SNPSNP interactions. BioMed research international. doi:10.1155/2015/524821 (2015).
 19.
Jan Christian, K. et al. Highspeed exhaustive 3locus interaction epistasis analysis on FPGAs. Journal of Computational Science 9, 131–136 (2015).
 20.
Yang, G., Jiang, W., Yang, Q. & Yu., W. “PBOOST: A GPU based tool for parallel permutation tests in genomewide association studies”. Bioinformatics 31(9), 1460–2 (2015).
 21.
Yosef, N., Yakhini, Z., Tsalenko, A., Kristensen, V. & BørresenDale, A. et al. A supervised approach for identifying discriminating genotype patterns and its application to breast cancer data. Bioinformatics 23, 91–98 (2007).
 22.
Wang, Z., Liu, T., Lin, Z., Hegarty, J., Koltun, W. et al. A general model for multilocus epistatic interactions in casecontrol studies 5. PloS One, doi:10.1371/journal.pone.0011384 (2010).
 23.
Zhang, Y. & Liu, J. S. Bayesian inference of epistatic interactions in case–control studies. Nature Genet 39, 1167–1173 (2007).
 24.
Cordell, H. J. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum. Mol. Genet 11, 2463–2468 (2002).
 25.
Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nature Rev. Genet. 10, 392–404 (2009).
 26.
Wei, W. H., Hemani, G. & Haley, C. S. Detecting epistasis in human complex traits. Nat Rev Genet 15, 722–33 (2014).
 27.
Zhao, J., Jin, L. & Xiong, M. Test for interaction between two unlinked loci. Am. J. Hum. Genet 79, 831–845 (2006).
 28.
Zhang, Y., Zhang, J. & Liu, J. S. Blockbased bayesian epistasis association mapping with application to WTCCC type 1 diabetes data. Ann Appl Stat 5, 2052–2077 (2011).
 29.
Wang, J. et al. A Bayesian model for detection of highorder interactions among genetic variants in genomewide association studies. BMC Genomics 16, 1011 (2015).
 30.
Tuo, S., Zhang, J., Yuan, X., Zhang, Y., & Liu, Z. FHSASED: TwoLocus Model Detection for GenomeWide Association Study with Harmony Search Algorithm. PloS one 11. doi:10.1371/journal.pone.0150669, (2016).
 31.
McDonald, J.H. G–test of goodnessoffit. Handbook of Biological Statistics (Third ed.). Baltimore, Maryland: Sparky House Publishing. 53–58 (2014).
 32.
Shannon, P., Markiel, A. & Ozier, O. et al. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research 13, 2498–2504, doi:10.1101/gr.1239303 (2003).
 33.
Klein, R. J. et al. Complement factor H polymorphism in agerelated macular degeneration. Science 308, 385–389 (2005).
 34.
Lin, W.Y. & Lee, W.C. Incorporating prior knowledge to facilitate discoveries in a genomewide association study on agerelated macular degeneration. BMC Research Notes 3, 26, doi:10.1186/17560500326 (2010).
 35.
Tuo, J., Ning, B. & Bojanowski, C. M. et al. Synergic effect of polymorphisms in ERCC6 5′ flanking region and complement factor H on agerelated macular degeneration predisposition. Proceedings of the National Academy of Sciences of the United States of America 103, 9256–9261 (2006).
 36.
Han B, Chen X, Talebizadeh Z. FEPIMB: identifying SNPsdisease association using a Markov Blanketbased approach. BMC Bioinformatics 12(Suppl 12) S3. doi:10.1186/1471210512S12S3 (2011).
 37.
Sivakumaran, T. A. et al. A 32 kb Critical Region Excluding Y402H in CFH Mediates Risk for AgeRelated Macular Degeneration. Urtti A, ed. PLoS ONE 6. doi:10.1371/journal.pone.0025598 (2011).
 38.
Kwon MS, Park M, Park T. IGENT: efficient entropy based algorithm for genomewide genegene interaction analysis. BMC Medical Genomics 7(Suppl 1). doi:10.1186/175587947S1S6 (2014).
 39.
Jiang, R. et al. A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC Bioinformatics 10, 1, doi:10.1186/1471210510S1S65 (2009).
 40.
Guo, S. T. et al. INPP4B is an oncogenic regulator in human colon cancer. Oncogene 35, 3049–3061 (2016).
 41.
Chen, X., Liu, C.T., Zhang, M. & Zhang, H. A forestbased approach to identifying gene and gene–gene interactions. Proceedings of the National Academy of Sciences of the United States of America 104, 19199–19203, doi:10.1073/pnas.0709868104 (2007).
 42.
Wang, M., Zhang, M., Chen, X. & Zhang, H. Detecting Genes and Genegene Interactions for Agerelated Macular Degeneration with a Forestbased Approach. Statistics in biopharmaceutical research 1, 424–430, doi:10.1198/sbr.2009.0046 (2009).
 43.
Shang, J. et al. CINOEDV: a coinformation based method for detecting and visualizing norder epistatic interactions. BMC Bioinformatics 17, 1, doi:10.1186/s1285901610768 (2016).
 44.
Toomey, C. B. et al. Regulation of agerelated macular degenerationlike pathology by complement factor H. Proceedings of the National Academy of Sciences of the United States of America 112, E3040–E3049 (2015).
 45.
Khan, M. A. et al. Homozygosity mapping identified a novel protein truncating mutation (p. Ser100Leufs* 24) of the BBS9 gene in a consanguineous Pakistani family with Bardet Biedl syndrome. BMC medical genetics 17, 1, doi:10.1186/s1288101602719 (2016).
 46.
Chi, M. N. et al. INPP4B is upregulated and functions as an oncogenic driver through SGK3 in a subset of melanomas. Oncotarget 6, 39891–39907 (2015).
 47.
Vishal, M., Sharma, A. & Kaurani, L. et al. Genetic association and stress mediated downregulation in trabecular meshwork implicates MPP7 as a novel candidate gene in primary open angle glaucoma. BMC medical genomics 9(1), 1, doi:10.1186/s1292001601776 (2016).
 48.
Testoni, E. et al. Somatically mutated ABL1 is an actionable and essential NSCLC survival gene. EMBO molecular medicine 8, 105–116 (2016).
 49.
EckelPassow, J. E. et al. ANKS1B is a smokingrelated molecular alteration in clear cell renal cell carcinoma. BMC urology 14, 1 (2014).
 50.
Herberich, S. E. et al. ANKS1B Interacts with the Cerebral Cavernous Malformation Protein1 and Controls Endothelial Permeability but Not Sprouting Angiogenesis. PloS one 10(12), e0145304, doi:10.1371/journal.pone.0145304 (2015).
 51.
Bertelsen, B. et al. Intragenic deletions affecting two alternative transcripts of the IMMP2L gene in patients with Tourette syndrome. European Journal of Human Genetics 22, 1283–1289 (2014).
 52.
George, S. K., Jiao, Y., Bishop, C. E. & Lu, B. Mitochondrial peptidase IMMP2L mutation causes early onset of ageassociated disorders and impairs adult stem cell selfrenewal. Aging cell 10, 584–594 (2011).
 53.
Geem, Z. W., Kim, J. & Loganathan, G. Musicinspired optimization algorithm harmony search. Simulation 76, 60–8 (2001).
 54.
Yu, E. L. & Suganthan, P. N. Ensemble of niching algorithms. information sciences 180, 2815–2833 (2010).
 55.
Ali, M. Z. & Awad, N. H. A novel class of niche hybrid Cultural Algorithms for continuous engineering optimization. information sciences 267, 158–190 (2014).
 56.
Harremoës, P. & Tusnády, G. Information divergence is more chi squared distributed than the chi squared statistic. Proceedings ISIT 2012, 538–543 (2012).
 57.
Quine, M. P. & Robinson, J. Efficiencies of chisquare and likelihood ratio goodnessoffit tests. Annals of Statistics 13, 727–742 (1985).
 58.
Harremoës, P. & Vajda, I. On the Bahadurefficient testing of uniformity by means of the entropy, IEEE Transactions on Information Theory 54, 321–331(2008).
 59.
Crow, J. Hardy, Weinberg and language impediments. Genetics 152, 821–825 (1999).
 60.
Urbanowicz, R. J., Kiralis, J., SinnottArmstrong, N. A., Heberling, T., Fisher, J. M. & Moore, J. H. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining 5, 1–14 (2012).
 61.
Himmelstein et al. Evolving hard problems: Generating human genetics datasets with a complex etiology. BioData Mining 4, 21. doi:10.1186/17560381421. http://discovery.dartmouth.edu/model_free_data/ (2011).
 62.
Jing, L., Horstman, B. & Chen, Y. Detecting epistatic effects in association studies at a genomic level based on an ensemble approach. Bioinformatics 27, i222–i229, doi:10.1093/bioinformatics/btr227 (2011).
 63.
Upton, A., Trelles, O. & CornejoGarcía, J. A. et al. Review: Highperformance computing to detect epistasis in genome scale data sets. Briefings in Bioinformatics 17(3), 368–379 (2016).
Acknowledgements
This work was supported by the Natural Science Foundation of China under Grants 61571341, 61201312, 91530113 and 11401357, Research Fund for the Doctoral Program of Higher Education of China (No. 2013 0203110017), the Fundamental Research Funds for the Central Universities of China (Nos BDY171416 and JB140306), the Natural Science Foundation of Shaanxi Province in China (2015JM6275), Free exploration projects for 2017 basic researchrelated expenses.
Author information
Affiliations
School of Computer Science and Technology, Xidian University, Xi’an, 710071, P.R. China
 Shouheng Tuo
 , Junying Zhang
 , Xiguo Yuan
 , Zongzhen He
 , Yajun Liu
 & Zhaowen Liu
School of Mathematics and Computer Science, Shaanxi University of Technology, Hanzhong, 723000, P.R. China
 Shouheng Tuo
Authors
Search for Shouheng Tuo in:
Search for Junying Zhang in:
Search for Xiguo Yuan in:
Search for Zongzhen He in:
Search for Yajun Liu in:
Search for Zhaowen Liu in:
Contributions
Shouheng Tuo proposed the NHSADHSC algorithm firstly and did all experiments; Junying Zhang put forward many constructive ideas and guidance. Shouheng Tuo wrote the manuscript and Junying Zhang revised it in detail; Xiguo Yuan, Zongzhen He, Yajun Liu and Zhaowen Liu also gave many good ideas for this work.
Competing Interests
The authors declare that they have no competing interests.
Corresponding authors
Correspondence to Shouheng Tuo or Junying Zhang.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.