Introduction

Population-based association study is one of the most important approaches for discovering disease–genotype relationships (Freimer and Sabatti 2004). The disease susceptibility of an individual may be predicted once the disease–genotype relationship is found. Genetic associations have been performed on either unphased single nucleotide polymorphisms (SNPs), or phase-resolved haplotypes (Schaid 2004). A haplotype block spans a chromosomal region where the allelic variants are tightly linked to one another [i.e., in linkage disequilibrium (LD)] (Schaid 2005). A haplotype is a combination of allelic variants which are located within a haplotype block and along a single chromosome (Epstein and Satten 2003). Lengths of typical human haplotype blocks range from a few kilo-bases to several hundred kilo-bases (Meng et al. 2003). The average length of human genes is 27 kb (Carlson et al. 2004), which is approximately at the same scale of haplotype blocks. Haplotype-based associations could detect chromosomal regions which harbor disease-causing variants, even when the variants themselves are not genotyped (Evans et al. 2004; Fallin et al. 2001). In addition, haplotypes are more polymorphic than SNPs, offering a more flexible stratification of the population. Haplotype-based associations have been employed in many disease association studies (Epstein and Satten 2003).

The International HapMap project accomplished a valuable reference of haplotype blocks of the human genome (The International HapMap Consortium 2003). Haplotypes of the entire block can be represented by a smaller set of SNPs referred to as tagging SNPs (Meng et al. 2003). It has been demonstrated empirically that uncommon polymorphisms of drug-related genes can be well represented by haplotypes constructed using tagging SNPs (Kamatani et al. 2004). Therefore, a proper selection of tagging SNPs can reduce the cost, efforts and complexity of the study while maintaining statistical power (Carlson et al. 2004; Goldstein and Cavalleri 2005; Meng et al. 2003). Haplotypes of each individual can be derived from unphased SNPs using a variety of well-tested algorithms such as PHASE (Stephens et al. 2001) or HAPLOTYPER (Niu et al. 2002), among others. The derived haplotypes are then utilized for haplotype-based associations (Fallin et al. 2001).

A complex trait is unlikely to associate prominently with a single allelic variant. On the contrary, it could be the consequence of complex biological mechanisms involving multiple genes in multiple genomic regions. Analysis of epistasis has been advocated for deciphering the complex mechanisms, particularly when each involved gene only demonstrates a minor marginal effect (Bell et al. 2006; Carlborg and Haley 2004). Interaction-based strategy has been demonstrated to outperform locus-by-locus search methods for complex traits (Marchini et al. 2005). Recently, an interaction-based method, GABA, has been proposed for detecting the epistasis among unphased SNPs (Liang et al. 2006). GABA has also been employed on the research of diabetic nephropathy (Hsieh et al. 2006). In this paper, we address the issue of epistasis among haplotypes in multiple genomic regions. The proposed methodology, referred to as the Haplotype Association based on Boolean Algebra (HABA), is an extension of GABA, aiming to overcome the challenges incurred on the epistasis of haplotypes. HABA can be used in conjunction with GABA, as well as traditional locus-by-locus methods, for assessing associations from both SNPs and haplotypes.

Materials and methods

Haplotype association based on boolean algebra

The proposed methodology is designed to discover epistatic effects among phased-resolved haplotypes in multiple genomic regions. The epistatic effects are shown as prediction models to indicate the susceptibility of a person to a dichotomous phenotypic trait.

Denote L as the number of genomic regions. Linkage equilibrium is assumed among regions. Each region accommodates a multi-allelic haplotype profile H l {h l,0 ,hl,1, h l,2,…h l,i}, where i is the index of haplotypes at a particular genomic site l, 0 ≤ l < L. When x bi-allelic SNPs occur within a region, the haplotype profile may have 2x different haplotypes. The number of real haplotypes in H l, however, is generally fewer.

Denote B l = {b ln1, b ln2 | 0 ≤ n < N, b ln1, b ln2 H l} as the union of pairs of haplotypes at site l carried by the nth individual, where N is the total number of individuals. Denote T = {Bl | 0 ≤ l < L} as the entire dataset, including both case and control individuals. In practice, T is formatted as a two-dimensional table, with each rank representing an individual, and each column representing B l. Each cell of T corresponds to a pair of haplotypes {bln1, b ln2} of a particular patient n at a particular site l.

A prediction model M comprises a chain of haplotype markers (m k) joined together by the Boolean operators, multiplication and addition {*, +}. It is denoted succinctly as M(m k | 0 ≤ k < K,  L), where K is defined as the number of haplotype markers in the model. Each haplotype marker defines the assessment of a single genomic region B l, the result of which is either true or false. A haplotype marker is denoted succinctly using a binary-valued vector: m k = l 〈h l,0,h l,1 ,hl,2,…h l,i〉. For example, m k = 〈0, 0, 1, 0, 1, 0〉 defines the following assessment on the nth individual:

$$ m_{k} = \left\{ {\begin{array}{*{20}l} {{{\text{True}}}} & {{{\text{if}}\,{\mathbf{any}}\,{\text{of}}\,b_{{\ln 1}} {\text{or}}\,b_{{\ln 2}} \,{\text{equals}}\,{\text{to}}\,h_{{l,2}} \,{\text{or}}\,h_{{l,4}} }} \\ {{{\text{False}}}} & {{{\text{otherwise}}}} \\ \end{array} } \right. $$
(1)

The complement marker of m k, denoted as m Ck , is defined as

$$ m^{C}_{k} = \left\{ {\begin{array}{*{20}l} {{{\text{True}}}} & {{{\text{if}}\,{\mathbf{both}}\,b_{{\ln 1}} \,{\text{and}}\,b_{{_{{\ln 2}} }} \,{\text{equals}}\,{\text{to}}\,h_{{l,0}} \,h_{{l,1}} \,h_{{l,2}} \,h_{{l,3}} \,{\text{or}}\,h_{{l,5}} ;}} \\ {{{\text{False}}}} & {{{\text{otherwise}}}} \\ \end{array} } \right. $$
(2)

The intersection of a marker (m k) and its complement marker (m Ck ) is an empty set. Their union contains all possible haplotypes in H l. In this way, H l is partitioned into two mutually exclusive groups defined by m k and m Ck , respectively. One group of haplotypes is associated to the disease susceptibility while the other to non-susceptibility. A typical type of haplotype-based association is performed on each specific haplotype h l,i, reflecting the differences of haplotype frequencies between case and control groups. In comparison, the omnibus haplotype-profile test gives an assessment on the entire haplotype frequency profiles of H l and reports an overall P-value (Fallin et al. 2001). Our method, on the other hand, finds the optimum partition of H l {h l,0,h l,1,h l,2, ... h l,i}, which is a valuable additional information to the above two methods.

A Boolean variable indicates the result of assessment of a haplotype marker. The variables are linked together by Boolean operators to construct M, a Boolean statement. The prediction result of an individual (which is either true or false) is computed from the values of Boolean variables. A Boolean statement can accommodate various types of relationships between variables, including the exclusive OR relationship (see “Discussion”).

Model optimization

We aimed to find a model which has the highest prediction performance on the dataset T,

$$ M_{{{\text{optimum}}}} = \arg {\mathop {\max }\limits_M }F{\left( T \right)} $$
(3)

where F(T) is the Fitness score indicating the prediction performance of M on T. Similar to GABA, HABA adopts the Genetic algorithm for the optimization process (Liang et al. 2006), where candidate models are constantly altered by either mutation or cross-over operations for finding the adequate combinations of haplotype markers. Denote R as the case population and R C the control population, thus T = R + R C. The Sensitivity of a model M on T is Pr(= 1|R), the probability of M being true within the case population; and the Specificity Pr(M C = 1|R C), the probability of M C being true (M being false) within the control population. In this paper, we defined F as

$$ F = {\text{sensitivity}} + {\text{specificity}}. $$
(4)

Sensitivity and specificity are two important clinical indexes of prediction performance. In comparison, positive predictive values (Pr(R|M)) and negative predictive values (Pr(R C|M C)) are posterior probabilities which are only adequate when prior probabilities (such as Pr(R)) were accurately estimated in the population (Yang et al. 2003). The likelihood ratio (LR(M)) is another commonly-used performance index (Yang et al. 2003).

$$ LR{\left( M \right)} = \frac{{Pr(M{\text{$|$}}R)}} {{Pr{\left( {M{\text{$|$}}R^{C} } \right)}}} = \frac{{{\text{Sensitivity}}}} {{1 - {\text{Specificity}}}} $$
(5)

Simulation

The penetrance rate of the underlying model was an indication of the difficulties of detecting the model accurately. The penetrance rates of a single genetic marker m is defined as a conditional probability (Zhao et al. 2003)

$$ {\text{Penetrance}}\;{\text{rate}}\;{\text{of}}\;X = Pr(R{\text{$|$}}m); $$
(6)

The penetrance rate needs to be distinguished from prevalence, the proportion of affected individuals in a population, i.e. Pr(R). We investigated both the penetrance of individual haplotype markers at a single genomic region as well as the penetrance of the entire model involving several haplotype markers Pr(R|M). The aim of HABA is to detect M at conditions when individual makers have moderate marginal effects. The sum of Pr(R|M) and Pr(R C|M C) has been proved to be greater than 1 (Appendix). To reflect the level of difficulty of the simulation using both Pr(R|M) and Pr(R C|M C), we assume they are equal in this simulation, i.e. penetrance rate = Pr(R|M) = Pr(R C|M C). Under this assumption, the minimum penetrance rate is 50.

Datasets for the simulation were generated randomly, according to the specified number of cases and controls, as well as L. For each region, a multi-allelic haplotype profile H l {h l,0, h l,1 ,h l,2,…h l,i} was randomly generated. The number of haplotypes i was a randomly generated value between 2 and 7, which are commonly observed numbers of haplotypes in the human genome. Each haplotype in H l was assumed to have equal frequency for simplicity. The underlying models were also generated randomly according to the specified number of markers involved. The markers were randomly chosen from the L regions and then randomly determined based on H l of the simulation dataset. Finally, the haplotypes in the marker regions were randomly modified, meeting the specified penetrance requirement of the underlying model.

Three sets of simulations were conducted. The first set of simulations demonstrated the characteristics/behavior of HABA at various conditions when the underlying models involved various numbers of genomic regions, and had complete and various incomplete penetrance. Twenty different models were generated randomly, one for each condition (number of markers = 1–4; penetrance = 60–100%). The models were then used to dictate the generation of individual genotypes for 1,000 cases and 1,000 controls. Therefore N = 2,000. The number of genomic regions (i.e. L) was 8. The halting condition of the program was when the best model remained unchanged for 1,000 iterations.

The second set of simulations was designed to evaluate the average performance of HABA on 50 replicated tests. We employed datasets comprising five genomic regions and an embedded model comprising three haplotype markers under a variety of penetrance rates (60–100%). The number of iterations for halting in this test is 300.

The third set of simulations was a permutation test (Hirschhorn and Daly 2005) showing the empirical significance level of the detected model when the dataset contain 100 genomic regions. The number of iterations for halting in this test is 150.

The heuristic parameters of this algorithm were identical to those previously described (Liang et al. 2006). A total of 300 models were used within an iteration of the computation. The Fitness score is defined in Eq. (4). No parsimonious constraints were used, apart from the condition when an equal Fitness score occurs on two candidate models. In such a condition, the one involving fewer markers would be ranked higher.

Results

Table 1 presented examples of models which were employed in the first set of simulation. These models all comprise four haplotype markers (each at a particular genomic region) but have various penetrance to the datasets. The marginal haplotype frequency Pr(m) and penetrance Pr(R|m) varies because the models and datasets were randomly generated (Table 1). However, it can be seen that the penetrance of individual haplotype marker was usually smaller than the model penetrance. Thus, the dataset simulates the epistasis where the associations reveal themselves at the combination of multiple haplotype markers, rather than individual markers.

Table 1 Marginal penetrance of haplotype markers, as well as the marker frequencies, when four genomic regions were involved in the underlying model

The first set of simulation evaluates the performance of HABA when various numbers of genomics regions (between 1 and 4) were involved, and when various penetrance rates (between 60 and 100%) were observed in the data. The detected models were identical to all the underlying models when they had complete penetrance, resulting in 100% sensitivity and specificity. The underlying models were also detected accurately when their penetrance rates were 90, 80 and 70%. When the penetrance rate of the models was further reduced to 60%, “over-fitting” models were detected instead of accurate underlying models, resulting in Fitness scores higher than the expected value (Table 2). The detected models not only contain some of the correct markers, but also introduce several additional markers (Table 2). From these experiments, we observed that the prediction accuracy is independent of number of regions involved. It is, however, dependent on the penetrance rate of the model.

Table 2 Comparisons of the underlying models and the detected models when the penetrance rate was 60%

We also compared the computation cost, shown as the number of iterations, for detecting the optimum model at various conditions (Table 3). The numbers of iteration did not include the additional 1,000 iterations after the optimum models were achieved (see “Materials and methods”). Table 3 shows that the average number of iterations is dependent on the number of markers in the underlying model. The penetrance rate, on the other hand, did not affect the computation cost.

Table 3 The iterations required for finding the optimum model

Having observed the general performance of the algorithm at a variety of conditions, we conducted the second set of simulations to calculate the average performance, including the sensitivity, specificity and computational cost, at various penetrance between 60 and 100%. The underlying model comprises three haplotype markers and characterizes control samples:

$$ M\;{\text{ = }}\;{\text{3}}\;\langle {\text{0, 1}}\rangle \;{\text{ + }}\;{\text{2}}\;\langle {\text{0, 0, 1}}\rangle \;{\text{*}}\;{\text{4}}\;\langle {\text{0, 0, 1}}\rangle {\text{.}} $$

The results are presented in Table 4, where each value is an average of 50 tests. The error sum is defined as the average of the sum of absolute differences between the measured and expected sensitivity and specificity values. The error sum increases as the penetrance decreases, implying that the accuracy depends on the penetrance rate. This is consistent with the observations from the first set of simulation. The error count is defined as the number of tests when the underlying model was not detected among the 50 replicates. Although the underlying model was not always accurately detected, HABA detected approximate models, resulting in small error sum values. The computation cost, measured by the averaged numbers of iterations, does not depend on the penetrance rate.

Table 4 The averaged performance of 50 tests under various penetrance between 60 and 100%

The third set of simulation is to observe the distribution of performance indexes when the labels of phenotypes (i.e. R and R C) were randomly permuted (Hirschhorn and Daly 2005). An underlying model of two haplotype markers was randomly generated. This model indicates control samples:

$$ M\;{\text{ = }}\;{\text{8}}\;\langle {\text{1, 0, 1, 0, 0, 0, 0}}\rangle \;{\text{*}}\;{\text{48}}\;\langle {\text{0, 1, 0}}\rangle {\text{.}} $$

This model was then used to guide the generation of a dataset, comprising 100 genomic regions, 1,000 cases and 1,000 controls. HABA have repeatedly detected the underlying model twice after 171 and 342 iterations of computation, showing its capability on datasets when L = 100. The labels of phenotype were then randomly permuted, resulting in 116 permuted datasets. These datasets simulates the situation where no association occurs, therefore the null distribution can be derived. The histograms of sensitivity and specificity of the models detected by HABA are illustrated in Fig. 1. This shows that the distributions are approximately normal, with the bulk of sensitivity and specificity occurring between 50 and 60%. The histogram of Fitness represents the empirical null distribution of prediction performance, which is illustrated in Fig. 2. The bulk of the null distribution occurs around 110%. Although both sensitivity and specificity have wide distributions, their sum (i.e. the Fitness) values were quite narrowly distributed. According to the null distribution of Fig. 2, the empirical type I error, or the P-value, is smaller than 0.0172 (which is 2/116) if the Fitness of a model is equal or greater than 115%.

Fig. 1
figure 1

The histogram of sensitivity and specificity when the label of phenotypes (R and R C) were permuted randomly

Fig. 2
figure 2

The histogram of Fitness when the labels of phenotypes (R and R C) were permuted randomly, simulating the situation where no association occurs to the dataset. Hence, this is an empirical null distribution of performances where the empirical type I error could be estimated

Discussion

Analysis of epistasis and haplotype-based association are both important issues for genetic associations. The proposed HABA methodology addresses both issues at the same time. It can be used to construct prediction models involving multiple haplotypes in different genomic regions. HABA enables the discovery of relationship among these haplotypes, facilitating further interpretation on biological mechanisms.

De Morgan duality and mode of inheritance

A model can address both dominant and recessive modes of inheritance at a genomic region. At the level of single haplotype markers, m k describes the dominant mode of inheritance, while m Ck accommodates the recessive mode of inheritance (cf. Eqs. 1, 2). According to the De Morgan’s law on duality (Liang et al. 2006), if a model is constructed by m k for predicting either a case or control group with a dominant mode of inheritance, then a corresponding model, consisting of m Ck , is simultaneously determined for indicating the other group with the recessive mode of inheritance. In other words, M(m k | 1 ≤  K,  L) and M C(m Ck |1 ≤ k ≤ K, K ≤ L) are a pair of models for a dichotomous trait, where Pr(M + M C) = 1. Whether M or M C is associated to case or control depends on model optimization. HABA cannot accommodate the additive mode of inheritance at this moment.

The epistasis between different genomic regions might appear in a more complex format known as the exclusive OR logic (XOR), apart from simple AND/OR relationships. The exclusive OR logic can also achieved by the HABA structure because exclusive OR is equivalent to the following equation composed of {*, +} operators:

$$ m_{1} \;XOR\;m_{2} =m_{1} \;{\text{*}}\;m^{C}_{2} + m^{C}_{1} \;{\text{*}}\;m_{2} . $$
(7)

Therefore, our formation of models can accommodate the XOR relationship between genomic regions.

HABA is mainly designed for datasets where all the haplotypes of each individual have been unambiguously determined. However, it is almost impossible for all the haplotypes in all the regions to be unambiguously determined from phase unknown samples. If only a small portion of missing/ambiguous haplotypes occur among the entire dataset, then the algorithm can temporarily discard those samples with missing/ambiguous haplotypes occurring at the marker site of interest. However, further research on technologies for resolving phases unambiguously, as well as on the improvement of HABA for addressing the ambiguity of haplotypes, is required so as to facilitate the practical use of epistasis analysis on real haplotype datasets.

In conclusion, our simulation results show that this algorithm can detect or approximate the underlying models, provided that the underlying model has reasonably high penetrance (e.g., higher than 70%) in the dataset. The prediction accuracy of this method is dependent on the penetrance rate of the underlying model. The computation cost, on the other hand, is dependent on the number of genomic regions involved for the complex phenotypic trait. This methodology will facilitate the discovery of novel associations based on the epistasis of haplotypes, an important aspect of research on complex diseases.