Introduction

Polymorphisms of the human genome are responsible for the causations of many genetic-linked phenotypes, including the traits of complex diseases, as well as the efficacies of medications addressed in pharmacogenomic studies. A model depicts the association between clinical phenotypes and multiple genetic information, such as single-nucleotide polymorphisms (SNPs) in either the haplotype or diplotype forms, or short tandem repeats (STRs) (Cordell and Clayton 2002; Yang et al. 2003). A model may even include physical information (such as the age, weight, diet, life style, and state of health) and clinical information (such as biochemical measurements, or the viral type for viral infection diseases). Once the associations between the clinical endpoints and the multi-locus genotypes are found and validated, the model could serve as the basis of new clinical prediction or prognostic methods. This paper presents a methodology for constructing models for a case–control study by using multiple SNP information in diplotype forms.

The rapid advance of genotyping techniques, exemplified by the recent Affymetrix GeneChip Human Mapping 500K Array, enables association studies with extensive genes and SNPs (Rabbee and Speed 2006). The immediate hurdle is the lack of an adequate strategy for multi-locus association and model construction from the vast amount of data. A case–control study is generally used for association studies, where the cases and controls refer to two distinct clinical groups (Cordell and Clayton 2002). Statistical tests based on the contingency tables (e.g., the χ 2 test) are commonly applied to the examination of associations for each screened diplotype polymorphisms (Pritchard and Rosenberg 1999). The polymorphisms with the p values smaller than a predefined threshold (commonly set at 0.05 and then adjusted with respect to multiple-comparison considerations) are declared significant in association, implying that the polymorphism is statistically associated to, or even functionally responsible for, a particular trait of interest.

Statistical tests of association are adequate particularly for single-gene diseases. There are many such diseases in the Online Mendelian Inheritance in Man database (OMIM 2000). However, the prediction of common multifactorial diseases is greatly improved by considering multiple alleles concurrently (Yang et al. 2003). Most pharmacogenomic studies also require an adequate combination of multi-loci information, given the complex pharmacokinetic and pharmacodynamic mechanisms involved. Gene–gene interactions need to be considered; therefore, a prediction model needs to characterize the complex roles of genes which lead to the phenotype.

Hence, we propose an algorithm which evaluates a set of SNPs simultaneously for model construction. A model is represented by a Boolean expression, facilitating biological interpretations. The amount of prior assumptions underlying the model is minimized. For example, the number of SNPs in the model is automatically determined by the algorithm, based on the dataset. A variety of models can be presented in Boolean expressions which reflect various forms of gene–gene interactions.

The search space for an adequate model is linearly proportional to the number of samples. It is, however, exponentially proportional to the number of screened SNPs when all of the combinations of SNPs need to be enumerated and calculated. Hence, an exhaustive search is prohibited, even for studies on hundreds of SNPs, let alone the whole-genome screening studies. To address this, the genetic algorithm is employed to systematically explore the vast choices of models described by Boolean expressions. The proposed algorithm is, thus, referred to as the genetic algorithm with Boolean algebra (GABA).

The GABA algorithm

Boolean algebra

Boolean algebra is a bivalent algebraic system (i.e., false and true, commonly shown as 0 and 1, respectively). It is used as the mathematical framework for representing the model. The defined operations of Boolean algebra include addition (+), multiplication (×), and negation (−) (Whitesitt 1995). They correspond to the operations of union, intersection, and complement in the set theory, respectively. The addition operation is also equivalent to the logical operation ‘OR’ and the multiplication operation to ‘AND’. Boolean algebra has several basic algebraic properties, such as the commutative and associative laws for addition and multiplication, and so on (Whitesitt 1995).

A model, denoted as M, comprises a chain of polymorphisms joined together by Boolean operators. To construct a model, a training dataset T containing genotypes of cases and controls is required. T is presented as a master table, a two-dimensional table, with each rank representing a sample of a subject (person), and each column representing an SNP. Denote l as the number of screened SNPs, therefore, T={SNPk|0≤ k<l}. The genotype dataset are commonly arranged in a consecutive order according to their chromosomal position. The assessment of an SNP is defined by a model element (m i). On a typical biallelic SNP, a mode element could, for example, represent either a recessive mode of inheritance:

$$ m_{i} :\;{\text{SNP}}_{k} = '{\text{AA}}'$$
(1)

or a dominant mode of inheritance:

$$m_{i} :\;{\text{SNP}}_{k} = '{\text{AA}} + {\text{AT}}'$$
(2)

The intersection of the model element (m i) and its complement element (−m i) is an empty set. Their union is the set containing all possible diplotypes in the SNP, for example, {AA, AT, TT} at a biallelic A/T locus.

Hence, a model can be denoted succinctly as M(m i|1≤i≤n, n≤l), where n is the number of SNPs in the model. The result of the assessment on an SNP locus could be true or false. The model elements are then joined together by either a multiplicative (×) or an additive (+) operator of Boolean algebra. A possible biological interpretation of the multiplicative (×) operator is that, for example, the concurrent appearance of two diplotypes activates a particular biological pathway. A possible biological interpretation for the additive (+) operator is that, for example, the mutations on any of the two SNPs in the same gene, joined by the additive operator, could result in the malfunction of this gene. It also represents the situation when the two SNPs reside in two genes of the same biological pathway, and the malfunction of either gene inactivates the pathway.

Given the genotypes of a particular subject, the computational result of M is either 0 or 1, indicating a control or a case, respectively. One negation operator (−) could be positioned at the starting position of M, converting those originally predicted cases to controls, and vice versa.

Hence, a legitimate model with four elements is exemplified as follows:

$$ \begin{aligned} & M{\left( {m_{1} ,\;m_{2} ,\;m_{3} ,\;m_{4} } \right)} = m_{1} \times m_{2} \times m_{3} + m_{4} \\ & \quad = {\left( {{\text{SNP}}_{3} = '{\text{AA}} + {\text{AT}}'} \right)} \times {\left( {{\text{SNP}}_{5} = '{\text{CC}}'} \right)} \times {\left( {{\text{SNP}}_{7} = '{\text{CC}}'} \right)} + {\left( {{\text{SNP}}_{8} = '{\text{TC}} + {\text{TT}}'} \right)} \\ \end{aligned} $$
(3)

M is a case model that, if the computational result of M is true, then the subject is predicted as a case; otherwise, it is a control. The complement of M, denoted as M C, is a control model. According to DeMorgan’s law in Boolean algebra (Whitesitt 1995), M C can be exemplified as:

$$ \begin{aligned} & M^{{\text{C}}} {\left( {{\left( { - m_{1} } \right)},\;{\left( { - m_{2} } \right)},\;{\left( { - m_{3} } \right)},\;{\left( { - m_{4} } \right)}} \right)} = {\left( {{\left( { - m_{1} } \right)} + {\left( { - m_{2} } \right)} + {\left( { - m_{3} } \right)}} \right)} \times {\left( { - m_{4} } \right)} \\ & \quad = {\left( {{\left( {{\text{SNP}}_{3} = '{\text{TT}}'} \right)} + {\left( {{\text{SNP}}_{5} = '{\text{AA}} + {\text{AC}}'} \right)} + {\left( {{\text{SNP}}_{7} = '{\text{GG}} + {\text{GC}}'} \right)}} \right)} \times {\left( {{\text{SNP}}_{8} = '{\text{CC}}'} \right)} \\ \end{aligned} $$
(4)

Genetic algorithm

The genetic algorithm is a modern heuristic method for solving combinatorial optimization tasks (Holland 1998; Goldberg 1989). The task of model construction may be formulated as:

$$ M_{{{\text{opt}}}} = \arg {\mathop {\max }\limits_M }F{\left( T \right)} $$
(5)

where F is the fitness score reflecting the prediction performance of the model M on the dataset T. We denote R as the case population and R C as the control population; thus, T=R+R C. The sensitivity of a model M on T is Pr(M|R), i.e., the probability of M being true within the case population, and the specificity is Pr(M C|R C), i.e., the probability of M C being true (M being false) within the control population. In this paper, we defined F as:

$$ F = {\text{Sensitivity}} + {\text{Specificity}} + {\text{Sensitivity}} \times {\text{Specificity}} $$
(6)

The sensitivity and specificity are used in the fitness function because the purpose of a model is mainly for clinical prediction, where the sensitivity and specificity are two important and commonly used indexes of performance. Different clinical applications may require different weightings on the sensitivity and specificity. Although positive predictive values (Pr(R|M)) and negative predictive values (Pr(R C|M C)) are also clinically important, they are variable with respect to the ratio of the numbers of cases and controls and, therefore, are not used. The last term of the fitness score, i.e., ‘Sensitivity×Specificity’, enable the algorithm to select favorably those models with high values in both sensitivity and specificity. The optimum fitness score in this definition is three.

The fitness score F is a heuristic parameter which should be defined according to the purpose of each clinical study. The likelihood ratio (LR(M)) has been used for measuring the performance of diagnostic testing (Yang et al. 2003). It is also a function of the sensitivity and specificity:

$$ {\text{LR}}{\left( M \right)} = \frac{{\Pr {\left( {M|R} \right)}}} {{\Pr {\left( {M|R^{{\text{C}}} } \right)}}} = \frac{{{\text{Sensitivity}}}} {1 - {\text{Specificity}}}$$
(7)

It is worth noting that Akaike’s information criteria (AIC) has been widely used for various model selection tasks, including biological studies and spectral analysis (Gardner 1988). AIC is solidly based on information theory, that it measures how good the model approximates the data using the likelihood function (Burnham and Anderson 2001). It also penalizes the increase of model length, i.e., the number of SNPs and their interaction terms, based on the principle of parsimony. AIC is also an adequate choice of F, where F can be defined as 1/AIC.

A parsimony constraint may be introduced to F so that a model with a smaller number of SNPs will be preferred. The parsimony constraint is to avoid over-fitting of the model to the data. However, we found in our simulation that the GABA algorithm can find the model with an adequate number of SNPs automatically, without the parsimony constraint (see section on Performance evaluation, where no parsimony constraint is used). We observed that the addition of extra SNPs to the optimum model will result in poorer performance. Hence, the parsimony constraint is not adopted in F.

A random model generator is required to initiate the computation. To generate a random model, the number of n, n≤l, is first randomly determined. Then, a series of SNPk, k ≤l, are randomly chosen for model elements m i, i={1,..., n}. Each SNPk has four possible diplotypes in our implementation, each corresponding to various dominant and recessive modes of inheritance. For example, if SNPk is a ‘C/G’ allele, then SNPk will be assigned as ‘CC’, ‘CC+CG’, ‘GG’, or ‘GG+CG’ with equal opportunities for m i. The additive (+) and multiplicative (×) Boolean operators are then randomly chosen between the model elements. Finally, a negation (−) operator is randomly determined whether or not to appear in front of the entire statement.

The GABA algorithm employs mutation and cross-over operations for altering an existing model.

Mutation operations

Five different types of mutations are employed in the GABA algorithm: (1) element insertion, (2) element deletion (3) element substitution, (4) operators ×/+ swap, and (5) case/control swap. The element insertion operation introduces a new random element into the model, increasing the model length by 1. The element deletion operation removes an element from the model. The element substitution operation changes the specified genotypes in a model element, for example, from SNPk=‘CC’ to ‘CC+CG’. The operators ×/+ swap converts a multiplication (×) into an addition (+), or vice versa. This operation changes the nonlinear relationship among the model elements. For example, if this operation modifies the model M=m 1+m 2×m 3×m 4 as m 1+m 2+m 3×m 4, then the relationship between the elements is changed. Finally, the case/control swap introduces a negation operator in front of the model. If there is already a negation operator, then this operation effectively removes the original negation operator.

A mutation rate (p) is required for a mutation operation, where p percent of the model elements are mutated and (1−p) percent are not mutated. Those mutated elements are subject to one of the four mutation methods (1)–(4) with equal opportunity. In addition, the entire model is subject to mutation (5) with 50% probability.

Cross-over operations

The cross-over operation is analogous to the chromosomal recombination events occurring in meioses of cell cycles. Note that chromosomal recombination is a basic concept underlying the discipline of statistical genetics, particularly in linkage analysis, association, as well as haplotype analysis (Cardon and Bell 2001; Schaid 2004). The rationale for the cross-over operation is that, if the good performances of two models are mainly due to parts of themselves, then a cross-over operation may combine these two parts, resulting in a model which outperforms the previous two models.

Using the defined operations of the GABA algorithm, the models with higher fitness scores are randomly mutated and crossed over with one another so as to produce various candidate models, exploring the entire solution space in a systematic manner. Each of these models is used to predict the samples in the training dataset. The prediction performances of the models are then evaluated by their fitness scores. Models and their elements with higher fitness scores are preserved and also serve as the templates for constructing the models in the next iteration. In this way, a group of modes go through a nature selection process. As candidate models are derived from well performing models in the previous iteration, they comprise advantageous components inherited from their parents. This type of algorithm has been demonstrated to avoid prolonged search according to the theory of schemata (Holland 1998).

The algorithm

The GABA algorithm is designed to select n SNPs, from a pool of l SNPs, for building a model. It is summarized as follows:

  1. 1.

    Randomly generate a series of Boolean expressions as the set of candidate models, denoted as S.

  2. 2.

    Use each candidate model in S to predict the samples in T.

  3. 3.

    The fitness score of each model is calculated. Those models with better performances are defined as the set of preserved models S p. The rest of the models, S d=S−S p, will be discarded in the next iteration.

  4. 4.

    The preserved models in S p are used as templates for producing a new set of candidate models S d′. Each model of S d′ is generated using one of the four methods:

    • (a) Randomly select one model from S p and then apply the mutation operation

    • (b) Randomly select two different models from S p and then proceed to apply a cross-over operation on them

    • (c) Randomly select one model from S p and have it crossed-over with a randomly generated model

    • (d) Produce a new model using the random model generator

    The four methods are selected randomly with equal opportunities. A new set of candidate models S is thus produced where S=S p+S d′.

  5. 5.

    Steps 2–4 are iterated until the optimum fitness score is achieved, or a maximum number of iterations (a user-defined value) is reached where the model with the highest fitness score stays unchanged.

The number of models in S, S p, and S d are heuristically determined by the user. They are constant in all iterations throughout the computation.

The search space of an adequate model is linearly proportional to the number of samples, yet, exponentially proportional to the number of screened SNPs. Hence, the increase of sample sizes is encouraged, as it will enhance the probability of detecting an adequate model at the reasonable expense of time. However, as the number of SNPs increases, more computation is expected, even for a heuristic search method such as the GABA algorithm.

Comparison with other methods

Logistic regression is commonly used for the linear combination of multiple genotypes and their interactions (e.g., Cordell and Clayton 2002). However, this method usually requires an enumeration of various interaction terms, which grows rapidly as l increases. A model must reflect the underlying disease etiology or biological mechanism so as to achieve accurate prediction (Yang et al. 2003). Since the order of interaction is unknown, high-order interactions need to be considered, which will complicate the computation. In comparison, the GABA algorithm detects the order of interactions automatically. The relationship between two adjacent SNPs could be additive or multiplicative, depending on the dataset.

The multifactor dimensionality reduction (MDR) method has been proposed for the detection of high-order interactions among loci (Ritchie et al. 2001). It has been successfully used for the identification of interactions among four SNPs in the estrogen metabolism genes associated with sporadic breast cancer (Ritchie et al. 2001), as well as the three-locus epistasis model for atrial fibrillation (Tsai et al. 2004). Similar to logistic regression, the MDR method has to enumerate all combinations of genotypes. The MDR method has the following weaknesses: (1) it requires a dataset where the numbers of cases and controls are close to 1:1; (2) it is difficult to make a biological interpretation of an MDR model, which is presented in a lookup table style (compare with Ritchie et al. 2001, p 144, Fig. 2). The lookup table style may incur difficulties on the biological interpretation of the model. It will be difficult to show the model in 2D lookup tables if more than five SNPs are involved. Note that the MDR model is analogous to the truth table in Boolean algebra, and the Karnaugh map is a useful tool to convert a truth table into a Boolean expression (Whitesitt 1995). The GABA algorithm, on the other hand, presents the model in a Boolean statement, which is easier to comprehend, facilitating future biological investigations.

Performance evaluation

The GABA algorithm is tested on simulated genotype data, as well as real genotypes from chronic hepatitis C patients treated with interferon-combined therapy. In our implementation, the mutation rate (p) is set at 20%. S contains 300 models; S p contains 90 models and S d 210 models.

The first test is to examine the performance of the GABA algorithm using the simulated dataset. We employ datasets with 50 SNPs (i.e., l=50) and 400 samples (200 cases and 200 controls), a data size commonly used for a typical small project. The SNPs in the dataset are in Hardy–Weinberg equilibrium (where the allele distribution is 1:2:1) and are also in linkage equilibrium.

Five distinct simulation datasets are produced. These datasets are embedded with models 1–5, representing a single-locus model, as well as 2-, 3-, 4-, and 5-locus interaction models (Table 1). These models are generated randomly. We can, therefore, evaluate whether the embedded models can be detected by the GABA and MDR methods. We repetitively tested the GABA algorithm three times (i.e., tests 1–3) for each dataset, and found that the algorithm can always identify the embedded model accurately. The numbers of iterations spent for finding the embedded models are presented in Table 1. The average number of iterations increased when the embedded model involves more SNPs (Fig. 1).

Table 1 The models used for the simulation, as well as the number of iterations computed for each test of the GABA algorithm. The last column presents the cross-validation consistency (CVC) of the multifactor dimensionality reduction (MDR) method
Fig. 1
figure 1

The average numbers of iterations computed for detecting the embedded models 1–5

The cases:controls ratio of the datasets are 1:1, fulfilling the assumption of the MDR method. Hence, the MDR method is also tested on models 1–5 for comparison purposes. The open-source MDR software v1.0.0rc1 is used, which is downloaded from SourceForge.net. The default setting of the MDR software is adopted, except that the attribute count range (i.e., the range of n in the MDR model) is changed from 1:4 to 1:5 for testing model 5. We found that the MDR method can also detect models equivalent to models 1–5 successfully. To illustrate the lookup table format of MDR models, the MDR result for model 3 is presented in Fig. 2. It is not difficult to envision that the MDR models involving more SNPs are very difficult to present and interpret.

Fig. 2
figure 2

The detected MDR model equivalent to model 3

The MDR method detected the five SNPs of model 5 correctly, with a cross-validation consistency of 0.9 (Table 1). In one of the cross-validation tests, the MDR method detects SNP0 instead of SNP43. The orders of appearance of SNPs in various tests for model 5 are summarized in Table 2. The computational process of test 1 is presented in terms of the highest fitness score (Fig. 3) and the number of SNPs detected (Fig. 4). The fitness score increases along with the number of computations, showing a continuous improvement. The length of the model varies between 2 and 5. The number of correctly detected SNPs increases gradually. At iteration 14,000, four SNPs have been corrected detected, which results in a fitness score of 2.99. The performance does not change until the fifth SNP is incorporated at iteration 88,182.

Table 2 The order of SNPs appearing in the model construction process of model 5
Fig. 3
figure 3

The performance, shown as the fitness score, is improved gradually after more iterations during the detection of a five-SNP model. The fitness score for a correct model is 3.0, representing 100% sensitivity and specificity

Fig. 4
figure 4

The number of SNPs per iteration during the detection of model 5. The dashed line is the number of SNPs in the model and the solid line represents the number of correct SNPs compared with model 5. It shows that the entire set of five SNPs are correctly detected after 88,182 iterations

The GABA algorithm is then applied to the construction of a prediction model for the interferon-combined therapy. Interferon-α combined with ribavirin is a standard treatment for patients infected by chronic hepatitis C viruses (HCV). The training dataset comprises genotypes of 381 chronic hepatitis C patients. These patients are from National Taiwan University Hospital, Kaohsiung Medical University Hospital, Kaohsiung Municipal Hsiaokang Hospital, and Tri-Service General Hospital in Taiwan, and the samples were collected between years 2002 and 2004. Informed consent and the medical records related to the history of the disease were collected for each subject. All patients had received interferon-α (3–6 MU/dosage, three times per week) and ribavirin (1,000–1,200 mg/day) for 6 months and then followed up for 6 months after the termination of treatment. Patients with concurrent hepatitis B or D infection were excluded from the study. The responsiveness of the treatment is determined by the detection of serum HCV RNA at the end of the follow-up period. Among the 381 patients, 243 are clinically diagnosed as responders (i.e., no HCV RNA detected) and 138 as non-responders.

The training dataset comprises 24 SNPs of eight genes involved in interferon signaling and immuno-modulating pathways (Table 1). The SNPs were genotyped using either direct sequencing or TaqMan methods. Direct sequencing was conducted with the ABI Prism 3700 instruments (Applied Biosystems) and the data was analyzed using the Phred and PolyPhred programs (Ewing et al. 1998; Nickerson et al. 1997). The TaqMan method was carried out on an ABI Prism 7900 instrument and genotypes were called using the SDS software (Applied Biosystems) supplemented with manual curation. The gene symbols were provided according to the HUGO gene nomenclature committee (Povey et al. 2001).

The eight genes of our study are hypothesized to be influential to the treatment efficacy of interferon-combined therapy (Hwang et al. 2006). ADAR is induced by interferon alpha or gamma for its antiviral effect. ICSBP1 is a member of the interferon regulatory factor family. It is a negative regulator on an interferon-stimulated response element (ISRE). IFI44 has an interferon-stimulated response element in its promoter region. TAP is a transporter involved in antigenic peptides transfer into the endoplasmic reticulum. TGFBRAP1 binds to Smad4 protein, which is involved in many signaling pathways. CASP5 has a central role in apoptosis. PIK3CG is the phosphoinositide-3-kinase, catalytic, gamma polypeptide. FGFs have mitogenic and cell survival activities and are involved in liver organogenesis. During the progression of CHC, the level of FGF is elevated.

Prior to model construction, the 24 SNPs are assessed individually for differences of allele and genotype frequencies between the case and control groups using standard χ2 statistics for contingency tables (Schlesselman 1982). The level of significance is set at 0.05. The genotype comparison employs a 3×2 contingency table, comparing three diplotypes at two conditions (i.e., responders and non-responders). Since 24 SNPs are assessed simultaneously, the issue of multiple comparisons is considered and the threshold on the p value (after Bonferroni correction) is 0.0021. None of the SNPs could be declared as significant according to the allele and genotype tests (Table 3).

Table 3 The 24 SNPs and their association test results, including the allelic comparison, the genotypic comparison of three genotypes (GG/GC/CC), and the Hardy–Weinberg test, shown in p values

The GABA algorithm detects a model comprising eight SNPs in five genes: ADAR, IFI44, ICSBP1, PIK3CG, and CASP5. The non-responders are identified if the following statement is true:

$$\begin{aligned} & {\left( {{\text{SNP}}_{7} = '{{\text{CC}}} \mathord{\left/ {\vphantom {{{\text{CC}}} {{\text{TC}}}}} \right. \kern-\nulldelimiterspace} {{\text{TC}}}'} \right)} \times {\left( {{\text{SNP}}_{{12}} = '{\text{GG}}'} \right)} \times {\left( {{\text{SNP}}_{{14}} = '{{\text{AA}}} \mathord{\left/ {\vphantom {{{\text{AA}}} {{\text{AC}}}}} \right. \kern-\nulldelimiterspace} {{\text{AC}}}'} \right)} \times {\left( {{\text{SNP}}_{{20}} = '{{\text{CC}}} \mathord{\left/ {\vphantom {{{\text{CC}}} {{\text{CT}}}}} \right. \kern-\nulldelimiterspace} {{\text{CT}}}'} \right)} \\ & \quad + {\left( {{\text{SNP}}_{6} = '{\text{CC}}'} \right)} \times {\left( {{\text{SNP}}_{9} = '{{\text{GG}}} \mathord{\left/ {\vphantom {{{\text{GG}}} {{\text{GT}}}}} \right. \kern-\nulldelimiterspace} {{\text{GT}}}'} \right)} \times {\left( {{\text{SNP}}_{{11}} = '{{\text{AA}}} \mathord{\left/ {\vphantom {{{\text{AA}}} {{\text{AG}}}}} \right. \kern-\nulldelimiterspace} {{\text{AG}}}'} \right)} \times {\left( {{\text{SNP}}_{{16}} = '{{\text{AA}}} \mathord{\left/ {\vphantom {{{\text{AA}}} {{\text{AG}}}}} \right. \kern-\nulldelimiterspace} {{\text{AG}}}'} \right)} \\ \end{aligned} $$
(8)

According to DeMorgan’s theorem (Whitesitt 1995), the responders are identified if the following statement is true:

$$ \begin{aligned} & [{\left( {{\text{SNP}}_{7} = '{\text{TT}}'} \right)} + {\left( {{\text{SNP}}_{{12}} = '{{\text{AA}}} \mathord{\left/ {\vphantom {{{\text{AA}}} {{\text{AG}}}}} \right. \kern-\nulldelimiterspace} {{\text{AG}}}'} \right)} + {\left( {{\text{SNP}}_{{14}} = '{\text{CC}}'} \right)} + {\left( {{\text{SNP}}_{{20}} = '{\text{TT}}'} \right)]} \\ & \quad \times [{\left( {{\text{SNP}}_{6} = '{{\text{AA}}} \mathord{\left/ {\vphantom {{{\text{AA}}} {{\text{AC}}}}} \right. \kern-\nulldelimiterspace} {{\text{AC}}}'} \right)} + {\left( {{\text{SNP}}_{9} = '{\text{TT}}'} \right)} + {\left( {{\text{SNP}}_{{11}} = '{\text{GG}}'} \right)} + {\left( {{\text{SNP}}_{{16}} = '{\text{GG}}'} \right)]} \\ \end{aligned} $$
(9)

When the eight SNPs are combined for the classification of patients, the prediction result is as shown in Table 4. The p value of the chi-square test of Table 4 is 1.79×10−7, showing a strong association in the training dataset. The performance indexes are as follows: the sensitivity is 62.1%, the specificity is 70.0%, the positive predictive value (PPV) is 79.0%, and the negative predictive value (NPV) is 50.7%.

Table 4 The prediction performance of the model on the training dataset (381 patients). Only 283 patients are shown in the table. The other 98 patients have missing data in the genotypes, and, thus, were not predictable

A prediction model needs to be validated using an independent dataset, from either prospective or retrospective studies, so as to demonstrate its capability of prediction. The above model is, thus, validated using samples of 159 persons (121 responders and 38 non-responders). These samples were collected during the years 2004–2005. The validation result is shown in Table 5, where the sensitivity is 54.7%, the specificity is 71.4%, the PPV is 86.4%, and the NPV is 32.1%. The p value of the chi-square test of Table 5 is 0.0067, a very small number, which shows a strong association with the validation dataset. The similarity of the sensitivity and specificity values enhances our confidence that the model has a certain degree of consistency for predicting the efficacy of interferon-combined treatment.

Table 5 The model derived from the training dataset is then used to predict the samples in a validation dataset (159 patients). Only 152 patients are shown in the table. The other seven patients have missing data in the genotypes, and, thus, were not predictable

Conclusions

The genetic algorithm and Boolean algebra (GABA) algorithm systematically investigates multiple single-nucleotide polymorphisms (SNPs) and their adequate combinations for predicting phenotypic traits of complex diseases or pharmacogenomic studies. The GABA algorithm shows promising capabilities in deriving a model from a large pool of SNP genotypes. This is demonstrated by experiments on the simulated datasets, as well as a real dataset of interferon-combined treatment. A Boolean expression model detected by the GABA algorithm is easily comprehensible, interpretable, and examinable by physicians and scientists. This is a merit of the GABA algorithm compared with other methods, such as multifactor dimensionality reduction (MDR) or logistic regression.

Although we use SNP diplotypes to demonstrate the algorithm, the GABA methodology should be able to incorporate haplotypes and other physical information, provided that this information can be represent adequately as model elements m i. This remains as our future research direction.