Introduction

Genetic association analyses need to address genetic and phenotypic heterogeneities for complex diseases. Investigating associations between marker genotypes and disease phenotypes for only one locus at a time without considering combinations of (unlinked) loci may capture only a small proportion of the total combined effect of all disease loci. Thus new methods are needed that allow the joint analysis of multiple loci for association analysis (Hoh and Ott 2003; Kang et al. 2008). For a complex disease, individual loci with small effects may not be sufficient to identify a genetic association with a clinical syndrome (Ritchie et al. 2001). However, the combined effect of multiple loci with minor or modest effect sizes might confer additive or multiplicative genetic contributions (Ritchie et al. 2001). Genotypic combinations at some loci that contain several alleles may be specific to certain classes of cases. Multiple loci with minor or modest effect sizes and that are located in different or the same chromosomes could form a joint functional unit for disease susceptibility. The problem is to identify these loci and quantify the interaction among these loci. Several algorithms are available to examine SNP combinations for complex diseases (Ritchie et al. 2001, 2003; Goodman et al. 2006; Onay et al. 2006; Hahn et al. 2003). These methods include dimension reduction, which combines all possible SNP–SNP interactions and chooses the set of SNPs that minimizes the classification error of cases and controls (Ritchie et al. 2003; Hahn et al. 2003). Some analyses use classification scoring functions to identify subsets of SNPs likely associated with disease risk (Goodman et al. 2006). Multivariate logistic regression and bootstrap analyses can be used to select SNP–SNP interactions via stepwise regression (Onay et al. 2006).

We use a two-stage approach that identifies multiple SNP patterns and evaluates their risk with a disease. The method allows the inclusion of any number of main effects together with the highest interaction term. At the first stage, synergistic blocks associated with a complex disease are identified. At the second stage, a logistic regression model is chosen to evaluate the interactive effects among different loci in a synergistic block. Logistic regression models for SNPs in synergistic blocks have better statistical power and more statistical significance than using a full model that includes all SNPs and their interactions, because of the reduced number of parameters.

This method is applied to a case–control schizophrenia study from a Chinese Han population to detect the effects of 17 loci of four candidate genes, the regulator of G-protein signaling-4 (RGS4, 1q21-q22), frizzled 3 (FZD3, 8p21), neuregulin 1 (NRG1, 8p22-p11), and G72 (13q34), on the susceptibility to this disease, since these have been reported in other studies to be possible candidate genes for schizophrenia (Yue et al. 2006, 2007; Zhang et al. 2004; Harrison and Owen 2003).

Harrison and Weinberger proposed that schizophrenia might be a genetic disorder of the synapses. There may be a putative common effect of schizophrenia susceptibility genes on the plasticity and functioning of synapses and other neurodevelopmental processes. We hypothesized that the NRG1, G72, RGS4 and FZD3 may play a common role in the pathogenesis of neurodevelopment and plasticity in schizophrenia.

Methods

Identifying SNP combination patterns associated with the phenotype

We assume n marker loci with two alleles each which are in Hardy–Weinberg equilibrium (HWE). The genotypes for one single nucleotide polymorphism (SNP) marker are coded by 0, 1, or 2. Denote a combination pattern of n SNPs (an SNP pattern) as an n-dimension genotype vector \( G = (a_{1} a_{2} \ldots a_{n} ), \) where n is the number of marker loci, and a i is the genotype of the ith SNP position. Denote π A(G) and π U(G) as the frequencies of the SNP pattern G in affected cases and unaffected controls, respectively. Then, the odds ratio of SNP pattern G (ORG) is the ratio of the odds of SNP pattern G in the cases to that of SNP pattern G in the controls, i.e.,

$$ {\text{OR}}_{G} = {{\frac{{\pi ^{A} (G)}}{{1 - \pi ^{A} (G)}}} \mathord{\left/ {\vphantom {{\frac{{\pi ^{A} (G)}}{{1 - \pi ^{A} (G)}}} {\frac{{\pi ^{U} (G)}}{{1 - \pi ^{U} (G)}}}}} \right. \kern-\nulldelimiterspace} {\frac{{\pi ^{U} (G)}}{{1 - \pi ^{U} (G)}}}}, $$

when ORG > 1, the SNP pattern G can be positively associated with the disease; when 0 < ORG < 1, the SNP pattern G can be negatively associated with it; when ORG is very close to 1, the SNP pattern G cannot be associated with the disease.

The hypotheses used to test whether SNP pattern G is associated with a disease are defined as follows:

$$ H_{0} :\ln OR_{G} = 0,\quad H_{1} :\ln OR_{G} \ne 0. $$

When the null hypothesis H 0 is rejected, then we can conclude that there is evidence for the association of G with the disease.

Let N A and N U be the numbers of cases and controls; N A(G) and N U(G) denote the number of subjects with the SNP pattern G among the cases and controls, respectively. Then, the log transform of the sample odds ratio, \( \displaystyle \ln \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G} \) is given by

$$\displaystyle \ln \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G} = \ln \left[ {{{\frac{{N^{A} (G)}}{{N^{A} - N^{A} (G)}}} \mathord{\left/ {\vphantom {{\frac{{N^{A} (G)}}{{N^{A} - N^{A} (G)}}} {\frac{{N^{U} (G)}}{{N^{U} - N^{U} (G)}}}}} \right. \kern-\nulldelimiterspace} {\frac{{N^{U} (G)}}{{N^{U} - N^{U} (G)}}}}} \right]. $$

This random variable has a large-sample approximate normal distribution with a mean of ln ORG and a standard deviation, referred to as the asymptotic standard error (ASE) (Agresti 1996), of

$$ \displaystyle {\text{ASE}}(\ln \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G} ) = \sqrt {\frac{1}{{N^{A} (G)}} + \frac{1}{{N^{A} - N^{A} (G)}} + \frac{1}{{N^{U} (G)}} + \frac{1}{{N^{U} - N^{U} (G)}}} . $$

Therefore, the statistic

$$ Z(G) = \frac{{\ln {\displaystyle \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G}} }}{{{\text{ASE}}(\ln {\displaystyle \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G}})}} $$

can be used to test the above hypothesis. If \( \left| {Z(G)} \right| > u_{{\alpha /2}} , \) (where u α/2 is the α/2 quantile of the standard normal distribution) then H 0 is rejected.

When the number of candidate SNPs is very large, a genetic algorithm (GA) (Goldberg 1989) can be applied to elucidate the associated SNP patterns quickly. In a GA, we examine every SNP pattern G as a candidate solution to the problem of associated SNP patterns. The fitness of an SNP pattern is defined as

$$ {\text{fitness}}_{G} = \left| {Z(G)} \right| = \left| {\frac{{\ln {\displaystyle \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G}} }}{{{\text{ASE}}(\ln{\displaystyle \mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G} })}}} \right|. $$

The sample odds ratio \(\displaystyle \left( {\mathop {{\text{OR}}}^{ \wedge }\!\!\ _{G} } \right) \) of SNP pattern G is 0 (or ∞) if N A(G) = 0 (or N U(G) = 0). The slightly amended estimator can be expressed as (Agresti 1996)

$$ {\text{fitness}}_{G}^{A} = \left| {\frac{{\ln \left[ {{{\frac{{N^{A} (G) + 0.5}}{{N^{A} - N^{A} (G) + 0.5}}} \mathord{\left/ {\vphantom {{\frac{{N^{A} (G) + 0.5}}{{N^{A} - N^{A} (G) + 0.5}}} {\frac{{N^{U} (G) + 0.5}}{{N^{U} - N^{U} (G) + 0.5}}}}} \right. \kern-\nulldelimiterspace} {\frac{{N^{U} (G) + 0.5}}{{N^{U} - N^{U} (G) + 0.5}}}}} \right]}}{{\sqrt {\frac{1}{{N^{A} (G) + 0.5}} + \frac{1}{{N^{A} - N^{A} (G) + 0.5}} + \frac{1}{{N^{U} (G) + 0.5}} + \frac{1}{{N^{U} - N^{U} (G) + 0.5}}} }}} \right|. $$

The parameters in the genetic algorithm are population sizes, crossover probabilities and mutation probabilities (Goldberg 1989). By assigning different values to the parameters, we can get different SNP patterns that are significantly associated with the disease.

Patterns associated with clustering

Cluster analysis is used to group similar SNP patterns associated with the disease. An SNP synergistic block is the loci in the SNP pattern cluster.

The clustering of SNP patterns is based on a similarity measure (Duda and Schafer 2001). A matched vector of the associated SNP pattern G may be denoted by \( Q^{G} = [q_{1} ,q_{2} , \ldots ,q_{N} ]^{\prime}, \) where q i = 1 if sample i has an associated SNP pattern G; otherwise, q i = 0, 1 ≤ i ≤ N, N AN U. The similarity distance d(G 1, G 2) between the associated SNP patterns G 1 and G 2 can be defined as

$$ d(G_{1} ,G_{2} ) = 1 - \left[ {\frac{{(Q^{{G_{1} }} )^{T} Q^{{G_{2} }} }}{{\left\| {Q^{{G_{1} }} } \right\|_{2}^{{1/2}} \left\| {Q^{{G_{2} }} } \right\|_{2}^{{1/2}} }}} \right]^{2} , $$

(where \( \left\| a \right\|_{2} \) represents the 2-norm of vector a), which is in fact the cosine measure between points. This distance measure is used to investigate the clustering of SNP patterns.

Grouping loci/SNPs into synergistic blocks

Based on the previous step, we cluster the SNP patterns into several SNP pattern clusters that include similar associated SNP patterns. Let R denote one SNP pattern cluster and N R denote the number of associated SNP patterns in R. The set of loci considered in R is referred to as γ(R). Let \( G^{i} = \left( {a_{1}^{i} a_{2}^{i} \ldots a_{n}^{i} } \right) \) denote the ith associated SNP pattern in R, where 1 ≤ i ≤ N R. Then, the difference between the ith and the jth SNP pattern can be defined as \( G^{i} - G^{j} = \left( {a_{1}^{i} - a_{1}^{j} a_{2}^{i} - a_{2}^{j} \cdots a_{n}^{i} - a_{n}^{j} } \right), \) where, for each locus \( k \in \{ 1,2, \ldots ,n\} , \) we get \( a_{k}^{i} - a_{k}^{j} = \left\{ \begin{gathered} 1,\; a_{k}^{i} \ne a_{k}^{j} ; \hfill \\ 0,\; a_{k}^{i} = a_{k}^{j} . \hfill \\ \end{gathered} \right. \) The diversity of loci considered between the ith and jth SNP patterns is expressed as

$$ D_{{ij}} (W) = \left( {G^{i} - G^{j} } \right)W\left( {G^{i} - G^{j} } \right)^{\prime } , $$

where = (w ij) is a weight matrix that can take any of several forms, depending on the intended use of the ancillary information. The sum of D ij(W) can be used to describe the diversity of the loci considered in a SNP pattern cluster. In particular, the diversity of a block B can be defined as

$$ D_{B} = \frac{1}{{2N_{R} }}\sum\limits_{{i = 1}}^{{N_{R} }} {\sum\limits_{{j = 1}}^{{N_{R} }} {D_{{ij}} (W_{B} ) = \frac{1}{{2N_{R} }}\sum\limits_{{i = 1}}^{{N_{R} }} {\sum\limits_{{j = 1}}^{{N_{R} }} {\left( {G^{i} - G^{j} } \right)W_{B} \left( {G^{i} - G^{j} } \right)^{\prime } } } ,} } $$

where W B = (w ij) with \( w_{{ij}} = \left\{ \begin{gathered} 1,\; i = j{\text{ and }}i \in \gamma (B); \hfill \\ 0,\; {\text{otherwise,}} \hfill \\ \end{gathered} \right. \) where γ(B) is the set of loci considered in block B.

Similarly, the diversity of all loci in γ(R) is defined as

$$ D_{R} = \frac{1}{{2N_{R} }}\sum\limits_{{i = 1}}^{{N_{R} }} {\sum\limits_{{j = 1}}^{{N_{R} }} {D_{{ij}} (W_{R} ) = \frac{1}{{2N_{R} }}\sum\limits_{{i = 1}}^{{N_{R} }} {\sum\limits_{{j = 1}}^{{N_{R} }} {\left( {G^{i} - G^{j} } \right)W_{R} \left( {G^{i} - G^{j} } \right)^{\prime } } } ,} } $$

where W R = (w ij) with \( w_{{ij}} = \left\{ \begin{gathered} 1,\quad i = j{\text{ and }}i \in \gamma (R); \hfill \\ 0,\quad {\text{otherwise}} .\hfill \\ \end{gathered} \right. \)

Therefore, we choose one subset B of γ(R) as a synergistic block of this SNP pattern cluster. There are three conditions for choosing B:

  1. (1)

    B R g.

  2. (2)

    Only choose significant SNP patterns whose fitness satisfy Fitness(G B) > u α/2.

  3. (3)

    Choose the value of B that minimizes \( \frac{{D_{B} }}{{D_{R} }} + \frac{1}{{Adapt(G_{B} )}} \), where R g represents all possible subsets of γ(R), and G B is one of the SNP patterns corresponding to block B. The G B that satisfies (2) and (3) is then an associated SNP pattern for this synergistic block B.

Suppose there are v loci in a synergistic block, i.e., v covariates. The multivariate logistic regression model (Agresti 1996) that includes all main effects and the highest order interaction term is

$$ \log {\text{it}}(p) = \beta _{0} + \beta _{1} {\text{SNP}}_{1} + \cdots + \beta _{v} {\text{SNP}}_{v} + \beta _{{v + 1}} {\text{SNP}}_{1} \times {\text{SNP}}_{2} \times \cdot \cdot \cdot \times {\text{SNP}}_{v} , $$

where SNPi {0,1}, 1 ≤ i ≤ v, and SNPi = 1 when the genotype of the ith locus is the same as that of the same locus in the synergistic block and 0 otherwise. Furthermore, \( p = \Pr ({\text{Affected}}|({\text{SNP}}_{1} , \ldots ,{\text{SNP}}_{v} ) \in \{ 0,1\} ^{v} ). \) In this model, we only consider the main effects and the interactions of all loci in a synergistic block.

Results

Application of the synergistic block algorithm to simulated data

Two data sets containing 100 replicates of 200 cases and 200 controls for ten unlinked biallelic loci were simulated using two two-locus interaction models as examples. The first and second loci out of ten were chosen as the disease loci with interaction effects. This number of replicates was selected to provide method validation and to enable exhaustive computational searches of all possible fourth-order SNP combinations to be performed. Hardy–Weinberg equilibrium was assumed. For the two-locus interaction disease models, the interaction effect was simulated using penetrance functions via two models. Model 1: P(Disease|AAbb) = 0.02, P(Disease|AaBb) = 0.2, P(Disease|aaBB) = 0.02, and P(Disease|others) = 0; Model 2: P(Disease|AAbb) = 0.2, P(Disease|AaBb) = 0.2, P(Disease|aaBB) = 0.2, and P(Disease|others) = 0, where A, a, B and b represent the alleles for the disease loci (Frankel and Schork 1996), with a population allele frequency of 0.3 in all cases. For the other SNPs, their population allele frequencies are drawn from the uniform distribution [0.1, 0.9]. Here, we chose the SNP combinations from among all of the fourth-order ones with fitnesses >3 as clusters when searching for a synergistic block.

The results are presented in Table 1. It includes the percentage of the time that the synergistic block associated with the disease was identified within 100 replicates, the average frequency in cases and controls, the average fitness with its standard deviation, and the average odds ratio for the SNP combination corresponding to the synergistic block. From these results, we know that for this interaction disease model, the synergistic block method has reasonable power to identify high-order gene–gene interactions.

Table 1 Simulation results for 200 cases and 200 controls in two disease models*

We also evaluated the type I error rate by simulating 100 data sets under the assumption that there is no interaction effect between unlinked loci on the disease. If there is also no main genetic effect on the disease, then we should not find any SNP synergistic block; if there is one locus with a main genetic effect on the disease, we will find this locus 70 times out of 100, and there is no multi-locus synergistic block (data not shown).

Application of the synergistic block algorithm to schizophrenia data

Four candidate genes, RGS4, FZD3, NRG1, G72, and seventeen SNPs that were genotyped and analyzed in this study are listed in Table 2. The SNPs SNP8NRG221533 and SNP8NRG243177 (NRG1) and rs2323019 and rs352203 (FZD3) are in linkage disequilibrium (Table 3). It is generally speculated that these four genes functionally converge to act upon schizophrenia by influencing synaptic plasticity and cortical microcircuitry (Harrison and Weinberger 2005). Seventeen SNPs were genotyped across these four candidate genes in the Chinese Han population, which included 120 schizophrenia cases and 225 healthy controls. Prior reports have suggested that variations in the genes may be associated with increased risk of paranoid schizophrenia (Yue et al. 2006, 2007; Zhang et al. 2004; Harrison and Owen 2003).

Table 2 Genes and SNP ID numbers
Table 3 The linkage disequilibrium information for four genes

Using a genetic algorithm, we identified 652 significantly associated SNP patterns using 120 cases and 225 controls. The P values for the permutation tests were at most 0.01 after adjusting for multiple testing using Benjamini–Hochberg’s algorithm (Benjamini and Hochberg 1995).

These SNP patterns were then clustered, producing five (two positively and three negatively) associated SNP pattern clusters. Five synergistic blocks (Table 4) were identified. Logistic regression models were used to quantify the size of the effect. Synergistic block 1, including the polymorphisms NRG1 (rs3735774), NRG1 (rs 2919390), and RGS4 (rs12753561), is associated with schizophrenia (OR 6.74, < 0.001). The interaction between these three SNPs is statistically significant (= 0.0014). Synergistic block 2, including the polymorphisms RGS4 (rs12753561), FZD3 (rs2241802) and FZD3 (rs2323019), is associated with schizophrenia (OR 2.0948, < 0.001). The interaction between these three SNPs is statistically significant (= 0.0221). Synergistic block 3, including the polymorphisms NRG1 (SNP8NRG221533) NRG1 (rs3735774) and NRG1 (rs6988339), is associated with schizophrenia (OR 0.2014, < 0.001). The interaction between these three SNPs is statistically significant (= 0.0017). Similar results were obtained for synergistic blocks 4 and 5.

Table 4 The synergistic blocks extracted from the 17 SNPs of NRG1, G72, RGS4 and FZD3

Information on the models is shown in Table 5. These results indicate that the interactions of alleles at different loci located on different or the same chromosomes may significantly influence complex human diseases.

Table 5 SNP interaction effects for logistic regression models

Discussion

In the present study, we identify synergistic blocks as being a genetic factor in complex disease, and use a two-stage approach to detect the effects of synergistic blocks on paranoid schizophrenia. Cluster analysis and logistic regression models are used to identify genetic associations of synergistic blocks with the phenotype. The approach is applied to detect the individual and interactive effects of four candidate genes, NRG1, G72, RGS4, and FZD3, on paranoid schizophrenia. The results suggest associations between these four genes and schizophrenia and intergenic interaction effects among these four genes on schizophrenia. However, because of the potentially data-driven nature of these conclusions and the limited multi-locus interaction model used in the simulation part, additional studies are required to confirm the validity of the present method in future studies.

Screening individual and interactive effects of disease genes in complex diseases is a feasible approach using the synergistic block-detecting method. These results further support previous findings about the interactive effects among NRG1, G72, RGS4 and FZD3, especially via glutamatergic transmission mediated by the ErbB3 and N-methyl-D-aspartate (NMDA) receptors (Harrison and Weinberger 2005), the Wnt pathway, or other processes associated with neurodevelopment and plasticity.

In this study, we found that the target SNPs interacted, and we introduced the concept of synergistic blocks. Since there are several disease loci in every synergistic block, we can address the dimensionality problem in multi-locus association analysis, and evaluate the sizes of the interactive effects between loci and their contributions to the disease. For genes that may play a role via similar neuropathological mechanisms, such as the effect of NRG1, G72, RGS4 and FZD3 on schizophrenia via synaptic function or other neurodevelopmental processes, synergistic blocks can be used to identify groups of loci that are specific to the disease and to quantify the interaction between genes or loci.

The synergistic block method described in this paper considered only ten SNPs in the simulation part and 17 SNPs in the real data set. Further numerical investigations will be needed when the number of SNPs to be examined is large, for examples 100 SNPs or 1000 SNPs.

Since genome-wide association studies (Risch and Merikangas 1996; Kang and Zuo 2007) are a priority, there is also the potential for synergistic blocks to be useful on a larger scale. However, we cannot use the present method to investigate genome-wide SNP data directly because of the large number of SNP combinations. One possible strategy is to break up large analyses into roughly independent modules of hundreds of tests (or SNPs) each (Seaman and Müller-Myhsok 2005). If we then detect a synergistic block for each group of SNPs, this synergistic block can be examined by our logistic model. As long as the correlation between the modules of SNPs is reasonably low, little power will be sacrificed by approximating in this way because the synergistic block has accounted for the correlation within the blocks.

Electronic database information

deCODE genetics, http://www.decode.com/nrg1/markers for SNPs and microsatellite markers in NRG1.

GenBank, http://www.ncbi.nlm.nih.gov/SNP/ for NRG1, G72, RGS4, FZD3.