Introduction

A syndrome is the combination of signs and symptoms that are associated with a morbid process (Dirckx, 2001) and which co-occur frequently enough to indicate a common underlying disease mechanism; a prototypical example is metabolic syndrome, which is estimated to affect nearly 25% of the US population, and greatly increases the risk of cardiovascular disease (Ford et al., 2002). The constellation of metabolic syndrome symptoms includes fasting hyperglycemia, high blood pressure, central obesity, increased triglycerides and decreased high-density lipoprotein (HDL) levels. The causes of metabolic syndrome are not known but aging, genetics, environmental and lifestyle factors such as physical activity and diet are considered to be important contributors. Delineating the gene–gene (GGI) and gene–environmental interactions (GEI) for syndromes may potentially enable identification of the causative pathobiological mechanisms.

There are some unique problems and methodological gaps in the GGI and GEI of syndromes. In particular, syndromes are a distinct class of clinical conditions that may not be appropriately characterized by any one (single) phenotype. Even when the constituent syndrome-affected phenotypes are individually amenable to genetic analysis, such an approach may miss loci involved in regulating the constellation of phenotypes. We hypothesize that an integrated and unified analysis of the overall syndrome phenotype is preferable for identifying the master pathophysiological mechanism(s). The identification of a unifying pathophysiology for syndromes could lead to better and more targeted treatment strategies as the underlying causes could be addressed instead of separately treating each component. Effective analysis methods also need to address issues arising from linear, and potentially non-linear, dependencies or relationships between the constituent syndrome-affected phenotypes for particular genetic and environmental predictor combinations. Accounting for the confounding effects of dependencies among the genetic and environmental predictor variables caused by factors such as pairwise linkage disequilibrium (LD) and correlations among environmental variables is also a challenge common to GEI and GGI analysis. In this research, we develop a methodology for addressing these unique problems associated with the interaction analysis of syndromes.

Previous reports from our group (Chanda et al., 2008, 2009) have demonstrated the usefulness of the K-way interaction information (KWII) and the phenotype-associated information (PAI) for GGI and GEI analysis of discrete phenotypes and quantitative traits (QTs). This paper extends our methodology and focuses on the development and critical assessment of SYMPHONY, a unique information-theoretic method for GGI and GEI analysis of syndromes.

Materials and methods

Definitions, terminology and representation

GGI and GEI

The methods in this paper are applicable to both GEI and GGI analyses and henceforth, we will simply use the term GEI to refer to both. The term GGI will be used only when environmental variables are not present in a data set.

Entropy

The entropy (Shannon, 1997), H(X), of a discrete random variable X is obtained from the probabilities p(X=x) of each event, x, in X, using the formula:

K-way interaction information

The KWII and PAI were developed in our earlier publications (Chanda et al., 2007) and are recapitulated here. For the three-variable case, the KWII is defined in terms of entropies for the individual variables, H(A), H(B) and H(P) and the joint entropies for the combinations of the variables, H(A, B), H(A, P), H(B, P) and H(A, B, P):

For the K-variable case on the set v={X1, X2, ..., XK}, the KWII can be written succinctly as an alternating sum over all possible subsets T of {v, P} using the difference operator notation of Han (1980):

The number of variables K in a combination is called the order of the combination; P is the phenotype variable. The KWII quantifies interactions by representing the information that cannot be obtained without observing all K variables at the same time (McGill, 1954; Fano, 1961; Jakulin and Bratko, 2004; Jakulin, 2005). The KWII of a given combination of variables is a parsimonious interaction metric because it does not contain contributions arising from the KWII of other lower order combinations (subsets) of these variables.

In the bivariate case, the KWII is always non-negative but in the multivariate case, the KWII can be positive or negative.

Interaction definition

Operationally, we define positive KWII values to indicate interactions (or net synergy) between the variables and negative KWII values to indicate net redundancy between variables. A value of zero indicates the net absence of K-way interactions.

Phenotype-associated information

The PAI is obtained from the total correlation information (TCI) (Watanabe, 1960), which is the difference between the entropies of the individual variables H(Xi) and the entropy of the joint distribution H(X1 X2…XK).

The TCI can be viewed as a general measure of dependency.

The PAI represents the overall dependency among the set of genetic and environmental variables with the phenotype by removing the interdependencies among the genetic and environmental variables. The interdependencies among variables can be caused by factors such as LD or by correlated source patterns of pollutant exposures. Accordingly, the PAI is defined by:

In the above equation, the genetic and environmental variables (predictors) are denoted by the X1, X2, …, XK, and the quantitative trait is denoted by P.

In the PAI definition, the TCI(X1, X2,…, XK, P) term represents the overall dependency between the predictors and the phenotype, whereas the TCI(X1, X2, , XK) term represents the interdependencies among the predictors in the absence of the phenotype.

Our approach utilizes the KWII as the principal measure of the GEI. However, we employ the PAI to facilitate efficient searching of the combinatorial space (Chanda et al., 2008, 2009). KWII computations require the entropies of all subsets and are computationally intensive. PAI calculations are computationally more tractable because only individual and joint entropies are needed. The PAI contains useful information regarding the KWII and also increases monotonically with increased combination size, which makes it a suitable search metric for hill climbing algorithms (Chanda et al., 2008, 2009).

Extension to syndrome multiple QT vector phenotypes

We have previously demonstrated that the general expressions for the KWII and PAI can be used for single QTs and categorical phenotypes (Chanda et al., 2008, 2009), as well as count/rate data (Knights and Ramanathan, 2012). However, specific expressions are necessary for the entropy of the multiple QT (MQT) vector phenotype and the subsequent joint entropies of the MQT vector phenotype with discrete variables resulting from the genetic variants, environmental variables and their combinations.

We assume that the MQT vector phenotype, P, is multivariate normally distributed. The entropy of a multivariate normal distribution containing M variables, NM(μ, Σ), with mean vector μ and covariance matrix Σ is (Gokhale et al., 1989):

For GGI and GEI analysis in this report, we are interested in H(X, P), the entropy of the joint distribution of the MQT vector, P, and a discrete variable X, representing, for example, genetic variants or environment variables of interest or their combinations.

Therefore:

We assume that the MQT vector, P, given X=x, is also multivariate normally distributed NM(μx, Σx). By expanding, simplifying and substitution, we obtain:

The H(X) term contains only discrete variables and as a result, this entropy can be computed using the usual Shannon entropy formula:

The p(X=x) are the probabilities of the event x (for example, the frequency of a particular genotype). These equations for the entropy of the MQT phenotype, the entropy of discrete variable combinations, and the joint entropy of the combinations composed of the MQT phenotype with discrete variables enable computation of the KWII and the PAI for GGI and GEI analysis of the syndrome phenotype.

The SYMPHONY algorithm

We designed the SYMPHONY algorithm to mine GEI interactions involving an MQT vector based on our previous work on discrete and single QT. The input of the algorithm is a quadruple (X, P, θ, τ): X={x1, x2, …, xn} is the set of genetic and environmental variables, P is the syndrome MQT vector, θ is the number of combinations to be picked up in each iteration and τ is the highest order of interactions to be assessed.

The SYMPHONY algorithm employs the PAI iteratively to search for the set of candidate combinations for genetic and environmental variables that have high interactions involving the MQT vector. In the ith (i ←1 to τ) iteration, θ combinations which contain i genetic/environmental variables with the top-ranked PAI values are retained in Xi and are passed to the (i+1)th iteration for the PAI computation of combinations with i+1 variables. After τ iterations, we can get the promising combinations {X1, X2, … } which contain up to τ variables. Finally, for the combination c, which is the subset of one of the combinations in Xi, calculate KWII(c, P) to measure interaction for combination (c, P).

The pseudo code for SYMPHONY is shown in Supplementary data.

Methods for the case studies

Case Study 1, two single-nucleotide polymorphism interactions with a bivariate phenotype

The model used for this prototypical example of GGI is shown in Figure 1a. The allele frequencies for all bi-allelic single-nucleotide polymorphisms (SNPs) were 0.5.

Figure 1
figure 1

Case Study 1: two-SNP interactions with a bivariate phenotype. (a) The gene–gene interaction model for Case Study 1. The alleles of SNP 1 are A and a, whereas those of SNP 2 are B and b; the asterisk is a ‘wild card’ indicating either allele. For Case Study 1, the Pearson correlation (R2) values for the correlation between the traits P1 and P2 was 0.5 for the aa genotype at SNP 1, 0.9 for the {A*, bb} genotypes and 0.1 for all {A*, B*} genotypes; the means were all zero and the variances were all unity. (b) The relationship between the syndrome P, which is comprised of traits P1 and P2, for various genotype combinations of SNP 1 and SNP 2; each plot in the panel represents the genotype combination indicated. The data points correspond to a single subject and the regression lines are shown. (c) The mean KWII values at an effect size of 0.4 for the combinations indicated in the y-axis. The error bars are standard deviations of the KWII. The P-values from permutation testing are shown for combinations with a P-value0.05. (d) The KWII values of the combinations in (c), but with only phenotype P2 considered. (d) is a plot of KWII vs effect size for the combinations {1, P}, {2, P} and {1, 2, P}, which are indicated as open circles, filled circles and triangles, respectively.

There were two traits P1 and P2 comprising the MQT vector, whose bivariate normal joint distribution N2(μ, Σ) had mean vector μ and covariance matrix Σ. The means for the P1 and P2 traits were zero for all genotype combinations and standard deviations of the P1 and P2 traits were both unity for all genotype combinations. However, the genotype combinations differed in the extent to which the P1 and P2 traits were correlated. The Pearson correlation (R2) value for the correlation between the traits P1 and P2 was 0.5 for all combinations containing the aa genotype at SNP 1, 0.9 for the {A*, bb} genotype combinations and 0.1 for all {A*, B*} genotype combinations (the asterisk in a genotype combination is a ‘wild card’ that denotes that any allele is suitable). In the nomenclature of Figure 1a, the covariance matrices are:

A total of 100 independent simulated data sets, each of sample size n=2000, were used for power calculations. The Pearson correlation (R2) value for the correlation between the traits P1 and P2 was 0.5 for all genotype combinations containing the aa genotype at SNP 1, (0.5+E) for the {A*, bb} genotype combinations and (0.5−E) for all {A*, B*} genotype combinations. In the nomenclature of Figure 1a, the covariance matrices are

The parameter E was treated as the effect size for this model; the range of E was 0<E<0.5. The distribution of the KWII for an effect size of zero was obtained and its 95th percentile value was computed. Power was defined as the fraction of the simulations whose KWII value exceeded the 95th percentile value of the KWII distribution for the zero effect size.

Permutation-based KWII P-values were obtained by randomly permuting the phenotype MQT vectors; 10 000 independent permutations of the phenotype were conducted for each combination of interest.

Case Study 2, two-SNP interactions with a three-variate phenotype

The model used for this Case Study is shown in Figure 2a. The allele frequencies for all bi-allelic SNPs were 0.5.

Figure 2
figure 2

Case Study 2: two-SNP interactions with a three-variate phenotype. (a) The interaction model used to generate the data for Case Study 2. The model is similar to that in Figure 1 but involves a phenotype vector containing three quantitative trait variables. (b) The data distribution of the phenotype vector for the various genotype combinations involved in the interaction model. (c) The mean KWII values at an effect size of 0.4 for the combinations indicated in the y-axis. The error bars are standard deviations of the KWII. The P-values from permutation testing are shown for combinations with P-values0.05. (d) is a plot of KWII vs effect size for the combinations {1, P}, {2, P} and {1, 2, P}, which are indicated as open circles, filled circles and triangles, respectively.

The MQT vector phenotype P comprising three traits P1, P2 and P3, whose three-variate, multinormal joint distribution was N3(μ, Σ) and had mean vector μ and covariance matrix Σ. The means for the P1, P2 and P3 traits were zero for all genotype combinations, and standard deviations of the P1, P2 and P3 traits were each unity for all genotype combinations. However, the genotype combinations differed in the covariance matrices to which the P1, P2 and P3 traits were correlated. In the nomenclature of Figure 2a, the covariance matrices are

As in Case Study 1, a total of 100 independent simulated data sets each of sample size n=2000 were used for power calculations. Power was defined as the fraction of the simulations whose KWII value exceeded the 95th percentile value of the KWII distribution for the zero effect size. The P-values were calculated using the permutation-based approach in Case Study 1.

Case Study 3, bivariate phenotype with non-linear dependencies

The model used for this Case Study is shown in Figure 3a. The allele frequencies for all bi-allelic SNPs were 0.5.

Figure 3
figure 3

Case Study 3: bivariate phenotype with nonlinear dependencies. (a) The interaction model for Case Study 3. The model is similar to that in Figure 1 except for the phenotype with non-linear dependence between the two quantitative trait variables. (b) The data distribution of the phenotype vector for the various genotype combinations involved in the interaction model; the regression curves for each of the genotypes are also shown. (c) The mean KWII values at an effect size of 1.0 for the combinations indicated in the y-axis. The error bars are standard deviations of the KWII. The P-values from permutation testing are shown for combinations with P-values0.05. (d) is a plot of KWII vs effect size for the combinations {1, P}, {2, P} and {1, 2, P}, which are indicated as open circles, filled circles and triangles, respectively.

The phenotype P comprising two inter-dependent QT, P1 and P2. The dependence between P1 and P2 was modeled by using a non-linear function that differed between the genotype combinations as shown in Figure 3a. For the genotype combinations containing the aa genotype at SNP 1, P2 was independent of P1 and defined by:

For the {A*, bb} genotype combination, the relationship between P1 and P2 was

For all {A*, B*} genotype combinations, the relationship between P1 and P2 was

For analysis, the relationship between P1 and P2 for all subjects was fitted to a second-order polynomial with linear regression. The linear regression equations were used to calculate the residuals P2R between P2 and the regression equation. For KWII calculations, the phenotype P comprising P1 and P2R was assumed to be a bivariate normal distribution.

The effect size E was varied by altering the standard deviation of the normal random variate as N1(0, 1/E). The P-values were calculated using the permutation-based approach in Case Study 1.

Analysis of public domain data sets

Hybrid simulations with Genetic Analysis Workshop 15 Genotypes

The data corresponding to Problem 3 of Genetic Analysis Workshop 15 (GAW15) were obtained from the GAW site (http://www.gaworkshop.org/) and used with permission.

These data consist of 100 replicates of simulated data that are modeled after the rheumatoid arthritis data and were generated by Miller et al. (2007). Each replicate includes 1500 nuclear families, each with 2 parents and an affected sib pair, and 2000 unrelated controls.

For evaluating the effectiveness of SYMPHONY on a data set with a larger number predictors, we used a subset of data from the larger GAW15 Problem 3 containing 9187 SNPs distributed across the genome to mimic a 10-K SNP chip set. Additionally, sex, age and smoking status were included as covariates; the age variable was binned into three intervals of equal width. The analyses were conducted with unphased genotypes. We pooled sequential pairs of replicates from the GAW15 replicates to generate 50 replicate data sets containing 4000 subjects. We refer to this data set as the ‘10 K GAW15 Data set.’

The original GAW15 data set was not simulated to include MQT vector phenotypes. To enable structured evaluation of SYMPHONY, we simulated an MQT vector phenotype random variable against the background of GAW15 genotypes and covariates. The model used (Figure 4a) contained interactions among SNP C6-153 from the DR locus (spanning SNPs C6-152-C6-155) on Chromosome 6 and SNP C16-30 from Locus A (spanning SNPs 30-31 on Chromosome 16), age and smoking. The other SNPs were uninformative, that is, they were not associated with the vector phenotype in the simulations. The covariance matrices for the mixture of bivariate normal distributions arising from the interactions in Figure 4 were

Figure 4
figure 4

Results for hybrid simulations with the GAW15 Problem 3 data set: The model used for simulating the vector phenotypes is shown in (a). The value of the covariance corresponding to the correlation coefficient R1 was 0.75 and that corresponding to R0 was 0.25. The mean KWII values for the five one- and two-predictor-containing combinations with the highest KWII values are shown in (b). The permutation P-values are shown against the bars and the error bars are standard deviations of the KWII.

The P-value of the KWII of each combination was obtained using 10 000 independent permutations of the phenotype MQT vector for each combination assessed. The permutation P-values reported are for the first replicate.

We analyzed all 50 replicates individually to obtain KWII values. The mean and standard deviations for the KWII of each combination of variables were calculated from these results.

All the analyses were performed with SYMPHONY input parameters values of θ=50 and τ=3.

The computational speed was assessed in the context of the hybrid GAW15 data set simulations on a 3.2-GHz Intel Xeon computer with Irwindale Processor with 4 Gb of main memory. We created six data sets containing 30, 100, 300, 1000, 3000 or the entire 9187 SNPs from the 10-K GAW15 data set. All data sets contained 4000 subjects and the sex, smoking, age, C6-153, C16-30 variables.

Analysis of HDL and Atherosclerosis Data

This data set from mice contains genotypes, HDL concentrations and size of aortic fatty streak measurements for 294 female F2 intercross progeny (derived from the C57BL/6 (B6) strain, which is susceptible to atherosclerosis and has low levels of HDL, and the 129 strain, which is relatively resistant to atherosclerosis and has high HDL levels) that were fed a high-fat diet for 14 weeks (Ishimori et al., 2004). The mice were genotyped with 88 simple sequence length polymorphic markers and subsequently 23 additional simple sequence length polymorphic markers in the QTL regions were added. The data were obtained from the Center for Genome Dynamics at the Jackson Laboratory (http://cgd.jax.org/nav/qtlarchive1.htm).

We analyzed the HDL and atherosclerotic aortic fatty streak lesion size as the MQT vector phenotype of interest. The atherosclerotic aortic fatty streak lesion size variable was logarithmically transformed using the function log10(1+Atherosclerotic Aortic Fatty Streak Lesion Size) before analysis; the 1 was added to accommodate individuals with no atherosclerotic streaks within the same transformation framework. The vector phenotype was obtained by normalizing the means and standard deviations of both the HDL and transformed atherosclerotic aortic fatty streak lesion size variable to zero and 1, respectively.

The P-value of the KWII of each combination was obtained using 10 000 independent permutations of the MQT vector phenotype for each combination assessed.

The results were compared with the findings on the same data set using regression presented in Tables 1–3 of Ishimori et al. (2004) and to the information-theoretic findings in Chanda et al. (2009).

Analysis of 1000 genomes data

Data from the 1000 Genomes project was downloaded from http://www.1000genomes.org/data. PLINK was used to compute minor allele frequency (MAF) and pairwise LD values.

Minor allele frequency

To investigate the effect of MAF on the KWII, we used the Southern Han Chinese (CHS) data set. Each autosome was divided into five segments of equal length and 20 contiguous SNPs were randomly selected from each segment. A data set containing 2200 SNPs and 4000 subjects was created by independently resampling each of the 22 autosomes. The Model in Case Study 1 was used to assign bivariate vector phenotypes. The informative SNP 1 in this Case Study was randomly selected from among the SNPs with MAF of 0.49±0.01. The informative SNP 2 in Case Study 1 was among the SNPs with MAF of 0.49±0.01, 0.33±0.01, 0.25±0.01 and 0.1±0.01.

Linkage disequilibrium

The dependence of KWII on LD was evaluated using the same general method for MAF evaluation; however, both informative SNPs (SNP 1, SNP 2) had MAFs between 0.48 and 0.5 and were randomly selected from among the SNP pairs in LD within the following ranges of R2: 0–0.3, 0.35–0.4, 0.45–0.5, 0.65–0.8 and 0.85–1.

Admixture/population stratification

To assess the effect of race admixture/population stratificiation on the KWII, the 1000 Genomes data from the Southern Han Chinese (CHS), Finnish (FIN) and Toscani (TSI) groups were filtered to obtain a single data set containing genotypes for all the SNPs that were shared between the three groups. From each autosome, 100 SNPs were randomly selected as described for the MAF and LD simulations. The informative SNP 1 and SNP 2 in Case Study 1 were selected from among the SNPs with MAFs between 0.48 and 0.5.

The Model in Case Study 1 was used to randomly assign bivariate vector phenotypes. The covariance matrices for different genotype combinations of the three groups, however, were different:

The covariance matrix of the bivariate phenotype for the FIN group was independent of the genotype. Saliently, the covariance matrices of the CHS and TSI groups for the {A*, bb} and the {aa, **} genotypes differed.

Data sets containing 4200 subjects were obtained by resampling the shared genotypes. A race variable was included for each subject based on the group (CHS, FIN or TSI) from which the resampled genotypes were obtained. To assess the effects of admixture, the percent of the FIN group in the population was varied (0, 10, 25, 33, 50 and 80%) and the relative percentages of subjects from the CHS and TSI groups were kept equal. The effects of admixture were assessed through the dependence of the KWII on the percentage of the FIN group in the population.

Results

Analysis of case studies

Case Study 1

To assess the usefulness of our information-theoretic approach, we used the GGI model in Figure 1a to generate simulated data. The interaction model was inspired by the genetics of coat color in Labrador retrievers, which involves interactions between two loci involved in skin pigmentation. The MQT phenotype vector consisted of two QTs.

To exclude the possibility that the information-theoretic approach was detecting interactions by identifying associations between the individual QT in the phenotype vector, we ensured that there were no differences in the mean values and standard deviations among the genotypes on either of the two QTs: the mean values and standard deviations for both QT were zero and unity for each of the nine genotypes. The strength of the associations between the QT, however, was dependent on the genotype. These relationships are summarized in Figure 1b.

Figure 1c shows the KWII values for each combination. We found three significant peaks {1, P}, {2, P} and {1, 2, P}. From permutation analysis, the P-value of the KWII for the {1, P} combination was 0.004 and the P-values for the KWII of the {2, P} and {1, 2, P} combinations were both <0.0001. The P-values for the KWII of all other combinations were >0.05. Figure 1d shows the KWII values for the same data when the quantitative trait P2 was analyzed individually without considering P1: there were no prominent peaks. The results for individual analysis of P1 without considering P2 were similar and are not shown.

Figure 1d shows the dependence of the KWII values of the {1, P}, {2, P} and {1, 2, P} combinations when the effect size (defined as the difference in the covariance between P1 and P2) between the three outcomes in Figure 1a was varied. The logarithm of KWII values for the {1, P}, {2, P} and {1, 2, P} combinations increased monotonically with increasing effect size.

The power of the KWII for detecting the interactions for the {1, P}, {2, P} and {1, 2, P} combinations was computed at a sample size of n=2000 subjects and effect sizes of 0.2 and 0.4. At the effect size of 0.2, the power to detect the {1, P}, {2, P} and {1, 2, P} was 32, 100 and 88%, respectively. At the higher effect size of 0.4, the power to detect the {1, P}, {2, P} and {1, 2, P} combinations was 96, 100 and 100%, respectively.

The results from this motivating example demonstrate that the information-theoretic approach is capable of detecting GGI associated with vector phenotypes representing syndromes. High KWII values were observed for SNPs involved in interactions compared with uninformative SNPs, which had low KWII values; the value of the KWII was also dependent on the effect size.

Case Study 2

Syndromes such as metabolic syndrome can contain more than two QTs. We therefore evaluated the ability of our method to identify interacting SNPs in a simulation framework similar to Case Study 1 but with an MQT phenotype comprising three QT. Figure 2a shows model used to generate the simulated data.

Figure 2b is a panel of three-dimensional scatter plots summarizing the generated data and shows that the mean values and standard deviations for the QT were identical for all the genotype combinations of SNP 1 and SNP 2. The KWII method therefore relies on the dependencies among the constituent phenotypes in the syndrome MQT vector to detect the underlying GGI.

Figure 2c shows the KWII values, their confidence intervals and P-values for various combinations in the model. As would be expected from the simulation model, the KWII values for the informative combinations {1, P}, {2, P} and {1, 2, P} had the highest peaks (all P-values 0.001). Figure 2d shows the dependence of the KWII values for the {1, P}, {2, P} and {1, 2, P} combinations when the effect size was varied. As for Case Study 1, the logarithm of KWII values for the {1, P}, {2, P} and {1, 2, P} combinations increased monotonically with increasing effect size.

The power of the KWII for detecting the interactions in the {1, P}, {2, P} and {1, 2, P} combinations was computed at a sample size of n=2000 subjects and effect sizes of 0.2 and 0.4. For both effect sizes, the power to detect the {1, P} and {2, P} combinations was 100%. The power to detect the {1, 2, P} combination was 100% at the effect size of 0.4 and 62% at the effect size of 0.2.

Thus, the KWII method is capable of detecting complex dependencies in three-variate phenotypes in the absence of main effects on the constituent QTs.

Case Study 3

A complexity with syndrome MQT vector data that is not present in discrete phenotypes or in the analysis of a single QT is the possibility of a non-linear inter-dependence between the constituent traits. We developed the strategy, described in Materials and methods, for identifying interactions in the presence of non-linearity by employing the residuals from regression.

Figure 3a summarizes the model. The non-linear dependence of the constituent QT for the various genotype combinations of SNP 1 and SNP 2 is highlighted in Figure 3b. Figure 3c shows that the KWII correctly identifies the main effects of SNP 1, SNP 2 and the interaction between SNP 1 and SNP 2 with the syndrome MQT vector phenotype. The variation of KWII with changes in effect size is summarized in Figure 3d. These results demonstrate that the proposed approach can be used even when non-linear dependence between the constituent phenotypes is present.

Performance on the hybrid 10 K GAW15 data set

The goal was to demonstrate the utility of our method on a larger data set with numerous genetic markers and environmental variables. We used the hybrid simulation approach so that the ground truth required for critical assessment of the method was known. The predictors from the GAW15 simulations conducted by Miller et al. (2007) were retained and the vector phenotypes were randomly assigned based on the model in Figure 4a.

Figure 4b summarizes the average KWII values, standard deviations and permutation-based P-values for the top five one- and two-predictor containing combinations with the highest mean KWII values in the GAW15 data set. The top-ranked one-predictor containing combinations were {Age, P}, {Smoking, P}, {C16-30, P}, {C6-153, P} and {C4-330, P}. Four of the top-five combinations were present in the model (Figure 4a); the {C4-330, P} was not present in the model but had a non-significant P-value of 0.11. The two-predictor containing combinations with the highest mean KWII were {Age, C6-153, P}, {Age, Smoking, P}, {Age, C6-154, P}, {Age, C16-30, P} and {Smoking, C16-30, P}. The C6-154 SNP in the {Age, C6-154, P} combination is part of the same locus (HLA DR locus) that contains the C6-153 SNP present in the model and is in LD with it.

Computational speed

To evaluate the computational speed, we utilized the GAW15 data set as a test bed. We generated subsets of the GAW15 data set containing either 1000 or 4000 subjects, and 30, 100, 300, 3000 or all 9192 predictors in the modified GAW15 data set. The runtime requirements for obtaining the KWII values of all one- and two-predictor combinations are summarized in Table 1. The results indicate that the analysis of the GAW15 data set for 4000 subjects and 9192 predictors required 11 h.

Table 1 Run time assessments of SYMPHONY

The SYMPHONY search algorithm is computationally much more efficient than an exhaustive search, which requires computation of all possible combinations and requires exponential time. Let m be the sample size of the data, n be the number of variables (excluding phenotype vector), τ be the number of iterations and θ be the number of combinations retained in each iteration of computing the PAI: using order notation from complexity theory, the running time of SYMPHONY can be estimated as O(τ n θ m2+τ θ 2τ m2).

Comparison with other methods: analysis of HDL-atherosclerosis data

The HDL-atherosclerosis data set was used to evaluate the results from SYMPHONY because this data set has been previously analyzed by two other methods for single QT.

Ishimori et al. (2004) used multiple regression analysis to identify main effects and interactions associated with HDL and atherosclerosis individually. They found that mouse HDL concentrations were affected by six loci: two loci were located on chromosome 1 (Hdlq14 and Hdlq15) and one locus each on chromosomes 8, 9 and 12; there was also an interaction found between a locus on chromosome 2 with the Hdlq15 chromosome 1 locus. The five loci (and their combinations) associated with atherosclerosis susceptibility in mice were (1) Ath17 on chromosome 10, (2) Ath18 on chromosome 12, an interaction between Ath18 and (3) Ath19 on chromosome 11, and an interaction between (4) Ath20 on chromosome 10 and (5) Ath21 on chromosome 12 (Ishimori et al., 2004). The same HDL and atherosclerosis phenotypes were also analyzed by Chanda et al. (2009) using the CHORUS information-theoretic algorithm.

Table 2 summarizes the KWII and permutation-based P-values for the 12 one-marker and 10 two-marker combinations with the highest KWII values for the HDL-Atherosclerosis phenotype vector. Table 2 also compares the findings from SYMPHONY to the analyses of the individual HDL and Atherosclerosis phenotypes with multiple regression and CHORUS. Among the single marker associations identified by SYMPHONY, D10Mit28 was found to be associated with both HDL and Atherosclerosis phenotypes individually by CHORUS. The D10Mit213 marker was found to be associated with Atherosclerosis by both CHORUS and regression, whereas D1Mit159 was found to be associated with HDL by both CHORUS and regression. None of the other nine markers identified by SYMPHONY overlapped with CHORUS or regression.

Table 2 Comparison of the one- and two-marker results from SYMPHONY for the HDL-Atherosclerosis vector phenotype.

There were five two-marker pairs that contained D3Mit320 and two pairs that contained D3Mit320. Interestingly, the D3Mit320 had been identified as being associated with Atherosclerosis by CHORUS and four of the remaining markers (D12Mit54, D12Mit172, D12Mit84 and D12Mit243) were associated with HDL levels. This suggests that interactions between loci on mouse chromosomes 3 and 12 are involved in the HDL-atherosclerosis correlations. Four markers on chromosome 12 (D12Mit170, D12Mit243, D12Mit84, and D12Mit91) were also involved in interactions with the DXMit159 marker on the X-chromosome.

Overall, these comparisons suggest that the findings from SYMPHONY overlap to a greater extent with CHORUS, which is to be expected given that they share the same information-theoretic framework. These findings further demonstrate that the vector phenotype approach in SYMPHONY complements the results from analysis of single traits in important ways. Some of the results from SYMPHONY overlap partially with regression and CHORUS methods but importantly, SYMPHONY provides novel results that can direct further investigation.

Performance on the hybrid 1000 genomes data set

The goal of these simulations was to demonstrate the utility of SYMPHONY for analysis of genetic data from a genome-wide study and to assess the dependence of the KWII on LD, MAF and admixture. The SNP genotypes from the 1000 Genomes data were used and the vector phenotypes were simulated based on the model described for Case Study 1.

Minor allele frequency

Figure 5a shows the dependence of the KWII on the MAF of SNP 2 with the MAF of SNP 1 kept constant at 0.5. The KWII values of the first-order {2, P} combination and the second-order {1, 2, P} combination increase monotonically with increasing SNP 2 MAF.

Figure 5
figure 5

Results for hybrid simulations with the 1000 genomes set: Case Study 1 was used to simulate the vector phenotype. (a) The dependence of the KWII of the informative {1, P}, {2, P} and {1, 2, P} SNP combinations on the minor allele frequency (MAF) of SNP 2. (b) Dependence of the KWII on the linkage disequilibrium as assessed by R2 value. (c) Dependence of the KWII on admixture as assessed by varying the percentage of the FIN group in the population while keeping the percentages of the CHS and TSI groups equal. The Race variable corresponds to the label indicating group membership (CHS, TSI or FIN). The KWII for the {S2, P} combination was similar to that for the {S1, P} combination and is not shown. The error bars in each plot are standard errors from five independent simulations.

Linkage disequilibrium

Figure 5b summarizes the effect of LD (as measured by R2) between SNP 1 and SNP 2 on the KWII. The KWII for the {1, P} and {2, P} first-order combinations converge at the highest LD value because the genotypes become highly correlated with each other. The KWII value of the second-order combination {1, 2, P} decreases monotonically with increasing LD and becomes negative, indicating that information present in this combination is rendered redundant as a result of increased LD.

Population stratification/admixture

Figure 5c shows the effects of population stratification/admixture, which was assessed by increasing the percentage of the sample from the FIN group in the population. The KWII patterns for the {1, P} and {2, P} combinations were similar (data for {2, P} not shown) and decreased monotonically as the percentage of the FIN group increased. This is a consequence of the absence of SNP 1 and SNP 2 effects in the FIN group. The {Race, P} combination reached a maximal value at a FIN group percentage of 33% when the relative proportions of all three groups—CHS, TSI and FIN—were equal. In the admixture simulation, we found high KWII values for the second-order {Race, 1, P} and {Race, 2, P} combinations. At FIN group, percentages >33% the KWII values decreased because of the absence of SNP 1 and SNP 2 effects in this group.

Discussion

In this report, we have presented results for a novel information-theoretic GGI and GEI analysis method for disease syndromes. The salient contributions of the paper include the development of an MQT vector framework for describing the phenotype of the syndrome and the demonstration that the versatile and general framework previously developed in AMBIENCE and CHORUS for the genetic analysis of discrete and QTs can be deployed for vector traits.

Our approach is based on the assumption that the MQT vector phenotype is adequately modeled by the multivariate normal distribution. Several statistical tests for assessing whether or not a given data set follows a multivariate normal distribution are available (Cox and Small, 1978; Friedman and Rafsky, 1979; Smith and Jain, 1988; Henze, 2002). In addition to its statistical underpinnings in the Central Limit theorem, the normal distribution N(μ, σ) has maximum entropy among all real-valued distributions with specified mean μ and standard deviation σ (Srivastava and Gupta, 2008). The findings from the univariate normal distribution also generalize to the multivariate normal distribution N(μ, Σ), which has maximum entropy among all real-valued distributions with specific mean vector μ and covariance matrix Σ. Therefore, a multivariate normal distribution is reasonable if only the mean and covariance matrix of MQT vector phenotype are known. A question that is outside the scope of the current research is the nature and number of the dimensions that should comprise the syndrome vector. The choice of the dimensions should be based on clinical and biometric characteristics that provide sensitivity and specificity for diagnosing and measuring the progression of the syndrome pathology objectively in the clinical or study setting. However, there may be a lack of agreement on whether a particular syndrome is a distinct clinical entity; in such situations, there may not be clinical consensus on the best ways diagnose and measure it.

We have provided results that demonstrate proof of concept that (with modifications) the KWII method may be generalizable to certain situations wherein there are non-linear relationships between the QT constituents comprising the vector phenotype. However, the method relies on regression. More research is necessary to define the range and applicability of the approach and its power to detect the potentially diverse range of non-linearities that could occur.

A range of GEI analysis methods based on the multi-factor dimensionality reduction (MDR) method are available for identifying and analyzing GEI (Ritchie et al., 2001; Hahn et al., 2003; Moore et al., 2006). MDR reduces the dimensionality of the multi-locus genotype systematically by pooling into high and low risk groups (Ritchie et al., 2001; Hahn et al., 2003; Ritchie et al., 2003; Bush et al., 2006; Moore et al., 2006). The combinatorial partitioning method and the restricted partition method (Culverhouse et al., 2004) share similarities to MDR, but identify multilocus genotypes capable of predicting QT levels (Nelson et al., 2001). The generalized MDR method employs the generalized linear model framework for scoring in conjunction with MDR for dimensionality reduction. Generalized MDR enables covariate corrections and handles both discrete phenotypes and continuous traits in population-based study designs. The Pedigree Disequilibrium Test (Martin et al., 2000) approach has been used to extend MDR, which was initially limited in its capacity to include potentially informative family data beyond single matched pairs in each family, to family-based study designs (Martin et al., 2006). MDR and its variants including generalized MDR are not capable of GEI analysis of syndromes.

Some machine learning methods including random forests and decision trees employ information gain metrics as an aid to decision making. Random forest methodologies are now being applied to GWAS studies (Kim et al., 2009; Meng et al., 2009; Wu, 2011); however, the design and role of the information-theory metrics are different from that in SYMPHONY. Additionally, random forest and decision tree algorithms have generally been limited to discrete phenotypes. The Markov blanket method has been used to analyze interactions at the genome-wide level with case–control study designs (Han et al., 2010). The extensions of these methods for analyzing vector phenotypes have not yet been evaluated systematically.

The KWII and PAI metrics that provide the underpinning of SYMPHONY are robust to LD. The second term on the right-hand side of the PAI definition contains the inter-dependence among the genetic (for example, pairwise LD) and environmental variables (for example, common source of multiple pollutants). Unlike traditional LD measures, which are pairwise in character (and typically are computed for contiguous SNPs), the PAI and KWII can assess the joint dependence between variables generalized to multiple, non-adjacent SNPs. We have extensively investigated the robustness of the PAI over a range of LD values for discrete and QTs (Chanda et al., 2008, 2009). The robustness of the PAI to LD does not compromise its ability to identify informative interactions.

In previous work, we have shown that the KWII is also effective at identifying two-locus interaction models that do not contain main effects (Chanda et al., 2009; Sucheston et al., 2010). Because SYMPHONY utilizes a marginal effect search strategy based on the PAI, its power is compromised when main effects are completely absent. However, interactions that lack main effects entirely are relatively rare in real data because of the stringent symmetry required to enforce absence of main effects (Chanda et al., 2008). Interactions in real data generally contain traces of main effects that can be leveraged by the SYMPHONY search. SYMPHONY can also easily be modified so that the entire space of second-order combinations is searched to identify interactions which could overcome the loss of power (Chanda et al., 2008).

In this report, we have focused on syndromes comprising multivariate normally distributed QT. However, many clinical syndrome vector phenotypes may comprise discrete or continuous non-normal variables. For example, subjects with autism may be described by the severity of the social and communication symptoms, the severity of fixated behaviors as well as associated features such as presence of genetic risk factors, epilepsy and intellectual disability. The framework developed in the paper can potentially be extended to syndrome phenotype vectors containing discrete and multivariate normally distributed variables in a straightforward manner since the phenotype can be represented as a mixture of multinormal distributions. Data transformations such as the logarithm and the cube root, which are well-established approaches to obtain normal distributions from non-normal data, could also be useful if an individual QT is not normally distributed. However, there are numerous challenges related to computations of the entropy expression for more complex multivariate distributions.

In conclusion, our results suggest that the availability of versatile methods based on information-theoretic metrics could enhance the GGI and GEI analysis for complex disease phenotypes.

Data archiving

There were no data to deposit.