Introduction

The introduction of genome-wide association (GWA) represents a revolutionary advance in the genetic investigation of complex diseases. Yet, despite the early promise of GWA studies in cardiovascular and other complex diseases, reported effect sizes of single-nucleotide polymorphisms (SNPs) (both individual and cumulative) explain disappointingly small proportions of the estimated trait heritability.1, 2, 3 Nonetheless, the information gained from GWA studies continues to offer insight regarding the genetic and molecular mechanisms of disease.

The most common approaches to GWA studies focus on the analysis of individual SNPs and their neighboring genes; only the strongest evidence of association for top-ranked SNPs is typically reported. This approach is hampered by the consideration of large numbers of variables (ie, genotypes), the vast majority of which will not meet criteria for genome-wide significance and fewer still that will ultimately be functionally important. Thus, even in studies of large cohorts, true signals remain difficult to identify.

To increase the yield from GWA studies and to ultimately explain a great proportion of the trait heritability, analysis approaches should be adapted to capitalize on available complementary data that allow testing for association on the basis of functional units such as genes, gene sets, and pathways.4 These approaches would decrease the number of statistical tests while taking advantage of known biology. The pathway-based gene-expression analysis approach called gene-set enrichment analysis (GSEA) was adapted for use in GWA studies.5, 6, 7 This adaptation uses the maximum single-SNP test statistic from a gene to score the strength of association between the gene and the trait of interest. It then applied the GSEA procedure to test whether a certain gene set is significantly ‘enriched’ with high-scored genes.

We recently presented a further extension of the GSEA method, variable set enrichment analysis (VSEA), which normalizes the maximum SNP statistics based on permutation results so that signals are comparable for genes that have different number of SNPs and different linkage disequilibrium structure.8 In this report, we hypothesize that by applying VSEA to GWA data from the Framingham Heart Study, we will identify gene sets associated with coronary heart disease (CHD) that would be otherwise missed by conventional single-SNP analyses.

Materials and methods

Framingham Heart Study genome-wide data

The Framingham Heart Study is a large-scale population-based cardiovascular study based in Framingham, Massachusetts, USA, which started in 1948 and currently consists of three generations of cohorts.9 To find the pathways related to the CHD, we used the genome-wide data (Affymetrix 5.0 GeneChip array with a 500K SNPs, Santa Clara, CA, USA) from Caucasian individuals representing the three cohorts from the Framingham SHARe project (SNP Health Association Resource) downloaded from the National Center for Biotechnology Information database of Genotypes and Phenotypes website (http://www.ncbi.nlm.nih.gov/gap). The analysis was performed on the Framingham Cohort data, version 4 (embargo release date 4 December 2009). The primary phenotype, prevalent CHD, was defined by the Framingham Heart Study as a composite of recognized and unrecognized myocardial infarction, coronary insufficiency, and CHD death.10

Quality control

The original dataset included 6476 Caucasian subjects (2959 men and 3517 women) and 498 014 SNPs. Mendelian genotype errors were checked and those with errors were set to missing. The quality of the subjects’ dataset was then verified: subjects with missing rate >5% and either low (≤25%) or high (≥30%) mean heterozygosity were removed from the dataset, resulting in 6438 individuals. Next, the qualities of individual SNPs were checked: monomorphic SNPs, SNPs with missing rate >5% or a missing rate >1% combined with a minor allele frequency <0.05, those with a Hardy–Weinberg equilibrium test P-value <10−6, and those without an annotated geneID were removed, resulting in 404 467 SNPs for the association analysis. Finally, population substructure was examined by multidimensional scaling using information from the HapMap samples of European (CEU), East Asian (CHB and JPT), and African (YRI) origins. A total of 17 subjects were removed because of poor clustering with CEU subjects. The final analyzable dataset included 6421 subjects (2935 males and 3486 females).

VSEA

VSEA is a novel GWA analysis method that tests for aggregated effect of many genes linked by biological functions or statistical gene–gene interaction.8 It is based on the method called GSEA, originally developed for differential gene expression analysis. GSEA derives an enrichment score to detect gene sets significantly enriched with differentially expressed genes.5 To facilitate analysis of SNP data in GWA studies, VSEA employs a permutation-based normalized gene score to aggregate effects of multiple individual SNPs in each gene of a gene set. Permutation was done 1000 times first by calculating the enrichment scores from the datasets where the disease status was randomly shuffled. The P-values were then calculated from the frequency of seeing a larger enrichment score in the observed than in the shuffled dataset.8

For the VSEA analyses described in this paper, we used a library of 1395 gene sets compiled from the collections of the genetic pathways, molecular functions, and/or biological processes in the Kyoto Encyclopedia of Genes and Genomes (www.genome.ad.jp/kegg/pathway.html), BioCarta (http://www.biocarta.com/Default.aspx), and Gene Ontology (http://geneontology.org/) databases.8 Based on the manufacturer’s annotation of the Affymetrix 500K GeneChip array, we refined the gene-set library by removing genes that had no SNP included on the genotyping array. To reduce the impact of multiple testing and to avoid testing overly narrow or broad functional categories, our analysis only considered gene sets and the pathways that contained at least 3 and at most 200 genes. The final panel included 1395 gene sets representing 404 467 SNPs, which were attributed to 15 474 genes.

Single-SNP and pairwise SNP–SNP interactions

The aggregated effect of multiple genes in a gene set may reflect the sum of individual gene effects, interactions of pairs or more genes, or both. To allow comparison of VSEA with conventional analytical approaches, genome-wide single-SNP association was determined using the allelic χ2 test by PLINK for prevalent CHD.11 Familial relationships in the sample were ignored based on a previous study that found very similar association test P-values in the Framingham GWAS data, whether familial relationships were considered or omitted.12 This enabled us to perform the large number of permutation tests in a practical time. An earlier simulation study evaluating the effect of such practice (ignoring familial relationships) in association analysis found that the effect-size estimates and power are not significantly affected, although Type I error rates increase as the disease heritability increases.13

The contribution of pairwise SNP–SNP interactions to the aggregated effect detected by VSEA was assessed by analyzing pairwise interactions between SNPs from genes in the highest-ranked gene sets after the VSEA test was performed. Pairwise SNP–SNP interactions were detected by significant difference between genotype correlation in cases and that in controls using Fisher’s Z transformation.14, 15, 16, 17 The significant difference of correlations between cases and controls reflects altered pairing preferences of alleles at the two loci and may be the result of some underlying molecular mechanism that was active in CHD. To entertain the idea of such underlying mechanisms, we performed interaction analysis of pairs of SNPs in top-ranked gene sets and organized all the significant pairwise interactions detected in a gene set into clusters (networks) of genes linked by SNP interactions.

Multiple testing

There is no generally accepted method of adjustment for testing the large number of the distinct pathways considered in this study. The VSEA procedure corrects for multiple testing due to genes shared by the different pathways/gene sets. However, adjustment of P-values for testing many distinct gene sets may lead to overly conservative results, especially when using gene sets derived from general-purpose databases (as was done in this study), because gene sets often contain many genes that are irrelevant to the disease trait of interest. Therefore, we used the unadjusted nominal P-values for the VSEA analyses. To prevent potential false positives in our validation analysis of pairwise SNP–SNP interactions among the genes in top-ranked gene sets, we imposed the stringent Bonferroni correction for multiple testing.

Results

After performing data quality control, the analysis sample consisted of 6421 subjects (2935 males and 3486 females) with 404 467 SNPs, of which 326 750 SNPs are associated to 15 474 genes. After removing gene sets that are too small or too large, the final panel includes 1395 gene sets, representing 207 120 SNPs, which were attributed to 8161 genes. Sample characteristics of several known risk factors of CHD are shown in Table 1. CHD events were identified in 221 individuals. Compared with subjects without CHD events (non-CHD controls), CHD cases were more likely to have higher systolic and diastolic blood pressure, total cholesterol and triglyceride, and lower high-density lipoprotein cholesterol.

Table 1 Population characteristics of the Framingham Heart Study population used in the current study

Among the 1395 gene sets tested, we identified 25 sets with a permutated P-value <0.01 (Table 2; top 100 gene sets available in Supplementary Table S1). Among the 25 gene sets, four (shown in bold) have been previously implicated in CHD by their participation in lipid metabolism and vascular genesis: fatty-acid biosynthetic process (GO:0006633), fatty-acid metabolic process (GO:0006631), glycerolipid metabolic process (GO:0046486), and vascular genesis (GO:0001570). The identification of these gene sets is supported by the existing body of literature linking these biological processes with atherosclerosis. Among the 170 genes represented by these four gene sets (Supplementary Table S2), only three contained any SNP ranked among top 100 in the single-SNP scan (Supplementary Table S3).

Table 2 Top 25 gene sets from the pathway-based VSEA test

The pathways shown underlined in Table 2 are examples of novel gene sets. Although these gene sets are less known for their association with CHD, a pathophysiological role in cardiovascular diseases is plausible. For example, many of the genes in the Rac 1 cell-motility signaling pathway (h_rac1 Pathway) are myosin-/actin-associated genes that have been shown to have roles in left ventricular hypertrophy and hypertrophic cardiomyopathy (RAC 1, MYL2, TRIO, and PPP1R12B). Other genes in this pathway have been shown to modulate cardiovascular risk traits including insulin sensitivity, glucose tolerance, and obesity (PIK3CB and RPS6KB1). Similarly, genes involved in the sulfur amino-acid metabolic process (GO:0000096) are related to cardiovascular diseases, through roles in oxidative stress (GCLC, GCLM, and MSRA) and/or metabolism of homocysteine (GCLC, BHMT, MTHFR, MTR, MRTT, and CBS), a well-known risk factor of CHD. There are also a few genes related to oxidative stress (CDO1, ADI1, and SOUX), although their roles in cardiovascular disease are not well studied.

Performance of genes in single-SNP analysis

Genome-wide single-SNP association was determined to allow comparison of VSEA with conventional analytical approaches. As an example of the effectiveness of VSEA to identify gene sets potentially relevant to CHD, the best-ranked single-SNP by the χ2 test from genes in the vasculogenesis pathway are shown in Table 3. Except for genes QKI, HEY2, and WARS2, the other genes in this pathway are not ranked highly, thus, these genes would likely be excluded from further follow-up studies if selection was based solely on the significance of the single-SNP test. However, when the VSEA test considered the 17 genes as a unit, their small marginal effects were combined, thus allowing this gene set to be identified as one significantly ‘enriched’ for genetic association with CHD. Upon further examination, other top-ranked gene sets showed similar patterns of predominantly weak single-SNP rankings.

Table 3 Ranks of SNPs in the vasculogenesis pathway (GO:0001570)

Pairwise interactions among genes from enriched gene sets

Using the VSEA test, 1005 distinct genes were identified from among the top 25 gene sets. This constitutes a total of 15 960 SNPs after removing those in high LD (r2>0.8), which resulted in 119 209 805 pairwise interaction tests. Using a stringent Bonferroni-adjusted significance level (P<4.2 × 10−10), 439 of the 1005 genes were linked by cross-gene SNP–SNP interactions. When these cross-gene SNP–SNP interactions were superimposed over the top-ranked gene sets, we obtained clusters (subnetworks) of genes within these pathways that reflected concerted action of multiple genes that differentiated the CHD group from the non-CHD group (Figure 1 for interaction subnetworks from the Rac 1 cell-motility signaling and sulfur amino-acid metabolic process pathways).

Figure 1
figure 1

Subnetworks of genes from the Rac 1 cell-motility signaling (a) and sulfur amino-acid metabolic process (b) pathways that differentiated the CHD from the non-CHD group. Subnetworks are composed of pairwise SNP–SNP interactions where the number on each edge (line) represents the number of interactions between each pair of genes.

Genes participating in pairwise interactions were then ranked by the number of other gene interaction partners. Genes interacting with ≥30 partners are listed in Table 4. Once again, VSEA has identified many genes with important roles in cardiovascular diseases including two genes (CDH13 and PARD3), which have been recently associated with CHD risk traits.

Table 4 Genes from enriched gene sets with highest number of interaction partners

Discussion

The purpose of this study was to apply a novel method, VSEA, which capitalizes on existing biological data to gain new insight about CHD genetics by testing for association on the basis of functional units such as gene sets and pathways beyond individual SNPs. We identified gene sets enriched with genes that have been previously associated with CHD. We also discovered gene sets with emerging evidence supporting roles in a variety of cardiovascular diseases and related illnesses. Importantly, many CHD genes ranked poorly in single-SNP tests, whereas their member groups were successfully picked up by analyzing pathway-based gene sets. Thus, VSEA identified gene sets that would have been otherwise missed by conventional single-SNP analyses.

There is ample evidence to support the biological plausibility of association with CHD among the identified enriched gene sets. Among the 25 sets with a permutated P-value<0.01, some pathways have been previously linked with CHD and/or CHD risk factors. For example, among genes from the vasculogenesis pathway, published reports have shown that SNPs in VEGFA modulate atherosclerosis severity and the prevalence of myocardial infarction.18, 19 Likewise, WARS2 was recently identified in a meta-analysis of GWA studies for adiposity, a CHD risk factor.20 SNPs in genes from the fatty-acid biosynthetic and metabolic process pathways, particularly those participating in the synthesis of prostaglandins (eg, ALOX5, ALOX5AP, ALOX12, ALOX15, PTGS1, PTGS2, and COX2), have also been identified as risk factors for atherosclerotic plaque burden and CHD events.21, 22, 23, 24, 25, 26 SNPs in other genes from these pathways modulate CHD risk, presumably through their effects on lipids.27, 28, 29, 30, 31 Thus, inclusion of these genes among enriched gene sets is supported by existing scientific literature.

Among genes in sets with less-well characterized associations with CHD, a few have recently been linked with CHD and/or CHD risk factors. For example, a functional promoter polymorphism in GCLC, a member of the sulfur amino-acid metabolic process pathway, has been associated with endothelium-dependent dilation of coronary arteries and myocardial infarction.32 GCLM and MSRA, scavengers of reactive oxygen species, were recently found to protect the myocardium from ischemia-reperfusion injury, a critical determinant of survival following myocardial infarction.33, 34, 35 Several other genes in this pathway (eg, BHMT, MTHFR, MTR, MRTT, and CBS) regulate the metabolism of homocysteine, a risk factor of CHD.36, 37, 38, 39, 40, 41 Many genes in this gene set (GCLC, GCLM, MTHFR, MTHFD1, MTR, and MTRR) are also related to methylation processes, a potential, but understudied, mechanism for CVD.42, 43

Several genes in the Rac 1 cell-motility signaling pathway have also been recently implicated in CHD. For example, Rac 1, a subunit of the Nox2 NADPH oxidase enzyme which is responsible for generating damaging reactive oxygen species in the heart, has been mechanistically linked with ischemia-reperfusion injury, adverse remodeling of the left ventricle, and survival in transgenic mice following myocardial infarction.44, 45, 46 Genes from the Rac 1 pathway that encode actin- and myosin-associated proteins, including RAC 1 and MYL2, have also been associated with left ventricular mass, an intermediate traits that is a known risk factor for cardiovascular morbidity and mortality.47, 48 Also notable is that among the genes with the greatest number of cross-gene interactions, two of the top-ranked genes (CDH13 and PARD3) have been recently associated with CHD risk traits, including left ventricular hypertrophy, dyslipidemia, metabolic syndrome, type 2 diabetes, and adiponectin levels.7, 49, 50, 51

We note that similar gene-set enrichment approaches were used by others to evaluate particular pathways52 or to prioritize candidate genes.53 But, there is an inherent difficulty in defining the potential relevance of any pathway to a specific disease process. Incorporating more specific types of biological functions such as protein–protein interactions as done by Jensen et al54 will certainly improve the functional relevance of detected gene sets. In the absence of well-informed disease-based pathway databases, it is difficult to give an unbiased assessment of validity of the results. Although for some gene sets the relatedness to the disease trait may be more certain, for the majority this is less clear or unknown. The four gene sets identified as relevant to CHD were highlighted based on existing literature. Ultimately, functional studies are necessary to confirm the biological relevance of genetic variation in these pathways to CHD.

In summary, the present study shows the use of VSEA as a robust novel extension to existing analysis methods for GWA data. This study confirmed the interplay of multiple loci among the genes in the pathways responsible for CHD. More importantly, it also showed that analysis methods that capitalize on existing knowledge and directly test for gene–gene interactions can allow an improved understanding of the genetic variants and the pathways responsible for CHD.