Main

Genome-wide association studies are now abundant with hundreds of newly identified single loci being shown with a high degree of probability to influence a variety of traits and diseases.1 However, for almost all tested traits only 1–2 genes are typically identified that survive correction for multiple testing in a genome-wide context, leaving the question open as to whether additional risk genes exist. An emerging approach to understanding these studies in a larger biological context is to explore the upper distribution of the most significant genes for an enrichment of certain classes of function. Because of the depth of annotation of the genome, the preferred way to do this is by means of the gene ontology (GO).2 This is a relatively new approach, stemming from the application of GO-based analyses to gene expression data,3 but despite promise only a handful of replicated cases of pathway enrichment have emerged.4, 5 One of the critical issues in enabling this strategy is to convert with high fidelity the single-nucleotide polymorphism (SNP) lists from genome-wide platforms to the list of the genes they represent. Toward this end, we developed a software program implemented in Perl, using as input genome-wide SNP results (primarily from PLINK6), that considers linkage disequilibrium (LD) across regions of significance that corrects for the inflation of significance due to gene length.5 In brief, our software automates the process of converting genome-wide SNP lists to gene lists, beginning with the retrieval of LD structures in analogous populations with denser genotyping data (that is, HapMap). When a group of markers are in high LD in HapMap (we use an r2>0.8 threshold), they are tied to a ‘proxy cluster’ treating it as a single signal. Subsequently, each marker in the original SNP list with statistically significant evidence of association with a phenotype is evaluated to see (a) if it belongs to any proxy cluster and (b) if the marker itself or any marker in the cluster is located in a genic region. Any marker or cluster that overlaps a region extending across a gene is assigned as a signal indicating the possible association of that gene. To correct the multiple-testing problem that emerges due to multiple signals across a gene, the P-value for each gene is adjusted by multiplication of the lowest P-value of the assigned signals by the number of signals. An illustration of the algorithm can be found in our earlier paper.5 Here, we have applied this program to a genome-wide association study in a French Alzheimer disease (AD) case–control sample.

The genome-wide association study included 2032 AD cases and 5328 controls of French ancestry and was conducted on the Illumina 610 platform (Illumina, San Diego, CA, USA).7 Appropriate institutional review board permission was obtained for this study(see Lambert et al.7 for details). A total of 511 978 SNPs that passed quality control (genotypes were excluded that had call rates <98%, a minor allele frequency of 1% or less, or significant deviation from Hardy–Weinberg equilibrium at P<10−6) were parsed and converted into a list of 16 503 genes using our algorithm. We note that the maximum significance (P=2.3 × 10−130) obtained overlapped with the TOMM40 gene, near APOE. Also notable is that within this set there are no genes, save around the APOE locus that show genome-wide significance. The resultant list of genes, the marker with highest significance that is assigned to that particular gene, the number of genetic markers used for gene-based correction and a list of genes indiscernible due to LD is provided as Supplementary Table 1. For enrichment analysis we used our software together with the public domain tools provided by both the DAVID bioinformatics platform8 and Genecodis.9 After adjustment for gene length, there were 1351 genes that were assigned a P-value of 0.05 or less and these were tested for enrichment against the study base set of 16 503 genes. Importantly, testing the top genes against a default full genome base set gives an anticipated highly significant (and incorrect) enrichment of multiple high level GO categories, emphasizing the importance of using the gene lists that are actually represented on, for example, the Illumina 610 platform.

In this genome-wide data set, we observed a highly significant enrichment of genes annotated as being involved in the biological process of intracellular transmembrane protein transport (GO:0065002, P=7.2 × 10−6 based on a hypergeometric test, P<0.001 based on 1000 permutations). Both Genecodis and DAVID provided equivalent results (the P-value for this pathway with DAVID was slightly lower at 5.2 × 10−6). There were 18 genes that contributed to this significance and we show those specific genes, as well as the best genetic marker and its associated P-value in Table 1. Both DAVID and Genecodis use a hypergeometric test for significance estimation, and taking the Genecodis example, significance was derived from 18 of 1331 genes in the enriched list being association with the protein transport term, versus 69 in the total of 16 283 annotated genes. We note that the genes contributing to the signal for protein transport are dispersed widely in terms of individual significance across the top 1351 genes, emphasizing the possible existence of true association signals beyond only the first few most significant genes. A common problem with analyses of this nature is the false appearance of enrichment due to chromosomal clustering of functionally related genes.5 For this particular analysis, all genes contributing to enrichment were located in distinct genomic loci (also shown in Table 1), with the closest genes being several megabases apart. However, there were also a few cases of ontology categories that could be dismissed because of positional clustering, the most prominent being ‘cytokine activity’ due to an enrichment of interferon genes that are located in tight genomic proximity (not shown).

Table 1 Enriched genes in Alzheimer disease involved in intracellular transmembrane protein transport

To understand in more detail the relationships among the genes contributing to the protein transport signal, we used FunCoup,10 which enables connections to be visualized based on genomics and experimental data, such as protein–protein interaction and gene expression correlations. We were particularly interested in how the identified protein transport pathway might be related to the APOE locus, which contains four genes that cannot be readily discerned due to LD (APOE, TOMM40, PVRL2 and BCL3). We therefore tested these 4 genes in turn for network connectivity to the 18 genes identified by enrichment analysis. To evaluate statistical significance we developed our own custom algorithm based on a previously described randomization strategy.11 The randomized network was thus re-wired in such a way that the number of links for each node was preserved, although its network neighbors were shuffled. The real (that is, FunCoup-predicted) network was randomized 100 times. In FunCoup, each link is characterized by a confidence value termed as final Bayesian score—a sum of individual log likelihood ratios of the integrated data sets (51 sets from 7 eukaryotes) that informed on functional coupling. For the analysis, we selected network edges with final Bayesian score 4.8 (natural logarithm), that defined a network of 14 899 genes connected with 709 343 links. After every randomization, connections between a gene of interest and a gene group were counted. These values were used to calculate the mean and s.d. Together with the respective number of links in the real network, these values produced one-sided Z-scores that estimated significance. In this analysis, only TOMM40 was strongly connected to the set of 18 protein transport genes, with 4 direct and 792 indirect; that is, through a third gene, links (P<10−4 and P<10−7, respectively). BCL3 had a single much weaker link (based on subcellular colocalization) to NUP88—but this was not significant. From Figure 1 there is a clear division into two groupings, one containing members of the nucleoporin gene family and the other consisting of mitochondrial genes. These two groups are connected, albeit weakly, by interactions between NUP98, TIMM44 and TIMM17A. In the final network (Figure 1) most of the original 18 genes are represented, with 3 (Magmas, TNKS and C18orf55) not having any significant connections with the remaining 15 genes. Notably, Magmas and C18orf55 are mitochondrial genes, whereas TNKS is a nuclear pore protein.

Figure 1
figure 1

A network connectivity map of genes identified in the intracellular transmembrane protein transport pathway from a GWAS of Alzheimer disease. A total of 18 genes were identified, consisting of both mitochondrial and nuclear pore genes. TOMM40 is included as it connects tightly to TIMM17A and TIMM44, illustrating its importance in this pathway. No other genes from the BCL3-PVRL2-TOMM40-APOE LD block are significantly associated with this network. The lines (edges) between genes denote the strength and origin of connectivity and are derived primarily from mRNA coexpression, protein–protein interaction, sub-cellular colocalization and phylogenetic similarity (thicker=stronger).

We used gene expression data to explore the relationships of TOMM40 and APOE to the known base set of genes that have been confirmed to lead to AD (PSEN1, PSEN2 and APP). For this, a human brain sample was used with gene expression level estimates for 14 077 transcripts in 193 individuals.12 In testing TOMM40 and APOE against the base set, we observed very strong correlation of TOMM40 to PSEN2 (P=1.3 × 10−13, r2=0.24) and a weaker association of APOE to APP (P=4.5 × 10−7, r2=0.12). The other correlations were not significant at α=0.05.

This study marks one of the first attempts to explore genome-wide association data in AD in the context of pathway enrichment. The enriched pathway that we have uncovered provides an intriguing indication that dysfunction of intracellular protein trafficking may be a common biological theme in AD. Although there is little support in the literature for the involvement of nucleoporin genes in AD, there is more substantial evidence for the importance of the mitochondria. In this regard, recent evidence suggests that import of β-amyloid into mitochondria may underlie β-amyloid toxicity,13, 14 in line with a larger body of evidence linking mitochondrial function to AD.15 More importantly this process is facilitated by the translocase of the mitochondrial outer membrane complex, illustrating the potential importance of TOMM40, itself the highest ranked gene in this GWAS and the only gene in the BCL3-PVRL2-TOMM40-APOE LD block that is significantly connected to the identified pathway. It may be plausible that age-related susceptibility to β-amyloid might be mediated by a decrease in mitochondrial function that occurs with advancing age.16, 17 Import of β-amyloid into the nucleus through nucleoporins may also be an avenue worth pursuing in functional studies. Although TOMM40 shows pathway connectivity, whereas APOE does not, we emphasize that we in no way make the claim that the association of the region to AD is mediated by TOMM40. Rather, the data indicate that TOMM40 may also have a role in the disease, and this is echoed in the strong correlation of TOMM40 to PSEN2 expression. In summary, our approach rests on the idea that the genetic architecture of complex traits is not dispersed over unrelated genes in the genome, but rather the mutational events that ultimately underlie trait variance can occur in functionally related genes. While implicating intracellular protein transport in AD is a highlight of the present study, we also consider the success of identifying a significant pathway component to a complex disease an important validation of this strategy.