Introduction

It is increasingly realized that evolutionary forces produce substantial genetic heterogeneity in human disease.1 Different affected individuals may have a large number of different risk variants. These different risk variants may be located in the same genes or in different genes, but in the same or related pathways. Vast allelic, locus and phenotypic heterogeneity in common disease implies that identifying pathways associated with disease is a key approach to unraveling the pathogenesis of the disease.2 Pathway analysis typically tests the association of a predefined set of related genes, which are often defined by biological knowledge.3 Although pathway analysis methods have been developed and successfully applied to association studies of common variants,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 the statistical methods for pathway-based association analysis of rare variants have not been well developed.18, 19, 20, 21 The current methods for pathway-based association analysis of rare variants are classified into two approaches. One approach is a two-step strategy, first generate gene-level statistics (combining information on all rare variants in the gene) or P-values and then aggregate the gene-level statistics or combining P-values across all genes in a pathway by gene set enrichment analysis (GSEA). An alternative approach is to aggregate all rare variants (combining all rare variants within genes in a pathway) in a pathway directly and to collectively test the association of all rare variants in the pathway by rare variant association test statistics.18 However, many investigators have observed highly inflated false-positive rates and low power in pathway-based tests of association of rare variants.22, 23 The inflated false-positive rates and low true-positive rates of the current methods are mainly due to their lack of ability to account for gametic phase disequilibrium and to reduce the high dimensionality of the data in the pathway-based association analysis.21

To overcome these limitations, we develop a novel statistical method for pathway-based association studies, which are based on the smoothed functional principal component analysis (SFPCA). The SFPCA is used to view the genotype profiles of SNPs as a function of genomic position of the SNPs and perform basis function expansion of genotype function. Therefore, the SFPCA takes information across all variants in the genomic region into account and hence, includes all individual varaint distribution. The SFPCA statistic globally compares differences in the average of functional principal component scores between cases and controls. In other words, it tests accumulation of differences in all variant variation in the genomic region between cases and controls. We extend SFPCA from a single gene to multiple genes and compare the difference in functional principal component scores that are calculated from all genes in a pathway between cases and controls. The SFPCA-based statistic for testing the association of pathway with the disease combines a measure of goodness-of-fit with a roughness penalty to retain the advantages of basis expansion and reduce the dimensionality of the data in the pathway. The SFPCA can utilize merits of both individual variant analysis and group tests. It can also efficiently use information of both risk and protective variants and allow for sign and size heterogeneity of genetic variants in the pathways. Many statistics can be used to test for association of either common variants or rare variants, but very few can test association of both common and rare variants. The SFPCA is designed to test the association of the entire allelic spectrum of genetic variation.

To evaluate the performance of the SFPCA-based statistic for pathway analysis, we will use large-scale simulations to calculate the type I error rates and systematically evaluate the power of 23 statistical methods: SFPCA, functional principal component analysis (FPCA), the weighted sum (WSS),24 variable-threshold (VT),25 combined multivariate and collapsing (CMC),26 linear combination test (LCT/LCT),13 quadratic test (QT/QT),13 de-correlation test (DT/DT),13 WSS/Sidak, WSS/Fisher combination, WSS/Fisher exact, WSS/GESA, VT/Sidak, VT/Fisher combination, VT/Fisher exact, VT/GESA, CMC/Sidak, CMC/Fisher combination, CMC/Fisher exact, CMC/GESA, PCA, SKAT27 and GESA.28

To further explore and illustrate some valuable features of the SFPCA, SFPCA and other popular statistics for pathway analysis are applied to the early-onset myocardial infarction (EOMI) exome sequence data sets, which contain individuals with European origin (EA) and African origin (AA) from the NHLBI’s Exome Sequencing Project (ESP). The ESP may be the largest publically available exome sequencing data set when this paper is completed. Our results show that although sample sizes are small, we still can identify pathways significantly associated with EOMI, which can be replicated in two independent studies and confirmed in the literature.

Materials and methods

FPCA for pathway association

Let t be a genomic position in the gene. Define a genetic variant function xi(t) of the i-th individual as 1, 0, −1, if the number of major alleles at the SNP located at the genomic position t is 2, 1 and 0, respectively.

Consider k genes in a pathway. The j-th gene is located in the genomic region [aj,bj]. Let R(s,t) be a covariance function between xi(t) and xi(s), and β(t) be a functional principal component. By extension of multivariate PCA to functional PCA, the formula for the variance of stochastic integral29 and calculus of variations,30 we obtain the following k eigenequations (Supplementary Information):

Equation (1) can be solved by basis function expansion (Supplementary Information).

The observed genetic variant functions are often not smooth, which will lead to substantial variability in the estimated functional principal component curves.31, 32 To improve the smoothness of the estimated functional principal component curves, we impose the roughness penalty on the functional principal component weight functions. The smoothed functional principal components can be found by solving the following integral equations (Supplementary Information):

Note that when μ=0 the smoothed functional principal components analysis is reduced to unsmoothed FPCA. Again, integral Equation (2) can be solved by basis function expansion, which leads to the principal component function (Supplementary Information).

Test statistic

We use the pooled genetic variant functions Xi(t) of cases and Yi(t) of controls to estimate the principal component function , j=1,...,k using the basis expansion methods (Supplementary Information). Using orthonormality of the functional principal components, we can obtain the functional principal component scores and of Xi(t) and Yi(t) (Supplementary Information). Let and , where M is the number of functional principal components in the eigenfunction expansion. Define the pooled covariance matrix.

where

Define

Then, the statistic is defined as

Under the null hypothesis of no association of pathway with the disease, the statistic TSFPCAP is asymptotically distributed as a distribution where M is the number of functional principal components in the eigenfunction expansion (Supplementary Information).

Results

Type 1 error rates

To assess the type 1 error rates of the test statistics, we performed a series of simulation studies. We randomly sampled (with replacement) from the exome sequence data in the TGF-β signaling pathway (19 genes) with 2242 European Americans in the NHLBI’s ESP project to simulate 100 000 individuals. A total of 1394 SNPs with minor allele frequencies (MAFs) ranging from 4.06 × 10−4 to 0.4523 and 595 rare variants (MAF≤ 0.05) were included in the analysis. A total of 2000 individuals were randomly sampled (with replacement) and equally assigned to cases and controls. A total of 5000 simulations were repeated.

Two approaches are usually taken in pathway analysis. One approach is to first generate gene-level statistics or P-values assessed by marginal association analysis or a group test, and then aggregate the gene-level statistics or combine P-values across all genes in a pathway by GSEA. An alternative approach is to aggregate all SNPs in the pathway directly and to collectively test the association of all SNPs in the pathway by association test statistics. In this paper, the gene-based statistics that test for association of a gene with disease were the LCT, QT, DT, CMC method, WSS, VT approach and SKAT method. The statistics for aggregating the gene-level statistics in the pathway analysis in which LD information in the pathway can be explored were LCT, QT and DT. The methods for combining P-values across all genes in the pathway were GSEA with Kolmogorov-Smirnov test, Fisher’s exact test and Sidak test. FPCA and SFPCA that aggregated all SNPs in the pathway were used to jointly test the association of all SNPs in the pathway. The CMC, WSS and VT were also used to serve this purpose.

Table 1 summarized type 1 error rates of 15 statistics to test pathway association including rare variants (MAF <0.05) only with disease. We observed that LCT/LCT and QT/QT tests were inflated at three significance levels, DT/DT, WSS/Fisher-exact and VT/Fisher-exact were inflated at a 0.05 significance level, but other tests were not appreciably different from the nominal levels. To examine the validity of the tests for testing the pathway association including both common and rare variants, we presented Table 2. Similar to Table 1, LCT/LCT and QT/QT tests were inflated at three significance levels. DT/DT, WSS/Fisher-exact and VT/Fisher-exact were inflated at a 0.05 significance level, and other tests were not appreciably different from the nominal levels.

Table 1 Type 1 error rates of 15 statistics for testing the association of pathway that includes only rare variants with disease
Table 2 Type 1 error rates of 15 statistics for testing the association of pathway that includes all variants with disease

Power evaluation

To evaluate the performance of the proposed statistics for testing the pathway association, we used simulated data to estimate their power to detect true associations. Again, we used exome sequence data in the TGF-β signaling pathway (19 genes) of 2242 European Americans in the ESP project to simulate 100 000 individuals. The number of SNPs and range of allele frequencies were the same as that described in the previous selection. We randomly selected six risk genes. In each risk gene, we selected a proportion of variants as the risk variant.

We assumed that the relative risks across all variant sites are equal and that the variants influence disease susceptibility independently (ie, no epistasis). Each individual was assigned to the group of cases or controls depending on their disease status (Supplementary Information). The process for sampling individuals from the population of 100 000 individuals was repeated until the desired samples were reached for simulations. A total of 5000 simulations were repeated.

By simulations we evaluated the power of 23 statistics: SFPCA, FPCA, WSS, VT, CMC, LCT/LCT, QT/QT, DT/DT, WSS/Sidak, WSS/Fisher combination, WSS/Fisher exact, WSS/GESA,VT/Sidak, VT/Fisher combination, VT/Fisher exact, VT/GESA, CMC/Sidak, CMC/Fisher combination, CMC/Fisher exact, CMC/GESA, PCA, SKAT and GESA. Figure 1 and Supplementary Figure 1 plotted the power curves of 23 statistics for testing the pathway association harbored rare variants (MAF ≤0.05). All variants including both rare and common variants, respectively, as a function of the total number of individuals at the α=0.05 significance level, assuming that 20% of the variants in each of six genes were randomly selected as risk variants. Several remarkable features emerged from Figure 1 and Supplementary Figure 1. First, the SFPCA-based statistic had the highest power, followed by the FPCA-based statistics and other statistics. Second, the SFPCA and FPCA are designed to directly test the association of all SNPs in the pathway. We observed that SFPCA, FPCA and direct application of WSS, SKAT, PCA, VT and CMC statistics to test the pathway association had higher power than the two-stage approache. Third, the power pattern of test statistics for pathway analysis of all variants was similar to that for pathway analysis of rare variants.

Figure 1
figure 1

The power curves of 23 statistics for testing the association of a pathway including rare variants (MAF ≤0.05) only with disease as a function of the total number of individuals at the significance level α=0.05, where six risk genes and 20% of the risk rare variants in each gene were randomly selected. The power curves of SFPCA, FPCA, WSS, VT, CMC, LCT/LCT, QT/QT, DT/DT, WSS/Sidak, WSS/Fisher Combination, WSS/Fisher exact, WSS/GESA, VT/Sidak, VT/Fisher combination, VT/Fisher exact, VT/GESA, CMC/Sidak, CMC/Fisher Combination, CMC/Fisher exact, CMC/GESA, PCA, SKAT and GESA were denoted by red solid (-), red dashed (—), blue dashdot (-.), blue solid (-), green solid (-), black solid (-), magenta solid (-), magenta dashed (—), magenta dotted (..), blue dashed (—), blue dashdot (-.), blue dotted (..), blue solid with * marktype (-*), green dashed (—), green dashdot (-.), green dotted (..), green solid with plus marktype (-+), black dashed (—), black dashdot (-.), black dotted (..), black solid with * marktype (-*), magenta solid with * marktype (-*) and magenta dashed with plus marktype (—+), respectively.

To examine the impact of the direction of association of alleles with disease risk on the power of the tests, we assume that the risk genes in the pathway include both risk and protective variants. Figure 2 and Supplementary Figure 2 plotted the power curves of 23 statistics for testing the pathway association that harbored rare variants (MAF ≤0.05), and all variants including both rare and common variants, respectively, as a function of the total number of individuals at the α=0.05 significance level. We assumed that 15% of rare (all) variants in each of six genes were randomly selected as risk variants and 15% of the rare (all) variants in each gene were protective variants. Figure 2 and Supplementary Figure 2 showed that the power pattern of the statistics for testing the association of pathway with both risk and protective variants was similar to the power patterns of the statistics for testing the pathway association with only risk variants. However, we also observed that the impact of the direction of association of alleles on the power of SFPCA and FPCA was much less than on the power of other test statistics.

Figure 2
figure 2

The power curves of 23 statistics for testing the association of a pathway including rare variants (MAF ≤0.05) only with disease as a function of the total number of individuals at the significance level α=0.05, where six risk genes were selected, and 15% of the rare variants in each gene were assumed to be risk variants and 15% of the rare variants in each gene were assumed to be protective variants. The power curves of SFPCA, FPCA, WSS, VT, CMC, LCT/LCT, QT/QT, DT/DT, WSS/Sidak, WSS/Fisher Combination, WSS/Fisher exact, WSS/GESA, VT/Sidak, VT/Fisher combination, VT/Fisher exact, VT/GESA, CMC/Sidak, CMC/Fisher Combination, CMC/Fisher exact, CMC/GESA, PCA, SKAT and GESA were denoted by red solid (-), red dashed (—), blue dashdot (-.), blue solid (-), green solid (-), black solid (-), magenta solid (-), magenta dashed (—), magenta dotted (..), blue dashed (—), blue dashdot (-.), blue dotted (..), blue solid with * marktype (-*), green dashed (—), green dashdot (-.), green dotted (..), green solid with plus marktype (-+), black dashed (—), black dashdot (-.), black dotted (..), black solid with * marktype (-*), magenta solid with * marktype (-*) and magenta dashed with plus marktype (—+), respectively.

To further systematically compare the power of 23 statistics for testing the pathway association, we studied the power of the tests as a function of the ratio of risk variants. For simplicity, again we selected six genes as risk genes and assumed that the proportion of risk variants in each of the six genes was the same. Figure 3 and Supplementary Figure 3 plotted the power of 23 statistics for testing the pathway association that harbored rare variants (MAF ≤0.05), and all variants including both rare and common variants, respectively, as a function of proportion of risk variants under the dominant model at the α =0.05 significance level, assuming that 1600 cases and 1600 controls were sampled. We observed that the smoothed FPCA-based statistics had the highest power, followed by the FPCA in all settings. Varying the number of genes, or the proportion of disease variants across genes (so some genes have a higher proportion), influenced the magnitude of the power, but did not change the relative performance of methods (data not shown).

Figure 3
figure 3

The power curves of 23 statistics for testing the association of a pathway including rare variants (MAF ≤0.05) only with disease as a function of the proportion of risk rare variants within the six selected risk genes at the significance level α=0.05, where 1600 cases and 1600 controls were sampled. The power curves of SFPCA, FPCA, WSS, VT, CMC, LCT/LCT, QT/QT, DT/DT, WSS/Sidak, WSS/Fisher Combination, WSS/Fisher exact, WSS/GESA, VT/Sidak, VT/Fisher combination, VT/Fisher exact, VT/GESA, CMC/Sidak, CMC/Fisher Combination, CMC/Fisher exact, CMC/GESA, PCA, SKAT and GESA were denoted by red solid (-), red dashed (—), blue dashdot (-.), blue solid (-), green solid (-), black solid (-), magenta solid (-), magenta dashed (—), magenta dotted (..), blue dashed (—), blue dashdot (-.), blue dotted (..), blue solid with * marktype (-*), green dashed (—), green dashdot (-.), green dotted (..), green solid with plus marktype (-+), black dashed (—), black dashdot (-.), black dotted (..), black solid with * marktype (-*), magenta solid with * marktype (-*) and magenta dashed with plus marktype (—+), respectively.

Application to a real data example

To further evaluate their performance, 23 statistics for pathway analysis were applied to the EOMI exome sequence data from the NHLBI’s ESP, where a total of 544 (188 cases and 356 controls) with EA and 312 (39 cases and 273 controls) with AA were exome sequenced. Genotype calling and quality control were described as previously.33 After quality control of the data (Supplementary Information), a total of 18 737 genes with 575 259 SNPs were included in the analysis. We assembled 249 pathways from KEGG34 and 308 pathways from Biocarta (http://www.biocarta.com). The assignment of SNPs to a gene including all SNPs within 5 kb of the gene region was obtained from the NCBI GRCh37/hg19 (ftp://ftp.ncbi.nih.gov/gene/Data/GENE_INFO). P-value for declaring significance after the Bonferroni correction is 8.98 × 10−5. We first studied pathway association including only rare variants (MAF ≤0.05). We identified two pathways: vascular endothelial growth factor (VEGF) signaling pathway and TGFβ signaling pathway that were significantly associated with EOMI in the EA population by the SFPCA-based test. Table 3 listed P-values of 11 statistics for testing the association of these two pathways with EOMI in the EA (113 201 rare variants) and AA (179 701 rare variants) populations, where P-values were calculated by permutation. Table 3 showed that the SFPCA-based statistic had the smallest P-value among 11 statistics, followed by FPCA. We also observed that the significant results by the SFPCA-based test in the EA population could not be replicated in the AA population.

Table 3 P-values of top 2 significant pathways with rare variants associated with EOMI

VEGF has an important role in maintaining healthy vascular integrity and promoting homoeostasis, and is involved in atherosclerosis plaque disruption.35, 36 The VEGF signaling pathway is now used as a target pathway for treatment of ischemic cardiovascular disease.37, 38 It is reported that TGFβ signaling pathway is a risk factor for cardiovascular disease39 and has an important role in the development of myocardial infarction.40, 41

To demonstrate that replication of the results of pathways in independent samples is much easier than replication of genes, we plotted in Figure 4 where the genes with red and blue color were mildly associated with EOMI in the EA and AA populations, respectively, and the genes with green color were mildly associated with EOMI in both of the EA and AA population (see data presented in Supplementary Table 1). Figure 4 and Supplementary Table 1 showed that the EOMI studies in the EA and AA populations shared no common significantly associated genes with rare variants within the TGFβ pathway, in other words, we failed to replicate significantly associated genes within the TGFβ pathway in two independent studies. However, Table 3 showed that the TGFβ pathway in both studies was significantly associated with EOMI using the SFPCA test. This example showed that replication at the pathway level is easier than replication at the gene level. We also observed no significant genes associated with EOMI in both of the EA and AA populations. But, we observed a number of mild associations of genes within the TGFβ pathway with EOMI in the EA and AA populations. In Figure 4, there were 14 genes and 10 genes that were mildly associated with EOMI (P-value <0.05) in the EA and AA populations, respectively. Figure 4 and Supplementary Table 1 showed that each gene in the TGFβ pathway may confer a small contribution, but their joint actions may affect the function of the pathway, which in turn will cause EOMI. Other mildly associated pathways with P-values <0.0001were summarized in Supplementary Table 2.

Figure 4
figure 4

P-values for testing the association of genes within the TGFβ pathway with EOMI in EA and AA population by the SFPCA test, where red, blue and green colors indicate the genes associated with EOMI in EA population, AA population or both EA and AA populations, respectively.

Next we examined the association of pathways with both common and rare variants. Table 4 listed top four best pathways harboring both common and rare variants associated with EOMI. Using the SFPCA test, pyrimidine metabolism was significantly associated with EOMI in EA population, whereas association of the metabolic pathway with EOMI in the AA population quite closely reached significance level (9.40 × 10−5). Although varaint distribution association of four pathways was not significant, their P-values of association were close to significance level. It is reported that glycosaminoglycan influences the low-density lipoprotein and is involved in atherosclerosis and ischemic heart disease.42, 43, 44, 45

Table 4 P-values of top four best pathways with both common and rare variants associated with EOMI

To examine whether P-values of the pathway associations in our real data set are dependent on pathway size, we generated three types of pathway data sets. For the first type of data sets, we randomly generated a pathway with the fixed number of genes (200 genes) and fixed number of SNPs (1764). 10 000 simulations were repeated. For the second type of data sets, we randomly generated 10 000 data sets, each with 200 genes and varying number of SNPs that range from 616 to 1162, selected from different pathways of the whole EOMI data set. For the third type of data sets, we randomly selected same number of genes and SNPs as that in the four reported significant pathways identified by our method. We used three data sets to generate empirical P-values of four pathways (Supplementary Table 3). Supplementary Table 3 clearly showed that using randomly selected genes and SNPs, the observed pathway associations are no longer significant. This demonstrates that the observed pathway associations were unlikely to be caused by systematic inflation of association test statistics. Rather, our results should have some biological implication.

Discussion

We have developed a novel SFPCA-based statistic to addresses the conceptual and analytical challenges raised by pathway-based association studies for the entire spectrum of genetic variation. To our knowledge, this is among the first one to systematically evaluate the power of a large number of tests for pathway association with NGS data using large-scale simulations and real data analysis. Using simulations and real data analysis of EOMI, we have demonstrated that the SFPCA-based statistic for pathway-based association studies has broad applicability to NGS data and has several remarkable advantages over many existing methods.

First, the SFPCA-based statistic not only can jointly test association of all SNPs within a pathway, but also can capture position-level variant information. The smoothed functional principal component scores take into account information across all variants in the pathway. The SFPCA-based statistic can account for gametic phase disequilibrium. Therefore, it has a much higher power and smaller P-value than the other 22 existing statistics in all scenarios.

Second, we have developed a unified statistic that can test the association of either rare or common variants or both rare and common variants without introducing any changes in the test statistic. From large-scale simulations and real data analysis, we showed that the SFPCA-based statistic for pathway-based association studies had the correct type 1 error rate and high power.

Third, by large-scale simulations, we have shown that any GSEA that does not employ correlation information among the variants within the pathway has less power than the statistics that aggregate all genetic variants within the pathway and directly test the association of all aggregated genetic variants. The SFPCA-based statistic has the highest power to test the association of pathway among 23 statistics.

Fourth, the SFPCA-based statistic can efficiently and automatically use information of both risk and protective variants and allow for sign and size heterogeneity of genetic variants. In general, the risk and protective variants will be present in different locations in the genomic regions within the pathway. Information of risk and protective variants usually will be reflected in different eigenfunctions and hence will be included in different functional principal component scores. The SFPCA-based statistic is to summarize the square of the differences in the smoothed functional principal component scores between cases and controls. Therefore, the opposite effects of risk and protective variants on the phenotype will not compromise each other in the SFPCA-based statistics. Using simulations, we showed that the SFPCA-based statistics had substantially higher power than the existing approach in the presence of both risk and protective variants in the pathway being investigated.

Fifth, the SFPCA method can partially handle missing variant calls. By functional expansion, the SFPCA method can automatically predict genetic variation function of the missing variants from information of other known variants. We can also redefine the genetic variation function to incorporate the predicted genotypes based on its MAF, that is, the probability of carrying a rare variant.

The developed SFPCA statistic was applied to pathway analysis of EOMI with exome sequencing data. We identified the TGFβ pathway harboring rare variants significantly associated with EOMI, which were replicated in two independent EA and AA studies. We also discovered several pathways showing association, which can be confirmed in the literature. However, surprisingly, we have not observed overlap between significant pathways with common variants and significant pathways with rare variants. This may imply that genetic causes underlying diseases for common and rare variants are different. Emerging NGS technologies enable sequencing individual genomes and have the potential to discover the entire spectrum of sequence variations in a sample of well-phenotyped individuals.

The results in this paper are quite preliminary. The number of eigenfunctions in the expansion of genetic variant function and penalty parameters will influence the performance of the SFPCA for pathway-based association studies. How optimally select these parameters in pathway analysis are still open to questions in practice. Great challenges in developing innovative approaches and general framework for pathway-based association studies of NGS data need to be dealt with effectively.