INTRODUCTION

Resequencing of genomes will generate unprecedentedly high-dimensional genetic variation data that allow nearly complete evaluation of the genetic variation including several million common (>5% population frequency), low frequency (<1% and <5% population frequency) and rare variants (<1% population frequency) in the typical human genomes.1, 2 Despite their promise, next-generation sequencing (NGS) technologies suffer from three remarkable limitations: high error rates, enrichment of rare variants and large proportion of missing values.3, 4 Since an individual rare variant would have a relatively small impact on the common disease and the rare variants have very low frequencies in the populations, the power of the traditional analytical tools that are mainly designed for the purpose of detecting common variants, for testing association of rare variants will be limited. Developing new analytical tools for the analysis of the massive sequencing data poses a novel and great challenge to statistical analysis.5

Genetic studies of complex diseases are undergoing a paradigm shift from the single market analysis to the joint analysis of multiple variants in a genomic region that can be genes or other functional units.6 Large simulations have shown that combining information across multiple variants in a genomic region of analysis will greatly enhance power to detect association of rare variants.2 In the past several years, various versions of collapsing methods in which all rare variants are collapsed and treated as a single variable for analysis have been developed.2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 Although in some cases group tests have a higher power than the individual tests, they also suffer limitations. First, group tests ignore difference in the effects of SNPs at different genomic locations on phenotype. Second, group tests do not leverage linkage disequilibrium (LD) in the data. And third, since sequence errors are cumulative when rare variants are grouped, group tests are sensitive to the genotyping errors and missing data. To utilize the advantages of both single variant analysis and group tests and address the limitations inherent by single variant analysis and group tests, we view the genome as a continuum and variants in the genome as a realization of a stochastic process which can be modeled as a random function and proposed to use functional principal component analysis (FPCA) for testing the association of rare variants with disease.18 FPCA can greatly enhance the power to detect association of variants. However, when the genetic variant functions in FPCA rapidly change within the genomic region, the basis expansion in the FPCA cannot approximate the genetic variation data well, which will decrease the power of FPCA. To overcome this limitation, we propose to develop the smoothed FPCA (SFPCA) for testing the association of rare variants that combines a measure of goodness-of-fit with a roughness penalty to retain the advantages of basis expansion, but circumvent its limitation.

Group tests often make implicit homogeneity assumptions where all putatively functional variants within the same genomic region are assumed to have the same direction of effects. However, in practice, the variants with opposite directions of effects will be simultaneously presented in the same genomic region.5 Group tests have difficulties in dealing with heterogeneity due to size and effect signs. The second purpose of this paper is to show that the SFPCA will take the sign and size heterogeneity of the variants into account and be less sensitive to the presence of variants with opposite directions of effect.

There is increasing consensus that complex diseases are caused by common and rare variants. Many statistics can be used to test for association of either common variants or rare variants, but very few can be used to test association of both common and rare variants. Third purpose of this report is to demonstrate that the SFPCA can be used to test the association of the entire allelic spectrum of genetic variation.

To accomplish these goals, we will use large-scale simulations to calculate the type 1 error rates and evaluate the power of 12 alternative statistical methods: the SFPCA discretization, SFPCA Fourier expansion, FPCA discretization, FPCA Fourier expansion, collapsing method, combined multivariate and collapsing (CMC) method,8 generalized T2,2, 19 multivariate principal component analysis (MPCA), the weighted sum statistic (WSS)9 and the variable threshold (VT) method10 under various scenarios. To further evaluate its performance, the SFPCA is applied to the ANGPTL4 sequence and six continuous phenotypes data from the Dallas Heart Study20 and a GWAS of schizophrenia data.

MATERIALS AND METHODS

Smoothed FPCA

We first review the definition of genetic variant profiles.18 Let t be the position of a genetic variant within a genomic region and T be the length of the genomic region being considered. For convenience, we rescale the region [0,T] to [0,1]. Because the density of genetic variants is high, we can view t as a continuous variable in the interval [0,1]. Assume that nA cases and nG controls are sampled and sequenced. We define the genotype of the ith case as

where M is an allele at the genomic position t. Similarly, we can define a genetic variant function Xi(t), for the ith control.

Now, we review the concept of functional principal component for association studies.18 To capture variation of genetic variant function, we define a linear combination of functional values:

where β(t) is a weight function and X(t) is a centered genetic variant function defined in equation (3). The functional principal components can be obtained by choosing weight function β(t) to maximize the variance of f:18

where R(s,t) is the covariance function of the genetic variant function X(t).

The observed genetic variant profiles are typically not smooth, which leads to substantial variability in the estimated functional principal component curves. To improve the smoothness of the estimated functional principal component curves, we impose the roughness penalty on the functional principal component weight functions. We often penalize the roughness of the functional principal component curve using its integrated squared second derivative. The balance between the goodness-of-fit and the roughness of the function is controlled by a smoothing parameter μ.

The smoothed functional principal components can be obtained by solving the following integral equations (see Appendix):

Note that when μ=0, the SFPCA is reduced to an unsmoothed FPCA.

Computations for the smoothed principal component function and the principal component score

The eigenfunction is an integral function and difficult to solve in closed form. A general strategy for solving the eigenfunction problem in (4) is to convert the continuous eigen-analysis problem to an appropriate discrete eigen-analysis task.21 In this paper, we use basis function expansion methods to achieve this conversion (see Supplementary File 1).

Let {φj(t)} be the series of Fourier functions. For each j, define ω2j−1=ω2j=2πj. We expand each genetic variant profile Xi(t) as a linear combination of the basis function φj:

Define the vector-valued function X(t)=[X1(t),…XN(t)]T and the vector-valued function φ(t)=[φ1(t),…φT(t)]T. The joint expansion of all N genetic variant profiles can be expressed as

where C is a coefficient matrix C=(Cij)N × T.

In matrix form, we can express the variance-covariance function of the genetic variant profiles as

Similarly, the eigenfunction β(t) can be expanded as

where b=[b1,…,bT]T and S0=diag (ω14,…,ωT4). Let S=diag ((1+μω14)−1/2,…,(1+μωT4)−1/2). Then, we have

Substituting expansions (7) and (8) of variance-covariance R(s,t) and eigenfunction β(t) into the functional eigen equation (4), we obtain

which can be rewritten as

where u=S−1b. Thus, b=Su and β(t)=φ(t)Tb is a solution to eigen equation (4).

Note that <uj, uj>=1 and <uj, uk>=0 for k<j. Therefore, we obtain a set of orthonormal eigenfunctions as shown in equation (11):

where an inner product of two functions is defined as , where .

Test statistic

We use the pooled genetic variant profiles Xi(t) of cases and Yi(t) of controls to estimate the set of orthonormal principal component function (eigenfunctions) using the basis expansion methods. By the K-L decomposition, the smoothed functional principal component score can be obtained by

We denote vectors of averages of functional principal component scores in cases and controls by and , where , and define the pooled covariance matrix , where . Let . Then, the statistic is defined as .

Under the null hypothesis of no association of the genomic region with a disease, the statistic T is asymptotically distributed as a central distribution.

RESULTS

Null distribution of test statistics

When the sample size is large, the distribution of the SFPCA test statistic for testing the association of the genomic region with a trait of interest is distributed under the null hypothesis of no association as a central distribution, where K is the number of functional principal components used in the test. To examine the validity of this statement, we performed a series of simulation studies.

We used the MS software22 to generate a population of 2 000 000 chromosomes, each with 60 common SNPs (MAF≥0.05) and 180 rare SNPs (MAF≤0.05) in a genomic region on the basis of a coalescent model that mimics the LD pattern and the population history. The frequencies of minor alleles in the genomic region vary from 10−5 to 0.43. A number of individuals, ranging from 1000 to 5000, each consisting of two chromosomes, were sampled from the population and equally assigned to cases and controls. A total of 10 000 data sets were generated and the proposed test statistics were performed for each data set. For each test, we selected the number of functional principal components that account for 90% of the total variations.

Table 1 summarized the type 1 error rates of the SFPCA test statistics for testing the association of rare variants within a genomic region. It showed that the estimated type 1 error rates of the test statistic were not appreciably different from the nominal levels α=0.05, α=0.01 and α=0.001. We also performed simulation studies to examine the validity of the null distribution of the test statistics in testing the association of a set of both common and rare variants within a genomic region. All 240 common and rare variants were used to calculate the type 1 error rates. Table 2 summarized the type 1 error rates of the SFPCA statistic for testing the association of all 240 variants in the genomic region with the disease. It showed that the estimated type 1 error rates of the SFPCA statistic were also not appreciably different from the nominal levels α=0.05, α=0.01 and α=0.001.

Table 1 Type 1 error rates of the smoothed FPCA statistic for testing the association of the rare variants in a genomic region with a disease
Table 2 Type 1 error rates of the smoothed FPCA statistic for testing the association of both common and rare variants in a genomic region with a disease

Power evaluation

To evaluate the performance of the FPCA-based statistics for testing the association of a set of rare variants with disease, we used the same data set as that for type 1 error rate calculation to estimate their power to detect a true association. We considered four disease models: additive, dominant, recessive and multiplicative.

An individual’s disease status was determined based on the individual’s genotype and the penetrance for each locus. Let Ai be a rare risk allele at the ith locus. Let be the genotypes aiai, Aiai and AiAi, respectively, and fki be the penetrance of genotypes at the ith locus. The relative risk (RR) at the ith locus is defined as R1i=f1i/f0i and R2i=f2i/f0i, where f0i is the baseline penetrance of the wild-type genotype at the ith variant site. We assume that for the additive disease model, R2i=2R1i−1; for the dominant disease model, R2i=R1i; for the recessive disease model, R1i=1; and for the multiplicative disease model, R2i=R1i2. The genotype RR was assumed to be inversely proportional to the MAF where the population attributable risk (PAR) of each group was assumed to be 0.005.7 We assumed equal RR across all variant sites and the independence of the variants influencing disease susceptibility. Each individual was assigned to the group of cases or controls depending on their disease status. The process for sampling individuals from the population of 2000 000 haplotypes was repeated until the desired samples were reached for each disease model.

Figure 1 and Supplementary Figures 1–3 plot the power curves of 12 statistics: SFPCA discretization, SFPCA Fourier expansion, FPCA discretization, FPCA Fourier expansion, sequence kernel association test (SKAT),23 WSS, VT, MPC-based statistic, Collapsing method and Generalized T2 statistic, Single marker χ2 test where permutation was used to adjust for multiple testing and the CMC method (variants with frequencies ≤0.005 were collapsed) as a function of the proportion of risk increasing variants for testing the association of 180 rare variants with disease under additive, dominant, multiplicative and recessive disease models, respectively, assuming a baseline penetrance of 0.01, and that 2000 cases and 2000 controls were sampled for additive, dominant and multiplicative models, and 3000 cases and 3000 controls for the recessive models. The SFPCA-based statistics had the highest power, followed by the classical FPCA-based statistics, SKAT, WSS and VT under four disease models. The single marker test, generalized T2 and CMC methods under all disease models had the lowest power to detect association of rare variants. When the PAR is assumed a constant, the number of risk increasing variants determines the marginal PAR of each variant in the group. From these figures, we can see that the power of all 12 statistics is an increasing function of the proportion of risk variants.

Figure 1
figure 1

Power of 12 statistics: the SFPCA (discretization approach) statistic, SFPCA (Fourier expansion approach) statistic, FPCA (discretization approach)-based statistics, FPCA (Fourier expansion approach)-based statistic, SKAT, multivariate PC-based statistic, WSS, VT, collapsing method, generalized T2 statistic, single marker χ2 test and CMC method (the variants with frequencies ≤0.005 were collapsed) as a function of the proportion of risk increasing variants for testing the association of 180 rare variants at significance level α=0.05 with the disease under the additive disease model, assuming baseline penetrance of 0.01 and 2000 cases and 2000 controls.

Next, we evaluate the impact of the sample sizes on the power. We assume that 15% of rare variants were risk increasing variants. Figure 2 and Supplementary Figures S5 and S6 showed the power of the above 12 statistics as a function of sample sizes under additive, dominant, multiplicative and recessive models, respectively. Similarly to Figure 1 and Supplementary Figures S1–S3, we observed that the SFPCA-based statistics had the highest power in all settings. Differences in the power between the SFPCA-based statistics and eight other non-FPCA statistics increased as the sample sizes increased. We also observed that most of the time the difference in power between the FPCA by expansion and FPCA by the discretization method is small.

Figure 2
figure 2

Power of 12 statistics: the SFPCA (discretization approach) statistic, SFPCA (Fourier expansion approach) statistic, FPCA (discretization approach)-based statistics, FPCA (Fourier expansion approach)-based statistic, SKAT, multivariate PC-based statistic, WSS, VT, collapsing method, generalized T2 statistic, single marker χ2 test and CMC method (the variants with frequencies ≤0.005 were collapsed) as a function of sample sizes for testing the association of 180 rare variants, 15% of which were risk increasing variants, with the disease under the additive disease model at significance level α=0.05, assuming baseline penetrance of 0.01.

Next, we investigate the power of statistics for testing association of both common and rare variants. Figure 3 plotted the power of 12 statistics for testing association of all 240 common and rare variants as a function of proportion of risk variants under the additive model, assuming that 2000 cases and 2000 controls were sampled and Supplementary Figure S7 showed the power of 12 statistics for testing association of all 240 common and rare variants as a function of sample sizes under the additive model, assuming 15% of risk variants. The power pattern of 12 statistics under other diseases models was similar to that of the tests under additive models (data not shown). From these figures, we observed that the SFPCA substantially outperform the non-SFPCA and other statistics. As sample sizes increased the difference in power between the SFPCA and other tests rapidly increased.

Figure 3
figure 3

Power of 12 statistics: the SFPCA (discretization approach) statistic, SFPCA (Fourier expansion approach) statistic, FPCA (discretization approach)-based statistics, FPCA (Fourier expansion approach)-based statistic, SKAT, multivariate PC-based statistic, WSS, VT, collapsing method, generalized statistic, single marker test and CMC method (the variants with frequencies 0.005 were collapsed) as a function of the proportion of risk increasing variants for testing the association of 240 common and rare variants at significance level with the disease under the additive disease model, assuming baseline penetrance of 0.01 and 2000 cases and 2000 controls.

To examine the impact of the direction of association of risk alleles with disease on the power of the tests, we randomly select 7.5% of variants as risk variants and 7.5% of variants as protective variants. We plotted Figure 4 to show the power curves of the 12 statistics for testing association of 180 rare variants as a function of sample size under additive model and Supplementary Figure S8 to show the power curves of the 12 statistics for testing association of all 240 common and rare variants as a function of sample sizes under additive model. The patterns of power curves of the 12 statistics under the dominant, multiplicative and recessive models were similar to Figure 4 and Supplementary Figure S8 (data not shown). These results clearly demonstrated that the power of the SFPCA was the highest, followed by the classical FPCA, SKAT, WSS and VT. We also observed that the generalized T2, single marker test, collapsing method and CMC almost had no power to detect association of either rare variants or both common and rare variants in the presence of both risk and protective variants. It is interesting to note that the FPCA-based statistics do not assume that all variants within the genomic region being tested have the same direction of effect and do not require a testing stage to predict direction of effect. The results showed that the SFPCA and FPCA statistics can effectively deal with the simultaneous presence of both risk and protective variants without additional computation.

Figure 4
figure 4

Power of 12 statistics: the SFPCA (discretization approach) statistic, SFPCA (Fourier expansion approach) statistic, FPCA (discretization approach)-based statistics, FPCA (Fourier expansion approach)-based statistic, SKAT, multivariate PC-based statistic, WSS, VT, collapsing method, generalized statistic, single marker test and CMC method (the variants with frequencies 0.005 were collapsed) for testing the association of 180 rare variants as a function of the sample sizes under the additive model at significance level α=0.05, assuming that 7.5% of rare variants were risk increasing variants and 7.5% of rare variants were protective variants, and baseline penetrance of 0.01.

Application to real data examples

To further evaluate their performance for testing association of rare variants, the SFPCA tests were first applied to the ANGPTL3, 4, 5 and 6 sequence and phenotype data from the Dallas Heart Study.21 The total numbers of rare variants with a minor allele frequency <0.05 in the ANGPTL3, 4, 5 and 6 genes which were identified from 3553 individuals were 49, 83, 91 and 66, respectively. Since the SFPCA method requires that each individual should have at least two rare variants in the genomic region being tested, we excluded 98 individuals with only one rare variant. The total number of rare variants with a minor allele frequency <0.03 in ANGPTL 4 was 71. To examine the phenotypic effects of the rare variants in the ANGPTL 3, 4, 5 and 6 genes, two groups of individuals with the lowest and highest quartiles of the five traits related to lipid metabolism were selected. The individuals with plasma triglyceride (Trig) levels less than or equal to the 25th percentile were classified as the lowest quartiles of the Trig and the individuals with plasma Trig greater than or equal to the 75th percentile were grouped as the highest quartiles of the Trig. We can similarly classify the individuals as the lowest and highest quartiles of high density lipoprotein cholesterol (HDL), total cholesterol, very low density lipoprotein cholesterol (VLDL) and body mass index (BMI). P-values from the SFPCA methods, the classical FPCA methods, SKAT, WSS, VT, MPCA-based statistic, the generalized T2 statistic, single marker χ2 test where permutation was used to adjust for multiple testing, collapsing and CMC methods for testing association of 71 rare variants (MAF≤0.03) in ANGPTL4 with the five traits, were summarized in Table 3. For the CMC method, variants with an allele frequency <0.005 were collapsed. The results in Table 3 clearly demonstrated that the SFPCA methods had the smallest P-values. We observed that P-values (1.01 × 10−5 and 4.47 × 10−5) by the SFPCA-based statistics for testing association of the rare variants in ANGPTL4 with triglyceride were much smaller than the P-value (0.016) in their original studies.21 Particularly, we observed that only the FPCA-based statistic and SKAT identified an association of the rare variants in ANGPTL4 with HDL and the P-values by the smoothed FPCA methods were even much smaller than the P-values by the SKAT and FPCA methods. This demonstrated that the smoothing techniques can largely increase the power to detect association of rare variants in some cases due to the improved accuracy to fit the data by the smoothed functional principal component curves. P-values from the 12 statistics for testing association of rare variants in the ANGPTL 3, 5 and 6 genes with the five traits are summarized in Supplementary Tables 1–3, respectively. We observed the same pattern as that in Table 3.

Table 3 P-values of 12 statistics for testing the association of rare variants in ANGPTL4 with five traits in the Dallas Heart Study

To illustrate that the SFPCA methods can be applied to common variants, they were applied to a GWAS of schizophrenia data that were downloaded from dbGaP to test the association of common variants within a genomic region. The samples were of the European origin and included 1135 individuals with schizophrenia and 1362 controls with 727 479 typed SNPs. The total number of genes being tested is 13 804. The threshold for declaring genome-wide significance after the Bonferroni correction is 3.6 × 10−6. The number of genes significantly associated with schizophrenia by 14 statistics: the SFPCA, FPCA, SKAT, Collapsing, CMC, WSS, VT, single marker χ2 test T2 test, MPCA test, FPCA, linear combination test (LCT), quadratic test (QT), de-correlation test (DT)24 is listed in Table 4. We also listed the top 10 significantly associated genes that were identified by the SFPCA (Fourier Expansion) in Table 5. Since in many cases the frequency of individuals with at least one minor alleles present is close to one, the collapsing test statistic cannot be calculated and hence its P-values were not listed in Table 5. The results clearly showed that the number of significantly associated genes identified by the SFPCA is much larger than that identified by the unsmoothed FPCA and other statistics, and the P-values by the SFPCA were much smaller than the P-values by the unsmoothed FPCA and other statistics. Therefore, the smoothing techniques provide a large improvement over the FPCA methods without smoothing. Among genes in Table 5, PDLIM5 was reported to be associated with schizophrenia and bipolar disorder,25 CERKL was associated with narcolepsy,26 HAAO was associated with Parkinson’s disease27 and MTA3 was associated with cancer.28

Table 4 Number of genes significantly associated with schizophrenia by 13 statistics
Table 5 P-values of the top 10 significantly associated genes identified by SFPCA

DISCUSSION

We have demonstrated here that the SFPCA statistics can be used to test association of both common and rare variants and have broad applicability to NGS data. The SFPCA statistics have several remarkable advantages over many previously proposed group tests.

The first advantage of the SFPCA is utilization of merits of both single variant analysis and group tests. The smoothed functional principal component scores take information across all variants in the genomic region into account and hence include all single variant variation. The SFPCA statistic is to globally compare differences in the average of functional principal component scores between cases and controls. In other words, it tests accumulation of differences in all variant variation in the genomic region between cases and controls. Therefore, the SFPCA overcomes limitations inherent by single variant analysis and group tests and effectively employ the merits of both single variant tests and group tests.

The second advantage is that the SFPCA methods can efficiently use information of both risk and protective variants and allow for sign and size heterogeneity of genetic variants. In general, the risk and protective variants will be present in different locations in the genomic region. Information of risk and protective variants usually will be reflected in different eigenfunctions and hence will be included in different functional principal component scores. The SFPCA test statistic is to summarize the square of the differences in the smoothed functional principal component scores between cases and controls. Therefore, the opposite effects of risk and protective variants on the phenotype will not compromise each other in the SFPCA statistics. The FPCA statistics automatically take the opposite effects of the risk and protective variants on the phenotype into account and do not require additional computations. By simulations we showed that the SFPCA test statistics had substantially higher power than the existing approach in the presence of both risk and protective variants in the genomic region being investigated.

The third advantage is that the SFPCA statistics can be used to test the association of either rare or common variants or both rare and common variants. Empirical and theoretical studies support potential roles for both rare and common variants in complex diseases. There is an increasing need to develop statistics that can be used to test association of rare variants or common variants or both rare and common variants. From large-scale simulations and real data analysis, we showed that the SFPCA statistics had the correct type 1 error rate and high power in all scenarios. The fourth advantage is that the smoothing techniques can largely increase the accuracy of fitting the data by FPCA and hence greatly improve the power to detect association of variants. The FPCA is often enhanced by the use of penalty techniques. The observed genetic variation records are not smooth. Consequently, we often observe that the principal component curves show substantial fluctuations. To reduce the variability of principal component curves, we need to either smooth or regularize the estimated principal component curves. The smoothing method removes the roughness in the raw principal component curves and hence improves the accuracy of the estimated functional principal component scores, which will lead to improved type 1 error rates and power.

The fifth advantage is that random genetic variant function in the SFPCA is flexible. The variable xi(t) at the single variant site can take integer values to code alleles or genotypes, or real numbers to represent the number of reads of the sequences, the probability of SNP call, and the probability of the variant being functional or weights at the variant site.

NGS techniques generalize extremely high dimensional genomic data. Transition of analysis from low dimensional data to extremely high dimensional data demands changes in statistical methods from multivariate data analysis to functional data analysis. Functional data analysis coupled with smoothing techniques will provide a powerful tool for NGS data analysis. However, the results in this report are considered preliminary. The number of eigenfunctions in the expansion of genetic variant function and penalty parameters will influence the performance of the smoother FPCA for association studies. How to simultaneously identify the associated genomic regions and causal variants within them and the optimal selection of these parameters in genome-wide association studies are still open questions in practice. We are facing great challenges in developing efficient and powerful analytic platforms for association analysis of NGS data.

WEB RESOURCES

The URL for the 1000 Genomes Project data is as follows: http://www.1000genomes.org/. A program for implementing the smoothed FPCA can be downloaded from our website http://www.sph.uth.tmc.edu/hgc/faculty/xiong/index.htm.