Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models

Chiu, Chi-yang; Jung, Jeesun; Chen, Wei; Weeks, Daniel E; Ren, Haobo; Boehnke, Michael; Amos, Christopher I; Liu, Aiyi; Mills, James L; Ting Lee, Mei-ling; Xiong, Momiao; Fan, Ruzong

doi:10.1038/ejhg.2016.170

Download PDF

Article
Published: 21 December 2016

Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models

Chi-yang Chiu¹,
Jeesun Jung²,
Wei Chen³,
Daniel E Weeks⁴,
Haobo Ren⁵,
Michael Boehnke⁶,
Christopher I Amos⁷,
Aiyi Liu¹,
James L Mills⁸,
Mei-ling Ting Lee⁹,
Momiao Xiong¹⁰ &
…
Ruzong Fan¹^nAff11

European Journal of Human Genetics volume 25, pages 350–359 (2017)Cite this article

808 Accesses
5 Citations
Metrics details

Subjects

Abstract

To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai–Bartlett trace, Hotelling–Lawley trace, and Wilks’s Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.

Mapping pleiotropic loci using a fast-sequential testing algorithm

Article Open access 18 June 2021

Fernando M. Aguate, Ana I. Vazquez, … Gustavo de los Campos

Multi-trait analysis of rare-variant association summary statistics using MTAR

Article Open access 05 June 2020

Lan Luo, Judong Shen, … Zheng-Zheng Tang

Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale

Article 24 August 2020

Xihao Li, Zilin Li, … Xihong Lin

Introduction

Meta-analysis of multiple studies and pleiotropy analysis of multiple traits are two areas in association studies that recently have received extensive attention in the literature.^{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} To our knowledge, meta-analysis and pleiotropy analysis have been performed separately so far, and there are no gene-based meta-analysis methods for combining multiple studies together and for carrying out a unified pleiotropy analysis. Here, multivariate functional linear models (MFLM) are developed to connect genetic variant data to multiple quantitative traits adjusting for covariates in a meta-analysis context. The goal is to take the advantage of both meta-analysis and pleiotropy analysis in order to improve power and to carry out a unified analysis of multiple studies and multiple quantitative traits of complex disorders.

A noticeable feature of next-generation sequencing data is that dense panels of genetic variants are available via high-throughput sequencing technology, and so we face high-dimension genetic data.^{11, 12, 13, 14} The genetic data can consist of rare variants, or common variants, or a combination of the two, where the rare variants’ minor allele frequencies (MAFs) are less than 0.01∼0.05. The high dimensionality of genetic data and the presence of dense rare variants raise huge challenges, and properly dealing with the high dimensionality and rare variants is one priority of statistical research in recent years.¹⁵

In our previous research as well as research from other groups, functional data techniques were used to reduce the dimensionality of genetic data and to build fixed effect functional regression models for association analysis of quantitative, dichotomous, and survival traits.^{10, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29} In most cases, it was shown that the functional regression test statistics perform better than sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C) of mixed models.^{4, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 32, 33} Specifically, mixed model-based SKAT/SKATO/SKAT-C performs well when (a) the number of causal variants is large and (b) each causal variant contributes a small amount to the traits, as the assumption of mixed models is satisfied under these circumstances.^{7, 21, 34} In most cases, however, fixed models perform better since the causal variants of complex disorders can be common or rare or a combination of the two and some causal variants may have relatively large effects.^{10, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27} If the number of causal variants is large and each causal variant contributes a small amount to the traits, it would be hard to show association as the power of a test can be low.³⁵ One may want to note that SKAT and SKAT-O were shown to have higher power than burden tests, which is another main method to analyze rare variants.^{4, 32, 36, 37, 38} Thus, fixed models can be useful in association studies of complex traits.

As functional regression models perform well in most cases, we are motivated to extend them to meta-analysis of pleiotropy traits. For individual studies, MFLM were built to perform pleiotropy analysis between multiple genetic variants and multiple quantitative traits adjusting for covariates in Wang et al.¹⁰ Similarly, functional linear models were developed to perform meta-analysis of a univariate quantitative trait in Fan et al.¹⁸ In this paper, we build MFLM to analyze multiple traits of multiple studies and introduce related approximate F-distributed test statistics to test for association based on multivariate analysis theory. The proposed methods are applied to analyze lipid traits in eight European cohorts. Simulation analysis is performed to evaluate the false-positive rates and power of the proposed tests.

Materials and methods

Consider a meta-analysis with L studies in a genomic region. For the l-th study, we assume that there are n_l individuals who are sequenced in the region at m_l variants. For each individual, we assume there are J quantitative trait phenotypes, J≥1. In this article, the research goal is to model association between the m_l genetic variants and the J phenotypic traits by combining all the L studies as a whole. We assume that the m_l variants are located with ordered physical positions . To make the notation simpler, we normalized the region to be [0, 1]. For the i-th individual in the l-th study, let y_lij denote her/his j-th quantitative trait (j=1,2,⋯,J), denote her/his genotypes of the m_l variants, and denote her/his covariates. Hereafter, ′ denotes the transpose of a vector or matrix. For the genotypes, we assume that (=0,1,2) is the number of minor alleles of individual i at the k-th variant.

Multivariate functional linear models

We view the i-th individual’s genotype data as a genetic variant function (GVF) from the l-th study. To relate the GVF to the phenotypic traits adjusting for covariates, we consider the following MFLM for

The notations used in the model (1) are defined below

where is a vector of overall means, is a matrix of regression coefficients of covariates, is a vector of genetic effect functions , and is a vector of error terms. For each pair of land i, the error vector is normally distributed with a mean vector of zeros and a J × J variance–covariance matrix Σ. Moreover, are assumed to be independent.

Expansion of Genetic Effect Function

The genetic effect functions of are assumed to be continuous/smooth functions of the position t. One may expand it by B-spline or Fourier basis functions. Formally, let us expand the genetic effect functions by a series of K_β basis functions as

where is a matrix of coefficients . We consider two types of basis functions: (1) the B-spline basis: ; and (2) the Fourier basis: and .^{39, 40, 41, 42}

Estimation of GVF

To estimate the GVFs from the genotypes , we use an ordinary linear square smoother.^{16, 17, 18, 19, 20, 42, 43} Let φ_k(t), k=1, ⋯, K, be a series of K basis functions, such as the B-spline basis and Fourier basis functions. Denote φ(t)=(φ₁(t), ⋯, φ_K(t))′. Let Φ denote the m_l by K matrix containing the values , where . Using the discrete realizations , we may estimate the GVF using an ordinary linear square smoother as follows:⁴²

Revised MFLM

Replacing by the expansion (2) and in the MFLM (1) by in (3), we have a revised multivariate linear regression model

where . In the above revised regression model, one needs to calculate and to get . In the statistical computing environment R, there are readily available R packages to calculate them.⁴³

Dealing with missing genotype data

If some genotype data are missing, the estimation (3) can be modified to estimate GVF . For instance, there is no genotype information at the first variant for the i-th individual, ie, we only have . Let Φ₁ denote the by K matrix containing the values , where . Then, we may revise the estimation (3) as

Note that the estimation (5) only depends on the available genotype data . Hence, each individual’s GVF is estimated by his/her own data. This is one advantage of functional data analysis, which can be useful in practice. Using the estimation (5), one may revise the model (4) accordingly.

Beta-smooth-only MFLM

Model (1) is a theoretical MFLM.⁴² For analysis of dense genetic data, one may use a simplified MFLM as follows

where is a vector of the genetic effects at position for the -th study, and the other terms are the same as those in the general MFLM (1).

In model (6), is a vector of the genetic effects at the position . We assume that is a vector of genetic effect functions of the physical position t. Therefore, are the values of vector at the physical positions. The genetic effect functions are assumed to be smooth. One may expand it by B-spline or Fourier basis functions. Replacing by expansion (2), model (6) can be revised as

where . In model (6) and its revised version (7), we use the raw genotype data directly in the analysis. The genetic effect vector is assumed to be smooth or continuous. Hence, the models are called beta-smooth only.

Dealing with Missing Genotype Data

If some genotype data are missing, eg, we only have and is missing, we may revise the MFLM (6) as

Again, the revised MFLM (8) only depends on the available genotype data , and it can be revised accordingly to be a form of model (7) by expansion (2) as

Traditional additive effect multivariate linear models

Traditionally, an additive effect model can be used to analyze the relation between the trait and the variants in the -study as Jung et al.⁴⁴ and Anderson⁴⁵

where is a matrix of coefficients , which is the additive genetic effect of variant k for the j-th trait in the -th study, and the other terms are similar to those in the MFLM (1) and (6). There is only one difference between model (6) and model (9), ie, the genetic effect coefficients in model (9) do not depend on the physical position , whereas in model (6) depends on the physical position .

Approximate F-distributed test statistics

Consider the revised regression models (4), (7), and the multivariate linear model (9), which model the genetic effect of the J phenotypic traits simultaneously adjusting for covariates by combining the L studies together. First, assume that the genetic effects among the L studies are different/heterogeneous. In the test of association between the genetic variants and the J quantitative traits simultaneously, the null hypothesis is , where is a zero J × K_β matrix O_J × _Kβ for models (4) and (7) or a zero matrix for model (9). We may test the null H₀ by approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks’s Lambda using standard statistical approaches.^{45, 46} The approximate F-distributed test statistic is denoted as heterogeneous F-approximation test statistics (Het-F).

Consider the revised models (4) and (7). If the genetic effects are homogeneous, ie, , we may test the association between the genetic variants and the J quantitative traits by testing a simplified null H₀:Ω=O_J × _Kβ. The null H₀ can be tested by approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks’s Lambda using standard statistical approaches. The approximate F-distributed statistic is denoted as Hom-F.

Assume that each individual of the L studies is sequenced at the same variants located at 0≤t₁<⋯<t_m and so . In addition, assume that the genetic effects are homogenous. Let us denote . Then, the model (9) is simplified as

The null hypothesis of no association between the genetic variants and the quantitative traits is H₀:Ω=O_J × _m. The corresponding approximate F-distributed test statistic is denoted as Hom-F.

If there is only one study, ie, L=1, the approximate F-distribution tests are equivalent to those of Wang et al.¹⁰ and Het-F is the same as Hom-F. If we only have one quantitative trait, ie, J=1, the three approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks’s Lambda are equivalent to the F-test statistics of the standard multiple linear regression. The models proposed in this article and the related approximate F-distribution tests extend the models and the F-test statistics in Fan et al.¹⁸

In practice, we find that the results of the three approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks’s Lambda are similar to each other.¹⁰ In this article, we only report the results of approximate F-distribution tests based on Pillai-Bartlett trace.

Parameters of Functional Data Analysis

In the data analysis and simulations, we used two functions from the fda R package to create the basis:

Basis=create.bspline.basis(norder=order, nbasis=bbasis)

basis=create.fourier.basis(c(0,1), nbasis=fbasis)

The three parameters were taken as order=4, bbasis=15, fbasis=25 in all data analysis and simulations. To make sure that the results are valid and stable, we tried a wide range of parameters: (1) 10≤K=K_β≤23 for the heterogeneous genetic effect model and (2) 10≤K=K_β≤29 for the homogeneous genetic effect model. The results are similar to each other.

Results

A simulation study

To evaluate the performance of the proposed MFLM, we carried out simulation analyses for two cases: (1) the variants are all rare; (2) some variants are rare and some are common. Simulations were performed for three scenarios listed in Table 4 in Supplementary Materials.^{4, 18} For scenarios 1 and 2, we used the European-like (EUR) sequence data used in Lee et al.³² For scenario 3, we used both the EUR and African-American-like (AA) sequence data. Specifically, the EUR sequence data were generated using COSI’s calibrated best-fit models, and the generated European haplotypes mimick CEPH Utah individuals with ancestry from northern and western Europe in terms of site frequency spectrum and linkage disequilibrium (LD) pattern (Figure 4 in Schaffner et al.^{47, 48}). Similarly, the AA sequence data mimic individuals with 20:80 mixture of Europeans and Africans, together with parameters calibrated to model realistic demographic history (including bottleneck, population expansion, and migration events). The EUR sequence data included 10 000 chromosomes covering 1 Mb regions, and the AA sequence data included 45 000 chromosomes covering 0.1 Mb regions. Genetic regions of 3 kb length were randomly selected in the simulations for type I error and power calculations.

Type I error simulations

To evaluate the type I error rates of the proposed MFLM and related tests, we generated phenotype data sets by using the model

Three scenarios of covariates are given in Supplementary Table S1, in which three covariates are considered: z₁ is a dichotomous covariate taking values 0 and 1 with a probability of 0.5, z₂ and z₃ are continuous covariates from a standard normal distribution N(0,1). The vector of error terms in model (11) follows a normal distribution with a mean vector of 0 and a 3 × 3 variance–covariance matrix

The 3 × 3 variance–covariance matrix Σ is taken from an empirical analysis of the three traits of the Trinity Students Study from Wang et al.¹⁰ For scenario 1 in Supplementary Table S1, the covariate regression coefficients are given by

For scenarios 2 and 3 in Supplementary Table S1, the covariate regression coefficients are given by

To obtain genotype data, 3 kb subregions were randomly selected in the 1 Mb region of EUR-like data and the 0.1 Mb region of AA-like data. The ordered genotypes were these SNPs in the 3 kb subregions. Note that the trait values are not related to the genotypes, and so the null hypothesis holds. The sample sizes were 1600 (study 1), 2200 (study 2), and 3200 (study 3). The simulation settings are summarized in Supplementary Table S1. For each sample size combination, 1.2 × 10⁶ phenotype–genotype data sets were generated to fit the proposed models and to calculate the test statistics and related P-values. Then, an empirical type I error rate was calculated as the proportion of 1.2 × 10⁶ P-values that were smaller than a given α level (ie, 0.05, 0.01, 0.001, and 0.0001, respectively).

Empirical power simulations

To evaluate the power of the proposed MFLM and related tests, we simulated data sets under the alternative hypothesis by randomly selecting 3 kb subregions to obtain causal variants for the phenotype values as follows. Once a 3 kb subregion was selected, a subset of causal variants located in the 3 kb subregion for the -th study was then randomly selected to obtain ordered genotypes . Then, we generated the quantitative traits by

where and are the same as in the type I error model (11), and the βs are additive effect for the causal variants defined as follows. We used , where MAF_k was the MAF of the k-th variant. Three genetic effect scenarios were used to perform power calculations: (1) all causal variants had positive effects; (2) 20%/80% causal variants had negative/positive effects; (3) 50%/50% causal variants had negative/positive effects. As in Fan et al.¹⁸ and Lee et al.,⁴ three different settings were considered: 5, 10, and 20% of variants in the 3 kb subregion are chosen as causal variants. When 5, 10, and 20% of the variants were causal, two parameter settings of genetic effects were considered for : (1) homogeneous and (2) heterogeneous (Supplementary Table S2). In the homogeneous case, the genetic effects are the same for the three studies, ie, c₁=c₂=c₃. In the heterogeneous case, the genetic effects are different for the three studies, ie, c₂=c₁+(0.15,0.15,0.15),c₃=c₁−(0.15,0.15,0.15). For each setting, 1000 data sets were simulated to calculate empirical power as the proportion of P-values, which are smaller than an α=0.0001 level.

Type I error simulation results

The empirical type I error rates are reported in Supplementary Table S3 when the variants are only rare and in Supplementary Table S4 when some variants are rare and some are common. For each entry of empirical type I error rates, we generated 1.2 × 10⁶ data sets. Results of four different α=0.05, 0.010.001, and 0.0001 levels were reported. For the proposed approximate F-distributed test statistics of MFLM (4) and (7) and additive model (9), all empirical type I error rates are around the nominal α levels for both B-spline basis and Fourier basis (columns 5–9 of Supplementary Tables S.3 and S.4). Therefore, the approximate F-distributed test statistics of MFLM controlled type I error rates correctly for all scenarios at all significance levels. The MFLM and related approximate F-distributed test statistics can be useful in both whole-genome and whole-exome association studies.

Power results

We compared the power of F-test of univariate and the approximate F-distributed tests of bivariate and trivariate traits based on the simulated COSI sequence data. The empirical power levels of the test statistics at α=0.0001 level were plotted in Figures 1 and 2. In the figures, 20%/80% causal variants had negative/positive effects for each trait. In the legend of all the Figures, ‘beta’ means that the power level is from beta-smooth only model (7), and ‘add’ means that the power level is from additive model (9). In Figure 1, the results of ‘Hom-F’ were reported when the approximate F-distributed statistics were constructed using the homogeneous effect model (7) when the data were generated using the homogeneous models (Supplementary Table S2). Since the genotype data are different from study to study, there are no power levels for homogeneous additive model (10) in Figure 1. In Figure 2, the results of ‘Het-F’ were reported that the approximate F-distributed statistics were constructed using heterogeneous effect models (7) and (9) when the data were generated using the heterogeneous models (Supplementary Table S2). Therefore, ‘correct models’ were used to analyze simulated data in Figures 1 and 2.

In general, the power levels of F-test of the univariate y₁ trait are the lowest, the power levels of approximate F-distributed tests of the bivariate (y₁,y₂) trait are in the middle, and the power levels of approximate F-distributed tests of the trivariate (y₁,y₂,y₃) trait are the highest for either beta-smooth only model (7) or additive model (9) in Figures 1 and 2. Therefore, it makes sense to perform multivariate analysis of pleiotropy traits.

Meta-analysis of lipid traits in eight European cohorts

Lipid traits from eight European cohorts were analyzed: five from Finland (FUSION Stage 2, D2d-2007, DPS, METSIM, and DRs EXTRA), two from Norway (HUNT and Tromso), and one from Germany (DIAGEN). The two Norwegian cohorts are combined into one study for these analyses. The genotype data were generated using the Metabochip, which was designed to fine map regions that have been associated with metabolic traits.⁴⁹ For each cohort, 54 741 genetic variants were genotyped.

For our analysis, we utilized the existing literature as a reference for gene selection and found that 22 gene regions were fine mapped.⁵ We used Builder Mar. 2006 (NCBI36/hg18) to determine gene positions and 5 kb was used to extend the gene region on each side of a gene. The summary of 22 genes and the number of genetic variants in each region are given in Supplementary Table S5, Supplementary Materials. Four lipid traits were analyzed: high-density lipoprotein levels, low-density lipoprotein (LDL) levels, triglycerides (TG), and total cholesterol (CHOL). The sample sizes for each trait are provided in Supplementary Table S6, Supplementary Materials. For each trait, inverse normal rank transformation was performed to make sure that normality holds. For all studies except for METSIM, age, sex, and type 2 diabetes status were used as covariates. For METSIM, age and type 2 diabetes status were used as covariates since no females were included in the study. A significance threshold of P<3.1 × 10⁻⁶ was taken from Liu et al.⁵ (corresponding to 0.05/16 153 and allowing for the number of genes tested therein).

Using homogeneous F-approximation test statistics (Hom-F) based on Pillai-Bartlett trace, Table 1 reports results of three-trait and four-trait meta-analysis of lipid traits in European studies. For each combination of three to four traits, we observed association at five genes of APOB, APOE, LDLR, LPL, and PCSK9. For each of the five genes, we observed association for some of the traits in one-trait meta-analysis by homogeneous models (Table 1 of Fan et al.¹⁸ presented in Supplementary Table S7 in the Supplementary Materials). The results of two-trait meta-analysis of the lipid traits are presented in Supplementary Table S8, and association is observed for each of the five genes for some of the two-trait combinations.

Table 1 Three-trait and four-trait meta-analysis of lipid traits in European studies using Hom-F based on Pillai-Bartlett trace

Full size table

Using Het-F based on Pillai-Bartlett trace, Tables 2 and 3 report results of three-trait meta-analysis of the lipid traits, and results of four-trait meta-analysis of Het-F are presented in Table 4. By Het-F of MFLM (4) and (7), we observe associations for some three-trait and four-trait combinations at APOB, APOE, CDC123, CDKAL1, CDKN2B, FTO, HMGA2, HNF1A, JAZF1, IDE, KCNQ1, KIF11, LDLR, LPL, OASL, PCSK9, and TSPAN8. The results of two-trait meta-analysis of lipid traits are presented in Supplementary Tables S9 and S10 and association is observed for some genes and some of the two-trait combinations. Three traits (LDL, TG, and CHOL) are associated with some genes in one-trait meta-analysis by heterogeneous models (Table 2 of Fan et al.¹⁸ presented in Supplementary Table S11 in the Supplementary Materials). The additive effect model (9) detects some association signals, but less than the MFLM (4) and (7).

Table 2 Three-trait meta-analysis of lipid traits in European studies using Het-F based on Pillai-Bartlett trace

Full size table

Table 3 Three-trait meta-analysis of lipid traits in European studies using Het-F based on Pillai-Bartlett trace

Full size table

Table 4 Four-trait meta-analysis of lipid traits in European studies using Het-F based on Pillai-Bartlett trace

Full size table

In study-based pleiotropy analysis of Wang et al.,¹⁰ which analyzes each data set separately, association was observed at only two genes, APOE and LDLR, in some studies (Supplementary Table S12 in the Supplementary Materials from Table 1 of Wang et al.¹⁰). Thus, it is more advantageous to perform meta-analysis of multiple studies.

Discussion

Here we develop MFLM for meta-analysis of multiple quantitative traits adjusting for covariates. On the basis of the MFLM, approximate F-distributed statistics of Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks’s Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to show that the approximate F-distributed tests control the false-positive rates accurately. By evaluating power performance, it is shown that it can be advantageous to perform the proposed pleiotropy analysis instead of individual trait analysis.^{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 27, 44} Among other merits, the MFLM can handle missing genotype data naturally.

The proposed methods were used to analyze four lipid traits in eight European cohorts. When we use the homogeneous MFLM to analyze three traits and four traits together, association is observed at five genes of APOB, APOE, LDLR, LPL, and PCSK9. For each of the five genes, we only observed association for some traits in one-trait meta-analysis and two-trait meta-analyses (Table 1 of Fan et al.¹⁸ presented in Supplementary Table S7 and Supplementary Table S8 in the Supplementary Materials). Similarly, the proposed heterogeneous MFLM detected more and stronger association signals by three-trait or four-trait analysis than one-trait or two-trait analysis.

One special feature of MFLM is that functional data analysis techniques are used to reduce the dimensionality of the next-generation sequencing data.^{39, 40, 41, 42, 43} The key idea is that multiple genetic variants of an individual is treated as a realization of an underlying stochastic process.⁵⁰ Therefore, the genome of an individual is viewed as a continuous stochastic function that contains both genetic position and LD information of the genetic markers. In real data analysis, one may test whether the genetic effects are heterogeneous or homogeneous, ie, to test H₀: Ω₁=· · ·=Ω_L=Ω. If the H₀ is rejected, the genetic effects are heterogeneous; otherwise, they are homogeneous.

In linkage analysis, it is well known that the genetic data are treated as functions of the recombination fraction^{51, 52} to order genes along a chromosome.⁵³ Thus, it is reasonable and esirable to treat genetic data as functions. In linkage analysis, one needs to estimate the recombination fractions based on pedigree data. In next-generation sequencing data, the physical positions in terms of base pairs are available in almost all studies and one does not need to estimate them. However, in association studies, the genetic data are usually treated as discrete and the physical positions are simply ignored in most literature except in recent functional regression models.^{10, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29} Our functional regression models provide a way to properly utilize the physical positions in gene-based association studies.

In genetic meta-analysis, summary statistics from different studies are usually used to meta-analyze the data as individual data are not always available.^{5, 54} In our case, the European cohorts individual genetic data are available for analysis. Therefore, we build our MFLM using the individual-level data. If only summary statistics of functional regression models are available from different studies, it is still an open question if those statistics can be used to meta-analyze the data. It is known that meta-analysis using individual data are advantageous over meta-analysis of summary statistics in non-genetics studies.^{55, 56, 57} It would be interesting to evaluate the pros and cons of two approaches in genetic association analysis in the future studies. Note that the functional regressions are simply ordinary regressions after revising the theoretical functional models by functional data analysis techniques, and so the strategy of usual meta-analysis would be useful.⁵⁴ It should be possible to use results from functional regression models for a meta-analysis across cohorts. However, the details are still waiting for further work.

References

Gianola D, de los Campos G, Toro MA, Naya H, Schön CC, Sorensen D : Do molecular markers inform about pleiotropy? Genetics 2015; 201: 23–29.
Article CAS Google Scholar
Guo X, Liu Z, Wang X, Zhang H : Genetic association test for multiple traits at gene level. Genet Epidemiol 2013; 37: 122–129.
Article Google Scholar
Jia Y, Jannink JL : Multiple trait genomic selection methods increase genetic value prediction accuracy. Genetics 2012; 192: 1513–1522.
Article Google Scholar
Lee S, Teslovich TM, Boehnke M, Lin X : General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet 2013; 93: 42–53.
Article CAS Google Scholar
Liu DJ, Peloso GM, Zhan X et al: Meta-analysis of gene-level tests for rare variant association. Nat Genet 2014; 46: 200–204.
Article CAS Google Scholar
Maity A, Sullivan PF, Tzeng JY : Multivariate phenotype association analysis by marker set kernel machine regression. Genet Epidemiol 2012; 36: 686–695.
Article Google Scholar
Broadaway KA, Cutler DJ, Duncan R et al: A statistical approach for testing cross-phenotype effects of rare variants. Am J Hum Genet 2016; 98: 525–540.
Article CAS Google Scholar
Maier R, Moser G, Chen GB et al: Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet 2015; 96: 283–294.
Article CAS Google Scholar
Van der Sluis S, Dolan CV, Li J et al: MGAS: a powerful tool for multivariate gene-based genome-wide association analysis. Bioinformatics 2015; 31: 1007–1015.
Article CAS Google Scholar
Wang YF, Liu AY, Mills JL et al: Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 2015; 39: 259–275.
Article Google Scholar
Mardis ER : Next-generation DNA sequencing methods. Annu Rev Genom Hum Genet 2008; 9: 387–402.
Article CAS Google Scholar
Metzker ML : Sequencing technologies the next generation. Nat Rev Genet 2010; 11: 31–34.
Article CAS Google Scholar
Rusk N, Kiermer V : Primer: sequencingthe next generation. Nat Methods 2008; 5: 15.
Article CAS Google Scholar
Shendure J, Ji H : Next-generation DNA sequencing. Nat Biotechnol 2008; 26: 1135–1145.
Article CAS Google Scholar
Bansal V, Libiger O, Torkamani A, Schork NJ : Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 2010; 11: 773–785.
Article CAS Google Scholar
Fan RZ, Wang YF, Mills JL, Wilson AF, Bailey-Wilson JE, Xiong MM : Functional linear models for association analysis of quantitative traits. Genet Epidemiol 2013; 37: 726–742.
Article CAS Google Scholar
Fan RZ, Wang YF, Mills JL et al: Generalized functional linear models for case-control association studies. Genet Epidemiol 2014; 38: 622–637.
Article Google Scholar
Fan RZ, Wang YF, Boehnke M et al: Gene level meta-analysis of quantitative traits by functional linear models. Genetics 2015; 200: 1089–1104.
Article Google Scholar
Fan RZ, Wang YF, Chiu CY et al: Meta-analysis of complex diseases at gene level by generalized functional linear models. Genetics 2016; 202: 457–470.
Article CAS Google Scholar
Fan RZ, Wang YF, Qi Y et al: Gene-based association analysis for censored traits via functional regressions. Genet Epidemiol 2016; 40: 133–143.
Article Google Scholar
Fan RZ, Chiu CY, Jung JS et al: A comparison study of fixed and mixed effect models for gene level association studies of complex traits. Genet Epidemiol 40: 702–721.
Luo L, Boerwinkle E, Xiong MM : Association studies for next-generation sequencing. Genome Res 2011; 21: 1099–1108.
Article CAS Google Scholar
Luo L, Zhu Y, Xiong MM : Quantitative trait locus analysis for next-generation sequencing with the functional linear models. J Med Genet 2012; 49: 513–524.
Article Google Scholar
Luo L, Zhu Y, Xiong MM : Smoothed functional principal component analysis for testing associa- tion of the entire allelic spectrum of genetic variation. Eur J Hum Genet 2013; 21: 217–224.
Article CAS Google Scholar
Svishcheva GR, Belonogova NM, Axenovich TI : Region-based association test for familial data under functional linear models. PLoS ONE 2015; 10: e0128999.
Article Google Scholar
Vsevolozhskaya OA, Zaykin DV, Greenwood MC, Wei C, Lu Q : Functional analysis of variance for association studies. PLoS ONE 2014; 9: e105074.
Article Google Scholar
Vsevolozhskaya OA, Zaykin DV, Barondess DA, Tong X, Jadhav S, Lu Q : Uncovering local trends in genetic effects of multiple phenotypes via functional linear models. Genet Epidemiol 2016; 40: 210–221.
Article Google Scholar
Zhang F, Boerwinkle E, Xiong MM : Epistasis analysis for quantitative traits by functional regres- sion models. Genome Res 2014; 24: 989–998.
Article CAS Google Scholar
Zhao JY, Zhu Y, Xiong MM : Genome-wide gene-gene interaction analysis for next-generation sequencing. Eur J Hum Genet 2016; 24: 421–428.
Article CAS Google Scholar
Chen H, Lumley T, Brody J et al: Sequence kernel association test for survival traits. Genet Epidemiol 2014; 38: 191–197.
Article Google Scholar
Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X : Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet 2013; 92: 841–853.
Article CAS Google Scholar
Lee S, Emond MJ, Bamshad MJ et al: Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet 2012; 91: 224–237.
Article CAS Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X : Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89: 82–93.
Article CAS Google Scholar
Fisher RA : The correlation between relatives on the supposition of Mendelian inheritance. Philos Trans R Soc Ed 1918; 52: 399–433.
Google Scholar
Zuk O, Schaffner SF, Samocha K et al: Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci USA 2014; 111: E455E464.
Article Google Scholar
Li B, Leal SM : Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008; 83: 311–321.
Article CAS Google Scholar
Madsen BE, Browning SR : A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009; 5: e1000384.
Article Google Scholar
Morris AP, Zeggini E : An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 2010; 34: 188–193.
Article Google Scholar
de Boor C : A Practical Guide to Splines, Revised Version. New York, NY, USA: Springer, 2001.
Google Scholar
Ferraty F, Romain Y : The Oxford Handbook of Functional Data Analysis. New York, NY, USA: Oxford University Press, 2010.
Google Scholar
Horváth L, Kokoszka P : Inference for Functional Data With Applications. New York, NY, USA: Springer, 2012.
Book Google Scholar
Ramsay JO, Silverman BW : Functional Data Analysis, 2nd edn. New York, NY, USA: Springer, 2005.
Google Scholar
Ramsay JO, Hooker G, Graves S : Functional Data Analysis With R and Matlab. New York, NY, USA: Springer, 2009.
Book Google Scholar
Jung JS, Zhong M, Liu L, Fan RZ : Bi-variate combined linkage and association mapping of quantitative trait loci. Genet Epidemiol 2008; 32: 396–412.
Article Google Scholar
Anderson TW : An Introduction to Multivariate Statistical Analysis, 2nd edn. New York, NY, USA: John Wiley & Sons, 1984.
Google Scholar
Rao CR : Linear Statistical Inference and its Applications, 2nd edn. New York, NY, USA: John Wiley & Sons, 1973.
Book Google Scholar
Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D : Calibrating a coalescent simulation of human genome sequence variation. Genome Res 2005; 15: 1576–1583.
Article CAS Google Scholar
The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007; 449: 851–861.
Article Google Scholar
The 1000 Genomes Project Consortium: A map of human genome variation from population scale sequencing. Nature 2010; 467: 1061–1073.
Article Google Scholar
Ross SM : Stochastic Processes, 2nd edn. New York, NY, USA: John Wiley & Sons, 1996.
Google Scholar
Lange K : Mathematical and Statistical Methods for Genetic Analysis, 2nd edn. New York, NY, USA: Springer, 2002.
Book Google Scholar
Ott J : Analysis of Human Genetic Linkage, 3rd edn. Baltimore and London: Johns Hopkins University Press, 1999.
Google Scholar
Sturtevant AH : The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association. J Exp Zool 1913; 14: 43–59.
Article Google Scholar
Lin DY, Zeng D : Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genet Epidemiol 2010; 34: 60–66.
CAS PubMed Google Scholar
Debray TP, Moons KG, Abo-Zaid GM, Koffijberg H, Riley RD : Individual participant data metaanalysis for a binary outcome: one-stage or two-stage? PLoS One 2012; 8: e60650.
Article Google Scholar
Higgins JP, Whitehead A, Turner RM, Omar RZ, Thompson SG : Meta-analysis of continuous outcome data from individual patients. Stat Med 2001; 20: 2219–2241.
Article CAS Google Scholar
Mathew T, Nordström K : Comparison of one-step and two-step meta-analysis models using indi- vidual patient data. Biometric J 2010; 52: 271–287.
Google Scholar

Download references

Acknowledgements

Two anonymous reviewers and Editor-in-Chief, Professor Dr Gertjan van Ommen, provided very good and insightful comments for us to improve the manuscript. We greatly thank the European cohort investigators for letting us analyze the data and use them as examples. Dr Heather M Stringham and Dr Tanya M Teslovich kindly sent us the data of the European cohorts and patiently answered many questions about the cohorts, and we greatly appreciated them. This study was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Maryland (Ruzong Fan and Chi-yang Chiu), by Wei Chen’s NIH grants R01EY024226 and R01HG007358 and the University of Pittsburgh (Ruzong Fan is an unpaid collaborator on the grant R01EY024226). This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov).

Author information

Ruzong Fan
Present address: 11Current address: Department of Biostatistics, Bioinformatics, and Biomathematics, 4000 Reservoir Road NW, Building D-180, Georgetown University Medical Center, Washington, DC 20057, USA.,

Authors and Affiliations

Division of Intramural Population Health Research, Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
Chi-yang Chiu, Aiyi Liu & Ruzong Fan
Laboratory of Epidemiology and Biometry, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, MD, USA
Jeesun Jung
Division of Pulmonary Medicine, Allergy and Immunology, The University of Pittsburgh Medical Center, Pittsburgh, PA, USA
Wei Chen
Department of Human Genetics and Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
Daniel E Weeks
Data Paradise Inc, Belle Mead, NJ, USA
Haobo Ren
Department of Biostatistics, The University of Michigan, Ann Arbor, MI, USA
Michael Boehnke
Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
Christopher I Amos
Division of Intramural Population Health Research, Epidemiology Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
James L Mills
Department of Epidemiology and Biostatistics, University of Maryland College Park, College Park, MD, USA
Mei-ling Ting Lee
Human Genetics Center, University of Texas–Houston, Houston, TX, USA
Momiao Xiong

Authors

Chi-yang Chiu
View author publications
You can also search for this author in PubMed Google Scholar
Jeesun Jung
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Daniel E Weeks
View author publications
You can also search for this author in PubMed Google Scholar
Haobo Ren
View author publications
You can also search for this author in PubMed Google Scholar
Michael Boehnke
View author publications
You can also search for this author in PubMed Google Scholar
Christopher I Amos
View author publications
You can also search for this author in PubMed Google Scholar
Aiyi Liu
View author publications
You can also search for this author in PubMed Google Scholar
James L Mills
View author publications
You can also search for this author in PubMed Google Scholar
Mei-ling Ting Lee
View author publications
You can also search for this author in PubMed Google Scholar
Momiao Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Ruzong Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruzong Fan.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Computer program

The methods proposed in this paper are implemented by using procedures of the R functional data analysis (fda) package. The R codes for data analysis and simulations are available from the web site: http://www.nichd.nih.gov/about/org/diphr/bbb/software/fan/Pages/default.aspx.

Supplementary Information accompanies this paper on European Journal of Human Genetics website

Supplementary information

Supplementary Information (PDF 116 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chiu, Cy., Jung, J., Chen, W. et al. Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 25, 350–359 (2017). https://doi.org/10.1038/ejhg.2016.170

Download citation

Received: 11 May 2016
Revised: 26 July 2016
Accepted: 27 September 2016
Published: 21 December 2016
Issue Date: March 2017
DOI: https://doi.org/10.1038/ejhg.2016.170

This article is cited by

Multiple phenotype association tests based on sliced inverse regression
- Wenyuan Sun
- Kyongson Jon
- Wensheng Zhu
BMC Bioinformatics (2024)
OpenMendel: a cooperative programming project for statistical genetics
- Hua Zhou
- Janet S. Sinsheimer
- Kenneth Lange
Human Genetics (2020)
A generalized model for combining dependent SNP-level summary statistics and its extensions to statistics of other levels
- Gulnara R. Svishcheva
Scientific Reports (2019)

Subjects

Abstract

Similar content being viewed by others

Mapping pleiotropic loci using a fast-sequential testing algorithm

Multi-trait analysis of rare-variant association summary statistics using MTAR

Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale

Introduction

Materials and methods

Multivariate functional linear models

Expansion of Genetic Effect Function

Estimation of GVF

Revised MFLM

Dealing with missing genotype data

Beta-smooth-only MFLM

Dealing with Missing Genotype Data

Traditional additive effect multivariate linear models

Approximate F-distributed test statistics

Parameters of Functional Data Analysis

Results

A simulation study

Type I error simulations

Empirical power simulations

Type I error simulation results

Power results

Meta-analysis of lipid traits in eight European cohorts

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (PDF 116 kb)

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Multiple phenotype association tests based on sliced inverse regression

OpenMendel: a cooperative programming project for statistical genetics

A generalized model for combining dependent SNP-level summary statistics and its extensions to statistics of other levels

Search

Quick links