A flexible likelihood framework for detecting associations with secondary phenotypes in genetic studies using selected samples: application to sequence data

Liu, Dajiang J; Leal, Suzanne M

doi:10.1038/ejhg.2011.211

Download PDF

Article
Published: 14 December 2011

A flexible likelihood framework for detecting associations with secondary phenotypes in genetic studies using selected samples: application to sequence data

Dajiang J Liu^1,2 &
Suzanne M Leal^1,2

European Journal of Human Genetics volume 20, pages 449–456 (2012)Cite this article

855 Accesses
7 Citations
1 Altmetric
Metrics details

Abstract

For most complex trait association studies using next-generation sequencing, in addition to the primary phenotype of interest, many clinically important secondary traits are also available, which can be analyzed to map susceptibility genes. Owing to high sequencing costs, most studies use selected samples, and the sampling mechanisms of these studies can be complicated. When the primary and secondary traits are correlated, analyses of secondary phenotypes can cause spurious associations in selected samples and existing methods are inadequate to adjust for them. To address this problem, a likelihood-based method, MULTI-TRAIT-ASSOCIATION (MTA) was developed. MTA is flexible and can be applied to any study with known sampling mechanisms. It also allows efficient inferences of genetic parameters. To investigate the power of MTA and different study designs, extensive simulations were performed under rigorous population genetic and phenotypic models. It is demonstrated that there are great benefits for analyzing secondary phenotypes in selected samples. In particular, using case–control samples and samples with extreme primary phenotypes can be more powerful than analyzing random samples of equivalent size. One major challenge for sequence-based association studies is that most data sets are not of sufficient size to be adequately powered. By applying MTA, data sets ascertained under distinct mechanisms or targeted at different primary traits can be jointly analyzed to map common phenotypes and greatly increase power. The combined analysis can be performed using freely available data sets from public repositories, for example, dbGaP. In conclusion, MTA will have an important role in dissecting the etiology of complex traits.

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

Wei Zhou, Zhangchen Zhao, … Seunggeun Lee

A resource-efficient tool for mixed model association analysis of large-scale data

Article 25 November 2019

Longda Jiang, Zhili Zheng, … Jian Yang

A generalized linear mixed model association tool for biobank-scale data

Article 04 November 2021

Longda Jiang, Zhili Zheng, … Jian Yang

Introduction

There is solid evidence that complex traits can be influenced by rare variants.^{1, 2, 3, 4} The development and large-scale application of next-generation sequencing have already revolutionized genetic studies and enabled detecting associations with complex traits owing to rare variants. In order to design powerful studies, it is necessary to deeply sequence samples from a large number of individuals.⁵ However, many existing studies are small- to moderate-sized, owing to the high cost of sequencing or limited availability of samples, and are therefore inadequately powered. It would be advantageous if different studies, which measure the same phenotypes, could be jointly analyzed to increase power. In particular, many clinically important traits, such as body mass index (BMI), systolic (SysBP) and diastolic blood pressure (DiasBP) are often measured in different studies. When combined analysis is performed, in addition to incorporating studies that are targeted at the same primary traits, it is desirable to also analyze data from studies for which the phenotype of interest is measured as an additional outcome. Combined analyses require modeling multiple phenotypes as different studies may sequence selected samples targeted at different primary traits. Similar to the idea of analysis of covariance (ANCOVA), jointly analyzing multiple phenotypes makes it possible to distinguish the phenotype covariance component that is due to gene pleiotropy and the component that is attributable to residual correlations.

Currently, most studies sequence selected samples, for example, case–control samples or individuals with extreme phenotypes.^{1, 6} Sequencing selected samples reduces sequencing cost and improves power. Owing to sample ascertainment, secondary traits can be associated with the gene region in a selected sample even though they are independent in the general population. For example, consider a gene that is associated with the primary trait, but not with the secondary trait, in the general population (Figure 1). In a sample that consists of individuals with extreme primary trait values, the causative variant frequency will be different between individuals from the upper and lower extremes. The mean value for the secondary trait will also be different owing to phenotypic correlations. Therefore, a spurious association can occur between the gene region and the secondary trait unless the sample ascertainment scheme is correctly modeled. The selection criteria for a sequencing study can be complicated and may involve multiple traits (multiple-trait study) or sub-phenotypes. For instance, it is hypothesized that the etiologies of type-2 diabetes (T2D) are different in obese and non-obese individuals.^{7, 8} In order to reduce phenotype heterogeneity and potentially improve power, a study of T2D might be performed using an obese population. There have been methods developed for detecting associations with multiple phenotypes in selected samples.^{9, 10} However, these methods are limited to case–control studies. They are not applicable to more complicated study designs, for example, studies that sequence individuals with extreme primary traits (extreme-trait study), or studies where secondary phenotypes are also involved in sample selection. In particular, the extreme-trait study design is becoming increasingly popular and widely applied.^{11, 12, 13} The results for detecting associations with secondary traits can be seriously biased if the secondary traits are not properly analyzed.⁹ It is desirable to have a unified approach for analyzing secondary phenotypes from all available data sets.

A flexible likelihood approach, MULTI-TRAIT-ASSOCIATION (MTA), is presented for detecting associations with multiple phenotypes in selected or randomly ascertained samples. This method can be used to detect both common and rare variant/secondary phenotype associations. MTA jointly models multiple phenotypes conditional on the study subjects being ascertained. The sampling mechanisms are incorporated by means of a prospective likelihood approach. The MTA framework is comprehensive and can be used to model multiple continuous or categorical traits. To model traits that are not continuous, a generalized linear model is used. For example, either a probit or logit link function can be applied to model binary traits. In this article, the discussion is focused on using the probit link function and the liability threshold model, which can be justified by the polygenic model of complex traits. It has been suggested that the liability of all complex traits can be considered as ‘quantitative’.¹⁴ For complex traits that are not measured in a quantitative scale, there should exist a continuous underlying liability trait, which is due to the aggregated outcome from multiple causative variants with small effects. In this case, a multivariate liability threshold model is naturally used to jointly model multiple phenotypes.

The power of MTA for detecting gene/secondary trait associations is examined in different selective study designs. Three study designs are considered, that is, case–control, extreme-trait and multiple-trait. It is assumed for each of the study designs that the same continuous secondary phenotype T is measured. For comparison purposes, study designs are also evaluated where the quantitative trait T is selected and analyzed as the primary phenotype. Simulation details for each study design can be found in Table 1 .

Table 1 Definitions of selection mechanisms

Full size table

It is very beneficial to be able to use and combine selected samples from existing sequencing-based genetic studies. Through extensive simulation studies, it is shown that the case–control and extreme-trait designs can be more powerful for detecting associations with secondary phenotypes than using a population-based design, where individuals are randomly selected regardless of their phenotypes. The power for detecting associations with secondary phenotypes strongly depends on the aggregation of causative variants in the sample. For study designs that facilitate enrichment of causative variants, power will be increased. In the presence of gene pleiotropy, variants that are associated with both the primary and secondary traits can be enriched through selections on the primary phenotype. When the gene region is only associated with the secondary phenotype, if the primary and secondary traits are correlated, selections on the primary phenotype can also induce selections on the secondary phenotype. In this case, for a sample of equivalent size, the power of rejecting the null hypothesis of no gene/secondary trait association in case–control or extreme-trait studies is still superior or comparable to a population-based study.

The power for detecting associations with secondary phenotypes in selected samples is jointly affected by locus phenotypic effects for both the primary and secondary phenotypes, as well as residual correlations. Concordant with observations from previous studies of multiple-trait linkage/association mapping,^{15, 16, 17} it is demonstrated that power is maximized when the locus-induced trait correlations are in the opposite direction of the residual correlations. To further demonstrate the utility of MTA in combined analysis, an example is given where samples from a case–control study and a multiple-trait study are jointly analyzed. The power for detecting associations with commonly measured phenotypes can be greatly increased when studies are combined, compared with analyzing each individual study separately.

As an application of MTA, we analyzed the sequence data from the ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6 genes generated by the Dallas Heart Study (DHS). The 3551 study participants of the DHS were phenotyped for multiple metabolism-related traits, including BMI, DiasBP, SysBP, total cholesterol level (TCL), low-density lipoprotein (LDL), high-density lipoprotein (HDL), triglyceride (TG) and glucose (Gluc). Two primary trait analyses were first performed: (1) analysis of all samples and (2) analysis of selected samples whose quantitative trait values fall within the lower and upper quartiles. Next a secondary phenotype analysis was performed where within each selected sample all other available phenotypes were analyzed as secondary traits. The results from the secondary trait analyses confirmed the primary trait analyses. These results established the importance of analyzing secondary phenotypes and the effectiveness of MTA. They provided solid support to our simulation experiment.

Materials and methods

It is assumed that there are S variant nucleotide sites for a gene locus. The multi-site genotype for individual i is given by where the genotype at the segregating nucleotide site s is coded by the number of minor alleles, (eg, if the individual is homozygous for the minor allele). To detect associations with rare variants, multiple rare variants in a gene locus are usually jointly analyzed.^{18, 19, 20, 21, 22} The locus genotype coding for an individual i is defined as where C(•) is the coding function.

Locus multi-site genotype coding schemes

Many statistical methods have been developed for association studies of complex traits owing to rare variants. Existing methods include combined multivariate and collapsing (CMC),²³ test of an aggregated number of rare variants (ANRV),²² weighted sum statistics (WSS),²⁰ variable threshold test (VT),²¹ kernel-based adaptive cluster (KBAC),¹⁹ the C-alpha test²⁴ and the RARECOVER (RC) method,²⁵ and so on. Most of these methods are essentially based on weighting or grouping variants. Among them, CMC and ANRV are regression-based methods, which can be incorporated into MTA through the coding function C(•):

(1) CMC: The coding function is defined as where δ(•) is an indicator function and is a summation over the set of rare variant nucleotide sites RV, which can be determined by a pre-specified frequency cut-off.

(2) ANRV: The coding function belongs to a more general class of weighted sum coding (WSC), which can be defined as In the WSC scheme, the variant from nucleotide site s is assigned weight w^s. The ANRV coding assigns equal weight for all variants, that is,

A general probability model for multiple phenotypes in selected samples

In order to incorporate the sample ascertainment mechanism and correct for the bias induced by phenotypic residual correlations, multiple phenotypes are jointly modeled. The primary and secondary traits are assumed to follow a multivariate generalized linear model:

and are link functions, and and are the model parameters related to the primary and secondary traits. This multivariate generalized linear model can be used with any type of link functions, such as probit link function or logit link function.

For selected samples, a conditional likelihood is used, which is similar to Pearson–Aitken correction for ascertainment:²⁶

Z_i is an indicator of individual i being sampled and N is the number of individuals in the sample. Each term in (2) satisfies

The sampling mechanism is characterized by which can be explicitly calculated for case–control, extreme-trait and multiple-trait studies. The details are shown in Supplementary Material Section 1. When the probit link function is used to model binary phenotypes, the multivariate generalized linear model can be simplified. Computational details can be found in Supplementary Material Section 2.

Association testing

The likelihood-based score statistic can be applied to detect associations with rare variants. Using collapsing coding, P-values for the score statistics can be analytically evaluated. For the WSC, if the weights are only dependent on the multi-site genotypes, the score statistic will asymptotically follow a normal distribution and the P-values can also be analytically evaluated. Permutation procedures cannot be used to analyze secondary phenotypes in selected samples. This is because if the gene region is associated with the primary phenotype, study subjects are not interchangeable under the null hypothesis of no gene/secondary phenotype associations. The analyses in the article were performed using the CMC coding, that is, The results remain the same when other coding schemes are used.

Combining different cohorts for analyses of secondary phenotypes

Statistical theories for combining multiple studies are well developed.²⁷ As heterogeneity may exist between different cohorts, meta-analysis methods that combine test statistics should be used.^{11, 12} For rare variant analysis, multiple rare variants are jointly analyzed and their phenotypic effects are not usually estimated and reported. Therefore, all the joint analyses in this study were performed by combining score statistics from different studies. In the joint analysis, score statistics from different studies are weighted and summed. The weight assigned for each score statistic is proportional to the square root of the sample size according to Skol et al.²⁸

Generation of genetic and phenotypic data

Following Boyko et al,¹⁸ a rigorous population genetic model incorporating demographic change and purifying selections was used to simulate the African variant data. Details of generation of genetic data are given in Supplementary Material Section 3. To generate phenotypes, we assume that the phenotypic effects for causative variants are independent of their fitness. In a case–control study, the augmented phenotype for an individual i with multi-site genotype follows a bivariate normal distribution with

and

The rare variants sites CV_A^*and CV_T are randomly chosen to be causative for the traits A^* and T. Either set can be empty if the gene is not associated with the corresponding trait. Variants at sites are pleiotropic and affect both phenotypes. The binary disorder status A_i is determined by For each scenario, 1000 individuals were simulated. Details for simulating the extreme-trait and multiple-trait study samples can be found in Supplementary Material Section 4.

In order to evaluate type-I errors, phenotype data were generated under the null hypothesis of no gene/secondary trait T associations, that is, β_T=0. Scenarios were considered where (1) the gene region is neither associated with the primary nor the secondary phenotypes and (2) the gene is associated with the primary phenotype but not with the secondary phenotype. Scenarios with a combination of two causative variant primary trait effects and four residual correlations were evaluated.

To compare the power of rejecting the null hypothesis of no gene/secondary trait associations, two causal variant secondary phenotype effects β_T =±0.5σ_T were used. The power for the three study designs was compared under scenarios with different combinations of genetic parameter values.

Software availability

An R-package implementing MTA is available at http://www.bcm.edu/genetics/leal/software, which is compatible with commonly used operating systems, including Linux, Windows and OS X.

Results

Evaluation of type-I errors

Type-I errors for each study design using MTA were evaluated empirically. Under the null hypothesis of no genetic/secondary phenotype associations, the quantile–quantile (Q-Q) plots of the empirical and theoretical distributions of P-values are shown in Figures 2 and 3 for the case–control study design. When the ascertainment mechanism is correctly specified, the type-I errors are controlled. Results are shown in Figure 2 for the scenario where the gene region is not associated with either the primary or the secondary phenotypes, and the scenario where the gene region is only associated with the primary trait. Type-I errors for the extreme-trait and multiple-trait designs were also well controlled (data not shown). The impact of mis-specified sampling mechanisms was investigated. The results are shown in Figure 3 when the prevalence parameter is 10%, but is incorrectly set to be 7% (Figure 3a) or 13% (Figure 3b) in the analyses. The results indicate that mis-specifying prevalence has only a very minimal impact on type-I error rates as can be observed in the Q-Q plot.

In order to illustrate the bias that could be induced by ascertainment, we also analyzed the simulated data using likelihood models without proper ascertainment corrections and the biases in most scenarios can be substantial. The details for the analyses are shown in Supplementary Material Section 5 and Supplementary Figure 1.

Power of detecting secondary phenotype rare variant associations

The efficiency of the three selective sampling designs for detecting secondary trait associations was compared when both the primary and the secondary traits are associated with the same gene (Tables 2 ). Scenarios were examined where 1000 individuals are sequenced for each study design. There is considerable power for detecting secondary phenotype associations in selected samples. Analyzing secondary phenotypes in a case–control or an extreme-trait study data set can be consistently more powerful than a randomly ascertained population data set of equal size.

Table 2 Power to detect secondary trait T associations using case–control, extreme-trait and multiple-trait study design

Full size table

When a population-based sample is used where 1000 individuals are randomly selected regardless of their phenotypic values, the power for rejecting the null hypothesis is only 51.7% (Supplementary Table 1). For a case–control sample where the secondary trait T is analyzed, the power can be higher (Table 2). For example, when the primary and secondary trait phenotypic effects, and residual correlation satisfy β_A^*=0.5σ_A^*, β_T=0.5σ_T and the power is 56.5%. It is also comparable to the power (56.6%) when 200 individuals with the most extreme trait T values from a cohort of 5000 are sequenced (Supplementary Table 2).

Compatible with observations from bivariate phenotype association studies,¹⁶ the power for detecting associations with secondary phenotypes is jointly determined by the sizes and directions of the locus phenotypic effects and residual correlations. The power is the highest when the correlation between the locus phenotypic effects is in the opposite direction of the trait residual correlations. For example, when the locus-induced correlation is positive (ie, and ) and the trait residual correlation is negative (ie, , the power is 55.7%. However, if the trait residual correlation is also positive (ie, ), the power is 53.5% (Table 2).

Similar patterns of power comparisons are observed for detecting associations with secondary phenotypes T in extreme-trait studies. The power for an extreme-trait study can be substantially higher than that for a population-based study of equivalent size. For example, if the primary and secondary trait effects, and residual correlations are given by and the power of rejecting the null hypothesis is 66.7% (Table 2). It is comparable to the power (70.6%) when 600 individuals with the most extreme trait T values from a cohort of 5000 are sequenced (Supplementary Table 2), or the power (66.6%) when 2000 randomly selected samples are sequenced (Supplementary Table 1).

When the gene region is only associated with the secondary trait T, using samples ascertained on the primary phenotype will induce selections on the secondary phenotype. For a data set of equivalent size, the power for rejecting the null hypothesis of no gene/secondary trait associations in case–control or extreme-trait samples is still greater than (or comparable to) analyzing the same trait using a randomly ascertained population sample. For example, in an extreme-trait study, which sequences 1000 individuals, when causal variants in the gene affect the secondary trait with effect and the two traits are positively correlated with correlation coefficient ρ=0.6, the power is 60.2%. If the two traits are negatively correlated with ρ=−0.6, the power is 60.6% (Table 2). The power in these two scenarios is both superior to that of a population-based study (51.6%), which sequences an equivalent number of samples (Supplementary Table 1).

The MTA method can be applied to analyze samples ascertained on multiple phenotypes. In this example of a multiple-trait study, 500 affected individuals with trait T-value above the 65th percentile are sequenced and 500 unaffected individuals are also selected regardless of their trait T-values (Table 2). Compared with the extreme-trait or case–control study design, the multiple-trait study example that is given is not as powerful. This is because there is not enough phenotypic variability in the sample, as affected individuals are only sampled from the sub-population with trait T above the 65th percentile. However, in some scenarios, there can be considerable power in a multiple-trait study, in particular when sampling on the secondary trait T increases phenotypic variability, for example, affected or unaffected individuals are selected to have secondary T trait values from opposite extreme tails.

MTA allows joint analysis of commonly measured phenotypes in different genetic studies. These studies may be targeted at different primary traits. An example is given where a multiple-trait study is implemented, and the association analysis of the secondary trait T is performed by combining a case–control study data set (Table 3 ). A wide variety of scenarios were extensively evaluated, and a sizable power increase for the combined analysis is consistently observed.

Table 3 Power to detect secondary trait T associations for individual studies (case–control and multiple-trait) and the combined analysis

Full size table

Applications to the ANGPTL family of genes

When each of the eight phenotypes from the DHS was analyzed as primary phenotype using selected samples and the entire sample, four nominally significant associations were found for both types of analyses, that is, ANGPTL4 with TG (P=0.005), ANGPTL5 with BMI (P=0.003), ANGPTL5 with HDL (P=0.024) and ANGPTL6 with BMI (P=0.022). All of the above significant associations were also successfully detected when TG, BMI and HDL were analyzed as secondary phenotypes. An additional association between ANGPTL4 and HDL (P=0.018) was identified only when the entire sample was analyzed (Supplementary Table 3).

The association between TG and rare variants in the ANGPTL4 gene was identified using selected samples where the primary traits are BMI (P=0.025), SysBP (P=0.012) or LDL (P=0.010) (Table 4 ). These traits are only weakly positively correlated with TG, that is, ρ_{BMI, TG}=0.227, ρ_{LDL, TG}=0.197 and ρ_{SysBP, TG}=0.102 (Supplementary Table 4) The association between ANGPTL4 and TG is not significant using samples with extreme DiasBP (P=0.137), TCL (P=0.065), Gluc (P=0.117) and HDL (P=0.107) levels.

Table 4 Results for the secondary phenotype analyses using sequence data from the ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6 genes

Full size table

Although the ANGPTL4 gene is significantly associated with HDL and the size of the correlation between HDL and TG is larger (ρ_{HDL, TG}=−0.374; Supplementary Table 4), the association of TG with ANGPTL4 gene is not significant when TG is analyzed as a secondary trait using samples with extreme HDL levels. This could have occurred because the locus phenotypic effects for HDL and TG are negatively correlated, and the locus-induced correlation lies in the same direction as the residual correlation, which is shown in our simulations to have reduced power compared with when the locus-induced correlation and trait residual correlations are in opposite directions.

There is one nominally significant association that was only detected in secondary phenotype analyses, that is, the association between Gluc and rare variants in the ANGPTL3 gene (P=0.024). It was identified when samples with extreme LDL levels were used. But when Gluc was analyzed as primary trait, the association is not significant (P=0.64). This could either be a novel association or a false-positive finding.

Discussion

In this article, a flexible likelihood framework MTA is proposed for jointly modeling multiple phenotypes in non-randomly ascertained samples, for example, case–control samples or extreme-trait samples. By coupling multivariate generalized linear models with prospective likelihood, complicated ascertainment mechanisms can be incorporated. The approach is flexible and particularly suitable for analyzing complex traits. It can be applied to any study with known sampling mechanisms. MTA allows efficient statistical inference for the genetic parameters of interest. Although the discussion in this article is focused on analyzing sequence data, MTA can also be applied to analyze genotype data.

The results presented in this article have important implications for the design and analysis of complex traits. Most current studies, owing to their limited sample size, are not adequately powered to detect associations with rare variants. It has been suggested that for an exome study ∼10 000 individuals with extreme traits from a cohort of 100 000 need to be sequenced in order to have adequate power.⁵ However, the sample size well exceeds the capacity of many existing studies.⁵ It is therefore particularly important that combined analysis can be performed using data from multiple studies in order to have sufficient power. Applying MTA, sequencing studies that are targeted at different primary traits can be jointly analyzed for detecting associations with a variety of commonly measured secondary traits.

The power of different selective study designs was investigated. It was shown through extensive simulations that there is considerable power for detecting secondary phenotype associations in selected samples. In particular, when the secondary trait of interest is analyzed in a case–control or an extreme-trait study data set, the power can be greater than analyzing an equivalent sized randomly ascertained sample. Using data-sharing platforms and protocols such as dbGaP,²⁹ samples from existing studies can be freely obtained and analyzed. The power can be greatly increased when data from multiple studies are jointly analyzed.

Secondary phenotypes not only have their own clinical importance, but they can also be relevant for understanding the primary trait etiologies. For example, among studies of T2D, many are targeted at related quantitative traits, including fasting glucose levels³⁰ and C-reactive protein.³¹ Given that these traits are often available for individuals who participate in T2D case–control studies,³² MTA can be applied to detect associations with these additional phenotypes.

MTA was also applied to the analysis of sequence data from the DHS. Multiple associations were identified, which confirmed previous data analyses. When the traits were analyzed as secondary phenotypes, although these same set of associations was observed, they were not detected in every selected sample, for example, the association between TG levels and ANGTPL4 was only detected in secondary trait analyses using samples with extreme BMI, SysBP and LDL, but not in samples with extreme DiasBP, HDL, TCL and Gluc. This could be affected by the small sample sizes that were analyzed; the moderate effect sizes for variants involved in complex trait etiologies; or the directions and magnitudes of the correlations between the primary and secondary phenotypes. Although these identified associations are only nominally significant, they all have biological support. In fact, the effects of mutant ANGPTL-family genes on lipoprotein lipase (LPL) have been investigated through in vitro functional studies and in vivo mice studies. LPL has been known to affect glucose metabolism,³³ cholesterol level³⁴ and blood pressure.³⁵ Additionally, the association between variants in the ANGPTL4 gene and triglyceride levels has been successfully replicated.^{3, 36}

Sensitivity of MTA to mis-specified sampling mechanisms was extensively evaluated. When the disease prevalence is reported as an interval of possible values, inferences from MTA can be conducted under different prevalence values from the interval. The results can be integrated using a model averaging procedure. It has been shown that it is an effective approach to further reduce the impact of mis-specified prevalence.³⁷

There can be heterogeneities of sequence coverage depth within and between different studies. Coverage depth differences within a single study may cause inflated type-I errors. Possible strategies to reduce the bias include incorporating the mean coverage depth of each individual in the analysis as a covariate.³⁸ The method can be used with the MTA model. In order to be robust against between-study heterogeneities, a meta-analyses procedure should be implemented for the joint analysis, instead of performing mega-analysis that combines individual participant data.^{11, 12}

When multiple phenotypes are analyzed, to avoid inflated type-I error owing to testing multiple hypotheses, a stringent significance level must be specified. Owing to phenotypic correlations, Bonferroni corrections for testing multiple genes and phenotypes can be overly conservative. Instead, the spectral decomposition-based method of Nyholt et al³⁹ can be used. In addition to correctly controlling for family-wise error rates, it is important that the findings can be replicated using independent samples.⁴⁰

With large-scale implementation of sequence-based genetic association studies, the capability for mapping complex traits will be further elevated. Detecting associations with rare variants and jointly investigating multiple phenotypes together can be an ambitious and difficult task given the moderate sample sizes of existing studies. Taking advantage of multiple studies and mapping commonly measured phenotypes using MTA is therefore highly beneficial and will greatly accelerate the process of dissecting complex trait genetic etiologies.

References

Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH : Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 2004; 305: 869–872.
Article CAS Google Scholar
Ji W, Foo JN, O’Roak BJ et al: Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet 2008; 40: 592–599.
Article CAS Google Scholar
Romeo S, Pennacchio LA, Fu Y et al: Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet 2007; 39: 513–516.
Article CAS Google Scholar
Bodmer W, Bonilla C : Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 2008; 40: 695–701.
Article CAS Google Scholar
Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR : Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci USA 2009; 106: 3871–3876.
Article CAS Google Scholar
Cohen JC, Pertsemlidis A, Fahmi S et al: Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc Natl Acad Sci USA 2006; 103: 1810–1815.
Article CAS Google Scholar
Cauchi S, Nead KT, Choquet H et al: The genetic susceptibility to type 2 diabetes may be modulated by obesity status: implications for association studies. BMC Med Genet 2008; 9: 45.
Article Google Scholar
Cauchi S, Meyre D, Dina C et al: Transcription factor TCF7L2 genetic study in the French population: expression in human beta-cells and adipose tissue and strong association with type 2 diabetes. Diabetes 2006; 55: 2903–2908.
Article CAS Google Scholar
Lin DY, Zeng D : Proper analysis of secondary phenotype data in case–control association studies. Genet Epidemiol 2009; 33: 256–265.
Article CAS Google Scholar
Richardson DB, Rzehak P, Klenk J, Weiland SK : Analyses of case–control data for additional outcomes. Epidemiology 2007; 18: 441–445.
Article Google Scholar
Ioannidis JP, Thomas G, Daly MJ : Validating, augmenting and refining genome-wide association signals. Nat Rev Genet 2009; 10: 318–329.
Article CAS Google Scholar
McCarthy MI, Abecasis GR, Cardon LR et al: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008; 9: 356–369.
Article CAS Google Scholar
Cirulli ET, Goldstein DB : Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 2010; 11: 415–425.
Article CAS Google Scholar
Plomin R, Haworth CM, Davis OS : Common disorders are quantitative traits. Nat Rev Genet 2009; 10: 872–878.
Article CAS Google Scholar
Lange C, Silverman EK, Xu X, Weiss ST, Laird NM : A multivariate family-based association test using generalized estimating equations: FBAT-GEE. Biostatistics 2003; 4: 195–206.
Article Google Scholar
Liu J, Pei Y, Papasian CJ, Deng HW : Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genet Epidemiol 2009; 33: 217–227.
Article Google Scholar
Allison DB, Thiel B, St Jean P, Elston RC, Infante MC, Schork NJ : Multiple phenotype modeling in gene-mapping studies of quantitative traits: power advantages. Am J Hum Genet 1998; 63: 1190–1201.
Article CAS Google Scholar
Boyko AR, Williamson SH, Indap AR et al: Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 2008; 4: e1000083.
Article Google Scholar
Liu DJ, Leal SM : A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 2010; 6: e1001156.
Article Google Scholar
Madsen BE, Browning SR : A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009; 5: e1000384.
Article Google Scholar
Price AL, Kryukov GV, de Bakker PI et al: Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 2010; 86: 832–838.
Article Google Scholar
Morris AP, Zeggini E : An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 2009; 34: 188–193.
Article Google Scholar
Li B, Leal SM : Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008; 83: 311–321.
Article CAS Google Scholar
Neale BM, Rivas MA, Voight BF et al: Testing for an unusual distribution of rare variants. PLoS Genet 2010; 7: e1001322.
Article Google Scholar
Bhatia G, Bansal V, Harismendy O et al: A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Comput Biol 2010; 6: e1000954.
Article Google Scholar
Aitken AC : Notes on selection from a multivariate normal population. Proc Edin Math Soc 1934; 4: 106–110.
Article Google Scholar
Munafo MR, Flint J : Meta-analysis of genetic association studies. Trends Genet 2004; 20: 439–444.
Article CAS Google Scholar
Skol AD, Scott LJ, Abecasis GR, Boehnke M : Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 2006; 38: 209–213.
Article CAS Google Scholar
Mailman MD, Feolo M, Jin Y et al: The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 2007; 39: 1181–1186.
Article CAS Google Scholar
Bouatia-Naji N, Rocheleau G, Van Lommel L et al: A polymorphism within the G6PC2 gene is associated with fasting plasma glucose levels. Science 2008; 320: 1085–1088.
Article CAS Google Scholar
Elliott P, Chambers JC, Zhang W et al: Genetic loci associated with C-reactive protein levels and risk of coronary heart disease. JAMA 2009; 302: 37–48.
Article CAS Google Scholar
Sladek R, Rocheleau G, Rung J et al: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 2007; 445: 881–885.
Article CAS Google Scholar
Webster RJ, Warrington NM, Weedon MN et al: The association of common genetic variants in the APOA5, LPL and GCK genes with longitudinal changes in metabolic and cardiovascular traits. Diabetologia 2009; 52: 106–114.
Article CAS Google Scholar
Koster A, Chao YB, Mosior M et al: Transgenic angiopoietin-like (angptl)4 overexpression and targeted disruption of angptl4 and angptl3: regulation of triglyceride metabolism. Endocrinology 2005; 146: 4943–4950.
Article CAS Google Scholar
Li B, Ge D, Wang Y et al: Lipoprotein lipase gene polymorphisms and blood pressure levels in the Northern Chinese Han population. Hypertens Res 2004; 27: 373–378.
Article CAS Google Scholar
Romeo S, Yin W, Kozlitina J et al: Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J Clin Invest 2009; 119: 70–79.
CAS PubMed Google Scholar
Li M, Li C : Assessing departure from Hardy–Weinberg equilibrium in the presence of disease association. Genet Epidemiol 2008; 32: 589–599.
Article Google Scholar
Garner C : Confounded by sequencing depth in association studies of rare alleles. Genet Epidemiol 2011; 35: 261–268.
PubMed PubMed Central Google Scholar
Nyholt DR : A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 2004; 74: 765–769.
Article CAS Google Scholar
Liu DJ, Leal SM : Replication strategies for rare variant complex trait association studies via next-generation sequencing. Am J Hum Genet 2010; 87: 790–801.
Article CAS Google Scholar

Download references

Acknowledgements

This research is supported by National Institutes of Health Grants 1RC4MD005964 and 1RC2HL102926 (SML). DJL is partially supported by a training fellowship from the Keck Center Pharmacoinformatics Training Program of the Gulf Coast Consortia (NIH Grant no. 5 R90 DK071505-04). We thank Drs Jonathan Cohen (JC) and Helen Hobbs for providing us with data from the Dallas Heart Study on the ANGTPL-family genes, which was supported by National Institutes of Health Grant RL1HL092550 (JC). Computation for this research was supported in part by the Shared University Grid at Rice funded by NSF under Grant EIA-0216467, and a partnership between Rice University, Sun Microsystems and Sigma Solutions Inc.

Author information

Authors and Affiliations

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
Dajiang J Liu & Suzanne M Leal
Department of Statistics, Rice University, Houston, TX, USA
Dajiang J Liu & Suzanne M Leal

Authors

Dajiang J Liu
View author publications
You can also search for this author in PubMed Google Scholar
Suzanne M Leal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suzanne M Leal.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on European Journal of Human Genetics website

Supplementary information

Supplementary Material (DOC 362 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, D., Leal, S. A flexible likelihood framework for detecting associations with secondary phenotypes in genetic studies using selected samples: application to sequence data. Eur J Hum Genet 20, 449–456 (2012). https://doi.org/10.1038/ejhg.2011.211

Download citation

Received: 17 May 2011
Revised: 17 October 2011
Accepted: 20 October 2011
Published: 14 December 2011
Issue Date: April 2012
DOI: https://doi.org/10.1038/ejhg.2011.211

A flexible likelihood framework for detecting associations with secondary phenotypes in genetic studies using selected samples: application to sequence data

Abstract

Similar content being viewed by others

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

A resource-efficient tool for mixed model association analysis of large-scale data

A generalized linear mixed model association tool for biobank-scale data

Introduction