Introduction

There is solid evidence that complex traits can be influenced by rare variants.1, 2, 3, 4 The development and large-scale application of next-generation sequencing have already revolutionized genetic studies and enabled detecting associations with complex traits owing to rare variants. In order to design powerful studies, it is necessary to deeply sequence samples from a large number of individuals.5 However, many existing studies are small- to moderate-sized, owing to the high cost of sequencing or limited availability of samples, and are therefore inadequately powered. It would be advantageous if different studies, which measure the same phenotypes, could be jointly analyzed to increase power. In particular, many clinically important traits, such as body mass index (BMI), systolic (SysBP) and diastolic blood pressure (DiasBP) are often measured in different studies. When combined analysis is performed, in addition to incorporating studies that are targeted at the same primary traits, it is desirable to also analyze data from studies for which the phenotype of interest is measured as an additional outcome. Combined analyses require modeling multiple phenotypes as different studies may sequence selected samples targeted at different primary traits. Similar to the idea of analysis of covariance (ANCOVA), jointly analyzing multiple phenotypes makes it possible to distinguish the phenotype covariance component that is due to gene pleiotropy and the component that is attributable to residual correlations.

Currently, most studies sequence selected samples, for example, case–control samples or individuals with extreme phenotypes.1, 6 Sequencing selected samples reduces sequencing cost and improves power. Owing to sample ascertainment, secondary traits can be associated with the gene region in a selected sample even though they are independent in the general population. For example, consider a gene that is associated with the primary trait, but not with the secondary trait, in the general population (Figure 1). In a sample that consists of individuals with extreme primary trait values, the causative variant frequency will be different between individuals from the upper and lower extremes. The mean value for the secondary trait will also be different owing to phenotypic correlations. Therefore, a spurious association can occur between the gene region and the secondary trait unless the sample ascertainment scheme is correctly modeled. The selection criteria for a sequencing study can be complicated and may involve multiple traits (multiple-trait study) or sub-phenotypes. For instance, it is hypothesized that the etiologies of type-2 diabetes (T2D) are different in obese and non-obese individuals.7, 8 In order to reduce phenotype heterogeneity and potentially improve power, a study of T2D might be performed using an obese population. There have been methods developed for detecting associations with multiple phenotypes in selected samples.9, 10 However, these methods are limited to case–control studies. They are not applicable to more complicated study designs, for example, studies that sequence individuals with extreme primary traits (extreme-trait study), or studies where secondary phenotypes are also involved in sample selection. In particular, the extreme-trait study design is becoming increasingly popular and widely applied.11, 12, 13 The results for detecting associations with secondary traits can be seriously biased if the secondary traits are not properly analyzed.9 It is desirable to have a unified approach for analyzing secondary phenotypes from all available data sets.

Figure 1
figure 1

Graphical illustration of gene/multiple phenotypes associations. The gene region is causal for the primary trait Y1 but not for the secondary trait Y2. Owing to the correlation between the two traits, a spurious association can be detected between the gene region and the secondary trait if the ascertainment mechanism or phenotypic correlations are not properly modeled.

A flexible likelihood approach, MULTI-TRAIT-ASSOCIATION (MTA), is presented for detecting associations with multiple phenotypes in selected or randomly ascertained samples. This method can be used to detect both common and rare variant/secondary phenotype associations. MTA jointly models multiple phenotypes conditional on the study subjects being ascertained. The sampling mechanisms are incorporated by means of a prospective likelihood approach. The MTA framework is comprehensive and can be used to model multiple continuous or categorical traits. To model traits that are not continuous, a generalized linear model is used. For example, either a probit or logit link function can be applied to model binary traits. In this article, the discussion is focused on using the probit link function and the liability threshold model, which can be justified by the polygenic model of complex traits. It has been suggested that the liability of all complex traits can be considered as ‘quantitative’.14 For complex traits that are not measured in a quantitative scale, there should exist a continuous underlying liability trait, which is due to the aggregated outcome from multiple causative variants with small effects. In this case, a multivariate liability threshold model is naturally used to jointly model multiple phenotypes.

The power of MTA for detecting gene/secondary trait associations is examined in different selective study designs. Three study designs are considered, that is, case–control, extreme-trait and multiple-trait. It is assumed for each of the study designs that the same continuous secondary phenotype T is measured. For comparison purposes, study designs are also evaluated where the quantitative trait T is selected and analyzed as the primary phenotype. Simulation details for each study design can be found in Table 1 .

Table 1 Definitions of selection mechanisms

It is very beneficial to be able to use and combine selected samples from existing sequencing-based genetic studies. Through extensive simulation studies, it is shown that the case–control and extreme-trait designs can be more powerful for detecting associations with secondary phenotypes than using a population-based design, where individuals are randomly selected regardless of their phenotypes. The power for detecting associations with secondary phenotypes strongly depends on the aggregation of causative variants in the sample. For study designs that facilitate enrichment of causative variants, power will be increased. In the presence of gene pleiotropy, variants that are associated with both the primary and secondary traits can be enriched through selections on the primary phenotype. When the gene region is only associated with the secondary phenotype, if the primary and secondary traits are correlated, selections on the primary phenotype can also induce selections on the secondary phenotype. In this case, for a sample of equivalent size, the power of rejecting the null hypothesis of no gene/secondary trait association in case–control or extreme-trait studies is still superior or comparable to a population-based study.

The power for detecting associations with secondary phenotypes in selected samples is jointly affected by locus phenotypic effects for both the primary and secondary phenotypes, as well as residual correlations. Concordant with observations from previous studies of multiple-trait linkage/association mapping,15, 16, 17 it is demonstrated that power is maximized when the locus-induced trait correlations are in the opposite direction of the residual correlations. To further demonstrate the utility of MTA in combined analysis, an example is given where samples from a case–control study and a multiple-trait study are jointly analyzed. The power for detecting associations with commonly measured phenotypes can be greatly increased when studies are combined, compared with analyzing each individual study separately.

As an application of MTA, we analyzed the sequence data from the ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6 genes generated by the Dallas Heart Study (DHS). The 3551 study participants of the DHS were phenotyped for multiple metabolism-related traits, including BMI, DiasBP, SysBP, total cholesterol level (TCL), low-density lipoprotein (LDL), high-density lipoprotein (HDL), triglyceride (TG) and glucose (Gluc). Two primary trait analyses were first performed: (1) analysis of all samples and (2) analysis of selected samples whose quantitative trait values fall within the lower and upper quartiles. Next a secondary phenotype analysis was performed where within each selected sample all other available phenotypes were analyzed as secondary traits. The results from the secondary trait analyses confirmed the primary trait analyses. These results established the importance of analyzing secondary phenotypes and the effectiveness of MTA. They provided solid support to our simulation experiment.

Materials and methods

It is assumed that there are S variant nucleotide sites for a gene locus. The multi-site genotype for individual i is given by where the genotype at the segregating nucleotide site s is coded by the number of minor alleles, (eg, if the individual is homozygous for the minor allele). To detect associations with rare variants, multiple rare variants in a gene locus are usually jointly analyzed.18, 19, 20, 21, 22 The locus genotype coding for an individual i is defined as where C(•) is the coding function.

Locus multi-site genotype coding schemes

Many statistical methods have been developed for association studies of complex traits owing to rare variants. Existing methods include combined multivariate and collapsing (CMC),23 test of an aggregated number of rare variants (ANRV),22 weighted sum statistics (WSS),20 variable threshold test (VT),21 kernel-based adaptive cluster (KBAC),19 the C-alpha test24 and the RARECOVER (RC) method,25 and so on. Most of these methods are essentially based on weighting or grouping variants. Among them, CMC and ANRV are regression-based methods, which can be incorporated into MTA through the coding function C(•):

(1) CMC: The coding function is defined as where δ(•) is an indicator function and is a summation over the set of rare variant nucleotide sites RV, which can be determined by a pre-specified frequency cut-off.

(2) ANRV: The coding function belongs to a more general class of weighted sum coding (WSC), which can be defined as In the WSC scheme, the variant from nucleotide site s is assigned weight ws. The ANRV coding assigns equal weight for all variants, that is,

A general probability model for multiple phenotypes in selected samples

In order to incorporate the sample ascertainment mechanism and correct for the bias induced by phenotypic residual correlations, multiple phenotypes are jointly modeled. The primary and secondary traits are assumed to follow a multivariate generalized linear model:

and are link functions, and and are the model parameters related to the primary and secondary traits. This multivariate generalized linear model can be used with any type of link functions, such as probit link function or logit link function.

For selected samples, a conditional likelihood is used, which is similar to Pearson–Aitken correction for ascertainment:26

Zi is an indicator of individual i being sampled and N is the number of individuals in the sample. Each term in (2) satisfies

The sampling mechanism is characterized by which can be explicitly calculated for case–control, extreme-trait and multiple-trait studies. The details are shown in Supplementary Material Section 1. When the probit link function is used to model binary phenotypes, the multivariate generalized linear model can be simplified. Computational details can be found in Supplementary Material Section 2.

Association testing

The likelihood-based score statistic can be applied to detect associations with rare variants. Using collapsing coding, P-values for the score statistics can be analytically evaluated. For the WSC, if the weights are only dependent on the multi-site genotypes, the score statistic will asymptotically follow a normal distribution and the P-values can also be analytically evaluated. Permutation procedures cannot be used to analyze secondary phenotypes in selected samples. This is because if the gene region is associated with the primary phenotype, study subjects are not interchangeable under the null hypothesis of no gene/secondary phenotype associations. The analyses in the article were performed using the CMC coding, that is, The results remain the same when other coding schemes are used.

Combining different cohorts for analyses of secondary phenotypes

Statistical theories for combining multiple studies are well developed.27 As heterogeneity may exist between different cohorts, meta-analysis methods that combine test statistics should be used.11, 12 For rare variant analysis, multiple rare variants are jointly analyzed and their phenotypic effects are not usually estimated and reported. Therefore, all the joint analyses in this study were performed by combining score statistics from different studies. In the joint analysis, score statistics from different studies are weighted and summed. The weight assigned for each score statistic is proportional to the square root of the sample size according to Skol et al.28

Generation of genetic and phenotypic data

Following Boyko et al,18 a rigorous population genetic model incorporating demographic change and purifying selections was used to simulate the African variant data. Details of generation of genetic data are given in Supplementary Material Section 3. To generate phenotypes, we assume that the phenotypic effects for causative variants are independent of their fitness. In a case–control study, the augmented phenotype for an individual i with multi-site genotype follows a bivariate normal distribution with

and

The rare variants sites CVA*and CVT are randomly chosen to be causative for the traits A* and T. Either set can be empty if the gene is not associated with the corresponding trait. Variants at sites are pleiotropic and affect both phenotypes. The binary disorder status Ai is determined by For each scenario, 1000 individuals were simulated. Details for simulating the extreme-trait and multiple-trait study samples can be found in Supplementary Material Section 4.

In order to evaluate type-I errors, phenotype data were generated under the null hypothesis of no gene/secondary trait T associations, that is, βT=0. Scenarios were considered where (1) the gene region is neither associated with the primary nor the secondary phenotypes and (2) the gene is associated with the primary phenotype but not with the secondary phenotype. Scenarios with a combination of two causative variant primary trait effects and four residual correlations were evaluated.

To compare the power of rejecting the null hypothesis of no gene/secondary trait associations, two causal variant secondary phenotype effects βT =±0.5σT were used. The power for the three study designs was compared under scenarios with different combinations of genetic parameter values.

Software availability

An R-package implementing MTA is available at http://www.bcm.edu/genetics/leal/software, which is compatible with commonly used operating systems, including Linux, Windows and OS X.

Results

Evaluation of type-I errors

Type-I errors for each study design using MTA were evaluated empirically. Under the null hypothesis of no genetic/secondary phenotype associations, the quantile–quantile (Q-Q) plots of the empirical and theoretical distributions of P-values are shown in Figures 2 and 3 for the case–control study design. When the ascertainment mechanism is correctly specified, the type-I errors are controlled. Results are shown in Figure 2 for the scenario where the gene region is not associated with either the primary or the secondary phenotypes, and the scenario where the gene region is only associated with the primary trait. Type-I errors for the extreme-trait and multiple-trait designs were also well controlled (data not shown). The impact of mis-specified sampling mechanisms was investigated. The results are shown in Figure 3 when the prevalence parameter is 10%, but is incorrectly set to be 7% (Figure 3a) or 13% (Figure 3b) in the analyses. The results indicate that mis-specifying prevalence has only a very minimal impact on type-I error rates as can be observed in the Q-Q plot.

Figure 2
figure 2

Q-Q plot of P-values under the null hypothesis of no gene/ secondary trait (T) associations. It is assumed that the disease prevalence (10%) is correctly specified. Scenarios with different combinations of primary trait phenotypic effects βA* and residual correlations ρA*,T were investigated. Results are shown where neither the primary nor the secondary traits are associated with the gene region (dashed red and blue lines), and where only the primary but not the secondary trait is associated with the gene region (solid green and brown line). The results were obtained using 10 000 replicates.

Figure 3
figure 3

Q-Q plot of P-values under the null hypothesis of no gene/ secondary trait (T) associations when prevalence is mis-specified. It is assumed that the prevalence (10%) is incorrectly specified as 7% (a) or 13% (b). Results are shown where neither the primary nor the secondary traits are associated with the gene region (dashed red and blue lines), and where only the primary but not the secondary traits is associated with the gene region (solid green and brown line). The results were obtained using 10 000 replicates.

In order to illustrate the bias that could be induced by ascertainment, we also analyzed the simulated data using likelihood models without proper ascertainment corrections and the biases in most scenarios can be substantial. The details for the analyses are shown in Supplementary Material Section 5 and Supplementary Figure 1.

Power of detecting secondary phenotype rare variant associations

The efficiency of the three selective sampling designs for detecting secondary trait associations was compared when both the primary and the secondary traits are associated with the same gene (Tables 2 ). Scenarios were examined where 1000 individuals are sequenced for each study design. There is considerable power for detecting secondary phenotype associations in selected samples. Analyzing secondary phenotypes in a case–control or an extreme-trait study data set can be consistently more powerful than a randomly ascertained population data set of equal size.

Table 2 Power to detect secondary trait T associations using case–control, extreme-trait and multiple-trait study design

When a population-based sample is used where 1000 individuals are randomly selected regardless of their phenotypic values, the power for rejecting the null hypothesis is only 51.7% (Supplementary Table 1). For a case–control sample where the secondary trait T is analyzed, the power can be higher (Table 2). For example, when the primary and secondary trait phenotypic effects, and residual correlation satisfy βA*=0.5σA*, βT=0.5σT and the power is 56.5%. It is also comparable to the power (56.6%) when 200 individuals with the most extreme trait T values from a cohort of 5000 are sequenced (Supplementary Table 2).

Compatible with observations from bivariate phenotype association studies,16 the power for detecting associations with secondary phenotypes is jointly determined by the sizes and directions of the locus phenotypic effects and residual correlations. The power is the highest when the correlation between the locus phenotypic effects is in the opposite direction of the trait residual correlations. For example, when the locus-induced correlation is positive (ie, and ) and the trait residual correlation is negative (ie, , the power is 55.7%. However, if the trait residual correlation is also positive (ie, ), the power is 53.5% (Table 2).

Similar patterns of power comparisons are observed for detecting associations with secondary phenotypes T in extreme-trait studies. The power for an extreme-trait study can be substantially higher than that for a population-based study of equivalent size. For example, if the primary and secondary trait effects, and residual correlations are given by and the power of rejecting the null hypothesis is 66.7% (Table 2). It is comparable to the power (70.6%) when 600 individuals with the most extreme trait T values from a cohort of 5000 are sequenced (Supplementary Table 2), or the power (66.6%) when 2000 randomly selected samples are sequenced (Supplementary Table 1).

When the gene region is only associated with the secondary trait T, using samples ascertained on the primary phenotype will induce selections on the secondary phenotype. For a data set of equivalent size, the power for rejecting the null hypothesis of no gene/secondary trait associations in case–control or extreme-trait samples is still greater than (or comparable to) analyzing the same trait using a randomly ascertained population sample. For example, in an extreme-trait study, which sequences 1000 individuals, when causal variants in the gene affect the secondary trait with effect and the two traits are positively correlated with correlation coefficient ρ=0.6, the power is 60.2%. If the two traits are negatively correlated with ρ=−0.6, the power is 60.6% (Table 2). The power in these two scenarios is both superior to that of a population-based study (51.6%), which sequences an equivalent number of samples (Supplementary Table 1).

The MTA method can be applied to analyze samples ascertained on multiple phenotypes. In this example of a multiple-trait study, 500 affected individuals with trait T-value above the 65th percentile are sequenced and 500 unaffected individuals are also selected regardless of their trait T-values (Table 2). Compared with the extreme-trait or case–control study design, the multiple-trait study example that is given is not as powerful. This is because there is not enough phenotypic variability in the sample, as affected individuals are only sampled from the sub-population with trait T above the 65th percentile. However, in some scenarios, there can be considerable power in a multiple-trait study, in particular when sampling on the secondary trait T increases phenotypic variability, for example, affected or unaffected individuals are selected to have secondary T trait values from opposite extreme tails.

MTA allows joint analysis of commonly measured phenotypes in different genetic studies. These studies may be targeted at different primary traits. An example is given where a multiple-trait study is implemented, and the association analysis of the secondary trait T is performed by combining a case–control study data set (Table 3 ). A wide variety of scenarios were extensively evaluated, and a sizable power increase for the combined analysis is consistently observed.

Table 3 Power to detect secondary trait T associations for individual studies (case–control and multiple-trait) and the combined analysis

Applications to the ANGPTL family of genes

When each of the eight phenotypes from the DHS was analyzed as primary phenotype using selected samples and the entire sample, four nominally significant associations were found for both types of analyses, that is, ANGPTL4 with TG (P=0.005), ANGPTL5 with BMI (P=0.003), ANGPTL5 with HDL (P=0.024) and ANGPTL6 with BMI (P=0.022). All of the above significant associations were also successfully detected when TG, BMI and HDL were analyzed as secondary phenotypes. An additional association between ANGPTL4 and HDL (P=0.018) was identified only when the entire sample was analyzed (Supplementary Table 3).

The association between TG and rare variants in the ANGPTL4 gene was identified using selected samples where the primary traits are BMI (P=0.025), SysBP (P=0.012) or LDL (P=0.010) (Table 4 ). These traits are only weakly positively correlated with TG, that is, ρBMI, TG=0.227, ρLDL, TG=0.197 and ρSysBP, TG=0.102 (Supplementary Table 4) The association between ANGPTL4 and TG is not significant using samples with extreme DiasBP (P=0.137), TCL (P=0.065), Gluc (P=0.117) and HDL (P=0.107) levels.

Table 4 Results for the secondary phenotype analyses using sequence data from the ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6 genes

Although the ANGPTL4 gene is significantly associated with HDL and the size of the correlation between HDL and TG is larger (ρHDL, TG=−0.374; Supplementary Table 4), the association of TG with ANGPTL4 gene is not significant when TG is analyzed as a secondary trait using samples with extreme HDL levels. This could have occurred because the locus phenotypic effects for HDL and TG are negatively correlated, and the locus-induced correlation lies in the same direction as the residual correlation, which is shown in our simulations to have reduced power compared with when the locus-induced correlation and trait residual correlations are in opposite directions.

There is one nominally significant association that was only detected in secondary phenotype analyses, that is, the association between Gluc and rare variants in the ANGPTL3 gene (P=0.024). It was identified when samples with extreme LDL levels were used. But when Gluc was analyzed as primary trait, the association is not significant (P=0.64). This could either be a novel association or a false-positive finding.

Discussion

In this article, a flexible likelihood framework MTA is proposed for jointly modeling multiple phenotypes in non-randomly ascertained samples, for example, case–control samples or extreme-trait samples. By coupling multivariate generalized linear models with prospective likelihood, complicated ascertainment mechanisms can be incorporated. The approach is flexible and particularly suitable for analyzing complex traits. It can be applied to any study with known sampling mechanisms. MTA allows efficient statistical inference for the genetic parameters of interest. Although the discussion in this article is focused on analyzing sequence data, MTA can also be applied to analyze genotype data.

The results presented in this article have important implications for the design and analysis of complex traits. Most current studies, owing to their limited sample size, are not adequately powered to detect associations with rare variants. It has been suggested that for an exome study 10 000 individuals with extreme traits from a cohort of 100 000 need to be sequenced in order to have adequate power.5 However, the sample size well exceeds the capacity of many existing studies.5 It is therefore particularly important that combined analysis can be performed using data from multiple studies in order to have sufficient power. Applying MTA, sequencing studies that are targeted at different primary traits can be jointly analyzed for detecting associations with a variety of commonly measured secondary traits.

The power of different selective study designs was investigated. It was shown through extensive simulations that there is considerable power for detecting secondary phenotype associations in selected samples. In particular, when the secondary trait of interest is analyzed in a case–control or an extreme-trait study data set, the power can be greater than analyzing an equivalent sized randomly ascertained sample. Using data-sharing platforms and protocols such as dbGaP,29 samples from existing studies can be freely obtained and analyzed. The power can be greatly increased when data from multiple studies are jointly analyzed.

Secondary phenotypes not only have their own clinical importance, but they can also be relevant for understanding the primary trait etiologies. For example, among studies of T2D, many are targeted at related quantitative traits, including fasting glucose levels30 and C-reactive protein.31 Given that these traits are often available for individuals who participate in T2D case–control studies,32 MTA can be applied to detect associations with these additional phenotypes.

MTA was also applied to the analysis of sequence data from the DHS. Multiple associations were identified, which confirmed previous data analyses. When the traits were analyzed as secondary phenotypes, although these same set of associations was observed, they were not detected in every selected sample, for example, the association between TG levels and ANGTPL4 was only detected in secondary trait analyses using samples with extreme BMI, SysBP and LDL, but not in samples with extreme DiasBP, HDL, TCL and Gluc. This could be affected by the small sample sizes that were analyzed; the moderate effect sizes for variants involved in complex trait etiologies; or the directions and magnitudes of the correlations between the primary and secondary phenotypes. Although these identified associations are only nominally significant, they all have biological support. In fact, the effects of mutant ANGPTL-family genes on lipoprotein lipase (LPL) have been investigated through in vitro functional studies and in vivo mice studies. LPL has been known to affect glucose metabolism,33 cholesterol level34 and blood pressure.35 Additionally, the association between variants in the ANGPTL4 gene and triglyceride levels has been successfully replicated.3, 36

Sensitivity of MTA to mis-specified sampling mechanisms was extensively evaluated. When the disease prevalence is reported as an interval of possible values, inferences from MTA can be conducted under different prevalence values from the interval. The results can be integrated using a model averaging procedure. It has been shown that it is an effective approach to further reduce the impact of mis-specified prevalence.37

There can be heterogeneities of sequence coverage depth within and between different studies. Coverage depth differences within a single study may cause inflated type-I errors. Possible strategies to reduce the bias include incorporating the mean coverage depth of each individual in the analysis as a covariate.38 The method can be used with the MTA model. In order to be robust against between-study heterogeneities, a meta-analyses procedure should be implemented for the joint analysis, instead of performing mega-analysis that combines individual participant data.11, 12

When multiple phenotypes are analyzed, to avoid inflated type-I error owing to testing multiple hypotheses, a stringent significance level must be specified. Owing to phenotypic correlations, Bonferroni corrections for testing multiple genes and phenotypes can be overly conservative. Instead, the spectral decomposition-based method of Nyholt et al39 can be used. In addition to correctly controlling for family-wise error rates, it is important that the findings can be replicated using independent samples.40

With large-scale implementation of sequence-based genetic association studies, the capability for mapping complex traits will be further elevated. Detecting associations with rare variants and jointly investigating multiple phenotypes together can be an ambitious and difficult task given the moderate sample sizes of existing studies. Taking advantage of multiple studies and mapping commonly measured phenotypes using MTA is therefore highly beneficial and will greatly accelerate the process of dissecting complex trait genetic etiologies.