Abstract
Integrating association evidence across multiple traits can improve the power of gene discovery and reveal pleiotropy. Most multitrait analysis methods focus on individual common variants in genomewide association studies. Here, we introduce multitrait analysis of rarevariant associations (MTAR), a framework for joint analysis of association summary statistics between multiple rare variants and different traits. MTAR achieves substantial power gain by leveraging the genomewide genetic correlation measure to inform the degree of genelevel effect heterogeneity across traits. We apply MTAR to rarevariant summary statistics for three lipid traits in the Global Lipids Genetics Consortium. 99 genomewide significant genes were identified in the singletraitbased tests, and MTAR increases this to 139. Among the 11 novel lipidassociated genes discovered by MTAR, 7 are replicated in an independent UK Biobank GWAS analysis. Our study demonstrates that MTAR is substantially more powerful than singletraitbased tests and highlights the value of MTAR for novel gene discovery.
Similar content being viewed by others
Introduction
Rich genomewide association study (GWAS) findings have suggested the sharing of genetic risk variants among multiple complex traits^{1,2}. Multitrait analyses that combine association evidence across traits can boost statistical power over singletrait analyses in detecting risk variants, especially for those traits that have weak associations with the variants. Many multitrait methods are designed for testing the singlevariant association^{3,4,5,6,7,8}. However, the statistical power of singlevariant tests is low for rarevariant association studies (RVAS)^{9}. In light of this limitation, genebased tests have been developed for RVAS to aggregate mutation information across several variant sites within a gene to enrich association signals and reduce the penalty resulting from multiple testing^{9}. Although several methods are available for multitrait multivariant tests, most of them require individuallevel genotype and phenotype data^{10,11,12,13,14,15} or are designed for common variants^{16,17,18,19} (Supplementary Table 1). The genebased tests for RVAS have not been fully exploited in the multitrait analysis.
The genetic architecture of complex traits is unknown in advance and is likely to vary from one gene to another across the genome and from one trait to another. Therefore, the main challenge of multitrait multivariant analyses is to flexibly accommodate a variety of genetic effect patterns among traits and variants such that the test is robust and has high power. The effect structures among rare variants within a gene have been wellstudied when numerous genebased tests were developed. The sequence kernel association test (SKAT)^{20} and burden tests^{21,22,23,24} are the most widely used genebased tests for RVAS and represent two main patterns of genetic effects across rare variants. Burden tests assume effects across variants are largely homogeneous and SKAT assumes they are heterogeneous. SKATO^{25} is a test that achieves robustness by combining tests with various degrees of effect heterogeneity, including the SKAT and burden tests as special cases. Specifically, SKATO assumes rarevariant effects are random variables with a uniform (exchangeable) correlation and different levels of heterogeneity can be considered by changing the correlation coefficient.
The effects on multiple traits may also exhibit homogeneous and heterogeneous patterns. However, the degree of genetic effect similarity/heterogeneity are likely to vary from one trait pair to another. As an example, for the pair of traits that are biologically related (e.g., triglycerides (TG) and highdensity lipoprotein cholesterol (HDL)), we expect they share more causal variants and have a higher level of genetic similarity than the pair of traits less relevant (e.g., TG and bipolar)^{26}. Hence, it is not adequate to use a uniform correlation coefficient to model the degree of similarity for all trait pairs. Many recent studies have investigated the genetic overlap for many pairs of complex traits and diseases and estimated genetic correlation as a global measure of genetic similarity for trait pairs^{26,27,28}. Although a genetic correlation is calculated using common variants across the genome and RV association tests are performed on the gene level, the idea of utilizing genetic correlation to guide the specification of genelevel effect heterogeneity across traits is intriguing and has not been considered in existing multitrait methods.
Here we develop multitrait analysis of rarevariant association (MTAR), a framework for the multitrait analysis of RVAS. MTAR is built upon a randomeffects metaanalysis model that uses different correlation structures of the genetic effects to represent a wide spectrum of association patterns across traits and variants. To model genetic effects across variants, MTAR employs the same strategy as SKATO. To model the rarevariant effect heterogeneity on multiple traits, MTAR leverages the genetic correlation. Specifically, we propose two correlation structures on the amongtrait genetic effects. The first structure allows the betweentrait effect similarity to change from the value of the genetic correlation to completely heterogeneous as an extreme and we term the resulting multitrait association test iMTAR. The second structure allows the betweentrait effect similarity to change from the value of the genetic correlation to homogeneous as an extreme and we term the resulting test cMTAR. Besides the aforementioned association patterns across traits, we also consider the scenario in which only a small number of traits are associated with the set of rare variants. This association pattern naturally occurs for the genes that have very specific biological functions and do not affect many traits. To accommodate this pattern, we construct another test, cctP, which uses the Cauchy method^{29,30} to combine singletrait RVAS Pvalues. To achieve robustness and improve overall power, we combine the Pvalues of iMTAR, cMTAR, and cctP, and refer to this omnibus test as MTARO. To demonstrate the usefulness of MTAR empirically, we analyze summary statistics from the Global Lipids Genetics Consortium (GLGC) on lowdensity lipoprotein cholesterol (LDL), HDL, and TG. MTAR discovers more lipidassociated genes than singletraitbased analyses and many novel association signals are replicated in an independent UK Biobank data. Moreover, our simulation results show that MTAR methods have wellpreserved type I error rate and greater power over singletraitbased methods across a wide range of effect patterns across traits and variants. Finally, we compare MTAR with two existing multitrait methods that outperform other competing methods. We find that MTAR is more powerful in almost all simulation settings and discovers more genes in the application to the GLGC data.
Results
MTAR overview
Suppose that we are interested in the effects of m variants in a gene on K traits. For k = 1, …, K, we let β_{k} = (β_{k1}, ⋯, β_{km})^{T} denote the effects of the m genetic variables on trait k. To perform MTAR tests, we first obtain the vector of variantlevel score statistics for testing β_{k} = 0 denoted by U_{k} = (U_{k1}, …, U_{km})^{T} and the covariance estimate for U_{k} denoted by V_{k}. The U_{k} and V_{k} can be easily constructed using the information routinely shared in public domains (Methods). We let \(\widehat {\mathbf{\beta }}_k = {\mathbf{V}}_k^{  1}{\mathbf{U}}_k\) and write \(\widehat {\mathbf{\beta }} = (\widehat {\mathbf{\beta }}_1^{\mathrm{T}}, \ldots ,\widehat {\mathbf{\beta }}_K^{\mathrm{T}})^{\mathrm{T}}\). Given the true genetic effects \({\mathbf{\beta }} = ({\mathbf{\beta }}_1^{\mathrm{T}}, \ldots ,{\mathbf{\beta }}_K^{\mathrm{T}})^{\mathrm{T}}\), the \(\widehat {\mathbf{\beta }}\) approximately follows normal distribution with mean β and covariance ∑^{31,32}, where \({\mathbf{\Sigma }} = {\mathrm{Blockdiag}}\{ {\mathbf{V}}_1^{  1}, \ldots ,{\mathbf{V}}_K^{  1}\}\) if traits are measured on studies without overlapping samples. If all the traits are from one study or multiple studies with overlapping subjects, the offdiagonal blocks in ∑ are not zeros. For any given traits k and k′ with sample overlap, the formula for estimating the covariance between \(\widehat {\mathbf{\beta }}_k\) and \(\widehat {\mathbf{\beta }}_{k\prime }\) is provided in Eq. (3).
We are interested in testing the null hypothesis that the m variants are not associated with any of the K traits: H_{0} : β_{1} = β_{2} = ⋯ = β_{K} = 0. Multivariate test for this hypothesis has a large degrees of freedom and low statistical power. In MTAR, we further assume that the genetic effects β are zeromean random effects with covariance matrix σB, where σ is an unknown scalar and B is a prespecified matrix dictating the covariances of genetic effects among traits and variants. Under this randomeffects model, the equivalent null hypothesis is H_{0} : σ = 0 and we test this hypothesis using a variancecomponent score test (Eq. (5)). The test will have the optimal power if the specification of B reflects the true covariance structure of the effects. The true structure of B is unknown a priori. To separately model the genetic structures among trait and among variants, we propose to formulate B = B_{2} ⊗ B_{1}, where ⊗ is the Kronecker product of amongvariant effect covariance B_{1} and amongtrait effect covariance B_{2}. For B_{1}, we assume the exchangeable correlation structure with a uniform correlation coefficient denoted by ρ_{1} (Methods). By specifying different values of ρ_{1}, this structure allows various degrees of amongvariant effect heterogeneity. As the two extremes, the effects across variants are homogeneous when ρ_{1} = 1; the effects are completely heterogeneous and vary independently when ρ_{1} = 0.
For the betweentrait effect covariance, we set \({\mathbf{B}}_2 = {\mathbf{W}}_2{\mathbf{\Omega }}_2{\mathbf{W}}_2\), where W_{2} is a diagonal matrix with each diagonal element being a traitspecific weight and Ω_{2} is a betweentrait effect correlation matrix. By setting the diagonal elements in W_{2} to 0 or 1, we can choose to focus on any subset of the traits and consider any degree of association sparsity across traits (e.g., set only one element as 1 for singletrait analysis or all the elements as 1 for alltrait analysis). It is not sensible to assume the exchangeable correlation structure for B_{2}, because some pairs of traits are more similar in the rarevariant effects than other pairs (e.g., two diseases that were caused by the same set of rare mutations would have a large correlation in their rarevariant genetic effects). Here we propose to leverage the genetic correlation^{27} to inform the similarity of rarevariant effects among traits. Genetic correlation is a single number measure that quantifies the overall genetic similarity between a pair of traits. Recent advancement of methods enables us to conveniently estimate genetic correlation based on GWAS summary statistics^{27,28} and there are web portals to query genetic correlations among many complex traits^{26}. We hypothesize that the genetic correlation is also informative to measure the similarity/heterogeneity of the genelevel rarevariant effects among traits for most genes in the genome. Specifically, let C_{kk′} denote the genetic correlation between traits k and k′. We propose two types of correlation structures for Ω_{2}. In both structures, we specify a parameter ρ_{2} (0 ≤ ρ_{2} ≤ 1) to control the contribution of genetic correlation C_{kk′} to the degree of effect heterogeneity between traits k and k′. The iMTAR structure assumes the correlation coefficient is ρ_{2}C_{kk′}. Under this structure, the rarevariant effects across traits are heterogeneous and the degree of heterogeneity can change from C_{kk′} (when ρ_{2} = 1) to completely heterogeneous (strongest level of heterogeneity as effects across traits can vary independently when ρ_{2} = 0). The cMTAR structure assumes the correlation coefficient is ρ_{2}C_{kk′} + (1 − ρ_{2}). Under this structure, the degree of heterogeneity can change from C_{kk′} (when ρ_{2} = 1) to homogeneous (no heterogeneity when ρ_{2} = 0).
As the optimal values of ρ_{1} and ρ_{2} are unknown, we propose to search a grid of different values of ρ_{1} and ρ_{2} and use the Cauchy method^{29,30} to combine multiple Pvalues (Methods). The resulting tests are named after the two aforementioned iMTAR and cMTAR structures. The Cauchy method is a fast and powerful approach to combine multiple correlated Pvalues without the need for estimating and accounting for their correlation. To accommodate the situation where the gene is associated with a small number of traits, we develop a test called cctP that uses the Cauchy method to combine singletrait Pvalues from SKAT and burden tests. As we demonstrate in the GLGC data analysis and simulation studies, the cMTAR, iMTAR, and cctP cover different effect patterns among traits. To achieve further robustness, we use the Cauchy method to combine Pvalues of the three complementary tests and term this omnibus test as MTARO. The summary of the proposed iMTAR, cMTAR, cctP, and MTARO methods are presented in Fig. 1. The calculations of the test statistics and Pvalues are described in Methods.
Although both ∑ and B are covariance matrices among traits and variants, it is important to note the difference. Matrix ∑ reflects the correlation due to the residual relatedness among traits in the presence of sample overlap and linkage disequilibrium (LD) among variants. An inaccurate estimate of ∑ yields inflated type I error in the association testing. On the other hand, the matrix B = B_{2} ⊗ B_{1} reflects the similarity of the true genelevel rarevariant effects among traits and variants. This information is unknown a priori; hence, B needs to be prespecified. The power of the tests can be greatly improved if the specification reflects the truth. MTAR utilizes the genetic correlation, a global measure of crosstrait genetic similarity, to guide the specification of B_{2}. The effectiveness of this strategy in gaining power has been demonstrated in the following sections.
Application of MTAR to GLGC
We performed multitrait RVAS for three plasma lipid traits: LDL, HDL, and TG. The GLGC data set includes ~300,000 individuals of primarily European ancestry genotyped with the HumanExome BeadChip (exome array)^{33}. The participants were from 73 different studies and singlevariant association summary statistics were combined across studies via fixedeffects metaanalysis^{32}. The acquisition of the GLGC summary statistics is described in Methods.
Following Liu et al.^{33}, we considered 179,884 rare variants with minor allele frequency (MAF) < 5% and the highest priority according to their functionality and deleteriousness. We focused on 15,378 genes that contain at least two rare variants. In our analysis, we used the previously reported genetic correlation estimates among the three lipid traits^{27} in MTAR. Specifically, the genetic correlation is −0.61 for the pair (HDL, TG), 0.35 for (LDL, TG), and 0.09 for (LDL, HDL). For comparison, we performed the singletraitbased analysis by combining SKAT and burden test Pvalues across traits using either the cctP or the Bonferronicorrected minimal Pvalue (minP, take the minimal Pvalues and then multiply it by the number of tests combined).
Similar to the previous genebased RVAS of GLGC data^{33}, the slightly elevated genomic control lambdas in the quantile–quantile plots suggest the polygenic inheritance of the lipid traits (Supplementary Fig. 1). At a significance threshold of P < 3.3 × 10^{−6} (corresponding to 0.05/15,378), a total of 140 genes were identified by at least one test (Supplementary Table 2). MTAR tests (MTARO, cMTAR, iMTAR) identified 139 genes and the singletraitbased tests (cctP and minP) identified 99 genes (Fig. 2). There are 41 genes exclusively identified by MTAR tests and the MTAR Pvalues for many of these genes are 100fold smaller than the singletraitbased Pvalues (Table 1, Manhattan plots in Fig. 3 and Supplementary Fig. 2). There is only one gene (HFE, Supplementary Table 2) missed by MTAR but its MTARO Pvalue (4.8 × 10^{−6}) is close to the singletraitbased Pvalues (1.8 × 10^{−6}).
Most discovered genes (>60%) have the smallest Pvalue when ρ_{2} is large (ρ_{2} ≥ 0.5), highlighting the informativeness of using genetic correlations to guide the amongtrait effect correlation (Supplementary Fig. 3). For those genes, the association patterns among traits are generally consistent with their genetic correlations: genetic effects on HDL and TG are negatively correlated and effects on LDL and TG are positively correlated (Fig. 4a). The cMTAR and iMTAR tests produce similar Pvalues in this case. About 18% of the discovered genes become insignificant if we do not use genetic correlations and simply assume the exchangeable correlation structure in B_{2}.
When the effects betweentrait are strongly heterogeneous and vary randomly among traits (Fig. 4b), iMTAR produces much smaller Pvalues than other tests. When the effects betweentrait effects are largely homogeneous (Fig. 4c), cMTAR provides the strongest evidence of association. When the gene is associated with one trait, the singletraitbased analysis (cctP and minP) is desirable (Fig. 4d). MTARO has the Pvalue close to the smallest Pvalues among all tests in all the identified genes (Supplementary Table 2).
Many of the 139 MTAR identified genes have an established role in the three lipid traits, including targets for LDL lowering drugs (e.g., PCSK9, NPC1L1, and PPARA) and genes with known association with lipidrelated Mendelian disorders (e.g., LDLR, ABCG5, APOB, ABCA1, LCAT, APOA1, and CETP). Gene set enrichment analysis of the 139 genes highlighted the gene sets related to lipid metabolism and transport (Supplementary Fig. 4 and Supplementary Data 1), similar to the reported findings from gene set enrichment analysis of GWAS loci for LDL, HDL, and TG^{34}. Tissue enrichment analysis of all 139 significant genes using either Human Protein Atlas (HPA) or GenotypeTissue Expression (GTEx) as reference sets demonstrated enrichment of liverspecific genes (Supplementary Fig. 5), in accordance with a published tissue eQTL enrichment analysis across GWAS loci associated with LDL, HDL, TG, or total cholesterol^{35}.
Among the 41 genes exclusively identified by MTAR tests, 27 (66%) genes have previously reported association evidence with at least one of the three lipid traits and 20 (74%) of them are associated with at least two lipid traits (Table 1). To replicate the associations of the genes without any existing annotation evidence, we applied the MTARO test to an independent UK Biobank GWAS data (Methods). Despite the fact that UK Biobank GWAS data usually harbor a smaller number of rare variants in a gene than GLGC exome chip data, 7 out of 11 (64%) genes were found significant in the UK Biobank at α = 0.05/11 = 4.5 × 10^{−3} (Table 1 and Supplementary Table 3). These seven validated MTAR discovered genes may have causal impact on the lipid traits. One example is PNPLA2, which encodes the enzyme adipose TG lipase (ATGL); ATGL is involved in the breakdown of TG. Although variants associated with PNPLA2 have not previously been directly linked with any of the three lipid traits in humans, ATGLknockout mice display altered verylowdensity lipoprotein, HDL, and TG levels^{36}.
Simulation studies
We used simulation studies to further investigate the type I error control and power of MTAR. We considered three continuous traits that have similar residual covariances as the three lipid traits in the real data^{37}. We simulated data in three cohorts (N_{1} = 3000, N_{2} = 3500, N_{3} = 2000) that have different patterns of sample overlap for the three traits (Supplementary Fig. 6). The details of genotype and phenotype simulations are provided in Methods.
As in the GLGC data analysis, we utilized the combined summary statistics across three cohorts for each trait (Methods). We first evaluated the empirical type I error rates based on 10^{8} replicates of simulation. Prior research has shown that the accuracy of the Cauchy combined Pvalue is generally satisfactory for practical use in rarevariant association tests, but a slight inflation is possible^{30}. Reassured that type I error was well controlled (Supplementary Table 4), we then proceeded to simulate traits under the alternative model to evaluate power. The percentage of causal variants was set to be 50% or 20% for scenarios of dense and sparse signals. For the causal variant j in trait k, the genetic effect was set to \(\beta _{kj} = s_j^{{\mathrm{snp}}} \times s_k^{{\mathrm{trait}}} \times d{\mathrm{log}}_{10}{\mathrm{MAF}}_j\), where \(s_j^{{\mathrm{snp}}}\) and \(s_k^{{\mathrm{trait}}}\) determined the heterogeneity of the effect directions among variants and traits, respectively, and dlog_{10} MAF_{j} stated that the effect size was larger for the variant with smaller MAF. We set different values of d for different percentage of causal and \(s_j^{{\mathrm{snp}}}\) settings such that the power of the tests in each setting is reasonably high. The effects among causal variants are either in the same direction (\(s_j^{{\mathrm{snp}}}\) = 1 for all j) or bidirectional (randomly assign 1 or −1 with equal probability to \(s_j^{{\mathrm{snp}}}\)). To run MTAR, we utilized the genetic correlations in GLGC data analysis for LDL, HDL, and TG, but we did not specify \(s_k^{{\mathrm{trait}}}\) according to their genetic correlations. In particular, we considered five patterns of \(s_k^{{\mathrm{trait}}}\) across traits: \((s_1^{{\mathrm{trait}}},s_2^{{\mathrm{trait}}},s_3^{{\mathrm{trait}}}) = (0,0,1)\); (0, 1, 1); (0, −1, 1); (1, −1, 1) and (1, 1, 1). All the association patterns across traits and variants considered in our power simulation are visualized in Supplementary Fig. 7.
The empirical power is estimated at the significance level of α = 2.5 × 10^{−6} based on 10^{4} replicates (Fig. 5). When the gene is associated with one trait (pattern 1 of \(s_k^{{\mathrm{trait}}}\)), the singletraitbased tests (cctP and minP) are more powerful than iMTAR and cMTAR but the trend is reversed in other patterns. cMTAR is more powerful than iMTAR when the effects are homogeneous (pattern 5). iMTAR is much more powerful than cMTAR when the effects are heterogeneous and the specified genetic correlations are not informative to the true relationship of the effects among traits (pattern 2). The power of MTARO is close to the most powerful test in all scenarios. These observations are consistent with results from the GLGC data analysis.
Comparison with other multitrait multivariant methods
In comparison with existing multitrait multivariant methods (Supplementary Table 1), MTAR has a unique combination of features that make it desirable for practical use. First, MTAR uses summary statistics rather than individuallevel data. MTAR starts with simple summary statistics calculated in a study for each trait: variantlevel score statistics and their covariance estimates^{24}. These statistics can be easily constructed using the information routinely shared in public domains^{38}. Compared with methods that require pooling individuallevel data, using summary statistics can better protect study participant privacy and reduce logistical difficulties and computational burden. Second, MTAR allows the summary statistics for different traits to come from (possibly unknown) overlapping samples. Failure to account for the correlation between summary statistics induced by the overlapping samples can greatly inflate type I error^{39}. Sample overlap is prevalent in the multitrait analysis. Sometimes the overlap pattern is clear (e.g., all traits are measured in the same study or in different studies that share controls^{40,41}), but other times is often elusive—public domains only have combined summary statistics across many studies for each trait and studyspecific summary statistics are not available^{7}. MTAR can handle these scenarios and use a simple approach to accurately estimate the correlation between summary statistics for the traits with sample overlap. Third, MTAR is computationally fast. The MTAR Pvalue calculation is analytical and does not require timeconsuming procedures such as permutation and Monte Carlo simulation.
We compared the power of MTAR with MultiSKAT^{10} (MultiSKAT R package) and MTaSPUsSet^{17} (aSPU R package) in numerical studies. These two existing methods have demonstrated superior performance to other competing multitrait multivariant methods such as metaCCA^{18}, MGAS^{19}, DKAT^{11}, MAAUSS^{13}, MSKAT^{15}, and GAMuT^{14}. Similar to MTAR, MultiSKAT and MTaSPUsSet proposed several tests to accommodate different patterns of associations across traits and variants, and omnibus tests to gain robustness. We compared their omnibus tests with MTARO. In the simulation study, we let all cohorts have complete trait values as it is required by MultiSKAT. Empirical power was estimated at the α = 10^{−4} level due to the speed of MTaSPUsSet. MTARO has greater power than MultiSKAT and MTaSPUsSet in almost all scenarios, especially when the genetic correlation reflects the heterogeneity of effects among traits (patterns 3–4 of \(s_k^{{\mathrm{trait}}}\)) (Supplementary Fig. 8). Furthermore, MTARO is computationally more efficient. MultiSKAT and MTaSPUsSet, respectively, take 29 and 184 s on average to complete one replicate of simulation, whereas MTARO only takes 10 s. In addition, we applied MTaSPUsSet that does not require individuallevel data to the GLGC summary statistics. MTaSPUsSet missed 52 MTAR identified genes, whereas MTAR only missed 9 MTaSPUsSet identified genes.
Discussion
We have introduced MTAR, a framework for conducting the metaanalysis of RVAS summary statistics across multiple traits. The cMTAR, iMTAR, and cctP tests cover a wide variety of association patterns among traits and variants. The omnibus test MTARO achieves robust and high power by combining the Pvalues of the three complementary tests. The use of summary statistics and Cauchy Pvalue combination method empowers MTAR to conduct wholegenome multitrait RVAS in a computationally efficient manner. The computation time of running MTAR methods on the simulated and GLGC datasets are summarized in Methods. Our numerical results have confirmed that MTAR tests properly control the type I error in the present of complex patterns of sample overlap among traits and have substantial power gain relative to the separate analysis of RVAS for each trait. In the analysis of lipid traits in GLGC, MTAR identified many more genes than singletraitbased tests, including genes that have not been previously linked to lipid traits and represent novel findings. Many of these genes have been successfully replicated in an independent UK Biobank data.
Utilizing genetic correlations to guide the specification of genelevel effects heterogeneity across traits is one main innovation of MTAR. The genetic correlation is a genomewide measure of the shared genetic architecture between a pair of traits and it is calculated using common variants across the genome. The GLGC data analysis results suggest that the rarevariant effect correlation among traits is generally in accordance with the genetic correlation for most genes and the use of genetic correlation in MTAR helps to substantially improve the power of the multitrait analysis.
Although we mainly demonstrate MTAR in the analysis of continuous traits, the method can be applied to binary traits (Methods) as long as the score statistics from the models are unbiased and their covariance estimates are accurate. For binary traits, the normal approximation to the score statistics could be inaccurate in the unbalanced casecontrol setting^{42}, which could affect the performance of the multitrait analysis. For studies with related subjects, one may use methods based on mixed models to generate appropriate score statistics^{43,44}. Future research is required on how to properly handle various patterns of sample overlap across traits in the presence of familial and cryptic relatedness.
With the increasing number of complex traits available in largescale whole exome/genome sequencing studies and electronic health record linked biobank data, multitrait analysis based on summary statistics of multiple rare variants will become an important tool to boost the power of discovering genetic components of complex traits and unravel their shared genetic architectures. We envision that MTAR will facilitate the accumulation of adequately large sample sizes to accelerate discoveries in complex trait genetics and provide new biological insights by revealing pleiotropic genes.
Methods
Covariance of genetic effects among variants
B_{1} is a m × m covariance matrix for the effects among variants. We set \({\mathbf{B}}_1 = {\mathbf{W}}_1{\mathbf{\Omega }}_1{\mathbf{W}}_1\), where W_{1} is a diagonal matrix with each element being a variantspecific weight and Ω_{1} is a betweenvariant effect correlation matrix of exchangeable structure with correlation coefficient ρ_{1} (0 ≤ ρ_{1} ≤ 1). Specifically, the singletrait analysis becomes SKAT^{20} if ρ_{1} = 0 and burden tests^{22,24,45,46} if ρ_{1} = 1. Burden tests are more powerful when the association effects are similar across the aggregated variants, whereas SKAT is more powerful when the effects are in opposite directions or the number of causal variants is small relative to neutral variants. As for the variantspecific weights (in W_{1}), by default, we set them based on the MAF through a beta distribution density function Beta(MAF; 1, 25) as in SKAT. Other weighting schemes can be employed as well.
Summary statistics
For each trait k (k = 1, …, K) and subject i (i = 1, …, n), when the individuallevel phenotype (Y_{ik}), genotypes (G_{ik}), and covariates (X_{ik}) are available, the score statistics U_{k} and their covariance estimate V_{k} can be obtained from the generalized linear model with the likelihood function \({\mathrm{exp}}\left\{ {\frac{{Y_{ik}\left( {{\mathbf{\beta }}_k^{\mathrm{T}}{\mathbf{G}}_{ik} + {\mathbf{\gamma }}_k^{\mathrm{T}}{\mathbf{X}}_{ik}} \right)  b({\mathbf{\beta }}_k^{\mathrm{T}}{\mathbf{G}}_{ik} + {\mathbf{\gamma }}_k^{\mathrm{T}}{\mathbf{X}}_{ik})}}{{a(\phi _k)}} + c(Y_{ik},\phi _k)} \right\},\) where β_{k} and γ_{k} are regression parameters, ϕ_{k} is a dispersion parameter, and a, b, and c are specific functions. Specifically, we have \({\mathbf{U}}_k = a(\hat \phi _k)^{  1}\mathop {\sum}\nolimits_{i = 1}^n {\{ Y_{ik}  b^{\prime} (\widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik})\} } {\mathbf{G}}_{ik}\) and \({\mathbf{V}}_k = a(\hat \phi _k)^{  1}[\mathop {\sum}\nolimits_{i = 1}^n {b^{\prime\prime} } (\widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik}){\mathbf{G}}_{ik}{\mathbf{G}}_{ik}^{\mathrm{T}}  \{ \mathop {\sum}\nolimits_{i = 1}^n {b^{\prime\prime} } (\widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik}){\mathbf{G}}_{ik}{\mathbf{X}}_{ik}^{\mathrm{T}}\} \{ \mathop {\sum}\nolimits_{i = 1}^n {b^{\prime\prime} } (\widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik}){\mathbf{X}}_{ik}{\mathbf{X}}_{ik}^{\mathrm{T}}\} ^{  1}\{ \mathop {\sum}\nolimits_{i = 1}^n {b^{\prime\prime} } (\widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik}){\mathbf{X}}_{ik}{\mathbf{G}}_{ik}^{\mathrm{T}}\}]\), where \(\widehat {\mathbf{\gamma }}_k\) and \(\hat \phi _k\) are the restricted maximum likelihood estimators of γ_{k} and ϕ_{k} under H_{0} : β_{k} = 0, and b′ and b″ are the first and second derivatives of function b. For the linear regression model, we have \(a(\hat \phi _k) = n^{  1}\mathop {\sum}\nolimits_{i = 1}^n {(Y_{ik}  \widehat {\mathbf{\gamma }}_k^{\mathrm{T}} {\mathbf{X}}_{ik})^2}\), b′(z) = z, and b″(z) = 1. For the logistic regression model, we have a(\(\hat \phi _k\)) = 1, b′(z) = e^{z}/(1 + e^{z}), and b″(z) = e^{z}/(1 + e^{z})^{2}.
The U_{k} and V_{k} can also be derived from different forms of summary statistics shared in public domains^{38}. When the score statistics U_{k} and their variances (i.e., diag(V_{k})) are available, the covariance matrix of U_{k} can be approximated as V_{k} ≈ {diag(V_{k})}^{1/2}R{diag(V_{k})}^{1/2}, where \({\mathbf{R}} = \{ R_{j\ell }\} _{j,\ell = 1}^m\) is the SNP LD matrix calculated from the Pearson correlation coefficient among the genotypes of the m variants based on the working genotypes or external reference. In another case, when the effect estimates \(\widehat {\mathbf{\beta }}_k = \{ \hat \beta _{kj}\} _{j = 1}^m\) and their standard errors \({\mathbf{se}}_k = \{ se_{kj}\} _{j = 1}^m\) are available, we can approximate \({\mathbf{U}}_k = \{ U_{kj}\} _{j = 1}^m\) and \({\mathbf{V}}_k = \{ V_{kj\ell }\} _{j,\ell = 1}^m\) as \(U_{kj} \approx \hat \beta _{kj}{\mathrm{/}}se_{kj}^2\) and \(V_{kj\ell } \approx R_{j\ell }{\mathrm{/}}(se_{kj}se_{k\ell })\).
Covariance of summary statistics between traits
If all the traits are from the same study or multiple studies with overlapping samples, the summary statistics U_{k} among traits k = 1, …, K are correlated. Assume trait k is from cohort A with sample size n_{A} and trait k′ is from cohort B with sample size n_{B}, and there are n_{C} overlapping subjects in these two cohorts. For any SNP j not associated with the traits, the correlation matrix of Zscore \(U_{kj}{\mathrm{/}}\sqrt {V_{kj}}\) among traits is invariant to SNP j^{39,47}. In particular, if both traits k and k′ are quantitative, we have
If both traits k and k′ are binary, let n_{C0} (n_{C1}) represent the number of overlapping samples with trait value of 0 (or 1), n_{A0} (n_{A1}) denotes the number of subjects with trait k and takes the value of 0 (or 1) and n_{B0} (n_{B1}) denotes the number of subjects with trait k′ and takes the value of 0 (or 1), then we have^{39}
Hence, we can accurately estimate ζ_{kk′} using the independent null variants across the whole genome. Specifically, we first perform LD pruning using LD threshold r^{2} < 0.01 in 500 kb region to obtain a set of independent common variants. We then remove variants with association test Pvalues < 0.05 and only keep variants that are not associated with any traits. For any traits k and k′, we calculate the betweentrait sample correlation of the Zscores on the remaining variants and denote it as \(\hat \zeta _{kk^\prime }\). In our simulation study, we benchmarked \(\hat \zeta _{kk^\prime }\) against empirical sample covariance of Zscores and confirmed the accuracy of the estimate \(\hat \zeta _{kk^\prime }\) (Supplementary Fig. 9). Finally, provided the gene is not associated with any trait, the covariance of \(\widehat {\mathbf{\beta }}_k\) and \(\widehat {\mathbf{\beta }}_{k^\prime }\) can be estimated using \(\hat \zeta _{kk^\prime }\)
where the matrix R is the SNP LD matrix defined in the previous subsection.
MTAR test statistics and Pvalues
We let \({\mathbf{\beta }} = ({\mathbf{\beta }}_1^{\mathrm{T}}, \ldots ,{\mathbf{\beta }}_K^{\mathrm{T}})^{\mathrm{T}}\) denote the m genetic effects across K traits and \(\widehat {\mathbf{\beta }} = ({\mathbf{V}}_1^{  1}{\mathbf{U}}_1, \ldots ,{\mathbf{V}}_K^{  1}{\mathbf{U}}_K)\) denote their effect estimates constructed from U_{k} and V_{k}. The MTAR framework assumes the hierarchical model
As described in the main text, ∑ reflects the correlation due to the residual relatedness among traits in the presence of sample overlap and LD among variants, and B reflects the correlation among the rarevariant effects across traits and variants. The B matrix contains two coefficients ρ_{1} and ρ_{2}, where ρ_{1} controls the effect correlation among variants and ρ_{2} controls the contribution of the genetic correlation to the amongtrait rarevariant effect correlation.
For a fixed set of ρ = (ρ_{1}, ρ_{2}), we test H_{0} : σ = 0 against H_{1} : σ ≠ 0 by a variancecomponent score test^{48}:
The test statistic follows a mixture of χ^{2} distribution under the null hypothesis. Davies method can be used to accurately estimate the Pvalue^{49}. In addition, rare variants often show polymorphisms in some but not all traits, the adjustment of the formula for this case is described in the Supplementary Methods.
In the cMTAR and iMTAR tests (respectively correspond to two specifications of effect correlation among traits in B), the Cauchy Pvalue combination method is utilized to combine results from various ρ_{1} and ρ_{2}. Similar to the minimum Pvalue method, the Cauchy method mainly focuses on a few smallest Pvalues^{30}. The advantage of the Cauchy method over the minimum Pvalue method is that the Cauchy method is computationally fast because it does not rely on the Monte Carlo simulation to account for the correlation of the individual tests^{29}. Specifically, the iMTAR or cMTAR test statistic is defined as
where p(Q_{ρ}) is the Pvalue of Q_{ρ}, \({\cal{S}}\) is a set that includes a grid of possible values of ρ = (ρ_{1}, ρ_{2}), and \({\cal{S}}\) is the size of the set. In our implementation, we consider the grid {0, 0.5, 1} for both ρ_{1} and ρ_{2} such that there are nine combinations. We have shown in the Supplementary Fig. 10 that the GLGC analysis results are not sensitive to the choice of the grid. The Pvalue of Q_{iMTAR/cMTAR} can be accurately approximated by 0.5 − arctan (Q_{iMTAR/cMTAR})/π^{29}.
In addition, the MTAR framework reduces to singletrait analysis when we set a single diagonal element of matrix W_{2} to 1 (Fig. 1). We use the Cauchy method to combine these singletrait Pvalues from SKAT and burden tests and construct the cctP test as
where p_{skat,k} and p_{burden,k} are the Pvalues from the SKAT and burden tests for trait k. The Pvalue of the cctP test can be approximated by 0.5 − arctan(Q_{cctP})/π.
Finally, the Cauchy method is used to construct MTARO test by combining Pvalues from cMTAR, iMTAR, and cctP as
where p_{cMTAR}, p_{iMTAR}, and p_{cctP} are the Pvalues of the cMTAR, iMTAR, and cctP tests. The Pvalue of the MTARO test can be approximated by 0.5 − arctan(Q_{MTAR−O})/π.
Summary statistics from the GLGC
The summary statistics for the lipid traits were downloaded from http://csg.sph.umich.edu/abecasis/public/lipids2017/. For each trait k, the web portal contains variantlevel genetic effect estimates \(\widehat {\mathbf{\beta }}_k\) for a given gene and their standard errors se_{k}. We obtained U_{k} and V_{k} by using \(\widehat {\mathbf{\beta }}_k\) and se_{k} as described in Summary statistics subsection of Methods. As the original genotypes from the study are not publicly available, we estimated the LD matrix R based on the genotypes of the European population from the NHLBI Exome Sequencing Project (ESP)^{37}. To account for possible sample overlap among traits, we used Eq. (3) to estimate covariance among summary statistics across traits.
Gene set and tissue enrichment analysis
Gene set enrichment analysis was conducted using the onesided hypergeometric test against Reactome Pathways and Gene Ontology Biological Processes, as implemented in the GENE2FUNC from FUMA^{50}, with the genes tested in MTAR used as the background gene set. Enrichment Pvalues are adjusted for multiplicity using the Benjamini–Hochberg procedure within each set type tested; sets with adjusted Pvalue less than 0.05 are reported. Tissue enrichment analysis was conducted using TissueEnrich^{51}, which implements the onesided hypergeometric test for enrichment of userdefined genes relative to lists of tissueenriched, tissueenhanced, and groupenhanced genes. Default settings for the definition of tissueenriched and enhanced genes from both GTEx and HPA RNAseq datasets were applied. Enrichment Pvalues are adjusted for multiplicity using the Benjamini–Hochberg procedure within each reference set (GTEx and HPA).
Gene association annotation
We annotated the 41 genes exclusively discovered by MTAR in the GLGC data analysis using two recently developed databases: Open Targets^{52,53} (Supplementary Data 2) and STOPGAP^{54} (Supplementary Data 3). Open Targets and STOPGAP both link genes to a trait or disease via annotation of genomic loci detected in GWAS. For each of the 41 genes, the linked diseases are searched and filtered to the three traits: LDL, HDL, and TG, and the variantdisease association Pvalue < 5 × 10^{−8} from the two databases. In addition, the lipid association results from the Supplementary Tables 9 and 12 in the paper of previous GLGC data analysis^{33} were also used to annotate the 41 genes (Supplementary Table 5).
Replication of significant genes in the UK Biobank data
To replicate the associations of 14 genes (11 after removing genes with cumulative minor allele counts less than 10 in the UK Biobank GWAS data) exclusively identified by MTAR tests but without any annotation evidence, we applied the MTAR methods to an independent study with association summary statistics from the UK Biobank GWAS data set. The GWAS summary statistics were released by the Neale Lab with the rerelease of UK Biobank genotype imputation (termed imputedv3). The three related traits LDL direct (mmol/L), HDL direct (mmol/L), and TG (mmol/L) were jointly analyzed in a similar manner as the analysis of GLGC data.
Data simulation
For all simulations, we generated 100 haplotypes of length 1 MB under a calibrated coalescent model to mimic the LD structure and local combination rate of the European population^{55}. These haplotypes were used to form the genotypes of 8500 subjects across three cohorts. To simulate the genotypes for a data set, we randomly selected one thousand 3 KB regions in each haplotype and focused on rare variants with MAF < 0.05.
For each subject i, three traits were generated based on a multiresponse regression model
where β_{kj} is the genetic effect for trait k at variant j, G_{ij} is the genotype at variant j, X_{i1} is a binary covariate simulated from Bernoulli(0.5), X_{i2} is a continuous covariate simulated from a standard normal distribution. The covariance matrix of the error term used here is based on the estimated residual correlations among the lipid traits LDL, HDL, and TG in the ESP data^{37}. The reduced model was used when we needed to generate only one or two traits for subject i.
Computation time
We estimated the computation time of MTAR tests by considering different numbers of variants m = 5, 10, 20, 50, 100 and traits K = 3, 6, or 9 (Supplementary Fig. 11). For each scenario, we generated 50 datasets and reported the average computation time. On average, MTARO, cMTAR, and iMTAR took less than 0.11, 0.06, and 0.05 s (2.4 GHz Intel Core i5, Produced by Intel Co., Santa Clara, CA) when applied to a data set with 20 variants and 3 traits. The computation time did not change much in the presence of sample overlap; but it increased to 1, 0.51, and 0.49 s when the number of traits was increased to 9. MTAR is scalable for genomewide analysis. Analyzing the GLGC data (15,378 genes) using MTARO, cMTAR, and iMTAR took about 25, 10, and 8 h on a laptop with a single core. After the computation jobs were distributed to multiple cores by chromosome, the analysis was finished within 2 h.
Web resources
SKAT R package v1.3.2.1: https://cran.rproject.org/web/packages/SKAT MultiSKAT R package v1.0: https://github.com/diptavo/MultiSKAT aSPU R package v1.48: https://cran.rproject.org/web/packages/aSPU FUMA v1.3.5: http://fuma.ctglab.nl TissueEnrich v1.8.0: https://tissueenrich.gdcb.iastate.edu Open Targets: https://genetics.opentargets.org STOPGAP: https://github.com/StatGenPRD/STOPGAP.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
No data were generated in the present study. The GLGC summary statistics are publicly available at http://csg.sph.umich.edu/abecasis/public/lipids2017/. The UK Biobank GWAS summary statistics data (Neale v2) are described at http://www.nealelab.is/ukbiobank and are publicly available at https://www.dropbox.com/s/2msvdv4axfz362b/30780_raw.gwas.imputed_v3.both_sexes.tsv.bgz?dl=0 for LDL direct (mmol/L); https://www.dropbox.com/s/sn30890f64p0htu/30760_raw.gwas.imputed_v3.both_sexes.tsv.bgz?dl=0 for HDL cholesterol (mmol/L); https://www.dropbox.com/s/0tdxu9g7itbct6m/30870_raw.gwas.imputed_v3.both_sexes.tsv.bgz?dl=0 for triglycerides (mmol/L).
Code availability
Our method is implemented in the MTAR R package, freely available at the Comprehensive R Archive Network (CRAN): https://cran.rproject.org/web/packages/MTAR.
References
Goh, K.I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Solovieff, N., Cotsapas, C., Lee, P. H., Purcell, S. M. & Smoller, J. W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 14, 483–495 (2013).
He, Q., Avery, C. L. & Lin, D. Y. A general framework for association tests with multivariate traits in largescale genomics studies. Genet. Epidemiol. 37, 759–767 (2013).
Kim, J., Bai, Y. & Pan, W. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genet. Epidemiol. 39, 651–663 (2015).
Zhu, X. et al. Metaanalysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am. J. Hum. Genet. 96, 21–36 (2015).
Ray, D. & Boehnke, M. Methods for metaanalysis of multiple traits using gwas summary statistics. Genet. Epidemiol. 42, 134–145 (2018).
Liu, Z. & Lin, X. Multiple phenotype association tests using summary statistics in genomewide association studies. Biometrics 74, 165–175 (2018).
Turley, P. et al. Multitrait analysis of genomewide association summary statistics using MTAG. Nat. Genet. 50, 229–237 (2018).
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rarevariant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Dutta, D., Scott, L., Boehnke, M. & Lee, S. MultiSKAT: general framework to test for rarevariant association with multiple phenotypes. Genet. Epidemiol. 43, 4–23 (2019).
Zhan, X. et al. Powerful genetic association analysis for common or rare variants with highdimensional structured traits. Genetics 206, 1779–1790 (2017).
Kaakinen, M. et al. MARV: a tool for genomewide multiphenotype analysis of rare variants. BMC Bioinformatics 18, 110 (2017).
Lee, S. et al. Rare variant association test with multiple phenotypes. Genet. Epidemiol. 41, 198–209 (2017).
Broadaway, K. A. et al. A statistical approach for testing crossphenotype effects of rare variants. Am. J. Hum. Genet. 98, 525–540 (2016).
Wu, B. & Pankow, J. S. Sequence kernel association test of multiple continuous phenotypes. Genet. Epidemiol. 40, 91–100 (2016).
Chung, J., Jun, G. R., Dupuis, J. & Farrer, L. A. Comparison of methods for multivariate genebased association tests for complex diseases using common variants. Eur. J. Hum. Genet. 27, 811–823 (2019).
Kwak, I.Y. & Pan, W. Geneand pathwaybased association tests for multiple traits with GWAS summary statistics. Bioinformatics 33, 64–71 (2016).
Cichonska, A. et al. metaCCA: summary statisticsbased multivariate metaanalysis of genomewide association studies using canonical correlation analysis. Bioinformatics 32, 1981–1989 (2016).
Van der Sluis, S. et al. MGAS: a powerful tool for multivariate genebased genomewide association analysis. Bioinformatics 31, 1007–1015 (2014).
Wu, M. C. et al. Rarevariant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).
Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).
Price, A. L. et al. Pooled association tests for rare variants in exonresequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).
Lin, D. Y. & Tang, Z.Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 89, 354–367 (2011).
Lee, S. et al. Optimal unified approach for rarevariant association testing with application to smallsample casecontrol wholeexome sequencing studies. Am. J. Hum. Genet. 91, 224–237 (2012).
Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).
BulikSullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
Lu, Q. et al. A powerful approach to estimating annotationstratified genetic covariance via GWAS summary statistics. Am. J. Hum. Genet. 101, 939–964 (2017).
Liu, Y. & Xie, J. Cauchy combination test: a powerful test with analytic pValue calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 0, 1–18 (2019).
Liu, Y. et al. ACAT: a fast and powerful p Value combination method for rarevariant analysis in sequencing Studies. Am. J. Hum. Genet. 104, 410–421 (2019).
Tang, Z. Z. & Lin, D. Y. Metaanalysis of sequencing studies with heterogeneous genetic associations. Genet. Epidemiol. 38, 389–401 (2014).
Tang, Z.Z. & Lin, D.Y. Metaanalysis for discovering rarevariant associations: statistical methods and software programs. Am. J. Hum. Genet. 97, 35–53 (2015).
Liu, D. J. et al. Exomewide association study of plasma lipids in >300,000 individuals. Nat. Genet. 49, 1758–1766 (2017).
Segrè, A. V. et al. Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet. 6, e1001058 (2010).
Hoffmann, T. J. et al. A large electronichealthrecordbased genomewide study of serum lipids. Nat. Genet. 50, 401–413 (2018).
Haemmerle, G. et al. Defective lipolysis and altered energy metabolism in mice lacking adipose triglyceride lipase. Science 312, 734–737 (2006).
Lin, D.Y., Zeng, D. & Tang, Z.Z. Quantitative trait analysis in sequencing studies under traitdependent sampling. Proc. Natl Acad. Sci. USA 110, 12247–12252 (2013).
Hu, Y.J. et al. Metaanalysis of genelevel associations for rare variants based on singlevariant statistics. Am. J. Hum. Genet. 93, 236–248 (2013).
Lin, D. Y. & Sullivan, P. F. Metaanalysis of genomewide association studies with overlapping subjects. Am. J. Hum. Genet. 85, 862–872 (2009).
Burton, P. R. et al. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Li, Y. R. et al. Metaanalysis of shared genetic architecture across ten pediatric autoimmune diseases. Nat. Med. 21, 1018–1027 (2015).
Zhou, W. et al. Efficiently controlling for casecontrol imbalance and sample relatedness in largescale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).
Chen, H. et al. Efficient variant set mixed model association tests for continuous and binary traits in largescale wholegenome sequencing studies. Am. J. Hum. Genet. 104, 260–274 (2019).
Morgenthaler, S. & Thilly, W. G. A strategy to discover genes that carry multiallelic or monoallelic risk for common diseases: a cohort allelic sums test (CAST). Mutat. Res. 615, 28–56 (2007).
Morris, A. P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).
LeBlanc, M. et al. A correction for sample overlap in genomewide association studies in a polygenic pleiotropyinformed framework. BMC Genomics 19, 494 (2018).
Zhang, D. & Lin, X. Hypothesis testing in semiparametric additive mixed models. Biostatistics 4, 57–74 (2003).
Davies, R. The distribution of a linear combination of χ ^{2} random variables. J. R. Stat. Soc. Ser. C. 29, 323–333 (1980).
Watanabe, K., Taskesen, E., Van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).
Jain, A. & Tuteja, G. TissueEnrich: tissuespecific gene enrichment analysis. Bioinformatics 35, 1966–1967 (2018).
Koscielny, G. et al. Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 45, D985–D994 (2016).
CarvalhoSilva, D. et al. Open Targets Platform: new developments and updates two years on. Nucl. Acids Res. 47, D1056–D1065 (2018).
Shen, J., Song, K., Slater, A. J., Ferrero, E. & Nelson, M. R. STOPGAP: a database for systematic target opportunity assessment by genetic association predictions. Bioinformatics 33, 2784–2786 (2017).
Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).
Acknowledgements
This work was supported by the Data Science Initiative Award provided by the University of WisconsinMadison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. We thank Dr D.J. Liu for providing information on the GLGC data.
Author information
Authors and Affiliations
Contributions
Z.Z.T. and J.S. oversaw the study. The theory underlying MTAR was conceived of and developed by Z.Z.T., with contributions from L.L., J.S., and H.Z. L.L. developed MTAR software and performed lipid data analyses. J.S. and A.C. conducted gene annotation and result interpretation. L.L., J.S., and H.Z. performed the simulation studies. Z.Z.T. and L.L. wrote the first version of the manuscript. J.S., H.Z., A.C., and D.V.M. also contributed to the writing. All authors provided input and revisions for the final manuscript.
Corresponding author
Ethics declarations
Competing interests
J.S., H.Z., A.C., and D.V.M. are employees at Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA. The remaining authors declare no competing interests.
Additional information
Peer review information: Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer review reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Luo, L., Shen, J., Zhang, H. et al. Multitrait analysis of rarevariant association summary statistics using MTAR. Nat Commun 11, 2850 (2020). https://doi.org/10.1038/s41467020165910
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467020165910
This article is cited by

Abundant pleiotropy across neuroimaging modalities identified through a multivariate genomewide association study
Nature Communications (2024)

New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits
BMC Bioinformatics (2023)

A comprehensive comparison of multilocus association methods with summary statistics in genomewide association studies
BMC Bioinformatics (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.