Exploiting family history in aggregation unit-based genetic association tests

Wang, Yanbing; Chen, Han; Peloso, Gina M.; DeStefano, Anita L.; Dupuis, Josée

doi:10.1038/s41431-021-00980-0

Article
Published: 25 October 2021

Exploiting family history in aggregation unit-based genetic association tests

European Journal of Human Genetics volume 30, pages 1355–1362 (2022)Cite this article

534 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The development of sequencing technology calls for new powerful methods to detect disease associations and lower the cost of sequencing studies. Family history (FH) contains information on disease status of relatives, adding valuable information about the probands’ health problems and risk of diseases. Incorporating data from FH is a cost-effective way to improve statistical evidence in genetic studies, and moreover, overcomes limitations in study designs with insufficient cases or missing genotype information for association analysis. We proposed family history aggregation unit-based test (FHAT) and optimal FHAT (FHAT-O) to exploit available FH for rare variant association analysis. Moreover, we extended liability threshold model of case–control status and FH (LT-FH) method in aggregated unit-based methods and compared that with FHAT and FHAT-O. The computational efficiency and flexibility of the FHAT and FHAT-O were demonstrated through both simulations and applications. We showed that FHAT, FHAT-O, and LT-FH methods offer reasonable control of the type I error unless case/control ratio is unbalanced, in which case they result in smaller inflation than that observed with conventional methods excluding FH. We also demonstrated that FHAT and FHAT-O are more powerful than LT-FH and conventional methods in many scenarios. By applying FHAT and FHAT-O to the analysis of all cause dementia and hypertension using the exome sequencing data from the UK Biobank, we showed that our methods can improve significance for known regions. Furthermore, we replicated the previous associations in all cause dementia and hypertension and detected novel regions through the exome-wide analysis.

You have full access to this article via your institution.

Download PDF

Liability threshold modeling of case–control status and family history of disease increases association power

Article 20 April 2020

Contextualizing genetic risk score for disease screening and rare variant discovery

Article Open access 20 July 2021

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Article Open access 04 February 2021

Introduction

Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex diseases at the genome-wide significance level (p < 5 × 10^–8). Most of the variants identified by GWAS are common variants with minor allele frequency (MAF) ≥ 1%, and most of these variants display modest effect sizes and can only explain a small portion of the total heritability of complex diseases. Yet, rare variants (MAF < 1%) are important to uncovering unexplained heritability and discovering novel genes contributing to complex diseases [1,2,3]. Because standard association approaches testing each variant individually are grossly underpowered for rare variants, aggregation unit-based methods that jointly analyze variants have been proposed to improve power to detect rare variant associations. Aggregation unit-based approaches include, among others, the sequence kernel association test (SKAT) [4], Burden tests [5,6,7], SKAT-O [8], and aggregated Cauchy association test (ACAT) [9]. However, power of these methods to identify disease regions can be limited by insufficient number of cases in unascertained cohorts.

In genetic association studies, family history (FH) of disease in relatives is often collected in large population cohorts. FH provides an overview of a phenotype within families. Such information typically includes phenotypes of un-genotyped parents or more distant relatives of probands. FH is related to the genotypes of probands at disease loci based on the Mendelian laws of transmission, and is important in assessing health problems and risk of diseases [10,11,12]. While collecting cases is expensive, incorporating FH information into standard case–control genetic association analyses is a cost-effective way to potentially increase statistical power [11, 13,14,15]. Many study designs have limitations for genetic research of late-onset diseases such as Alzheimer’s disease (AD), because disease cases may be deceased with unavailable genotype data. The standard statistical association tests in younger cohorts with low prevalence of some late-onset diseases are not powerful to identify genetic regions associated with a trait of interest. In contrast, the incorporation of available information of disease status in the form of FH may increase the sample size in cohorts with limited cases or individuals with unavailable genotypes. Genetic association studies using only cases and controls will greatly benefit by incorporating available FH information to detect associations.

FH cannot be directly incorporated in standard genetic association methods, limiting its use in genetic association testing. FH has been included as a covariate to improve disease prediction [16], or used to infer mode of inheritance to construct statistical tests [17]. However, there are a few reported methods that allow FH to be exploited in genetic association analysis to improve statistical power to detect disease loci. The method developed by Ghosh et al. [13] enables the incorporation of FH as a phenotype into the standard single variant analysis, and the results confirmed that exploiting the information contained in FH substantially boosts power to detect the individual variant at disease loci. Nevertheless, these single variant tests suffer from loss of power to detect rare variant associations. While numerous aggregation unit-based methods to jointly analyze rare variants have been proposed to improve power to detect rare variant associations, aggregation unit-based methods that can directly incorporate FH information are needed.

We developed a new and powerful method of family history aggregation unit-based test (FHAT) that enables the incorporation of FH to enhance the statistical power for rare variant associations. We also developed an optimal unified test FHAT-O to maintain robust power in complex scenarios regardless of directions of genetic effects or the proportion of causal variants. To make the comparison with the recent developed method, liability threshold model of case–control status and FH (LT-FH) [11], we proposed a novel way to utilize LT-FH into aggregation unit-based method for rare variant analysis. We performed an extensive simulation study to evaluate the type I error and power of FHAT and FHAT-O under various scenarios, and illustrated the methods using whole exome sequencing data from the UK Biobank.

Material and methods

Family history aggregation unit-based test (FHAT)

We propose a novel approach, FHAT, to incorporate FH information in the aggregation unit-based tests. We assume that there are n probands with m observed variants included in the aggregation unit-based test. When we have FH on the relative of the probands, let $Y_i^P$ denotes the phenotype of the ith proband; $Y_i^R$ denotes the phenotype of the relative of the ith proband, respectively; $G_i^P$ denotes the genotypes of the ith proband; $X_i^P$ denotes covariates for the ith proband; $X_i^R$ denotes covariates of the relative of the ith proband, such as age and ancestral principal components (PCs) that account for population structure. The probability of observing ($Y_i^P$,$Y_i^R$) conditional on $G_i^P$ can be written as follows (see details in the Supplementary Method):

$$P\left( {Y_i^P,Y_i^R|G_i^P,X_i^P,X_i^R} \right) = P\left( {Y_i^P|G_i^P,X_i^P} \right)P\left( {Y_i^R|G_i^P,Y_i^P,X_i^R} \right)$$

(1)

Therefore, the evidence for association can be assessed from two separate analyses for probands and relatives. We assume an additive model and coding the genotypes in $G_i^P$ as the number of minor alleles. One can also use dominant or recessive models by coding the variants appropriately. Based on $P\left( {Y_i^P|G_i^P} \right)$, we first assess the association between probands’ genotypes and their disease status using

$$g\left( {E\left( {Y_i^P|G_i^P,X_i^P} \right)} \right) = X_i^P\alpha _P + G_i^P\beta _P$$

(2)

where g(∙) is the link function, α_P is a vector of regression coefficients for covariate effects, β_P is a vector of regression coefficients for the observed genotypes in probands. The model for relatives based on $P\left( {Y_i^R|G_i^P,Y_i^P} \right)$ is specified as follows:

$$g\left( {E\left( {Y_i^R|G_i^P,Y_i^P,X_i^R} \right)} \right) = X_i^R\alpha _R + G_i^P\beta _R + Y_i^P\lambda _R$$

(3)

where λ_R is scalar of regression coefficients for probands’ phenotypes for the relatives’ model; α_R is vector of regression coefficients for relatives’ covariates; β_R is vector of regression coefficients for m observed variants in probands. Inclusion of $Y_i^P$ is necessary to make the analyses of probands and relatives independent based on (1). This relatives’ model (3) can analyze FH from unrelated relatives, i.e., single relative per probands or FH from both parents since mothers and fathers are conditional independent. We observe that the two underlying association estimators, ($\hat \beta _P,\hat \beta _R$), have the relationship [18] of $\hat \beta _R \approx 2\Omega \hat \beta _P$ where Ω is the kinship coefficient between probands and their relatives and ${{\Omega }} = \frac{1}{4}$ for first-degree relatives such as parents.

Conventional aggregation unit-based methods evaluate the association between a set of variants and phenotype among probands. One such aggregation unit-based method is called the SKAT [4]. The weighted score statistic based on the probands’ model (2) is

$$Q_{{{{{{\rm{SKAT}}}}}}} = \frac{{\left( {Y^P - \hat \mu _P} \right)^TG^PWWG^{P^T}\left( {Y^P - \hat \mu _P} \right)}}{{\hat \phi _P^2}}$$

where W = diag(w₁,w₂,…,w_m) is a pre-specified weight matrix for m variants; G^P is a n × m genotype matrix with (i, j)th element corresponding to the additively coded genotype for variant j of proband i; $\hat \mu _P$ is the estimated mean of Y^P using the null model with only covariates; $\hat \phi _P$ is the estimate of dispersion parameter in the generalized linear model that is related to the variance of the distribution under H₀: for binary outcomes it is fixed to 1, and for continuous outcomes it is the variance of the random errors. The score statistic can be obtained similarly to evaluate whether genetic variants are associated with disease status using the relatives’ phenotypes to replace the probands’ phenotypes based on relatives’ model (3). The pre-specified weights can be a function of MAF. For example, one can use Wu’s weights [4] w_j = Beta(MAF_j;1,25) to up-weight the effect of rarer variants.

We propose to combine the score statistics from the two association models for probands and their relatives using a weighted meta-analysis to increase the flexibility of incorporating relatives with different degrees of relatedness (thus different kinship coefficients), as well as different numbers of available relatives for each proband. Meta-analysis is often used in genetic association analysis to increase the power by combining results from multiple studies. Methods to meta-analyze SKAT results have been developed [19]. Meta-analysis of rare variant association tests proposed is based on the study-specific summary statistics, that is, score statistics for each variant and linkage disequilibrium estimates in a region. Because of the genetic relationship between probands and their relatives, we down-weight the scores for relatives by 2Ω when combining the score statistics in a meta-analysis by assuming the homogeneous genetic effects among probands and their relatives. Specifically, because relative k of each proband may or may not have phenotype data available, we use $Y^{R_k}$ to denote the collective phenotype vector for relative k of all probands (e.g., all mothers), including missing values, with kinship coefficient Ω_k. The diagonal matrix D(R_k) indicates whether corresponding element in $Y^{R_k}$ for each proband is missing (denoted by 0) or not (denoted by 1). Therefore, relatives with missing phenotype data do not contribute to the test statistic. We fit a single relative model jointly using all relatives’ phenotypes and covariates conditional on their probands’ phenotypes to get $\hat \mu _{R_k}$, the estimated mean vector of $Y^{R_k}$ for relative k of all probands, as well as the dispersion parameter estimate $\hat \phi _R$ under the null hypothesis of no genetic effects. We assume that all relatives are independent in the model. The general form of FHAT statistics that incorporates FH from relatives is

$$Q_{{{{{{\rm{FHAT}}}}}}} = \left[ {\frac{{\left( {Y^P - \hat \mu _P} \right)^T}}{{\hat \phi _P}} + \mathop {\sum}\nolimits_k {\frac{{2\Omega _kD\left( {R_k} \right)\left( {Y^{R_k} - \hat \mu _{R_k}} \right)^T}}{{\hat \phi _R}}} } \right]\\ G^PWWG^{P^T}\left[ {\frac{{\left( {Y^P - \hat \mu _P} \right)}}{{\hat \phi _P}} + \mathop {\sum}\nolimits_k {\frac{{2\Omega _kD\left( {R_k} \right)\left( {Y^{R_k} - \hat \mu _{R_k}} \right)}}{{\hat \phi _R}}} } \right]$$

(4)

Under the null hypothesis, Q_FHAT follows a weighted sum of χ² distributions with 1 degree of freedom, $Q_{{{{{{\rm{FHAT}}}}}}}\sim \mathop {\sum}\nolimits_{j = 1}^m {\lambda _j\chi _{1,j}^2}$. The weights λ_j can be estimated from the eigenvalues of $WG^{P^T}\left( {\hat P + \mathop {\sum}\nolimits_k {4\Omega _k^2D\left( {R_k} \right)\hat P_{R_k}D\left( {R_k} \right)} } \right)G^PW$, where $\hat P$ and $\hat P_{R_k}$ are the projection matrices in probands and relatives k, respectively, see the Supplementary Method. The p value can be estimated by Davies’ method [20]. The general form can be reduced to

$$Q_{{{{{{\rm{FHAT}}}}}}} = \left[ {\left( {Y^P - \hat \mu _P} \right)^T + \frac{{D\left( {R_m} \right)\left( {Y^{R_m} - \hat \mu _{R_m}} \right)^T}}{2} + \frac{{D\left( {R_f} \right)\left( {Y^{R_f} - \hat \mu _{R_f}} \right)^T}}{2}} \right]\\ G^PWWG^{P^T}\left[ {\left( {Y^P - \hat \mu _P} \right) + \frac{{D\left( {R_m} \right)\left( {Y^{R_m} - \hat \mu _{R_m}} \right)}}{2} + \frac{{D\left( {R_f} \right)\left( {Y^{R_f} - \hat \mu _{R_f}} \right)}}{2}} \right]$$

(5)

for incorporating FH from both parents (with mothers denoted by m and fathers denoted by f) when using logistic regression models for binary trait with the estimates of dispersion parameters fixed to 1 (i.e., $\hat \phi _P = \hat \phi _R = 1$), and the kinship coefficients (Ω_m, Ω_f) fixed to $\frac{1}{4}$.

Optimal FHAT (FHAT-O)

Using the same framework adopted in FHAT, we develop a FHAT-O statistic based on the optimal unified test SKAT-O [8]. Since SKAT-O combines the feature of SKAT and Burden tests, the power is robust in the presence of both different and same directions of causal variant effects.

We first develop a FHAT-Burden, which is a weighted sum of the weighted score statistics in probands, and relatives based on their relationships (Supplementary Method). Then we propose unified test defining as the weighted average of FHAT and FHAT-Burden:

$$Q_\rho = \left( {1 - \rho } \right)Q_{{{{{{\rm{FHAT}}}}}}} + \rho Q_{{{{{{\rm{FHAT - Burden}}}}}}}$$

where the weight ρ can be estimated to minimize the p value using the procedure proposed by Lee et al. [21]. When ρ = 1, Q_ρ reduces to FHAT-Burden, and when ρ = 0, Q_ρ is equivalent to FHAT. The statistic for optimal test FHAT-O that combines the features of FHAT and FHAT-Burden is determined as follows:

$$Q_{{{{{{\rm{FHAT - O}}}}}}} = \mathop {{\min }}\limits_{0 \le \rho \le 1} P_\rho$$

(6)

where P_ρ is the p value estimated for each given ρ (more details are in the Supplementary Method).

Simulation analysis

Simulations were performed to evaluate the FHAT and FHAT-O statistics in terms of empirical type I error and statistical power. We generated 10,000 haplotypes for a 4 kb region on chromosome 19 using HapGen2 software [22]. The data from 1000 genomes project were used as the reference panel to simulate haplotypes. In all simulations, we focused on binary traits because they are more often collected through questionnaire in relatives and we focused on rare variants with MAF < 1%. We used the definition from Chen et al. [23] to calculate the genetic effect size. We simulated the probands with both genotypes and phenotypes, and available FH data from both parents. We used LT-FH phenotype in SKAT (SKAT-LTFH) and SKAT-O (SKATO-LTFH) and compared the results to FHAT and FHAT-O, and they were all calculated by combining the FH from relatives (i.e., mothers and fathers) into the analysis. The standard methods (SKAT, SKAT-O, Burden test, and ACAT-V) only used proband data in the analysis. Because mothers and fathers were simulated as independent samples, they were analyzed using a single relatives’ model (3) and then FHAT and FHAT-O statistics were calculated using (5) and (6). The type I error and power of FHAT and FHAT-O were compared to SKAT-LTFH, SKATO-LTFH, SKAT, SKAT-O, Burden test, and ACAT-V. Note that ACAT-V is an aggregation unit-based test combining variant-level p values using ACAT. The detailed description of type I error and power simulations can be found in the Supplementary Method.

Analysis of whole exome sequencing data in the UK Biobank

The UK Biobank is a large prospective cohort study with information on clinical traits, covariates, and genome-wide genotype data for over 500,000 individuals with age at assessment between 37 and 73 years at baseline (2006–2010). The second tranche of exome sequence data of approximately 4 million coding variants for 200,000 individuals has been recently completed in the UK Biobank. FH of all cause dementia and hypertension was collected from questionnaires. Rare variant (with MAF < 1%) gene-based analyses detailed in the Supplementary Method were conducted to analyze all cause dementia and hypertension in the UK Biobank data.

Results

Type I error and power

A total of 20 million simulation replicates were first generated to evaluate type I error at various alpha levels for FHAT, FHAT-O, SKAT, SKAT-O, Burden test, and ACAT-V using 5000 probands with available parental history (Table 1). SKAT and SKAT-O have inflated type I error for prevalence = 20%, while the type I error is controlled better in FHAT, FHAT-O. When the disease prevalence is low (i.e., 10%), FHAT and FHAT-O have inflated type I error, especially for exome-wide significance (alpha = 2.5 × 10^–6), but the inflation is smaller than that was observed with SKAT and SKAT-O after incorporating additional cases in relatives. A slightly deflated type I error was observed in FHAT and SKAT for prevalence = 50%. The conservativeness of SKAT when the prevalence is 50% was also observed in prior publications [4, 8]. Burden test and ACAT-V control the type I error relatively better in some scenarios shown in Table 1. By comparing the type I error of the methods shown in Table 1 to SKAT-LTFH and SKATO-LTFH (Supplementary Table S2), FHAT and FHAT-O yield similar type I error results as in SKAT-LTFH and SKATO-LTFH, respectively. The type I error for the LTFH methods was evaluated at alpha level as low as to 2.5 × 10^–5 to reduce the computational cost.

Table 1 The empirical type I error rate divided by the significance level.

Full size table

Figure 1 summarizes the power simulation results of FHAT, SKAT-LTFH, SKAT, FHAT-O, SKATO-LTFH, SKAT-O, Burden test, and ACAT-V for disease prevalence = 20% at alpha = 2.5 × 10^–6. Additional power results for prevalence = 50% and other alpha levels can be found in Supplementary Figs. S1–S3. The causal variants in a region were set to have positive effects, or half of the causal variants have positive effects and half of the causal variants have negative effects. In all scenarios, similar patterns are shown in Fig. 1 and Supplementary Fig. S1. Our main findings included: (1) FHAT and FHAT-O are more powerful than SKAT-LTFH and SKATO-LTFH, respectively, under many scenarios when the variants have larger effects on the disease among older people; (2) FHAT and FHAT-O have greatly improved power compared to standard methods that do not incorporate FH in most scenarios except for the scenario when the proportion of causal variants is 10% and half of the causal variants have positive effects and half of the causal variants have negative effects. However, ACAT-V has substantial power loss in many other scenarios; (3) FHAT suffers from a loss of power when the proportion of causal variants is high and the causal variants have effects in the same directions. In contrast, FHAT-O outperforms FHAT in those scenarios, and remains powerful regardless of the directions of genetics effects or number of causal variants.

**Fig. 1: Empirical power of FHAT, FHAT-O, SKAT-LTFH, SKATO-LTFH, SKAT, SKAT-O, Burden test, and ACAT-V at exome-wide significance for prevalence = 20%.**

Computational cost

FHAT and FHAT-O and other existing methods (SKAT, SKAT-O, Burden test, and ACAT-V) have lower computational cost compared to SKAT-LTFH and SKATO-LTFH. Table 2 summarizes computation time (in minutes) for all methods for analyzing 1000 regions that contain 30 variants. The computation time of FHAT, FHAT-O, SKAT, SKAT-O, Burden test, and ACAT-V depends on sample size and region size, whereas the running time for SKAT-LTFH and SKATO-LTFH (conducting using the LT-FH software v2 [11]) depends on the number of configurations of probands’ disease status and FH.

Table 2 Computational time for testing 1000 regions.

Full size table

Application to the UK Biobank

We restricted the analysis to 129,670 white individuals who passed all filters and have exome sequencing data, phenotype, and available parental disease status (see details in the Supplementary Method). The age at the first assessment visit for probands is between 38 and 72 with the mothers of probands being between 60 and 105, and the fathers of probands being between 60 and 102. There are 27 dementia cases (p = 0.02%) and 32,773 hypertension cases (p = 25.3%) among probands. While mothers and fathers of probands have similar hypertension prevalence (37,145 hypertension cases in mothers, p = 28.6%; 26,063 hypertension cases in fathers, p = 20.1%), more dementia cases are observed in the parents (10,654 dementia cases in mothers, p = 8.2%; 5720 dementia cases in fathers, p = 4.4%) compared to probands.

We first evaluated the associations between all cause dementia and hypertension with known regions previously implicated with AD/dementia risk [15] and hypertension [24,25,26,27]. We performed the analysis for all unrelated white individuals using FHAT, FHAT-O, SKAT-LTFH, SKATO-LTFH, and other conventional tests (SKAT, SKAT-O, Burden test, and ACAT-V), see results in Table 3. The samples involved in the analyses varied because of missing values in the covariates used for adjustment in the models. FHAT, SKAT-LTFH, FHAT-O, and SKATO-LTFH had improved significance after incorporating parental phenotype information compared to p values calculated using other conventional tests for majority of genes. SKAT, SKAT-O, and ACAT-V had almost no power to detect some associations for all cause dementia due to low prevalence in probands. The results show that BCL3 (p = 6.8 × 10^–5 in FHAT, p = 2.5 × 10^–5 in SKAT-LTFH, p = 5.9 × 10^–5 in FHAT-O, p = 1.8 × 10^–5 in SKATO-LTFH) and TOMM4 (p = 3.0 × 10^–4 in FHAT, p = 5.8 × 10^–4 in SKAT-LTFH, p = 3.8 × 10^–4 in FHAT-O, p = 7.7 × 10^–4 in SKATO-LTFH) were significantly associated with all cause dementia status at a significance level of 6.3 × 10^–3 for testing eight genes. At the same significance level, DBH (p = 1.3 × 10^–3 in FHAT, p = 2.0 × 10^–3 in SKAT-LTFH, p = 2.6 × 10^–3 in FHAT-O, p = 3.3 × 10^–3 in SKATO-LTFH) was identified for hypertension and which had improved significance compared to the results from conventional methods. Although the tests that incorporate FH demonstrated an improved significance for all eight AD/dementia genes we tested, some p values for hypertension genes were less significant. This may be due to the fact that the prevalence for hypertension in probands was similar to that in parents, and the associations were diluted by the potential noises that were added when combining the FH from parents.

Table 3 Association analysis for genes previously implicated in all cause dementia and hypertension susceptibility.

Full size table

A comprehensive exome-wide analysis was then conducted. A total of ~18K genes with two or more rare genetic variants meeting our filtering criteria were included. We used models including the same covariates for all cause dementia and hypertension as we did in the known gene analyses. We used p < 5.6 × 10^–5 as the suggestive significance threshold for testing ~18K genes. In the analysis of all cause dementia (Table 4 and Fig. 2), the gene TREM2 [28] (p = 4.1 × 10^–9) with known effects on AD/dementia and late-onset AD achieved a strict exome-wide significance (p < 2.8 × 10^–6) using FHAT-O and it was also detected by FHAT (p = 5.2 × 10^–6) with a suggestive exome-wide significance. One known AD/dementia gene, PVR [29] (p = 1.2 × 10^–5 in FHAT and p = 1.8 × 10^–5 in FHAT-O) was identified with both FHAT and FHAT-O analysis, and ABCA7 [30] (p = 4.1 × 10^–5) with known effects on AD/dementia was identified by FHAT-O. Moreover, three novel genes were found to be significantly associated with all cause dementia using FHAT and FHAT-O (EFCAB3 with p = 4.0 × 10^–5 in FHAT and p = 4.2 × 10^–5 in FHAT-O, EMSY with p = 4.4 × 10^–5 in FHAT and p = 2.7 × 10^–5 in FHAT-O, and KLC3 with p = 1.4 × 10^–5 in FHAT-O). Because we observed highly inflated results (Fig. 2) from hypertension analysis due to the correlation among parents’ phenotypes, we corrected the analysis by additionally adjusting for the spouse’s hypertension status in the parents’ model. For the adjusted hypertension analysis (Table 4 and Fig. 2), FHAT identified GATA5 (p = 4.1 × 10^–5), and FHAT-O identified FGD5 (p = 4.3 × 10^–5) and DDN (p = 4.2 × 10^–5) at a suggestive significance level. Those genes detected by our methods have previously been reported to be associated with hypertension-related trait [31,32,33].

Table 4 Whole exome-wide association analysis for all cause dementia and hypertension.

Full size table

**Fig. 2: Q–Q plots of whole exome-wide analysis results for all cause dementia and hypertension.**

Discussion

We proposed two novel approaches, FHAT and FHAT-O, that incorporate FH to increase power to detect rare variant associations in aggregation unit-based analysis. We also offered a novel way to adapt the LT-FH method to analyze rare variants. Because FH of disease is often collected through questionnaires in large cohorts, the added power is at no added cost. We applied our methods to exploit the FH from parents in simulation analysis and using the UK Biobank data, by assuming that the parents are conditionally independent. We analyzed both parents through a single relatives’ model, and combined the scores calculated from parents and probands with appropriate weights to calculate the test statistics. Because the probands’ analysis is separate from the relatives’ analysis, our methods can handle the missingness in FH as presented in (1) and (4), and one can include all probands with or without FH to optimize the usage of data.

The power was evaluated at alpha = 2.5 × 10^–6 to represent the exome-wide significance for testing 20,000 genes as well as at a suggestive threshold of alpha = 2.5 × 10^–5. By assuming that the causal variants in older people have bigger effects compared to younger people, we showed that FHAT and FHAT-O have slightly greater power than SKAT-LTFH, SKATO-LTFH, with greatly reduced computational cost. Compared with SKAT and ACAT-V, FHAT has greater gain in power in most cases. However, FHAT and SKAT are less powerful than Burden test and SKAT-O when there is a high proportion of causal variants, especially when the causal variants all have the positive effects. FHAT-O combines the features of both FHAT and FHAT-Burden, has robust power in many scenarios, and outperforms other methods, as shown in our extensive power simulations. ACAT-V has slightly higher power in some cases where the proportion of causal variants is low, which was expected because only a few genetic variants contribute to the results in ACAT-V, though the score statistic for FHAT and FHAT-O is calculated using a linear combination of squared scores from both causal and non-causal variants. We further demonstrated that our methods have improved significance after incorporating FH from association analyses with all cause dementia and hypertension using genotypes and phenotypes collected from the UK Biobank. We compared results using FHAT, FHAT-O, SKAT-LTFH, and SKATO-LTFH for probands with both genotypes and phenotypes, and their parental history of disease to other methods only using probands. Variants in eight known AD/dementia regions and eight known hypertension regions were selected for the analysis. Using the significance level = 6.3 × 10^–3 for testing eight known genes, BCL3 and TOMM40 were significantly associated with all cause dementia, while other known AD/dementia regions had improved significance compared to the methods that do not incorporate FH. Some of the hypertension genes were less significant using our method to incorporate FH, which might be caused by additional noise resulting from a similar hypertension prevalence in probands and their parents. The FHAT and FHAT-O approaches yielded similar conclusions compared to SKAT-LTFH, and SKATO-LTFH, respectively.

We evaluated type I error at various alpha levels and disease prevalence. We did not evaluate the type I error for SKAT-LTFH and SKATO-LTFH at the exome-wide significance (alpha = 2.5 × 10^–6) to limit the computational cost. The type I error of SKAT was previously found to be conservative when the disease prevalence is ~50%, and the Burden test was found to have appropriate type I error when the case–control ratio is balanced [5,6,7]. However, SKAT, SKAT-O, Burden, and ACAT-V suffer from substantial inflated type I error when the prevalence is low, especially for lower alpha level (i.e., alpha <2.5 × 10^–4). In contrast, the FHAT, SKAT-LTFH, FHAT-O, and SKATO-LTFH control the type I error rates relatively better. The type I error is overall well controlled using FHAT and FHAT-O in most scenarios, but a high inflation occurs for alpha = 2.5 × 10^–6 and prevalence = 10% where the number of cases and controls is unbalanced (Table 1). Unbalanced case–control ratio yields inflated type I error rates because the imbalance invalidates the asymptotic assumption of logistic regression. Saddle point approximation [34,35,36] method and efficient resampling [37] have been successfully used to calibrate binary phenotype-based logistic mixed models when case–control ratios are extremely unbalanced. In the future, we plan to adopt these cutting-edge methods to properly account for unbalanced case–control ratio.

In the exome-wide association analysis, we used the same covariates (age, sex, PC1-5, PC11) as we did in the known region analysis for all cause dementia. However, as the inflation was observed in our hypertension analysis (Fig. 2), we further adjusted for the spouse’s disease status in the parents’ model to account for the correlations among parents in addition to the covariates of age, age squared, sex, BMI, PC1-PC5, PC8, and PC14. The FH could be correlated with household effects. In the future, we will extend the current approaches to allow for correlation, as might be induced by household effect, in the analysis. Through the exome-wide analysis using FHAT and FHAT-O, we confirmed previously reported genes (TREM2, PVR, and ABCA7) [28–30] for AD/dementia as well as genes (GATA5, FGD5, DDN) [31–33] related to blood pressure and hypertension. Moreover, our methods identified three novel regions associated with all cause dementia (EFCAB3, EMSY, KLC3) using a suggestive exome-wide significance threshold. Replication analyses are needed to confirm these findings. While we observed inflated type I error for low prevalence in our simulations, we did not see evidence of large inflation of FHAT and FHAT-O in all cause dementia analysis, as seen from the Q–Q plot (Fig. 2) and genomic control inflation factor (with λ_FHAT = 1.13 and λ_{FHAT_O} = 1.06 for all cause dementia analysis). The methods require that all samples are unrelated. The generalized linear mixed model can be used to expand the current methods for related samples, which will allow us to incorporate FH from multiple relatives or handle the consanguineous families.

Although the method development, simulation studies, and UK Biobank analysis described in the paper were focusing on the population samples, our methods can also handle the ascertainment that happens in case–control analysis, because the likelihood can be written as the product of the retrospective proband information, taking ascertainment into consideration:$P\left( {G_i^P,Y_i^R|Y_i^P,X_i^P,X_i^R} \right) = P\left( {Y_i^R|G_i^P,Y_i^P,X_i^R} \right)P\left( {G_i^P|Y_i^P,X_i^P} \right)$ (Supplementary Method). Equation (1) was derived based on the assumption of independence of the relatives’ phenotype and probands’ covariates conditional on the relatives’ covariates and the strength of the associations in relatives. However, when the proband covariates are believed to have an effect on the relatives’ disease status, one can adjust for such covariates in the relatives’ model (3) to account for such an effect. There might be a concern about the accuracy about the FH collected from the probands. The reporting bias in FH among relatives would lead to misclassification of relatives’ disease status, which might cause biases in effect size estimates. However, the methods we proposed are variance component models that do not rely on effect estimates and only provide statistical significance (p values) for associations. The misclassification would affect the power of our methods, but would not affect the validity of the test (i.e., type I error) as the misclassification is not related to genotype data under the null hypothesis of no association between genotypes and disease status. We would expect minimal impact on the methods based on the previously published work, where FH was down-weighted based on the accuracy calculated as the correlation of FH recorded among siblings, resulting only in small changes to the association results [11].

In this paper, we demonstrated that FHAT and FHAT-O are computationally efficient compared to SKAT-LTFH and SKATO-LTFH. The significant reduced computational cost using FHAT and FHAT-O was shown in the analysis time to run 1000 aggregation unit-based tests. Although we focused on binary traits and rare variants, our method can be applied to analyze continuous traits using linear models and common variants. The framework in FHAT is flexible for various settings. While we applied FHAT and FHAT-O for probands with parental disease status available in simulations and the UK Biobank analysis, FHAT can be easily applied to other relative types. We also proposed an extension to FHAT, FHAT-O, to capture the features in SKAT-O, in particular the robustness of the power when all genetic variants have the same direction of effects and the proportion of causal variants is high. The framework can easily be extended to incorporate any other established aggregation unit-based methods. Our methods that allow the incorporation of available FH are innovations compared to traditional rare variant studies that only use cases and controls, which have great potentials to promote genetic association research.

Data availability

The datasets generated during and/or analyzed during the current study are available in the UK Biobank repository (http://www.ukbiobank.ac.uk/).

Code availability

The functions for FHAT and FHAT-O are available on the website http://sites.bu.edu/fhspl/publications/fhat/.

References

Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40:695–701.
Article CAS PubMed PubMed Central Google Scholar
Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev. 2009;19:212–9.
Article CAS PubMed PubMed Central Google Scholar
Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–50.
Article CAS PubMed PubMed Central Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93.
Article CAS PubMed PubMed Central Google Scholar
Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615:28–56.
Article CAS PubMed Google Scholar
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–21.
Article CAS PubMed PubMed Central Google Scholar
Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384.
Article PubMed PubMed Central Google Scholar
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012;91:224–37.
Article CAS PubMed PubMed Central Google Scholar
Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am J Hum Genet. 2019;104:410–21.
Article CAS PubMed PubMed Central Google Scholar
So H-C, Kwan JSH, Cherny SS, Sham PC. Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. Am J Hum Genet. 2011;88:548–65.
Article CAS PubMed PubMed Central Google Scholar
Hujoel MLA, Gazal S, Loh PR, Patterson N, Price AL. Liabilty threshold modeling of case-control status and family history of disease increases association power. Nat Genet. 2020;52:541–7.
Article CAS PubMed PubMed Central Google Scholar
Wacholder S, Hartge P, Struewing JP, Pee D, McAdams M, Brody L, et al. The kin-cohort study for estimating penetrance. Am J Epidemiol. 1998;148:623–30.
Article CAS PubMed Google Scholar
Ghosh A, Hartge P, Kraft P, Joshi AD, Ziegler RG, Barrdahl M, et al. Leveraging family history in population-based case-control association studies. Genet Epidemiol. 2014;38:114–22.
Article PubMed PubMed Central Google Scholar
Liu JZ, Erlich Y, Pickrell JK. Case-control association mapping by proxy using family history of disease. Nat Genet. 2017;49:325–31.
Article CAS PubMed Google Scholar
Marioni RE, Harris SE, Zhang Q, McRae AF, Hagenaars SP, Hill WD, et al. GWAS on family history of Alzheimer’s disease. Transl Psychiatry. 2018;8:99.
Article PubMed PubMed Central Google Scholar
Gim J, Kim W, Kwak SH, Choi H, Park C, Park KS, et al. Improving disease prediction by incorporating family disease history in risk prediction models with large-scale genetic data. Genetics. 2017;207:1147–55.
Article PubMed PubMed Central Google Scholar
Shi M, Umbach DM, Weinberg CR. Using parental phenotypes in case-parent studies. Front Genet. 2015;6:221.
Article PubMed PubMed Central Google Scholar
Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81:321–37.
Article CAS PubMed PubMed Central Google Scholar
Lee S, Teslovich TM, Boehnke M, Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet. 2013;91:42–53.
Article Google Scholar
Davies RB. The distribution of a linear combination of chi-square random variables. J R Stat Soc. 1980;29:323–33.
Google Scholar
Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13:762–75.
Article PubMed PubMed Central Google Scholar
Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27:2304–5.
Article CAS PubMed PubMed Central Google Scholar
Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol. 2013;37:196–204.
Article PubMed Google Scholar
Mavani G, Kesar V, Devita MV, Rosenstock JL, Michelis MF, Schwimmer JA. Neurofibromatosis type 1-associated hypertension secondary to coarctation of the thoracic aorta. Clin Kidney J. 2014;7:394–5.
Article PubMed PubMed Central Google Scholar
Sun B, Williams JS, Pojoga L, Chamarthi B, Lasky-Su J, Rabuy BA, et al. Renin gene polymorphism: its relationship to hypertension, renin levels and vascular responses. J Renin Angiotensin Aldosterone Syst. 2011;12:564–71.
Article CAS PubMed Google Scholar
Liu C, Kraja AT, Smith JA, Brody JA, Franceschini N, Bis JC, et al. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nat Genet. 2016;48:1162–70.
Article CAS PubMed PubMed Central Google Scholar
Surendran P, Feofanova EV, Lahrouchi N, Natlla L, Karthikeyan S, Cook J, et al. Discovery of rare variants associated with blood pressure regulation through meta-analysis of 1.3 million individuals. Nat Genet. 2020;52:1314–32.
Article CAS PubMed PubMed Central Google Scholar
Gratuze M, Leyns CEG, Holtzman DM. New insights into the role of TREM2 in Alzheimer’s disease. Mol Neurodegener. 2018;13:66.
Article CAS PubMed PubMed Central Google Scholar
Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet. 2019;51:404–13.
Article CAS PubMed PubMed Central Google Scholar
Hollingworth P, Harold D, Sim R, Gerrish A, Lambert J-C, Carrasquillo MM, et al. Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease. Nat Genet. 2011;43:429–35.
Article CAS PubMed PubMed Central Google Scholar
Messaoudi S, He Y, Gutsol A, Wight A, Hébert RL, Vilmundarson RO, et al. Endothelial Gata5 transcription factor regulates blood pressure. Nat Commun. 2015;6:8835.
Article CAS PubMed Google Scholar
Ehret GB, Ferreira T, Chasman DI, Jackson AU, Schmidt EM, Johnson T, et al. The genetics of blood pressure regulation and its target organs from association studies in 342,415 individuals. Nat Genet. 2016;48:1171–84.
Article CAS PubMed PubMed Central Google Scholar
Giri A, Hellwege JN, Keaton JM, Park J, Qiu C, Warren HR, et al. Trans-ethnic association study of blood pressure determinants in over 750,000 individuals. Nat Genet. 2019;51:51–62.
Article CAS PubMed Google Scholar
Dey R, Schmidt EM, Abecasis GR, Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am J Hum Genet. 2017;101:37–49.
Article CAS PubMed PubMed Central Google Scholar
Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–35.
Article Google Scholar
Daniels HE. Saddlepoint approximations in statistics. Ann Math Stat. 1954;25:631–50.
Article Google Scholar
Lee S, Fuchsberger C, Kim S, Scott L. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case-control studies. Biostatistics. 2016;17:1–15.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

YW, GMP, ALD, and JD acknowledge the grant support (No. 5U01AG058589) from the National Institute of Aging. HC acknowledges the grant support (No. R00 HL130593) from the National Institute of Health. This research was conducted using the UK Biobank Resource under Application Number 42614.

Author information

Authors and Affiliations

Department of Biostatistics, School of Public Health, Boston University, Massachusetts, MA, 02215, USA
Yanbing Wang, Gina M. Peloso, Anita L. DeStefano & Josée Dupuis
Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
Han Chen
Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
Han Chen

Authors

Yanbing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Han Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gina M. Peloso
View author publications
You can also search for this author in PubMed Google Scholar
Anita L. DeStefano
View author publications
You can also search for this author in PubMed Google Scholar
Josée Dupuis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YW developed the method, analyzed the data, and drafted the manuscript; HC provided critical input in the method development and revised the manuscript for intellectual content; GMP conceived the study, revised the manuscript for intellectual content; ALD revised the manuscript for intellectual content; JD supervised the work, conceived the study, revised the manuscript for intellectual content.

Corresponding authors

Correspondence to Yanbing Wang or Josée Dupuis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary_Information_FHAT

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Chen, H., Peloso, G.M. et al. Exploiting family history in aggregation unit-based genetic association tests. Eur J Hum Genet 30, 1355–1362 (2022). https://doi.org/10.1038/s41431-021-00980-0

Download citation

Received: 24 March 2021
Revised: 20 July 2021
Accepted: 04 October 2021
Published: 25 October 2021
Issue Date: December 2022
DOI: https://doi.org/10.1038/s41431-021-00980-0

This article is cited by

Leveraging family history in genetic association analyses of binary traits
- Yixin Zhang
- James B. Meigs
- Chloé Sarnowski
BMC Genomics (2022)

Exploiting family history in aggregation unit-based genetic association tests

Subjects

Abstract

Similar content being viewed by others

Liability threshold modeling of case–control status and family history of disease increases association power

Contextualizing genetic risk score for disease screening and rare variant discovery

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Introduction

Material and methods

Family history aggregation unit-based test (FHAT)

Optimal FHAT (FHAT-O)

Simulation analysis

Analysis of whole exome sequencing data in the UK Biobank

Results

Type I error and power

Computational cost

Application to the UK Biobank

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Additional information

Supplementary information

Supplementary_Information_FHAT

Rights and permissions

About this article

Cite this article

This article is cited by

Leveraging family history in genetic association analyses of binary traits

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

Liability threshold modeling of case–control status and family history of disease increases association power

Contextualizing genetic risk score for disease screening and rare variant discovery

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Introduction

Material and methods

Family history aggregation unit-based test (FHAT)

Optimal FHAT (FHAT-O)

Simulation analysis

Analysis of whole exome sequencing data in the UK Biobank

Results

Type I error and power

Computational cost

Application to the UK Biobank

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Additional information

Supplementary information

Supplementary_Information_FHAT

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Leveraging family history in genetic association analyses of binary traits

Search

Quick links