A stepwise likelihood ratio test procedure for rare variant selection in case–control studies

Kuk, Anthony YC; Nott, David J; Yang, Yaning

doi:10.1038/jhg.2014.1

Download PDF

Original Article
Published: 23 January 2014

A stepwise likelihood ratio test procedure for rare variant selection in case–control studies

Anthony YC Kuk¹,
David J Nott¹ &
Yaning Yang²

Journal of Human Genetics volume 59, pages 198–205 (2014)Cite this article

887 Accesses
3 Citations
Metrics details

Subjects

Abstract

There is much recent interest in finding rare genetic variants associated with various diseases. Owing to the scarcity of rare mutations, single-variant analyses often lack power. To enable pooling of information across variants, we use a random effect formulation within a retrospective modeling framework that respects the retrospective data collecting mechanism of case–control studies. More concretely, we model the control allele frequencies of the variants as random effects, and the systematic differences between the case and control frequencies as fixed effects, resulting in a mixed model. The use of Poisson approximation and gamma-distributed random effects results in a generalized negative binomial distribution for the joint distribution of the control and case frequencies. Variants are selected by conducting stepwise likelihood ratio tests. The superiority of the proposed method over two existing variant selection methods is demonstrated in a simulation study. The effects of non-gamma random effects and correlated variants are also found to be not too detrimental in the simulation study. When the proposed procedure is applied to identify rare variants associated with obesity, it identifies one additional variant not picked up by existing methods.

Rare-variant collapsing analyses for complex traits: guidelines and applications

Article 11 October 2019

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Article Open access 04 February 2021

Controlling for human population stratification in rare variant association studies

Article Open access 24 September 2021

Introduction

Despite the success of genome-wide association studies in identifying genetic variants associated with many diseases and traits,¹ there are still many common diseases that cannot be explained by common genetic variants. Furthermore, the common variants identified through genome-wide association studies often account for only a small fraction of the heritability of the disease.² This has led to discoveries that some common diseases are caused by the aggregate effect of multiple rare variants that individually have little impact. It has also been reported that rare variants tend to be functional alleles and have stronger effects on complex diseases than common variants.³ Recent advances in technology have made it possible to re-sequence large stretches of a genome in a cost-effective way. With the advent of the next-generation sequencing data, the time is ripe for rare variant analysis, and as a result there is a huge surge of papers on this topic.

The analysis of rare variants, however, presents many new challenges. Most existing methods of data analysis are not designed with rare attributes in mind and their naive application will lead to imprecise estimates and tests of low power. To overcome this limitation, various strategies have been proposed to handle rare variant data, including collapsing,⁴ weighting,⁵ thresholding⁶ and pooling.^{7, 8} Neale et al.⁹ propose a C-alpha test based on comparing the expected variance with the actual variance of the distribution of allele frequencies. Lin and Tang¹⁰ propose the use of score-type tests, and Wu et al.¹¹ propose the sequence kernel association test within the framework of a random variant effects model. All the above procedures are concerned with testing the overall significance of a collection of variants rather than variant selection that is the focus of this article. Two rare variant selection procedures that we are aware of are the ‘RARECOVER’ method proposed by Bhatia et al.,¹² and the increase in score statistic procedure proposed by Hoffmann et al.¹³ In developing RARECOVER, Bhatia et al.¹² propose taking the union of rare genetic variants, which they define as those with minor allele frequency (MAF) between 0.0001 and 0.1. The union variant is said to have occurred if one or more of its component variants had occurred. By taking their union, variants with low individual MAF are combined to form a union variant with a higher frequency of occurrence that is more amenable to conventional statistical analyses. RARECOVER is basically a step-up greedy procedure whereby at each step the variant that maximizes the Pearson’s χ²-statistic upon taking union with the variants selected so far in the current set S (which is set to the empty set φ initially) is added to S if the increase in Pearson’s statistic exceeds a certain threshold c. Bhatia et al.¹² commented that the choice of c is not crucial, and they used c=0.5. We demonstrate that this choice of c is much too liberal, leading to hugely inflated type I error and very high false selection rates. As a remedy, we propose a random permutation approach to determine c for a given nominal level of the type I error. For data collected from a prospective study, the unweighted version of the procedure of Hoffmann et al.¹³ is based on statistics of the form

where N is the sample size, D_i the disease status of subject i, i=1,…N, , R_ij=1 if subject i has rare variant j, , and S is a subset of {1,…,J}. Hoffmann et al.¹³ interpreted T_S as the score test statistic for testing H₀: β=0 in the logistic model

where S denote the set of variants included in the above model. Similar to RARECOVER, Hoffmann, Marini and Witte’s SCORE procedure is a step-up procedure whereby at each step, the variant which maximizes the score test statistic is added to the current set S if the increase in score statistic exceeds a certain threshold c. Again, we will use the random permutation approach to determine c.

As useful information is at a premium for rare variant analysis, it is clear that some kind of pooling of information is necessary. The way we propose to do this is to treat the control frequencies of all the rare variants, which are not of direct interest, as random effects that follow a common distribution; and the effects of disease on the causal variants that are of substantive interest as fixed effects, resulting in a mixed model. Although mixed models have been used before in the genome-wide association studies literature, they are mostly based on the prospective¹⁴ approach of modeling the probability of disease occurrence, given the genetic variants of an individual. We choose to model instead the distribution of the genetic variants of a person given his/her disease status that is more in line with the retrospective¹⁴ nature of a case–control study. It will be shown in the Materials and methods section that the use of Poisson approximation and gamma-distributed random effects results in a generalized negative binomial distribution for the joint distribution of the control and case frequencies. Variants are selected by conducting stepwise likelihood ratio tests (LRTs) based on the generalized negative binomial likelihood. Again, a random permutation approach is used to determine the critical value to account for multiple testing. The superiority of the proposed method over RARECOVER and SCORE is demonstrated in a simulation study. When applied to identify rare variants associated with obesity, the proposed stepwise LRT procedure identifies one additional variant that is not picked up by RARECOVER and SCORE.

Materials and methods

One hindrance in using the retrospective modeling approach (which focuses on the distribution of the genetic variants given disease status) when there are multiple variants is the dearth of distributions for multivariate binary/discrete data. Fortunately, it is common in the rare variant literature to assume independence between rare variants (RVs); see Li and Leal,⁴ Neale et al.,⁹ Bhatia et al.,¹² and the references therein. With this independence assumption, the retrospective approach is greatly simplified because we can model the allele frequency given disease status one variant at a time.

Generalized negative binomial likelihood for the collapsed frequencies

Suppose there are n₀ controls, n₁ cases and J rare genetic variants under consideration. To focus on issues that are particular to rare variants, we consider only RVs in our analysis that is also what Bhatia et al.¹² did. As rare mutations are infrequently observed, for each individual and at each marker, we will combine the scenario of having ‘two mutant alleles’ with that of ‘one mutant allele only’ into a merged category of ‘at least one mutant allele’ in the hope that the merged category will have slightly larger frequency than the original MAF. Another advantage of merging genotypes 1 and 2 is that it frees us from making the assumption of Hardy–Weinberg equilibrium that is highly unlikely to be true for rare alleles. Thus, for j=1,…,J we define Y_j0 as the number of individuals among the n₀ controls who have ‘at least one occurrence’ of the jth RV. Similarly, Y_j1 is the number of individuals among the n₁ cases who have ‘at least one occurrence’ of the jth RV. To respect the data generating mechanism of case–control studies, we adopt a retrospective approach to model the data as

independently of

given the probabilities p_j0 and p_j1 of at least one occurrence of the jth RV for the controls and cases, respectively. We refer to Y_j0 and Y_j1 as the collapsed frequencies in this paper. As explained earlier, we assume independence between RVs that means that the (Y_j0,Y_j1) for different j are independent. Now rather than treating p_j0 and p_j1 as fixed parameters, and there are lots of them if J is large, we reduce the number of parameters by treating p₁₀,p₂₀,…,p_J0 as random effects generated from a common distribution that enables pooling of information across variants. The fact that the alleles are rare means that the p_J0 and p_J1 are small, and so if the sample sizes n₀ and n₁ are reasonably large, the two binomial distributions given by (1) and (2) can be approximated by Poisson distributions to yield

where r_j0=n₀p_j0, and

with r_j1=n₁p_j1=fn₀p_j1, and the factor f=n₁/n₀ reduces to 1 when n₀=n₁. For the sake of mathematical convenience, we will assume that

independently, which implies that marginally, the collapsed control frequencies Y_j0, j=1,…,J, are independently distributed according to the negative binomial distribution,¹⁵ with probability function

where μ=α/λ=E(Y_j0) is the marginal mean of Y_j0, and v=α⁻¹ is the dispersion parameter of the negative binomial distribution. Now let

be the difference between the case and control probabilities for RV j on the log scale, so that r_j1=n₁p_j1=fn₀p_j1=f exp(δ_j)r_j0. Note that δ_j>0 corresponds to a deleterious effect of the variant, and δ_j<0 a protective effect, and f=n₁/n₀ is a factor to account for unequal sample sizes. As the δ_j are of substantive interest, we will treat them as fixed rather than random effects. According to (4), Y_j1 is conditionally Poisson, and under the log link,

is a linear function of both the fixed effects δ_j and logarithm of the gamma-distributed random effects r_j0. This results in what is called a generalized linear mixed model. As the random effects r_j0, j=1,…,J, are independent, it follows that marginally, the vectors (Y_j0,Y_j1) are also independent. Generalizing (6), the joint distribution of (Y_j0,Y_j1) is given by

Stepwise LRTs

Within the earlier described framework, the variant selection problem that we are interested in can be formulated as finding those δ_j that are not equal to 0. Our approach to variant selection is to conduct stepwise LRT in the following way. We begin with testing the complete null hypothesis

against the alternative

one k at a time using the LRT, and we select the rare variant that maximizes the likelihood ratio statistic provided the value of this maximized statistic is greater than some critical value or cutoff c. We will postpone discussion of the choice of c to the next section and treat c as given for our present discussion. After we have included a variant, we will try to add one more variant by maximizing the likelihood ratio statistic of the current subset versus the current subset plus one more RV. The procedure stops when the maximized likelihood ratio statistic is <c. To make this operational, we need to maximize the marginal likelihood function under the generic null hypothesis

where S denotes the subset of RVs with non-zero δ_j in the current model. The likelihood function under H₀ based on the observed frequencies y_j0 and y_j1 is given by the following product of terms like (7)

To obtain the maximum likelihood estimates and (j∈S) of the parameters under H₀, we differentiate the log-likelihood with respect to μ, v, and δ_j, j∈S, and set them to 0. The Newton–Raphson algorithm is used to solve these score equations.

The generic alternative hypothesis under our stepwise setup is

where S′=S∪{k} for some k∉S. The likelihood function lik(H₁) under this alternative hypothesis has the same form as lik(H₀) given above, but with S replaced by S′=S∪{k}. The maximum likelihood estimates of the parameters μ, v, and δ_j, j∈S′=S∪{k}, can again be obtained using the Newton–Raphson algorithm. The LRT statistic of H₀ against H₁ is given by

where (H₁) and (H₀) are the maximum values of lik(H₁) and lik(H₀) respectively.

Choice of critical value

As the hypotheses being tested are nested, the LRT statistic W_k=W_φvs{k} for testing the initial complete null hypothesis H_φ: δ₁=···=δ_J=0 versus H_{k}: δ_k≠0; δ_j=0 for j≠k is distributed asymptotically like under H_φ, and so setting c=3.84 will control the asymptotic type I error at level 0.05 one test at a time, but the familywise error rate is not controlled at 0.05 because a variant will be selected at stage 1 if the maximal statistic M=W_k is >c. This implies the selection of at least one variant because more variants could have been selected at the subsequent stages of the stepwise procedure. Thus, the type I error is (M>c). The null distribution of the maximal statistic M=W_k is, however, quite complicated because the statistics W₁,W₂,…,W_J are not independent even when the variants are. The use of Bonferroni inequality leads to an overly large critical value. We propose instead the following permutation approach to find the critical value adaptively. Represent the observed data by an J by (n₀+n₁) matrix Y={y_ji}, where y_ji=1 if the i^th individual has at least one copy of the jth RV, and y_ij=0 otherwise. We can assume without loss of generality that i=1,…,n₀ correspond to the controls, and i=n₀+1,…,n₀+n₁ correspond to the cases, so that Y_j0= and Y_j1= are the control and case frequencies, respectively. The permutation approach operates as follows.

Step 1: Generate a random permutation r(1),…,r(n₀+n₁) of 1,…,n₀+n₁.

Step 2: Compute and for j=1,…,J.

Step 3: For k=1,…,J re-compute the likelihood ratio statistic for testing S=φ versus S={k} using the permuted frequencies ,,j=1,…,J. Denote that by , and let M*=

Repeat the above procedure independently B times to obtain . For a nominal type I error of ɛ, we will choose c to be the upper 100ɛ empirical percentile of . To save computing time, we use B=100 that seems to do a reasonable job in our simulation study. With the aim of filtering the list to a manageable but sufficiently rich set of RVs for further investigation and confirmatory study, we recommend a relatively liberal nominal type I error such as 0.1 or 0.2 (because of the inherent problem of insufficient sample size for rare mutations) so as to select more variants at the expense of a possible increase in false selection rate, which hovers ∼15–30% in our simulation studies to be reported later. An alternative to the permutation approach is the bootstrap.¹⁶ The way the bootstrap differs from the permutation approach is that r(1),…,r(n₀+n₁) are now obtained by sampling with replacement n₀+n₁ times from 1,…,n₀+n₁. We do not expect sampling with or without replacement to make a big difference, and hence we expect the permutation and bootstrap approach to produce similar critical values. This turns out to be the case in our real data examples.

To have a fair comparison between the proposed stepwise LRT procedure with RARECOVER and SCORE, we ought to control the type I error of all three procedures at the same level. Bhatia et al.¹² recommended the use of c=0.5. Based on the evidence of our simulation study (not shown here to save space), this choice of c is much too liberal and grossly over-selects the number of variants with false selection rates easily reaching 70% or more. Hoffmann, Marini and Witte¹³ seem to suggest the use of c=0 that also over-selects. The solution that we propose to overcome this problem is again to use the random permutation approach to determine the cutoffs for all three procedures.

Results

Selection of rare variants associated with obesity

Against the background that blockade of the endocannabinoid receptor reduces obesity and improves metabolic abnormalities, the Comprehensive Rimonabant Evaluation Study of Cardiovascular ENDpoints and Outcomes (CRESCENDO) clinical trial (trial number NCT00263042 in ClinicalTrials.gov) was conducted to assess whether rimonabant, a cannabinoid-1 receptor blocker, would improve major vascular event-free survival. The subjects in this study are patients with abdominal obesity and with previously manifested or increased risk of vascular disease. More details about the study design and protocol can be found in http://clinicaltrials.gov/ct/show/NCT00263042 and Topol et al.¹⁷ Our concerns here are not on cardiovascular outcomes, but on finding rare genetic variants that are associated with obesity. Out of 2958 Caucasian individuals aged 55 years or older in the CRESCENDO cohort, Harismendy et al.¹⁸ selected individuals at the two extreme ends of the body mass index for DNA sequencing. In the end, 143 individuals (73 men and 70 men) with body mass index >40 kg m⁻² were selected as the cases (obese persons); and 146 individuals (74 men and 72 women) with body mass index <30 kg m⁻² were selected as the controls. So the selected samples are balanced in gender. The DNA samples of the cases and controls were re-sequenced around two genes, FAAH and MGLL, known to be involved in endocannabinoid metabolism. Bhatia et al.¹² also made use of this data set to illustrate their RARECOVER procedure, and we extracted the data from their online supplementary materials. Apparently, some new cases and controls have been added because the data online consist of 148 cases and 150 controls that our analysis will be based on. For each rare variant in the FAAH and MGLL regions, Table 1 lists the collapsed frequencies (that is, the number of individuals with at least 1 occurrence of the rare variant) among the 148 cases and the 150 controls. The variants in bold type, 16 of them in the FAAH region and 12 in the MGLL region, were those selected by Bhatia et al.¹² using RARECOVER with critical value c=0.5. Among the 16 RVs selected by RARECOVER in the FAAH region, the case versus control frequencies are 1:0 for 12 of them. Likewise, 6 of the 12 RVs selected by RARECOVER in the MGLL region have frequency comparison of 1:0. It is hard to justify in our view this mass selection of RVs, each of which occurs only once in the entire sample of 148 cases. As commented earlier, it is better to try to control the familywise type I error by using the permutation approach to determine the critical value. We will do this for all three selection procedures: stepwise LRT, RARECOVER and SCORE.

Table 1 Case- versus control-collapsed frequency comparisons for 32 rare variants near the FAAH gene on chromosome 1 and 25 rare variants near the MGLL gene on chromosome 3

Full size table

Rare variants in the MGLL and FAAH regions

Even though it is commonly believed that RVs are at linkage equilibrium (that is, occur independently), we should check whether this assumption holds for the data at hand. One way to test whether two RVs are at linkage disequilibrium (LD) is to compute their correlation r from a sample of n individuals. As the data are bivariate binary rather than bivariate normal, we do not use the usual test statistic (n−2) × r²/(1−r²) to test the significance of r. Rather, we use the Pearson test of independence in the corresponding 2 × 2 table of frequencies, which for the present case of binary variables, is numerically equal to nr². As we are dealing with RVs, the expected frequencies will be low in some cells, and the χ² approximation may not be accurate. As a remedy, we use the option provided in the R function ‘chisq.test’ to compute P-values by B=10 000 Monte Carlo simulations. We consider the MGLL region first, with 25 RVs in this region, there are ₂₅C₂=300 pairs of RVs to be tested for LD. To correct for multiple testing, we use Bonferroni’s correction and declare LD for a pair of RVs only when the P-value is <0.05/300=0.000167. Of all the 300 pairs of correlations, only the correlation between RV6 and RV7 is declared significant by Bonferroni’s method (n=298, r=0.913, P-value=0.0001). It may not be appropriate to combine the cases with controls to test for LD between two RVs because there may be a systematic difference between the case and control frequencies, but the same conclusion is reached if we test LD using the control data (n=148) or the case data only (n=150). Looking more closely at RVs 6 and 7, out of the total of 298 individuals, the two variants occur simultaneously for 11 individuals, and separately for only two individuals (one for each variant), with neither variant occurring for the remaining 285 individuals. Furthermore, the position of the two variants differ only by one (chr3:129031590 versus chr3:129031591), and so they are in tight LD. Thus, we drop RV7 (with frequencies 9:3) and keep RV6 (with frequencies 10:2) in our analysis. RVs 19 and 20 also differ by one position only, and they both have collapsed frequency ratio of 2:0. However, their sample correlation is 0.497 only, with P-value 0.014 that is not significant when the familywise type I error is set at level 0.05. Thus, we keep both RVs 19 and 20 in our analysis. Ignoring significance for the time being, the order in which RVs are selected by stepwise LRT (Materials and Methods) can be found in the top panel of Table 2. The first three are RVs 6, 3 and 1 with collapsed frequency ratio of 10:2, 15:6 and 18:13, respectively. The associated stepwise (maximal) likelihood ratio statistics are 7.22, 7.45 and 5.07. The permutation-based critical values turn out to be 6.28 for nominal level 0.1 and 6.90 for level 0.05. Thus, whether we aim to control the type I error at 0.05 or 0.1, the proposed stepwise LRT procedure will select RVs 6 and 3, but not RV1. As commented before, the bootstrap critical values (6.01 and 6.90) are similar to the permutation critical values and the conclusion remains unchanged for this example. We will use the permutation approach to determine critical values in the remainder of this paper.

Table 2 Selection of RVs associated with obesity in the MGLL region using stepwise LRT, RARECOVER and SCORE procedures with critical values obtained using random permutations and case- versus control-collapsed frequencies given in parentheses

Full size table

The middle and lower panels of Table 2 show the results for RARECOVER and SCORE using permutation-based critical values. It can be seen that whether at level 0.05 or 0.1, both procedures select RV 6 only, but not RV3. It appears that by pooling information across variants, the proposed stepwise LRT procedure was able to identify one more potential causal variant.

In the FAAH region, only the correlation between RV4 and RV7 is found significant at level 0.05 after applying Bonferroni’s correction, but we will not merge these two variants and keep them separate as they are quite far apart. It can be seen from Table 3 that at level 0.1, no variant is selected by all procedures. If one is willing to increase the level to ∼0.2 to induce more discoveries, then stepwise LRT will flag out RV28 (with frequency ratio 18:10) as a variant worthy of further investigation, but RARECOVER and SCORE will select no variant even at level 0.2.

Table 3 Selection of RVs associated with obesity in the FAAH region using stepwise LRT, RARECOVER and SCORE procedures with critical values obtained using random permutations and case- versus control-collapsed frequencies given in parentheses

Full size table

Our findings so far are purely empirical. Bhatia et al.¹² and Harismendy et al.¹⁸ have reported findings that corroborate with ours, and they also offer some scientific conjectures to explain how the selected RVs in the MGLL and FAAH regions could cause obesity.

Simulation results when the assumed model is correct

Mimicking the MGLL example, we generate data for n₀=150 controls, n₁=148 cases, and 24 RVs in the following way. First, we simulate r_j0=n₀p_j0=150p_j0 (j=1,…,24) according to the gamma (α, λ) distribution with v=α⁻¹=0.98 and μ=α/λ=1.72 as in the model selected by stepwise LRT based on the original MGLL data (given in bold type in Table 2). We then divide r_j0 by n₀=150 to get the p_j0 and then generate the control data in the form of binary vectors (Y_ij0, j=1,…,24), i=1,…,150, with Y_ij0∼Bernoulli(p_j0) independently. Summing over individuals, we obtain Y_j0=. The case data (Y_ij1, j=1,…,24), i=1,…,148, are generated with Y_ij1∼Bernoulli(p_j1) independently, where p_j1=exp(δ_j)p_j0. We consider two parameter settings for the δ_j. In setting (a), we have δ₁=…=δ₂₄=0, which corresponds to the situation of no causal variant. For this setting, we focus on type I error. In setting (b), we set δ₃=1.24, δ₆=1.68, and δ_j=0 for j≠3,6 as in the model selected by stepwise LRT based on the original MGLL data. For this setting, our interest will focus on the procedure’s ability to select RVs 3 and 6, and the false selection rate. To compare selection procedures on equal footing, we will use the random permutation approach to determine the critical values, with nominal type I error set at a liberal level of 0.2 (as our aim is to select a sufficiently rich set of potential causal RVs for further investigation).

The results based on 100 sets of simulations are present in the top panel of Table 4. It can be seen that the type I errors of the three procedures range from 0.12 to 0.15 when there is no casual variant and so are conservative. In setting (b), where RVs 3 and 6 are the casual variants with δ₃=1.24 and δ₆=1.68, RV6 is selected more often by stepwise LRT than by RARECOVER and SCORE (47 times as compared with 40 and 39). The number of times that RVs 3 and 6 are selected together is also highest for stepwise LRT (15 versus 11 and 10). Over the 100 samples simulated, RV3 is selected by stepwise LRT 26 times, RV6 is selected 47 times, whereas the other non-casual variants are selected a total of 15 times; thus, the false selection rate for stepwise LRT is 15/(26+47+15)=0.17. The false selection rates of both RARECOVER and SCORE are 0.167. Although the set of figures given above seems to suggest only modest power for stepwise LRT to select the right variants, we expect the power to improve with either larger sample sizes and/or larger sizes of the variant effects (that is, larger values for δ₃ and δ₆). To illustrate this point, we conduct a simulation study with the same model parameters but double the sample sizes (that is, 300 controls and 296 cases), and the results of stepwise LRT improve to selecting RV3 35 times, RV6 55 times, and together 21 times, with a false selection rate of 0.167 that is almost unchanged.

Table 4 Operating characteristics of stepwise LRT, RARECOVER and SCORE procedures with nominal level 0.2 and critical values obtained using random permutations, based on 100 sets of data (148 cases, 150 controls and 24 variants) simulated under both null and non-null models, gamma and non-gamma random effects, and independent and correlated rare variants

Full size table

Simulation results when the assumed model is incorrect

To investigate how the performance of the three procedures is affected by violations of the model assumptions, we conduct two more sets of simulations. For the first set of extra simulations, we focus on non-gamma r_j0. A general class of mixtures of Poisson distributions to model over-dispersed count data has been proposed by Hougaard et al.¹⁹ A prominent member of that class is the inverse Gaussian–Poisson distribution. The second panel of Table 4 summarizes the simulation results when the r_j0 are simulated from an inverse Gaussian distribution with the same mean and variance as the gamma distribution used to generate the results of the top panel of Table 4. We can see that the powers of all three procedures are slightly reduced, but stepwise LRT is still marginally more powerful and has lower false selection rate (0.186 versus 0.215 and 0.261).

Our last set of simulations is designed to look into the effect of LD (that is, correlated variants) on our procedure and its competitors. Instead of simulating independent binary observations given the p_j0, we simulate correlated binary observations given the p_j0. As there is a dearth of distributions for correlated discrete data, we resort to the familiar technique of dichotomizing multivariate normal latent variables as is commonly done in multivariate probit models. To be specific, we first generate the r_j0 from the same gamma distribution as before. Given the r_j0, and for each i=1,…,n₀, rather than generating Y_ij0∼Bernoulli(p_j0=r_j0/n₀) independently for j=1,…,24, we simulate a multivariate normal vector (Z_ij0, j=1,…,24)∼N₂₄(η,Σ), where Σ=(σ_ij) is a correlation matrix with σ_ii=1, and σ_ij=, where c_i and c_j are the positions of RVs i and j listed in Table 1. This seems to be a sensible correlation structure where the correlation decays exponentially with inter-loci distance. We set ρ to 0.99, so that ρ²⁰⁰=0.134, where 200 is roughly the average distance between successive RVs in the MGLL region; Table 1. The mean vector η of the multivariate normal distribution is chosen to make the marginal distribution of the dichotomized variable =I{Z_ij0>0} the same as that of Y_ij0, namely, Bernoulli(p_j0). But unlike the Y_ij0, that are mutually independent, the =I{Z_ij0>0} are correlated because the Z_ij0 are. We will treat as the control data in this set of simulations. Similarly, to simulate the case data, we generate (Z_ij1, j=1,…,24)∼N₂₄(η,Σ), where the correlation matrix Σ is as defined above, and the mean vector η is chosen to make =I{Z_ij0>0}∼Bernoulli(p_j1), where p_j1=exp(δ_j)p_j0. As before, we consider two settings: (a) all δ_j=0, and (b) δ₃=1.24, δ₆=1.68, and δ_j=0 for j≠3,6. The results based on 100 simulations are given in the bottom panel of Table 4 under the headings 3(a) and 3(b). It can be seen that for setting 3(a), the type I error is inflated to 0.23 for RARECOVER, 0.25 for stepwise LRT, and 0.28 for SCORE. In setting 3(b), just as in settings 1(b) and 2(b), stepwise LRT selects the correct RVs more often than RARECOVER and SCORE, but the false selection rates increase to 0.283, 0.25 and 0.241, respectively.

Simulation results for the case of protective variants

To investigate the power of the proposed stepwise LRT procedure in picking out protective variants, we conduct extra simulations with parameter values set to v=1, μ=5, J=24, δ₆=−2,−3, δ_j=0 (j≠6) with n₀=150, n₁=148 as in the CRESCENDO study. This amounts to an average MAF of 5/150=0.033 in the control population. Strictly speaking, a variant with MAF 0.033 is not very rare, but we need to leave some room for the variant to be even rarer among the cases if it is in fact protective. The nominal level is 0.1 and the critical value is determined adaptively using the random permutation approach. The number of times RV6 is picked out in 100 simulations, and the number of times other variants are wrongly picked, together with the false selection rate are shown in the top panel of Table 5 under the heading (a). The power is moderate but this is to be expected as it is very difficult to detect the difference between the rare and the rarer. Increasing the sample size will obviously help. Part (b) of Table 5 depicts the results when the sample sizes are doubled to n₀=300, n₁=296, and μ increased to 10 correspondingly. It can be seen that the power is now increased to 0.48 and 0.57 for the cases δ₆=−2 and δ₆=−3 respectively, whereas the false selection rate is kept at 0.2 or less.

Table 5 Power of the stepwise LRT procedure with nominal level 0.1 and permutation-based threshold to detect a protective variant based on 100 sets of data simulated from the gamma random effects model with v=1, J=24 and all δ_j=0 except δ₆

Full size table

Discussion

Due to the scarcity of information in rare variant analysis, it is important to pool information across variants. We show that one way to do this is to treat all the rare variant occurrence probabilities in the control sample as random effects from a common distribution, resulting in a mixed model. Even though the retrospective likelihood approach and its advantages have been advocated by Epstein and Satten²⁰ and Satten and Epstein,²¹ genetic variant analyses are currently still mostly based on prospective likelihoods even when the study is retrospective, with justifications provided by Prentice and Pyke.²² One reason for this tendency is that it is easier to model disease status given genotypes by some kind of binary regression model than to model the genotypes at multiple sites given disease status. Our mixed models are based on a retrospective formulation to better reflect the sampling given disease status nature of case–control studies. The justifiable and commonly made assumption of independence between rare genetic variants offers great simplification to the retrospective likelihood that we take full advantage of. We also prefer to treat the variant effects that are of substantive interest as fixed rather than random effects, which we think is the sensible thing to do. As we are dealing with RVs, we collapse frequencies as it is highly unlikely for an individual to have two mutant alleles at the same locus. Another advantage of modeling the collapsed frequencies is that it does not assume Hardy–Weinberg equilibrium, which is unlikely to be true for rare alleles. Efficiency calculation reported by Kuk, Xu and Li²³ within the context of haplotype frequency estimation when there is no random effect demonstrated that collapsing frequencies will not lead to much loss of estimation efficiency when the alleles are rare. We take rarity explicitly into account by making use of Poisson approximation in Equations (3) and (4). Out of convenience, we assume in (5) that the r_j0=n₀p_j0 are gamma distributed to result in the generalized negative binomial distribution (7) for (Y_j0,Y_j1). Our simulation results show that the resulting stepwise LRT procedure is more powerful than RARECOVER and SCORE in selecting the correct variants. When applied to the MGLL data, stepwise LRT picks up one more variant, namely, RV3, than RARECOVER and SCORE. LRT is computationally more demanding as it involves finding the maximum likelihood estimates under both the null and alternative models. But as pointed out on p.9, the log-likelihood function has the same form under both models, and both can be fitted by the Newton–Raphson algorithm without too much difficulty. We outline in the appendix how to obtain the required first and second derivatives of the log-likelihood function. The proposed LRT approach is a parametric one and may not be robust against departures from the parametric assumptions. The fact that we obtain critical values by the random permutation approach rather than from the asymptotic distribution of LRT should make the procedure a bit more robust, and this is corroborated by the findings of our simulation study to a certain degree.

Non-gamma distributions could have been used for r_j0=n₀p_j0 in our mixed model. Our modest simulation study suggests that the proposed stepwise LRT procedure based on gamma random effects is not too adversely affected by departure from the gamma assumption. Furthermore, the resulting negative binomial distribution passed the Pearson goodness of fit test when fitted to the control data in both the MGLL and FAAH regions.

Rather than integrating out the random effects r_j0 to result in the generalized negative binomial distribution (7), an alternative is to eliminate the random effects r_j0 as nuisance parameters by conditioning on the sum of the two independent Poisson counts Y_j0 and Y_j1 to obtain a binomial distribution for Y_j1 conditionally. Although not mentioned explicitly, this is the theoretical basis for the C-alpha test of Neale et al.⁹ But as r_j0 has been eliminated from the conditional likelihood, there is no pooling of information across variants.

Our framework allows both deleterious (δ>0) and protective (δ<0) variants. It is also possible to incorporate covariate effects into our retrospective modeling framework.

References

Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S. et al. Potential etiologic and functional implications of genome-wide- association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Article CAS Google Scholar
Eichler, E. E., Flint, J., Gibson, G., Kong, A., Leal, S. M., Moore, J. H. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).
Article CAS Google Scholar
Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R. & Amos, C. I. Shifting paradigm of association studies: Value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82, 100–112 (2008).
Article CAS Google Scholar
Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).
Article CAS Google Scholar
Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).
Article Google Scholar
Price, A. L., Kryukov, G. V., de Bakker, P. I. W., Purcell, S. M., Staples, J., Wei, L. J. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).
Article Google Scholar
Kim, S. Y., Lim, Y., Guo, Y., Li, R., Holmkvist, J., Hansen, T. et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Nat. Biotechnol. 34, 479–491 (2010).
Google Scholar
Liang, W. E., Thomas, D. C. & Conti, D. V. Analysis and optimal designs for association studies using next-generation sequencing with case-control pools. Genet. Epidemiol. 36, 870–881 (2012).
PubMed PubMed Central Google Scholar
Neale, B. M., Rivas, M. A., Voight, B. F., Altshuler, D., Devlin, B., Orho-Melander, M. et al. Testing for an unusual distribution of rare variants. PLoS Genet. 7, e1001322 (2011).
Article CAS Google Scholar
Lin, D. Y. & Tang, Z. Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 89, 354–367 (2011).
Article CAS Google Scholar
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT). Am. J. Hum. Genet. 89, 82–93 (2011).
Article CAS Google Scholar
Bhatia, G., Bansal, V., Harismendy, O., Schork, N. J., Topol, E., Frazer, K. et al. A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Comput. Biol. 6, e1000954 (2010).
Article Google Scholar
Hoffmann, T. J., Marini, N. J. & Witte, J. S. Comprehensive approach to analyzing rare genetic variants. PLoS One 5, e13584 (2010).
Article Google Scholar
Breslow, N. E. & Day, N. E. Statistical Methods in Cancer Research. Volume I—The Analysis of Case-Control Studies, (IRAC Publications, 1980).
Google Scholar
Saha, K. & Paul, S. Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics 61, 179–185 (2005).
Article Google Scholar
Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap, (Chapman & Hall: New York, NY, USA, 1994).
Google Scholar
Topol, E. J., Bousser, M. G., Fox, K. A., Creager, M. A., Despres, J. P., Easton, J. D. et al CRESCENDO investigators Rimonabant for prevention of cardiovascular events (CRESCENDO): a randomised, multicentre, placebo-controlled trial. Lancet 376, 517–523 (2010).
Article CAS Google Scholar
Harismendy, O., Bansal, V., Bhatia, G., Nakano, M., Scott, M., Wang, X. et al. Population sequencing of two endocannabinoid metabolic genes identifies rare and common regulatory variants associated with extreme obesity and metabolite. Genome. Biol. 11, R118 (2010).
Article CAS Google Scholar
Hougaard, P., Lee, M. L. T. & Whitmore, G. A. Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes. Biometrics 53, 1225–1238 (1997).
Article CAS Google Scholar
Epstein, M. P. & Satten, G. A. Inference on haplotype effects in case-control studies using genotype data. Am. J. Hum. Genet. 75, 35–43 (2003).
Google Scholar
Satten, G. A. & Epstein, M. P. Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet. Epidemiol. 27, 192–201 (2004).
Article Google Scholar
Prentice, R. L. & Pyke, R. Logistic disease incidence model and case-control studies. Biometrika 66, 403–411 (1979).
Article Google Scholar
Kuk, A. Y. C., Li, X. & Xu, J. A fast collapsed data method for estimating haplotype frequencies from pooled genotype data with applications to the study of rare variants. Stat. Med. 32, 1343–1360 (2013).
Article Google Scholar

Download references

Acknowledgements

We would like to thank the referees for their helpful comments and suggestions. The research of the third author was supported by the National Science Foundation of China Grant 11271346.

Author information

Authors and Affiliations

Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
Anthony YC Kuk & David J Nott
Department of Statistics and Finance, University of Science and Technology, Hefei, China
Yaning Yang

Authors

Anthony YC Kuk
View author publications
You can also search for this author in PubMed Google Scholar
David J Nott
View author publications
You can also search for this author in PubMed Google Scholar
Yaning Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anthony YC Kuk.

Appendix

Appendix First and second derivatives of the log-likelihood function

It can be assumed without loss of generality that δ_j=0 for J⩾j>k, hence the model involves only the parameters v, μ and δ₁,…,δ_k. The log-likelihood function can be written as l=l₁+l₂, where

and

The first and second derivatives of l₁ can be obtained readily by symbolic differentiation. Note in particular that ∂²l₁/∂δ_i∂δ_j=0 for i≠j. The only thing left to do is to find the derivatives of l₂. As l₂ is a function of v only, the only non-zero derivatives are

and

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuk, A., Nott, D. & Yang, Y. A stepwise likelihood ratio test procedure for rare variant selection in case–control studies. J Hum Genet 59, 198–205 (2014). https://doi.org/10.1038/jhg.2014.1

Download citation

Received: 30 September 2013
Revised: 26 December 2013
Accepted: 26 December 2013
Published: 23 January 2014
Issue Date: April 2014
DOI: https://doi.org/10.1038/jhg.2014.1

Keywords

This article is cited by

Development and cross-validation of prognostic models to assess the treatment effect of cisplatin/pemetrexed chemotherapy in lung adenocarcinoma patients
- Wenjun Mou
- Zhaoqi Liu
- Yaping Tian
Medical Oncology (2014)