Introduction

Despite the success of genome-wide association studies in identifying genetic variants associated with many diseases and traits,1 there are still many common diseases that cannot be explained by common genetic variants. Furthermore, the common variants identified through genome-wide association studies often account for only a small fraction of the heritability of the disease.2 This has led to discoveries that some common diseases are caused by the aggregate effect of multiple rare variants that individually have little impact. It has also been reported that rare variants tend to be functional alleles and have stronger effects on complex diseases than common variants.3 Recent advances in technology have made it possible to re-sequence large stretches of a genome in a cost-effective way. With the advent of the next-generation sequencing data, the time is ripe for rare variant analysis, and as a result there is a huge surge of papers on this topic.

The analysis of rare variants, however, presents many new challenges. Most existing methods of data analysis are not designed with rare attributes in mind and their naive application will lead to imprecise estimates and tests of low power. To overcome this limitation, various strategies have been proposed to handle rare variant data, including collapsing,4 weighting,5 thresholding6 and pooling.7, 8 Neale et al.9 propose a C-alpha test based on comparing the expected variance with the actual variance of the distribution of allele frequencies. Lin and Tang10 propose the use of score-type tests, and Wu et al.11 propose the sequence kernel association test within the framework of a random variant effects model. All the above procedures are concerned with testing the overall significance of a collection of variants rather than variant selection that is the focus of this article. Two rare variant selection procedures that we are aware of are the ‘RARECOVER’ method proposed by Bhatia et al.,12 and the increase in score statistic procedure proposed by Hoffmann et al.13 In developing RARECOVER, Bhatia et al.12 propose taking the union of rare genetic variants, which they define as those with minor allele frequency (MAF) between 0.0001 and 0.1. The union variant is said to have occurred if one or more of its component variants had occurred. By taking their union, variants with low individual MAF are combined to form a union variant with a higher frequency of occurrence that is more amenable to conventional statistical analyses. RARECOVER is basically a step-up greedy procedure whereby at each step the variant that maximizes the Pearson’s χ2-statistic upon taking union with the variants selected so far in the current set S (which is set to the empty set φ initially) is added to S if the increase in Pearson’s statistic exceeds a certain threshold c. Bhatia et al.12 commented that the choice of c is not crucial, and they used c=0.5. We demonstrate that this choice of c is much too liberal, leading to hugely inflated type I error and very high false selection rates. As a remedy, we propose a random permutation approach to determine c for a given nominal level of the type I error. For data collected from a prospective study, the unweighted version of the procedure of Hoffmann et al.13 is based on statistics of the form

where N is the sample size, Di the disease status of subject i, i=1,…N, , Rij=1 if subject i has rare variant j, , and S is a subset of {1,…,J}. Hoffmann et al.13 interpreted TS as the score test statistic for testing H0: β=0 in the logistic model

where S denote the set of variants included in the above model. Similar to RARECOVER, Hoffmann, Marini and Witte’s SCORE procedure is a step-up procedure whereby at each step, the variant which maximizes the score test statistic is added to the current set S if the increase in score statistic exceeds a certain threshold c. Again, we will use the random permutation approach to determine c.

As useful information is at a premium for rare variant analysis, it is clear that some kind of pooling of information is necessary. The way we propose to do this is to treat the control frequencies of all the rare variants, which are not of direct interest, as random effects that follow a common distribution; and the effects of disease on the causal variants that are of substantive interest as fixed effects, resulting in a mixed model. Although mixed models have been used before in the genome-wide association studies literature, they are mostly based on the prospective14 approach of modeling the probability of disease occurrence, given the genetic variants of an individual. We choose to model instead the distribution of the genetic variants of a person given his/her disease status that is more in line with the retrospective14 nature of a case–control study. It will be shown in the Materials and methods section that the use of Poisson approximation and gamma-distributed random effects results in a generalized negative binomial distribution for the joint distribution of the control and case frequencies. Variants are selected by conducting stepwise likelihood ratio tests (LRTs) based on the generalized negative binomial likelihood. Again, a random permutation approach is used to determine the critical value to account for multiple testing. The superiority of the proposed method over RARECOVER and SCORE is demonstrated in a simulation study. When applied to identify rare variants associated with obesity, the proposed stepwise LRT procedure identifies one additional variant that is not picked up by RARECOVER and SCORE.

Materials and methods

One hindrance in using the retrospective modeling approach (which focuses on the distribution of the genetic variants given disease status) when there are multiple variants is the dearth of distributions for multivariate binary/discrete data. Fortunately, it is common in the rare variant literature to assume independence between rare variants (RVs); see Li and Leal,4 Neale et al.,9 Bhatia et al.,12 and the references therein. With this independence assumption, the retrospective approach is greatly simplified because we can model the allele frequency given disease status one variant at a time.

Generalized negative binomial likelihood for the collapsed frequencies

Suppose there are n0 controls, n1 cases and J rare genetic variants under consideration. To focus on issues that are particular to rare variants, we consider only RVs in our analysis that is also what Bhatia et al.12 did. As rare mutations are infrequently observed, for each individual and at each marker, we will combine the scenario of having ‘two mutant alleles’ with that of ‘one mutant allele only’ into a merged category of ‘at least one mutant allele’ in the hope that the merged category will have slightly larger frequency than the original MAF. Another advantage of merging genotypes 1 and 2 is that it frees us from making the assumption of Hardy–Weinberg equilibrium that is highly unlikely to be true for rare alleles. Thus, for j=1,…,J we define Yj0 as the number of individuals among the n0 controls who have ‘at least one occurrence’ of the jth RV. Similarly, Yj1 is the number of individuals among the n1 cases who have ‘at least one occurrence’ of the jth RV. To respect the data generating mechanism of case–control studies, we adopt a retrospective approach to model the data as

independently of

given the probabilities pj0 and pj1 of at least one occurrence of the jth RV for the controls and cases, respectively. We refer to Yj0 and Yj1 as the collapsed frequencies in this paper. As explained earlier, we assume independence between RVs that means that the (Yj0,Yj1) for different j are independent. Now rather than treating pj0 and pj1 as fixed parameters, and there are lots of them if J is large, we reduce the number of parameters by treating p10,p20,…,pJ0 as random effects generated from a common distribution that enables pooling of information across variants. The fact that the alleles are rare means that the pJ0 and pJ1 are small, and so if the sample sizes n0 and n1 are reasonably large, the two binomial distributions given by (1) and (2) can be approximated by Poisson distributions to yield

where rj0=n0pj0, and

with rj1=n1pj1=fn0pj1, and the factor f=n1/n0 reduces to 1 when n0=n1. For the sake of mathematical convenience, we will assume that

independently, which implies that marginally, the collapsed control frequencies Yj0, j=1,…,J, are independently distributed according to the negative binomial distribution,15 with probability function

where μ=α/λ=E(Yj0) is the marginal mean of Yj0, and v=α−1 is the dispersion parameter of the negative binomial distribution. Now let

be the difference between the case and control probabilities for RV j on the log scale, so that rj1=n1pj1=fn0pj1=f exp(δj)rj0. Note that δj>0 corresponds to a deleterious effect of the variant, and δj<0 a protective effect, and f=n1/n0 is a factor to account for unequal sample sizes. As the δj are of substantive interest, we will treat them as fixed rather than random effects. According to (4), Yj1 is conditionally Poisson, and under the log link,

is a linear function of both the fixed effects δj and logarithm of the gamma-distributed random effects rj0. This results in what is called a generalized linear mixed model. As the random effects rj0, j=1,…,J, are independent, it follows that marginally, the vectors (Yj0,Yj1) are also independent. Generalizing (6), the joint distribution of (Yj0,Yj1) is given by

Stepwise LRTs

Within the earlier described framework, the variant selection problem that we are interested in can be formulated as finding those δj that are not equal to 0. Our approach to variant selection is to conduct stepwise LRT in the following way. We begin with testing the complete null hypothesis

against the alternative

one k at a time using the LRT, and we select the rare variant that maximizes the likelihood ratio statistic provided the value of this maximized statistic is greater than some critical value or cutoff c. We will postpone discussion of the choice of c to the next section and treat c as given for our present discussion. After we have included a variant, we will try to add one more variant by maximizing the likelihood ratio statistic of the current subset versus the current subset plus one more RV. The procedure stops when the maximized likelihood ratio statistic is <c. To make this operational, we need to maximize the marginal likelihood function under the generic null hypothesis

where S denotes the subset of RVs with non-zero δj in the current model. The likelihood function under H0 based on the observed frequencies yj0 and yj1 is given by the following product of terms like (7)

To obtain the maximum likelihood estimates and (jS) of the parameters under H0, we differentiate the log-likelihood with respect to μ, v, and δj, jS, and set them to 0. The Newton–Raphson algorithm is used to solve these score equations.

The generic alternative hypothesis under our stepwise setup is

where S′=S{k} for some kS. The likelihood function lik(H1) under this alternative hypothesis has the same form as lik(H0) given above, but with S replaced by S′=S{k}. The maximum likelihood estimates of the parameters μ, v, and δj, jS′=S{k}, can again be obtained using the Newton–Raphson algorithm. The LRT statistic of H0 against H1 is given by

where (H1) and (H0) are the maximum values of lik(H1) and lik(H0) respectively.

Choice of critical value

As the hypotheses being tested are nested, the LRT statistic Wk=Wφvs{k} for testing the initial complete null hypothesis Hφ: δ1=···=δJ=0 versus H{k}: δk≠0; δj=0 for jk is distributed asymptotically like under Hφ, and so setting c=3.84 will control the asymptotic type I error at level 0.05 one test at a time, but the familywise error rate is not controlled at 0.05 because a variant will be selected at stage 1 if the maximal statistic M=Wk is >c. This implies the selection of at least one variant because more variants could have been selected at the subsequent stages of the stepwise procedure. Thus, the type I error is (M>c). The null distribution of the maximal statistic M=Wk is, however, quite complicated because the statistics W1,W2,…,WJ are not independent even when the variants are. The use of Bonferroni inequality leads to an overly large critical value. We propose instead the following permutation approach to find the critical value adaptively. Represent the observed data by an J by (n0+n1) matrix Y={yji}, where yji=1 if the ith individual has at least one copy of the jth RV, and yij=0 otherwise. We can assume without loss of generality that i=1,…,n0 correspond to the controls, and i=n0+1,…,n0+n1 correspond to the cases, so that Yj0= and Yj1= are the control and case frequencies, respectively. The permutation approach operates as follows.

Step 1: Generate a random permutation r(1),…,r(n0+n1) of 1,…,n0+n1.

Step 2: Compute and for j=1,…,J.

Step 3: For k=1,…,J re-compute the likelihood ratio statistic for testing S=φ versus S={k} using the permuted frequencies ,,j=1,…,J. Denote that by , and let M*=

Repeat the above procedure independently B times to obtain . For a nominal type I error of ɛ, we will choose c to be the upper 100ɛ empirical percentile of . To save computing time, we use B=100 that seems to do a reasonable job in our simulation study. With the aim of filtering the list to a manageable but sufficiently rich set of RVs for further investigation and confirmatory study, we recommend a relatively liberal nominal type I error such as 0.1 or 0.2 (because of the inherent problem of insufficient sample size for rare mutations) so as to select more variants at the expense of a possible increase in false selection rate, which hovers 15–30% in our simulation studies to be reported later. An alternative to the permutation approach is the bootstrap.16 The way the bootstrap differs from the permutation approach is that r(1),…,r(n0+n1) are now obtained by sampling with replacement n0+n1 times from 1,…,n0+n1. We do not expect sampling with or without replacement to make a big difference, and hence we expect the permutation and bootstrap approach to produce similar critical values. This turns out to be the case in our real data examples.

To have a fair comparison between the proposed stepwise LRT procedure with RARECOVER and SCORE, we ought to control the type I error of all three procedures at the same level. Bhatia et al.12 recommended the use of c=0.5. Based on the evidence of our simulation study (not shown here to save space), this choice of c is much too liberal and grossly over-selects the number of variants with false selection rates easily reaching 70% or more. Hoffmann, Marini and Witte13 seem to suggest the use of c=0 that also over-selects. The solution that we propose to overcome this problem is again to use the random permutation approach to determine the cutoffs for all three procedures.

Results

Selection of rare variants associated with obesity

Against the background that blockade of the endocannabinoid receptor reduces obesity and improves metabolic abnormalities, the Comprehensive Rimonabant Evaluation Study of Cardiovascular ENDpoints and Outcomes (CRESCENDO) clinical trial (trial number NCT00263042 in ClinicalTrials.gov) was conducted to assess whether rimonabant, a cannabinoid-1 receptor blocker, would improve major vascular event-free survival. The subjects in this study are patients with abdominal obesity and with previously manifested or increased risk of vascular disease. More details about the study design and protocol can be found in http://clinicaltrials.gov/ct/show/NCT00263042 and Topol et al.17 Our concerns here are not on cardiovascular outcomes, but on finding rare genetic variants that are associated with obesity. Out of 2958 Caucasian individuals aged 55 years or older in the CRESCENDO cohort, Harismendy et al.18 selected individuals at the two extreme ends of the body mass index for DNA sequencing. In the end, 143 individuals (73 men and 70 men) with body mass index >40 kg m−2 were selected as the cases (obese persons); and 146 individuals (74 men and 72 women) with body mass index <30 kg m−2 were selected as the controls. So the selected samples are balanced in gender. The DNA samples of the cases and controls were re-sequenced around two genes, FAAH and MGLL, known to be involved in endocannabinoid metabolism. Bhatia et al.12 also made use of this data set to illustrate their RARECOVER procedure, and we extracted the data from their online supplementary materials. Apparently, some new cases and controls have been added because the data online consist of 148 cases and 150 controls that our analysis will be based on. For each rare variant in the FAAH and MGLL regions, Table 1 lists the collapsed frequencies (that is, the number of individuals with at least 1 occurrence of the rare variant) among the 148 cases and the 150 controls. The variants in bold type, 16 of them in the FAAH region and 12 in the MGLL region, were those selected by Bhatia et al.12 using RARECOVER with critical value c=0.5. Among the 16 RVs selected by RARECOVER in the FAAH region, the case versus control frequencies are 1:0 for 12 of them. Likewise, 6 of the 12 RVs selected by RARECOVER in the MGLL region have frequency comparison of 1:0. It is hard to justify in our view this mass selection of RVs, each of which occurs only once in the entire sample of 148 cases. As commented earlier, it is better to try to control the familywise type I error by using the permutation approach to determine the critical value. We will do this for all three selection procedures: stepwise LRT, RARECOVER and SCORE.

Table 1 Case- versus control-collapsed frequency comparisons for 32 rare variants near the FAAH gene on chromosome 1 and 25 rare variants near the MGLL gene on chromosome 3

Rare variants in the MGLL and FAAH regions

Even though it is commonly believed that RVs are at linkage equilibrium (that is, occur independently), we should check whether this assumption holds for the data at hand. One way to test whether two RVs are at linkage disequilibrium (LD) is to compute their correlation r from a sample of n individuals. As the data are bivariate binary rather than bivariate normal, we do not use the usual test statistic (n−2) × r2/(1−r2) to test the significance of r. Rather, we use the Pearson test of independence in the corresponding 2 × 2 table of frequencies, which for the present case of binary variables, is numerically equal to nr2. As we are dealing with RVs, the expected frequencies will be low in some cells, and the χ2 approximation may not be accurate. As a remedy, we use the option provided in the R function ‘chisq.test’ to compute P-values by B=10 000 Monte Carlo simulations. We consider the MGLL region first, with 25 RVs in this region, there are 25C2=300 pairs of RVs to be tested for LD. To correct for multiple testing, we use Bonferroni’s correction and declare LD for a pair of RVs only when the P-value is <0.05/300=0.000167. Of all the 300 pairs of correlations, only the correlation between RV6 and RV7 is declared significant by Bonferroni’s method (n=298, r=0.913, P-value=0.0001). It may not be appropriate to combine the cases with controls to test for LD between two RVs because there may be a systematic difference between the case and control frequencies, but the same conclusion is reached if we test LD using the control data (n=148) or the case data only (n=150). Looking more closely at RVs 6 and 7, out of the total of 298 individuals, the two variants occur simultaneously for 11 individuals, and separately for only two individuals (one for each variant), with neither variant occurring for the remaining 285 individuals. Furthermore, the position of the two variants differ only by one (chr3:129031590 versus chr3:129031591), and so they are in tight LD. Thus, we drop RV7 (with frequencies 9:3) and keep RV6 (with frequencies 10:2) in our analysis. RVs 19 and 20 also differ by one position only, and they both have collapsed frequency ratio of 2:0. However, their sample correlation is 0.497 only, with P-value 0.014 that is not significant when the familywise type I error is set at level 0.05. Thus, we keep both RVs 19 and 20 in our analysis. Ignoring significance for the time being, the order in which RVs are selected by stepwise LRT (Materials and Methods) can be found in the top panel of Table 2. The first three are RVs 6, 3 and 1 with collapsed frequency ratio of 10:2, 15:6 and 18:13, respectively. The associated stepwise (maximal) likelihood ratio statistics are 7.22, 7.45 and 5.07. The permutation-based critical values turn out to be 6.28 for nominal level 0.1 and 6.90 for level 0.05. Thus, whether we aim to control the type I error at 0.05 or 0.1, the proposed stepwise LRT procedure will select RVs 6 and 3, but not RV1. As commented before, the bootstrap critical values (6.01 and 6.90) are similar to the permutation critical values and the conclusion remains unchanged for this example. We will use the permutation approach to determine critical values in the remainder of this paper.

Table 2 Selection of RVs associated with obesity in the MGLL region using stepwise LRT, RARECOVER and SCORE procedures with critical values obtained using random permutations and case- versus control-collapsed frequencies given in parentheses

The middle and lower panels of Table 2 show the results for RARECOVER and SCORE using permutation-based critical values. It can be seen that whether at level 0.05 or 0.1, both procedures select RV 6 only, but not RV3. It appears that by pooling information across variants, the proposed stepwise LRT procedure was able to identify one more potential causal variant.

In the FAAH region, only the correlation between RV4 and RV7 is found significant at level 0.05 after applying Bonferroni’s correction, but we will not merge these two variants and keep them separate as they are quite far apart. It can be seen from Table 3 that at level 0.1, no variant is selected by all procedures. If one is willing to increase the level to 0.2 to induce more discoveries, then stepwise LRT will flag out RV28 (with frequency ratio 18:10) as a variant worthy of further investigation, but RARECOVER and SCORE will select no variant even at level 0.2.

Table 3 Selection of RVs associated with obesity in the FAAH region using stepwise LRT, RARECOVER and SCORE procedures with critical values obtained using random permutations and case- versus control-collapsed frequencies given in parentheses

Our findings so far are purely empirical. Bhatia et al.12 and Harismendy et al.18 have reported findings that corroborate with ours, and they also offer some scientific conjectures to explain how the selected RVs in the MGLL and FAAH regions could cause obesity.

Simulation results when the assumed model is correct

Mimicking the MGLL example, we generate data for n0=150 controls, n1=148 cases, and 24 RVs in the following way. First, we simulate rj0=n0pj0=150pj0 (j=1,…,24) according to the gamma (α, λ) distribution with v=α−1=0.98 and μ=α/λ=1.72 as in the model selected by stepwise LRT based on the original MGLL data (given in bold type in Table 2). We then divide rj0 by n0=150 to get the pj0 and then generate the control data in the form of binary vectors (Yij0, j=1,…,24), i=1,…,150, with Yij0Bernoulli(pj0) independently. Summing over individuals, we obtain Yj0=. The case data (Yij1, j=1,…,24), i=1,…,148, are generated with Yij1Bernoulli(pj1) independently, where pj1=exp(δj)pj0. We consider two parameter settings for the δj. In setting (a), we have δ1=…=δ24=0, which corresponds to the situation of no causal variant. For this setting, we focus on type I error. In setting (b), we set δ3=1.24, δ6=1.68, and δj=0 for j≠3,6 as in the model selected by stepwise LRT based on the original MGLL data. For this setting, our interest will focus on the procedure’s ability to select RVs 3 and 6, and the false selection rate. To compare selection procedures on equal footing, we will use the random permutation approach to determine the critical values, with nominal type I error set at a liberal level of 0.2 (as our aim is to select a sufficiently rich set of potential causal RVs for further investigation).

The results based on 100 sets of simulations are present in the top panel of Table 4. It can be seen that the type I errors of the three procedures range from 0.12 to 0.15 when there is no casual variant and so are conservative. In setting (b), where RVs 3 and 6 are the casual variants with δ3=1.24 and δ6=1.68, RV6 is selected more often by stepwise LRT than by RARECOVER and SCORE (47 times as compared with 40 and 39). The number of times that RVs 3 and 6 are selected together is also highest for stepwise LRT (15 versus 11 and 10). Over the 100 samples simulated, RV3 is selected by stepwise LRT 26 times, RV6 is selected 47 times, whereas the other non-casual variants are selected a total of 15 times; thus, the false selection rate for stepwise LRT is 15/(26+47+15)=0.17. The false selection rates of both RARECOVER and SCORE are 0.167. Although the set of figures given above seems to suggest only modest power for stepwise LRT to select the right variants, we expect the power to improve with either larger sample sizes and/or larger sizes of the variant effects (that is, larger values for δ3 and δ6). To illustrate this point, we conduct a simulation study with the same model parameters but double the sample sizes (that is, 300 controls and 296 cases), and the results of stepwise LRT improve to selecting RV3 35 times, RV6 55 times, and together 21 times, with a false selection rate of 0.167 that is almost unchanged.

Table 4 Operating characteristics of stepwise LRT, RARECOVER and SCORE procedures with nominal level 0.2 and critical values obtained using random permutations, based on 100 sets of data (148 cases, 150 controls and 24 variants) simulated under both null and non-null models, gamma and non-gamma random effects, and independent and correlated rare variants

Simulation results when the assumed model is incorrect

To investigate how the performance of the three procedures is affected by violations of the model assumptions, we conduct two more sets of simulations. For the first set of extra simulations, we focus on non-gamma rj0. A general class of mixtures of Poisson distributions to model over-dispersed count data has been proposed by Hougaard et al.19 A prominent member of that class is the inverse Gaussian–Poisson distribution. The second panel of Table 4 summarizes the simulation results when the rj0 are simulated from an inverse Gaussian distribution with the same mean and variance as the gamma distribution used to generate the results of the top panel of Table 4. We can see that the powers of all three procedures are slightly reduced, but stepwise LRT is still marginally more powerful and has lower false selection rate (0.186 versus 0.215 and 0.261).

Our last set of simulations is designed to look into the effect of LD (that is, correlated variants) on our procedure and its competitors. Instead of simulating independent binary observations given the pj0, we simulate correlated binary observations given the pj0. As there is a dearth of distributions for correlated discrete data, we resort to the familiar technique of dichotomizing multivariate normal latent variables as is commonly done in multivariate probit models. To be specific, we first generate the rj0 from the same gamma distribution as before. Given the rj0, and for each i=1,…,n0, rather than generating Yij0Bernoulli(pj0=rj0/n0) independently for j=1,…,24, we simulate a multivariate normal vector (Zij0, j=1,…,24)N24(η,Σ), where Σ=(σij) is a correlation matrix with σii=1, and σij=, where ci and cj are the positions of RVs i and j listed in Table 1. This seems to be a sensible correlation structure where the correlation decays exponentially with inter-loci distance. We set ρ to 0.99, so that ρ200=0.134, where 200 is roughly the average distance between successive RVs in the MGLL region; Table 1. The mean vector η of the multivariate normal distribution is chosen to make the marginal distribution of the dichotomized variable =I{Zij0>0} the same as that of Yij0, namely, Bernoulli(pj0). But unlike the Yij0, that are mutually independent, the =I{Zij0>0} are correlated because the Zij0 are. We will treat as the control data in this set of simulations. Similarly, to simulate the case data, we generate (Zij1, j=1,…,24)N24(η,Σ), where the correlation matrix Σ is as defined above, and the mean vector η is chosen to make =I{Zij0>0}Bernoulli(pj1), where pj1=exp(δj)pj0. As before, we consider two settings: (a) all δj=0, and (b) δ3=1.24, δ6=1.68, and δj=0 for j≠3,6. The results based on 100 simulations are given in the bottom panel of Table 4 under the headings 3(a) and 3(b). It can be seen that for setting 3(a), the type I error is inflated to 0.23 for RARECOVER, 0.25 for stepwise LRT, and 0.28 for SCORE. In setting 3(b), just as in settings 1(b) and 2(b), stepwise LRT selects the correct RVs more often than RARECOVER and SCORE, but the false selection rates increase to 0.283, 0.25 and 0.241, respectively.

Simulation results for the case of protective variants

To investigate the power of the proposed stepwise LRT procedure in picking out protective variants, we conduct extra simulations with parameter values set to v=1, μ=5, J=24, δ6=−2,−3, δj=0 (j≠6) with n0=150, n1=148 as in the CRESCENDO study. This amounts to an average MAF of 5/150=0.033 in the control population. Strictly speaking, a variant with MAF 0.033 is not very rare, but we need to leave some room for the variant to be even rarer among the cases if it is in fact protective. The nominal level is 0.1 and the critical value is determined adaptively using the random permutation approach. The number of times RV6 is picked out in 100 simulations, and the number of times other variants are wrongly picked, together with the false selection rate are shown in the top panel of Table 5 under the heading (a). The power is moderate but this is to be expected as it is very difficult to detect the difference between the rare and the rarer. Increasing the sample size will obviously help. Part (b) of Table 5 depicts the results when the sample sizes are doubled to n0=300, n1=296, and μ increased to 10 correspondingly. It can be seen that the power is now increased to 0.48 and 0.57 for the cases δ6=−2 and δ6=−3 respectively, whereas the false selection rate is kept at 0.2 or less.

Table 5 Power of the stepwise LRT procedure with nominal level 0.1 and permutation-based threshold to detect a protective variant based on 100 sets of data simulated from the gamma random effects model with v=1, J=24 and all δj=0 except δ6

Discussion

Due to the scarcity of information in rare variant analysis, it is important to pool information across variants. We show that one way to do this is to treat all the rare variant occurrence probabilities in the control sample as random effects from a common distribution, resulting in a mixed model. Even though the retrospective likelihood approach and its advantages have been advocated by Epstein and Satten20 and Satten and Epstein,21 genetic variant analyses are currently still mostly based on prospective likelihoods even when the study is retrospective, with justifications provided by Prentice and Pyke.22 One reason for this tendency is that it is easier to model disease status given genotypes by some kind of binary regression model than to model the genotypes at multiple sites given disease status. Our mixed models are based on a retrospective formulation to better reflect the sampling given disease status nature of case–control studies. The justifiable and commonly made assumption of independence between rare genetic variants offers great simplification to the retrospective likelihood that we take full advantage of. We also prefer to treat the variant effects that are of substantive interest as fixed rather than random effects, which we think is the sensible thing to do. As we are dealing with RVs, we collapse frequencies as it is highly unlikely for an individual to have two mutant alleles at the same locus. Another advantage of modeling the collapsed frequencies is that it does not assume Hardy–Weinberg equilibrium, which is unlikely to be true for rare alleles. Efficiency calculation reported by Kuk, Xu and Li23 within the context of haplotype frequency estimation when there is no random effect demonstrated that collapsing frequencies will not lead to much loss of estimation efficiency when the alleles are rare. We take rarity explicitly into account by making use of Poisson approximation in Equations (3) and (4). Out of convenience, we assume in (5) that the rj0=n0pj0 are gamma distributed to result in the generalized negative binomial distribution (7) for (Yj0,Yj1). Our simulation results show that the resulting stepwise LRT procedure is more powerful than RARECOVER and SCORE in selecting the correct variants. When applied to the MGLL data, stepwise LRT picks up one more variant, namely, RV3, than RARECOVER and SCORE. LRT is computationally more demanding as it involves finding the maximum likelihood estimates under both the null and alternative models. But as pointed out on p.9, the log-likelihood function has the same form under both models, and both can be fitted by the Newton–Raphson algorithm without too much difficulty. We outline in the appendix how to obtain the required first and second derivatives of the log-likelihood function. The proposed LRT approach is a parametric one and may not be robust against departures from the parametric assumptions. The fact that we obtain critical values by the random permutation approach rather than from the asymptotic distribution of LRT should make the procedure a bit more robust, and this is corroborated by the findings of our simulation study to a certain degree.

Non-gamma distributions could have been used for rj0=n0pj0 in our mixed model. Our modest simulation study suggests that the proposed stepwise LRT procedure based on gamma random effects is not too adversely affected by departure from the gamma assumption. Furthermore, the resulting negative binomial distribution passed the Pearson goodness of fit test when fitted to the control data in both the MGLL and FAAH regions.

Rather than integrating out the random effects rj0 to result in the generalized negative binomial distribution (7), an alternative is to eliminate the random effects rj0 as nuisance parameters by conditioning on the sum of the two independent Poisson counts Yj0 and Yj1 to obtain a binomial distribution for Yj1 conditionally. Although not mentioned explicitly, this is the theoretical basis for the C-alpha test of Neale et al.9 But as rj0 has been eliminated from the conditional likelihood, there is no pooling of information across variants.

Our framework allows both deleterious (δ>0) and protective (δ<0) variants. It is also possible to incorporate covariate effects into our retrospective modeling framework.