Introduction

Rare variants and gene–environment interactions (GXE) have been suggested in the literature as potential causes of ‘missing heritability’ in common diseases. We consider these problems by focusing on G being a rare haplotype variant (rHTV), which may reflect a combination of common single-nucleotide polymorphisms (SNPs). Thus rHTVs can be studied even in existing genome-wide association studies data without the need to sequence any additional data. Recently, we have proposed an approach for rHTV association for case–control data called logistic Bayesian LASSO (LBL).1 We have extended it to handle GXE under the assumption of GE independence as well as when this assumption is relaxed or there is an uncertainty about it.2, 3, 4 LBL shrinks the effects of unassociated haplotypes or their interactions with environmental covariates toward zero, so that the associated effects can be identified with considerable power.5, 6, 7 In fact, LBL is one of the most powerful rHTV methods.8

Complex sampling designs are being utilized with increasing frequency in case–control studies, especially for sampling of the controls. Typically, all available cases are included while controls are selected by stratified sampling using frequency matching with cases. Strata are usually formed based on known risk factors, such as race, age and sex. Often one or more strata, especially those containing minorities, are oversampled to obtain more controls. To account for different sampling rates arising from unequal sampling among strata, population weights are calculated, which indicate the number of population members represented by each sample subject. It is important to use these weights in the analysis to avoid bias in the results. However, at the same time, the use of weights also eliminates the power and efficiency in case–control studies due to the fact that population weights for controls are usually much larger than those for cases, leading to large variability in weights.9 To regain some of the lost efficiency, rescaling of population weights has been suggested.10 For example, one way of rescaling is such that the sum of the case (control) weights is equal to case (control) sample size. Another type of rescaling is to have the sum of weights of controls be equal to the sum of weights of cases.

The US Kidney Cancer Study (KCS) was designed using a complex sampling scheme through stratified random sampling for recruiting subjects.11, 12 It was conducted at two sites—Chicago and Detroit. Cases identified from the Metropolitan Detroit Cancer Surveillance System and Cook County hospitals were recruited. At each site, the controls were frequency matched to cases based on age, sex and race. The matching rate of controls to cases was 2:1 in blacks and 1:1 in whites. Age groups were formed at 5-year intervals starting from 20 to 79 years. For age groups 65 years, controls were chosen from the database of Medicare beneficiaries, which has information on age, sex and race. For age groups <65 years, controls were chosen from a listing of the Department of Motor Vehicles, which contains information on age and sex but not on race. As a proxy for race, strata of low and high black densities were formed based on Census data. Thus the overall strata were formed by cross-classification of age, sex and race (or black density). In addition to these stratifying variables, KCS collected covariates such as smoking status, high blood pressure, education level and body mass index. As described in Colt et al.,12 to account for features related to the complex sampling design (differential sampling rates for controls and cases, survey nonresponse and deficiencies in coverage of the population at risk in the Department of Motor Vehicles and Medicare files), population weights were formed for each sampled individual.

Several authors have analyzed the KCS data and reported risk factors for kidney cancer such as smoking, obesity and hypertension.12, 13, 14 Besides, genetic susceptibility and its interaction with environmental factors have been reported to affect the risk as reported in the KCS and other studies.15, 16, 17, 18, 19 In particular, the N-acetyltransferase 2 (NAT2) gene is known to code for an enzyme involved in tobacco-carcinogen mechanism. Semenza et al.15 found that smoking-related risk of kidney cancer is higher among those carrying a polymorphic variant of NAT2 called slow acetylator genotype than rapid acetylators. Longuemaux et al.20 observed a higher risk of kidney cancer for subjects with NAT2 slow acetylators combined with CYP1A1 variants; however, they did not study GXE.

To the best of our knowledge, there is currently no rHTV association method that can account for complex sampling design such as that adopted in the KCS. To fill this gap, we adapt the LBL model to analyze this type of data. We show that stratified sampling with frequency matching can be easily accounted for in the framework of LBL without any additional modeling. We conduct simulation studies to investigate the properties of the extensions and compare with the original LBL method. Finally, we also analyze the KCS data to study the NAT2–smoking interaction.

Materials and Methods

The method mostly follows from Zhang et al.4 with necessary adaptation to include stratifying variables and population weights. Suppose we have a case–control sample consisting of n1 cases and n2 controls with n1+n2=n. Let Yi=1/0 denote the case/control status of the ith individual, i=1,…,n and Y=(Y1,…,Yn). Let Gi denote the observed genotype of the ith individual and G=(G1,…,Gn). We then let be the set of haplotype pairs compatible with Gi as the haplotype pair of a person may not be completely determined from the observed genotypes. Further we denote the rth haplotype pair in by Zir. Next we denote the vector of environmental covariates of the ith individual by Ei. For a complex sampling design, the stratifying variables have a key role, and they are denoted collectively as Si for individual i. In this paper, we consider both E and S to be categorical.

Complex sampling design structure and analysis

For the type of complex sampling considered in this paper, the sampling mechanism leads to known (rescaled) population weights, wi, for the ith individual. In simple terms, wi is the number of individuals in the population that the ith sampled person represents. It is essentially the ratio of the number of individuals available to be sampled (population size) to the number of individuals actually sampled (sample size) in the stratum to which the ith individual belongs. In surveys, non-response and poststratification adjustments are further made to these weights, and they are made available along with the rest of the sample data.9 The weights are typically rescaled to increase efficiency as mentioned in the Introduction section. Further details on calculation of weights will be provided in the ‘Simulation study’ section.

The basic principle that we follow for incorporating complex sampling design in the Bayesian framework is to write the analysis model conditional on the information and variables that describe the data collection process.21 That is, for writing the likelihood, we condition on the fact that the frequencies of cases and controls were matched (in some way that will become apparent below) in each stratum and on the values of the variables used for matching (in this case, the stratifying variables).

Retrospective likelihood

Conditional on {wi}, i=1,…, n and S, the retrospective likelihood of the observed data is written as:

where Ψ consists of the regression coefficients and the parameters associated with the haplotype pair frequencies, which will be specified more explicitly later. Note that conditioning on the case/control status (Y), stratifying variable information and the weight for each person automatically takes care of matching frequencies of cases and controls in all strata in the retrospective (in contrast to prospective) likelihood formulation. Now we will specify the model for each component of the likelihood. In the following, we suppress the subscripts i and r for simplicity without causing ambiguity.

Modeling of P(Z| E , S ,Y=0)

We start with modeling P(Z|E,S,Y=0)=aZ|E,S, the frequency of haplotype pair Z in the control population for a given E and S. Suppose there are a total of m haplotypes and assume gene–environment (GE) dependence is only due to some of the stratifying variables and/or covariates, defined as Cdep, a subset of {E,S}. That is, conditional on Cdep, G and E are independent.22, 23 Then we denote the haplotype frequencies in the control population by f(Cdep)=(f1(Cdep),…, fm(Cdep)). We model aZ|E,S for a haplotype pair Z=(zk, zk) as follows:

where δkk=1(0) if zk=zk(zkzk), fk and fk are frequencies of zk and zk and d(−1,1) is the within-population inbreeding coefficient that captures excess/reduction of homozygosity.24 For d=0, the above expression is equivalent to assuming Hardy–Weinberg Equilibrium (HWE) while other values of d allow Hardy–Weinberg Disequilibrium (HWD).

We then model f(Cdep) using a multinomial logistic regression model to allow GE dependence.25 Let the mth haplotype be the baseline and assume Cdep has L levels excluding baseline(s): Cdep={C1,C2,…,CL}. For example, if Cdep consists of two binary variables, then L=2 with exclusion of baseline category of each variable. Then we have

Thus

Let γ denote an (m−1) × (L+1) matrix with the (k, l)th element being γkl, k=1,...,m−1 and l=0,...,L. Combining Equations (2) and (4), we have now fully specified aZ|E, S(γ, d).

Modeling of P(Z| E , S , Y=1)

Next let us consider P(Z|E, S, Y=1)=bZ|E, S, the frequency of haplotype pair Z in the case population for a given value of E and S. We express bZ|E, S in terms of aZ|E, S and the odds of disease for a given Z, E and S, θZ, E, S(=P(Y=1|Z, E, S)/P(Y=0|Z, E, S)):

where H is the set of all possible haplotype pairs and θZ, E, S is modeled using logistic regression. We consider two different ways of modeling θZ, E, S=exp() with respect to the stratifying variables. They are included as covariates either just as main effects (LBLc-GXE) or with additional modeling of interaction effects of S with haplotypes (LBLc-GXE-GXS); ‘c’ in LBLc represents complex sampling. More specifically, X is (1, XS, XE, XZ, XZXE) in LBLc-GXE and (1, XS, XE, XZ, XZXS, XZXE) in LBLc-GXE-GXS. For each model, β is the vector comprising the corresponding regression coefficients. Here XZ=(x1, x2, …, xm−1), where xk is the number of copies of haplotype zk in haplotype pair Z with the mth haplotype assumed to be the baseline. XE and XS consist of the usual dummy variables corresponding to E and S, respectively. XZXE and XZXS are obtained by (scalar) multiplication of XZ and XE and XZ and XS, respectively.

Modeling of P(Ei|Yi=0, Si) and P(Ei|Yi=1, Si). It remains to model P(Ei|Yi=0, Si) and P(Ei|Yi=1, Si) in Equation (1). Assuming a saturated model for P(E,S), P(E|Y,S)P(Y|E,S) without loss of information.26, 27 Then using the Bayes rule, we get the following:

and

Thus we can write the observed data retrospective likelihood in Equation (1) as:

where Ψ=(β, γ, d).

Priors, posterior distributions and inference on association

These follow closely from LBL-GXE4 as elucidated briefly in the following. Bayesian LASSO is used to regularize the regression coefficients βs by assigning each of them a double exponential prior centered at 0 and variance . Such regularization helps in weeding out the unassociated effects, making it possible for the associated ones, especially those involving rHTVs, to stand out. The parameter λ controls the degree of penalty. It is assigned a Gamma(a,b) hyper-prior with parametrization such that its mean is a/b. When a=b=20, we obtain SD(β)=1.53, which corresponds to a realistic variability in odds ratios. For γ parameters, we use a double exponential prior with hyper-parameter ν set to be 0.5, which provides well-calibrated results as seen in our simulation study. For d, we note that it is dependent on f(Cdep) as aZ|E, S should be non-negative. Thus . As −1<d<1, we get Therefore, we set the prior for d to be uniformly distributed in that range.

The posterior distributions of all parameters in Ψ are estimated using Markov chain Monte Carlo (MCMC) methods. Finally, we test for significance of each β coefficient by computing its 95% credible set (CS) using MCMC samples from its posterior distribution. A 95% CS not covering 0 is considered as an evidence for significance. Alternatively, Bayes factor (BF) >2 can be also used to declare significance.1 For the KCS data analysis, we report both 95% CS and BF.

Results

Simulation study

One stratifying variable

We carry out simulation studies to investigate the performance of LBL for complex sampling data. In this subsection, we consider one binary stratifying variable S (=0/1) with prevalence pS=P(S=1)=0.3. There is also a binary environmental covariate E (=0/1) with prevalence pE|S=0=P(E=1|S=0)=0.3 and pE|S=1=P(E=1|S=1)=0.7. There are three haplotype settings with 6, 9 and 12 haplotypes in a haplotype block as listed in Table 1. Each haplotype block is formed by five SNPs with alleles labeled as 0 or 1. There are two rHTVs, denoted as R1 and R2, in each block. Note that there is GS dependence as frequencies of haplotypes differ in the two strata. This, in turn, induces GE dependence as prevalence of E differs across strata.

Table 1 Simulation setup for one stratifying variable: OR under association scenarios 1–6 and frequencies of haplotypes and environmental covariate in each stratum

For creating association scenarios, we use various combinations of the following effects: R1, R2XS, R2XE, S, and E, as listed in Table 1. We also simulate a completely null model with all odds ratios (ORs) set to be 1 (scenario 6).

To mimic a complex sampling design for generating data, we first generate a population of cases and controls and then sample from it using matching based on the stratifying variable. For a specific combination of association scenario and haplotype setting, we generate a population of 10 000 subjects in the following manner. For each individual, first we simulate a stratifying variable value, say S using the pS value. Then we generate an environmental covariate value, E, using the pE|S value. Then we generate a phased haplotype pair, say Z, using the frequencies given in Table 1 and assuming HWE (d=0). Next, the individual is assigned to be a case or control using a logistic regression model: log(p/(1−p))=X β, where p is the probability that the individual is case, and X=(1, XS, XE, XZ, XZXS, XZXE). The intercept is calculated using a baseline prevalence of 0.1, that is, β0=log(0.1/0.9). For the other β coefficients, we use the corresponding ORs as listed in Table 1. We set the most frequent haplotype as the baseline in the regression model. After the case/control status is assigned, the phase information is removed and only genotypes are retained. Once a population of 10 000 subjects is generated in this manner, we obtain a sample from it as described next.

Suppose the numbers of cases and controls in the population of Stratum h (h=0, 1; h=0 corresponds to S=0 and h=1 corresponds to S=1) are and . Correspondingly, let the number of cases and controls in the sample of the Stratum h be and . First, we select all the cases in the population to be included in the sample for each of the strata, that is, , h=0, 1.19 For selecting controls, to mimic the KCS data, we use differential sampling rates in the two strata. In Stratum 0, the number of controls is set to be the same as the number of cases, that is, . While in Stratum 1, we select a simple random sample of size controls, that is, . In most situations, out of a population of size 10 000, we get a sample of size of 2200–3800 with the number of cases varying between 1000–1500 (700–900 in Stratum 0 and 150–800 in Stratum 1) depending on the scenario.

Next we calculate the population weights for sampled cases and control in each stratum and rescale them. The rescaling is such that the sum of weights for cases is the same as the sum of weights for controls, as in the analysis of the KCS data reported by Hofmann et al.14. Denote the rescaled weights of sampled cases and controls in stratum h by and . As all cases are sampled, the weight for a case is 1, that is, =1, h=0, 1. Thus the sum of weights for cases in the sample is the sample size of the cases (). The population weights of controls in stratum h is . For rescaling, we divide these population weights by their sum, that is, and then multiply by case sample size, that is, . Thus and . Therefore, we can see that if, we oversample the controls for one stratum, their weights will be reduced. This can be clearly seen from the above expressions of weights if the control-to-case ratio in the population is constant across different strata. Note that all persons in a stratum have the same weight, original as well as rescaled, and these are computed only once for a given sample.

We analyze each sample using LBLc-GXE and LBLc-GXE-GXS. For comparison, we also apply LBL-GXE from Zhang et al.,4 which models GE dependence but ignores the stratifying variables (Note that, in Zhang et al.,4 this method was referred as LBL-GXE-D; however, for the sake of simplicity here we refer to it as LBL-GXE). Additionally, we also analyze the data using a variation of LBL-GXE, referred to as LBL-GXE-GXS, which includes the stratifying variables as covariates but does not use sampling weights, that is, ignores the complex sampling scheme. For each of these four methods, we use a total number of 120 000 iterations with a burn-in period of 20 000 iterations to ensure satisfactory convergence.21 The total number of replications in each simulation is 500. For each β coefficient, we calculate the percentage of times (out of 500) that its 95% CSs does not cover 0 to study the power or type I error rate.

Figures 1,2,3 and Supplementary Figures S1–S3 show the powers and type I error rates for LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE for association scenarios 1–6 (null model), respectively. In scenario 1 (Figure 1), the performance of LBLc-GXE and LBLc-GXE-GXS are comparable, detecting the main haplotype and interaction effects with E with similar powers and keeping the type I error rates under control, while LBL-GXE-GXS and LBL-GXE have inflated type I error rates. In scenario 2 (Figure 2) where an interaction effect with S is present in the data, LBLc-GXE-GXS continues to performs well, while the other three methods, including LBLc-GXE, have inflated type I error rates. In scenario 3 (Figure 3) where the main effect of S is included in the data, LBLc-GXE, LBLc-GXE-GXS and LBL-GXE-GXS control the type I error rates successfully while LBL-GXE leads to inflated type I error rates. However, we should note that the main effect of S detected by LBL-GXE-GXS here is not really an indication of its power because this method detects the main effect of S to be significant always irrespective of whether S has a true main effect or not, as seen in Figures 1 and 2 and Supplementary Figures S1–S3. In summary, LBLc-GXE controls type I error rates in situations where there is no interaction between haplotype and stratifying variable, while LBLc-GXE-GXS performs well in all scenarios. The Supplementary Figure S3 for the null model (scenario 6) shows that LBL-GXE-GXS and LBL-GXE lead to seriously inflated type I error rates while LBLc-GXE and LBLc-GXE-GXS control these rates well.

Figure 1
figure 1

Powers (in gray shadow) and type I error rates of LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE for scenario 1 (OR.R1=3, OR.R2XE=3 and all other ORs=1). Each plot has three panels for main effects (bottom row), interactions of the corresponding haplotypes with S (middle row) and interactions of the corresponding haplotypes with E (top row). 5% is marked by a gray horizontal dashed line. The haplotype frequencies are listed in Table 1. A full color version of this figure is available at the Journal of Human Genetics journal online.

Figure 2
figure 2

Powers (in gray shadow) and type I error rates of LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE for scenario 2 (OR.R1=3, OR.R2XS=3, OR.R2XE=3 and all other ORs=1). Each plot has three panels for main effects (bottom row), interactions of the corresponding haplotypes with S (middle row) and interactions of the corresponding haplotypes with E (top row). 5% is marked by a gray horizontal dashed line. The haplotype frequencies are listed in Table 1. A full color version of this figure is available at the Journal of Human Genetics journal online.

Figure 3
figure 3

Powers (in gray shadow) and type I error rates of LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE for scenario 3 (OR.R1=3, OR.S=3, OR.R2XE=3 and all other ORs=1). Each plot has three panels for main effects (bottom row), interactions of the corresponding haplotypes with S (middle row) and interactions of the corresponding haplotypes with E (top row). 5% is marked by a gray horizontal dashed line. The haplotype frequencies are listed in Table 1. A full color version of this figure is available at the Journal of Human Genetics journal online.

We also explore scenarios 2 and 6 with pS=0.15 and pE|S=0=pE|S=1=0.19 to mimic S and E to be race and smoking. We use the fact that the prevalence of blacks in the United States is about 15% and the prevalence of smoking among whites or blacks in the United States is about 19%. Supplementary Figures S4 and S5 show the corresponding results. The methods perform similarly as before except that, with lower prevalences of S and E, LBLc-GXE and LBLc-GXE-GXS have reduced power, as expected.

For scenario 2 and setting 1, we also analyzed the data using a standard haplotype association method haplo.glm.28 Haplo.glm is based on the generalized linear model and uses maximum likelihood methods for inference. The results are reported in Figure 4 and Supplementary Table S1, which show that haplo.glm has inflated type I error rates.

Figure 4
figure 4

Powers (in gray shadow) and type I error rates of LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS, LBL-GXE and haplo.glm (with and without S) for scenario 2 (OR.R1=3, OR.R2XS=3, OR.R2XE=3 and all other ORs=1). Each plot has three panels for main effects (bottom row), interactions of the corresponding haplotypes with S (middle row) and interactions of the corresponding haplotypes with E (top row). 5% is marked by a gray horizontal dashed line. The haplotype frequencies are listed in Table 1. A full color version of this figure is available at the Journal of Human Genetics journal online.

Additionally, we investigate a different rescaling of the weights such that the sum of the case (control) weights is equal to case (control) sample size. We compare the two types of rescaling by applying LBLc-GXE and LBLc-GXE-GXS to the data generated under setting 1 of scenario 2. The results of these two types of rescaling are comparable as shown in Supplementary Table S2. We also examine the methods for data generated under HWD by setting d=0.1 in the data simulation procedure for setting 1 of scenario 2. The relative performances of the methods are similar to what we found earlier under HWE. The detailed results are shown in Supplementary Figure S6.

Two stratifying variables

We next conduct simulation studies using two stratifying variables S1 (0/1) and S2 (0/1) to mimic race and sex. We set the prevalence pS1 =P(S1=1)=0.15 and pS2 =P(S2=1)=0.5. These two stratifying variables form four strata: Stratum 1 (S1=0, S2=0), Stratum 2 (S1=0, S2=1), Stratum 3 (S1=1, S2=0), and Stratum 4 (S1=1, S2=1). The binary environmental covariate E has prevalence pE|S2 =0=P(E=1|S2=0)=0.15 and pE|S2=1=P(E=1|S2=1)=0.2, which mimics that prevalence of smoking among females and males are about 15% and 20%, respectively (http://kff.org/other/state-indicator/smoking-adults-by-gender/). We consider six haplotypes and two types of GS dependence—dependence on S1 only (GS1 dependence) or on both S1 and S2 (GS1S2 dependence), as listed in Table 2.

Table 2 Simulation setup for two stratifying variables: OR and haplotype frequencies under two types of GS dependence

The sample generation and weight calculation procedure is similar to that in the ‘One stratifying variable’ subsection. Specifically, we generate a population of size 10 000 and select all cases in the population. In Strata 1 and 2, we select a simple random sample of controls of the same size as the number of cases in the corresponding stratum. In Strata 3 and 4, we select a simple random sample of controls with size double of that of the cases in the corresponding stratum. The total sample sizes range from 2000 to 2500 with roughly 1000 cases (about 400 each in Strata 1 and 2 and 100 each in Strata 3 and 4).

Figure 5 shows the results for both GS1 dependence and GS1S2 dependence. The relative performances of the methods are comparable to what we observe in the case of one stratifying variable. That is, LBLc-GXE-GXS has type I error rates well controlled while the other three methods, including LBLc-GXE, have inflated type I error rates as the simulation model includes non-null effects of both GXE and GXS. The powers are lower under GS1S2 dependence compared with GS1 dependence as the former involves additional modeling.

Figure 5
figure 5

Powers (in gray shadow) and type I error rates of LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE when there are two stratifying variables S1 and S2, where pS1 =0.15, pS2 =0.5, pE|S2=0=0.15, pE|S2=1=0.2, OR.R1=3, OR.R2XS1=5, OR.R2XE=4 and all other ORs=1. Each plot has four panels for main effects (bottom row), interactions of the corresponding haplotypes with S1 (second from bottom row), interactions of the corresponding haplotypes with S2 (third from bottom row) and interactions of the corresponding haplotypes with E (top row). 5% is marked by a gray horizontal dashed line. The haplotype frequencies are listed in Table 2. A full color version of this figure is available at the Journal of Human Genetics journal online.

Application to the KCS data

Following our motivation described in the Introduction section, we study the NAT2 gene and its interaction with smoking. Deitz et al.29 report that seven SNPs (rs1801279, rs1041983, rs1801280, rs1799929, rs1799930, rs1208 and rs1799931) explain 100% of the alleles detected in NAT2. Out of these seven SNPs, six are available in the KCS data. From them, a haplotype block consisting of the following five SNPs is detected by Haploview:30 rs1041983, rs1801280, rs1799929, rs1799930, and rs1208. We focus on analyzing this five-SNP haplotype block.

The KCS data include rescaled population weights; the rescaling is such that the sum of the weights for the cases is the same as the sum of the weights for the controls. We used these weights in our analyses to account for complex sampling design. We consider smoking status as a covariate with three levels: never smoking, former smoking, and current smoking (consisting of occasional and regular current smokers). Further, we adjust for all four stratifying variables: site (Detroit, Chicago), age (<45, 45–54, 55–64, 65–74, 75 years), race (white, black), and sex following Li and Graubard19 and Hofmann et al.14 Note that, at each site (city), both cases and controls were recruited, and so using site as a stratifying variable along with race can address population stratification due to geographical location to some extent.

After removing subjects with missing genotype or smoking status, there are 909 cases and 936 controls in the KCS data. Table 3 shows some characteristics of these data. There is a higher proportion of current smokers among cases than in controls for both whites and blacks. More details about these data can be found in Hofmann et al.14 Haplotype frequencies as estimated using the hapassoc software31 based on maximum likelihood estimation are shown in Table 4. They vary substantially between the two races as well as cases and controls. These estimates are used as starting values of the frequency (γ) parameters in the MCMC procedures.

Table 3 Characteristics distributions of the KCS data according to several variables
Table 4 Haplotype frequencies in the KCS data as reported by hapassoc

In our analysis, we set haplotype TTCAA as the baseline as it has similar frequencies in the cases and controls among whites as well as blacks. In addition, we assume that GE dependence can be captured through the dependence of haplotypes on race, that is, Cdep={Race}. As there are several haplotypes that are extremely rare, we run LBL for a large number of iterations to ensure convergence and accurate results. In particular, to monitor convergence, we run three chains from three different starting points and make diagnostic plots and calculate the R2 statistics.21 We run each chain for 300 000 iterations, discard initial 100 000 as burn-in and combine the three chains to obtain the posterior distributions.

The results are reported in Table 5. Both LBLc-GXE and LBLc-GXE-GXS find an interaction effect of a rare haplotype CTCGG and current smoking to be highly significant with BF>100. LBLc-GXE also detects the main effects of CTCGG and current smoking to be significant while LBLc-GXE-GXS finds only the latter to be significant. Specifically, LBLc-GXE-GXS estimates the OR of the interaction to be 0.37 and the main effect of CTCGG to be null. Therefore, among current smokers, the carriers of CTCGG have reduced odds of kidney cancer compared with the carriers of the baseline haplotype TTCAA. The two methods also detect a few other effects with their 95% CS excluding 1; however, their corresponding BF values are small.

Table 5 Results of analysis of the KCS dataa

On the other hand, if the complex sampling design is ignored in the analysis (that is, LBL-GXE-GXS or LBL-GXE are used for analysis), we fail to detect the main effect of former or current smoking. Besides, LBL-GXE-GXS, which models main and interaction effects of stratification variables, even detects a protective effect of the black race, which contradicts the fact that blacks are at an increased risk of kidney cancer than whites.17 These contradictory results illustrate the importance of accounting for complex sampling design in the analysis.

Discussion

Complex sampling schemes such as stratified sampling with frequency matching are now increasingly used in practice. At the same time, in the quest to dissect the etiology of common diseases, tremendous efforts are being directed toward detecting rare variants and their interactions with environmental covariates. Yet most of the current genetic association methods do not take the design of data collection into account, which can lead to biased results. Thus there is a pressing need for methods, especially for rare variants, that can properly account for complex sampling design. Here we adapted the LBL framework to analyze data originating from complex sampling schemes. As LBL is based on retrospective likelihood, it automatically conditions on the matched frequencies of cases and controls in each stratum once we condition on the stratifying variables. The differential sampling rates across strata are accounted for using the (rescaled) population weights.

When there is no interaction between stratifying variable and haplotype, we found that LBLc-GXE provides considerable powers and controlled type I error rates. However, it has increased type I error rates when such type of interaction is present. In such situations, the method that additionally models the interaction term, LBLc-GXE-GXS, performs well. On the other hand, the originally proposed LBL method has high type I error rates even when stratifying variables are included as covariates in the model. In addition to inference on association, which is our main focus, we also report in Supplementary Table S3 bias, standard errors and mean squared errors of the point estimates of the regression coefficients whose true OR>1. For the null effects (OR=1), these values are smaller than the ones reported in the table and thus omitted for brevity. As we can see from the table, these are all small for LBLc-GXE-GXS. The same is true for LBLc-GXE except for the bias and the mean squared errors of the R2XE effect when there are two stratifying variables and there is also R2XS effect. In this case, LBLc-GXE is not the correct model and thus gives inflated type I errors, as already noted above.

To examine the methods under realistic linkage disequilibrium patterns and potential cryptic relatedness among subjects, we also carried out simulations based on the haplotypes and results from the KCS data analysis. We use the haplotype frequencies from Table 4 (separately for whites and blacks) and use race as the stratifying variable (S) and smoking as a binary environmental covariate (E). To mimic the prevalences of blacks in the United States and smoking among the two races, we set pS=0.15 and pE|S=0=pE|S=1=0.19, as used earlier in some simulations. The data are generated in the same manner as described in the ‘Simulation study’ section. We consider two scenarios—(1) Null with all ORs set to 1 and (2) Non-null with OR=1.4 for E and OR=0.3 for interaction of haplotype CTCGG with E, which are similar to those estimated in the KCS data analysis. The results, presented in Supplementary Figure S7, are consistent with our earlier simulation study results.

When applied to the KCS data, our method found current smokers to be at an increased risk for kidney cancer, consistent with the literature. Further, our finding of interaction between smoking and NAT2 gene has been also reported in the literature. However, this is the first time, to the best of our knowledge, that an interaction with a specific rHTV has been implicated. Moreover, we found that the current smokers carrying the rHTV CTCGG have reduced odds of the disease compared with those with baseline haplotype. Semenza et al.15 and Chow et al.17 state that kidney cancer risk is higher for NAT2 slow acetylators than rapid acetylators among smokers. The haplotype CTCGG appears to be of a rapid acetylator type as per http://www.snpedia.com/index.php/NAT2, which might explain its protective effect for current smokers. However, the finding of this significant interaction effect appears to be novel and should be investigated in future studies. Moreover, the population stratification issue, in general, might need to be handled more carefully because genetic background can sometimes vary even within the same race and site.

As an alternative to LBLc-GXE, which models stratifying variables as covariates, we also explored including stratifying variables in the model by assigning to each stratum its own intercept32 denoted by LBLc-GXE(I). We compared LBLc-GXE and LBLc-GXE(I) for a few simulation settings when there is one binary stratifying variable and they perform similarly. This is expected as the two models are actually equivalent in this case. When there are two or more stratifying variables and their effects are not additive, LBLc-GXE(I) may perform better than LBLc-GXE; however, its power will suffer if the model is additive given that it has a large number of intercept parameters.

The LBL methods are computationally intensive and hence are more suited for zooming into genes/regions of interest implicated previously by fast, typically single-SNP-based and genome-wide, algorithms. LBLc-GXE-GXS is computationally slower than LBLc-GXE as it has more parameters. For example, when there is one stratifying variable, LBLc-GXE takes 915, 1379 and 1993 s to finish 120 000 iterations under settings 1–3 of scenario 2, respectively, while the corresponding times for LBLc-GXE-GXS are 1095, 1694 and 2435 s. These computing times are for a 3.60 GHz Xeon processor under Linux operating system with 15.55 GB RAM.

To summarize, we have extended the original LBL method to incorporate complex sampling schemes, in particular, stratified random sampling. Its main advantage stems from the fact that none of the current haplotype association methods can handle both rare variants and complex sampling design in the model. Another complex sampling scheme that is gaining popularity is matching controls to cases individually rather than with frequency matching (typically referred as matched case–control). Although we focus on stratified sampling design for a more concise discussion, the model for an individually matched case–control design would be similar because the retrospective likelihood will take care of conditioning on individual-level matching, similar to frequency matching. LBL has been also extended to handle longitudinal data33 and case–parent triad data.34 Thus LBL is now a comprehensive suite of rHTV methods, which can be used for various types of data. We plan to extend the methods to quantitative traits and extended family data as well as other sampling designs such as nested case–control and case–cohort to further increase LBL’s capability.

Software

The methods have been implemented in an R package LBL available at:

http://www.utdallas.edu/~swati.biswas and

http://www.stat.osu.edu/~statgen/SOFTWARE/LBL