Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs

Zhang, Yuan; Hofmann, Jonathan N; Purdue, Mark P; Lin, Shili; Biswas, Swati

doi:10.1038/jhg.2017.43

Download PDF

Original Article
Published: 20 April 2017

Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs

Yuan Zhang¹,
Jonathan N Hofmann²,
Mark P Purdue²,
Shili Lin³ &
…
Swati Biswas¹

Journal of Human Genetics volume 62, pages 819–829 (2017)Cite this article

508 Accesses
10 Citations
Metrics details

Subjects

Abstract

Detecting gene–environment interactions with rare variants is critical in dissecting the etiology of common diseases. Interactions with rare haplotype variants (rHTVs) are of particular interest. At the same time, complex sampling designs, such as stratified random sampling, are becoming increasingly popular for designing case–control studies, especially for recruiting controls. The US Kidney Cancer Study (KCS) is an example, wherein all available cases were included while the controls at each site were randomly selected from the population by frequency matching with cases based on age, sex and race. There is currently no rHTV association method that can account for such a complex sampling design. To fill this gap, we consider logistic Bayesian LASSO (LBL), an existing rHTV approach for case–control data, and show that its model can easily accommodate the complex sampling design. We study two extensions that include stratifying variables either as main effects only or with additional modeling of their interactions with haplotypes. We conduct extensive simulation studies to compare the complex sampling methods with the original LBL methods. We find that, when there is no interaction between haplotype and stratifying variables, both extensions perform well while the original LBL methods lead to inflated type I error rates. However, when such an interaction exists, it is necessary to include the interaction effect in the model to control the type I error rate. Finally, we analyze the KCS data and find a significant interaction between (current) smoking and a specific rHTV in the N-acetyltransferase 2 gene.

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Article Open access 04 February 2021

Simultaneous selection of multiple important single nucleotide polymorphisms in familial genome wide association studies data

Article Open access 25 May 2023

A robust method for collider bias correction in conditional genome-wide association studies

Article Open access 02 February 2022

Introduction

Rare variants and gene–environment interactions (GXE) have been suggested in the literature as potential causes of ‘missing heritability’ in common diseases. We consider these problems by focusing on G being a rare haplotype variant (rHTV), which may reflect a combination of common single-nucleotide polymorphisms (SNPs). Thus rHTVs can be studied even in existing genome-wide association studies data without the need to sequence any additional data. Recently, we have proposed an approach for rHTV association for case–control data called logistic Bayesian LASSO (LBL).¹ We have extended it to handle GXE under the assumption of G–E independence as well as when this assumption is relaxed or there is an uncertainty about it.^{2, 3, 4} LBL shrinks the effects of unassociated haplotypes or their interactions with environmental covariates toward zero, so that the associated effects can be identified with considerable power.^{5, 6, 7} In fact, LBL is one of the most powerful rHTV methods.⁸

Complex sampling designs are being utilized with increasing frequency in case–control studies, especially for sampling of the controls. Typically, all available cases are included while controls are selected by stratified sampling using frequency matching with cases. Strata are usually formed based on known risk factors, such as race, age and sex. Often one or more strata, especially those containing minorities, are oversampled to obtain more controls. To account for different sampling rates arising from unequal sampling among strata, population weights are calculated, which indicate the number of population members represented by each sample subject. It is important to use these weights in the analysis to avoid bias in the results. However, at the same time, the use of weights also eliminates the power and efficiency in case–control studies due to the fact that population weights for controls are usually much larger than those for cases, leading to large variability in weights.⁹ To regain some of the lost efficiency, rescaling of population weights has been suggested.¹⁰ For example, one way of rescaling is such that the sum of the case (control) weights is equal to case (control) sample size. Another type of rescaling is to have the sum of weights of controls be equal to the sum of weights of cases.

The US Kidney Cancer Study (KCS) was designed using a complex sampling scheme through stratified random sampling for recruiting subjects.^{11, 12} It was conducted at two sites—Chicago and Detroit. Cases identified from the Metropolitan Detroit Cancer Surveillance System and Cook County hospitals were recruited. At each site, the controls were frequency matched to cases based on age, sex and race. The matching rate of controls to cases was 2:1 in blacks and 1:1 in whites. Age groups were formed at 5-year intervals starting from 20 to 79 years. For age groups ⩾65 years, controls were chosen from the database of Medicare beneficiaries, which has information on age, sex and race. For age groups <65 years, controls were chosen from a listing of the Department of Motor Vehicles, which contains information on age and sex but not on race. As a proxy for race, strata of low and high black densities were formed based on Census data. Thus the overall strata were formed by cross-classification of age, sex and race (or black density). In addition to these stratifying variables, KCS collected covariates such as smoking status, high blood pressure, education level and body mass index. As described in Colt et al.,¹² to account for features related to the complex sampling design (differential sampling rates for controls and cases, survey nonresponse and deficiencies in coverage of the population at risk in the Department of Motor Vehicles and Medicare files), population weights were formed for each sampled individual.

Several authors have analyzed the KCS data and reported risk factors for kidney cancer such as smoking, obesity and hypertension.^{12, 13, 14} Besides, genetic susceptibility and its interaction with environmental factors have been reported to affect the risk as reported in the KCS and other studies.^{15, 16, 17, 18, 19} In particular, the N-acetyltransferase 2 (NAT2) gene is known to code for an enzyme involved in tobacco-carcinogen mechanism. Semenza et al.¹⁵ found that smoking-related risk of kidney cancer is higher among those carrying a polymorphic variant of NAT2 called slow acetylator genotype than rapid acetylators. Longuemaux et al.²⁰ observed a higher risk of kidney cancer for subjects with NAT2 slow acetylators combined with CYP1A1 variants; however, they did not study GXE.

To the best of our knowledge, there is currently no rHTV association method that can account for complex sampling design such as that adopted in the KCS. To fill this gap, we adapt the LBL model to analyze this type of data. We show that stratified sampling with frequency matching can be easily accounted for in the framework of LBL without any additional modeling. We conduct simulation studies to investigate the properties of the extensions and compare with the original LBL method. Finally, we also analyze the KCS data to study the NAT2–smoking interaction.

Materials and Methods

The method mostly follows from Zhang et al.⁴ with necessary adaptation to include stratifying variables and population weights. Suppose we have a case–control sample consisting of n₁ cases and n₂ controls with n₁+n₂=n. Let Y_i=1/0 denote the case/control status of the ith individual, i=1,…,n and Y=(Y₁,…,Y_n). Let G_i denote the observed genotype of the ith individual and G=(G₁,…,G_n). We then let be the set of haplotype pairs compatible with G_i as the haplotype pair of a person may not be completely determined from the observed genotypes. Further we denote the rth haplotype pair in by Z_ir. Next we denote the vector of environmental covariates of the ith individual by E_i. For a complex sampling design, the stratifying variables have a key role, and they are denoted collectively as S_i for individual i. In this paper, we consider both E and S to be categorical.

Complex sampling design structure and analysis

For the type of complex sampling considered in this paper, the sampling mechanism leads to known (rescaled) population weights, w_i, for the ith individual. In simple terms, w_i is the number of individuals in the population that the ith sampled person represents. It is essentially the ratio of the number of individuals available to be sampled (population size) to the number of individuals actually sampled (sample size) in the stratum to which the ith individual belongs. In surveys, non-response and poststratification adjustments are further made to these weights, and they are made available along with the rest of the sample data.⁹ The weights are typically rescaled to increase efficiency as mentioned in the Introduction section. Further details on calculation of weights will be provided in the ‘Simulation study’ section.

The basic principle that we follow for incorporating complex sampling design in the Bayesian framework is to write the analysis model conditional on the information and variables that describe the data collection process.²¹ That is, for writing the likelihood, we condition on the fact that the frequencies of cases and controls were matched (in some way that will become apparent below) in each stratum and on the values of the variables used for matching (in this case, the stratifying variables).

Retrospective likelihood

Conditional on {w_i}, i=1,…, n and S, the retrospective likelihood of the observed data is written as:

where Ψ consists of the regression coefficients and the parameters associated with the haplotype pair frequencies, which will be specified more explicitly later. Note that conditioning on the case/control status (Y), stratifying variable information and the weight for each person automatically takes care of matching frequencies of cases and controls in all strata in the retrospective (in contrast to prospective) likelihood formulation. Now we will specify the model for each component of the likelihood. In the following, we suppress the subscripts i and r for simplicity without causing ambiguity.

Modeling of P(Z| E , S ,Y=0)

We start with modeling P(Z|E,S,Y=0)=a_Z|E,S, the frequency of haplotype pair Z in the control population for a given E and S. Suppose there are a total of m haplotypes and assume gene–environment (G–E) dependence is only due to some of the stratifying variables and/or covariates, defined as C_dep, a subset of {E,S}. That is, conditional on C_dep, G and E are independent.^{22, 23} Then we denote the haplotype frequencies in the control population by f(C_dep)=(f₁(C_dep),…, f_m(C_dep)). We model a_Z|E,S for a haplotype pair Z=(z_k, z_k′) as follows:

where δ_kk′=1(0) if z_k=z_k′(z_k≠z_k′), f_k and f_k′ are frequencies of z_k and z_k′ and d∈(−1,1) is the within-population inbreeding coefficient that captures excess/reduction of homozygosity.²⁴ For d=0, the above expression is equivalent to assuming Hardy–Weinberg Equilibrium (HWE) while other values of d allow Hardy–Weinberg Disequilibrium (HWD).

We then model f(C_dep) using a multinomial logistic regression model to allow G–E dependence.²⁵ Let the mth haplotype be the baseline and assume C_dep has L levels excluding baseline(s): C_dep={C₁,C₂,…,C_L}. For example, if C_dep consists of two binary variables, then L=2 with exclusion of baseline category of each variable. Then we have

Thus

Let γ denote an (m−1) × (L+1) matrix with the (k, l)th element being γ_kl, k=1,...,m−1 and l=0,...,L. Combining Equations (2) and (4), we have now fully specified a_{Z|E, S}(γ, d).

Modeling of P(Z| E , S , Y=1)

Next let us consider P(Z|E, S, Y=1)=b_{Z|E, S}, the frequency of haplotype pair Z in the case population for a given value of E and S. We express b_{Z|E, S} in terms of a_{Z|E, S} and the odds of disease for a given Z, E and S, θ_{Z, E, S}(=P(Y=1|Z, E, S)/P(Y=0|Z, E, S)):

where H is the set of all possible haplotype pairs and θ_{Z, E, S} is modeled using logistic regression. We consider two different ways of modeling θ_{Z, E, S}=exp(Xβ) with respect to the stratifying variables. They are included as covariates either just as main effects (LBLc-GXE) or with additional modeling of interaction effects of S with haplotypes (LBLc-GXE-GXS); ‘c’ in LBLc represents complex sampling. More specifically, X is (1, X_S, X_E, X_Z, X_ZX_E) in LBLc-GXE and (1, X_S, X_E, X_Z, X_ZX_S, X_ZX_E) in LBLc-GXE-GXS. For each model, β is the vector comprising the corresponding regression coefficients. Here X_Z=(x₁, x₂, …, x_m−1), where x_k is the number of copies of haplotype z_k in haplotype pair Z with the mth haplotype assumed to be the baseline. X_E and X_S consist of the usual dummy variables corresponding to E and S, respectively. X_ZX_E and X_ZX_S are obtained by (scalar) multiplication of X_Z and X_E and X_Z and X_S, respectively.

Modeling of P(E_i|Y_i=0, S_i) and P(E_i|Y_i=1, S_i). It remains to model P(E_i|Y_i=0, S_i) and P(E_i|Y_i=1, S_i) in Equation (1). Assuming a saturated model for P(E,S), P(E|Y,S)∝P(Y|E,S) without loss of information.^{26, 27} Then using the Bayes rule, we get the following:

and

Thus we can write the observed data retrospective likelihood in Equation (1) as:

where Ψ=(β, γ, d).

Priors, posterior distributions and inference on association

These follow closely from LBL-GXE⁴ as elucidated briefly in the following. Bayesian LASSO is used to regularize the regression coefficients βs by assigning each of them a double exponential prior centered at 0 and variance . Such regularization helps in weeding out the unassociated effects, making it possible for the associated ones, especially those involving rHTVs, to stand out. The parameter λ controls the degree of penalty. It is assigned a Gamma(a,b) hyper-prior with parametrization such that its mean is a/b. When a=b=20, we obtain SD(β)=1.53, which corresponds to a realistic variability in odds ratios. For γ parameters, we use a double exponential prior with hyper-parameter ν set to be 0.5, which provides well-calibrated results as seen in our simulation study. For d, we note that it is dependent on f(C_dep) as a_{Z|E, S} should be non-negative. Thus . As −1<d<1, we get Therefore, we set the prior for d to be uniformly distributed in that range.

The posterior distributions of all parameters in Ψ are estimated using Markov chain Monte Carlo (MCMC) methods. Finally, we test for significance of each β coefficient by computing its 95% credible set (CS) using MCMC samples from its posterior distribution. A 95% CS not covering 0 is considered as an evidence for significance. Alternatively, Bayes factor (BF) >2 can be also used to declare significance.¹ For the KCS data analysis, we report both 95% CS and BF.

Results

Simulation study

One stratifying variable

We carry out simulation studies to investigate the performance of LBL for complex sampling data. In this subsection, we consider one binary stratifying variable S (=0/1) with prevalence p_S=P(S=1)=0.3. There is also a binary environmental covariate E (=0/1) with prevalence p_E|S=0=P(E=1|S=0)=0.3 and p_E|S=1=P(E=1|S=1)=0.7. There are three haplotype settings with 6, 9 and 12 haplotypes in a haplotype block as listed in Table 1. Each haplotype block is formed by five SNPs with alleles labeled as 0 or 1. There are two rHTVs, denoted as R1 and R2, in each block. Note that there is G–S dependence as frequencies of haplotypes differ in the two strata. This, in turn, induces G–E dependence as prevalence of E differs across strata.

Table 1 Simulation setup for one stratifying variable: OR under association scenarios 1–6 and frequencies of haplotypes and environmental covariate in each stratum

Full size table

For creating association scenarios, we use various combinations of the following effects: R1, R2XS, R2XE, S, and E, as listed in Table 1. We also simulate a completely null model with all odds ratios (ORs) set to be 1 (scenario 6).

To mimic a complex sampling design for generating data, we first generate a population of cases and controls and then sample from it using matching based on the stratifying variable. For a specific combination of association scenario and haplotype setting, we generate a population of 10 000 subjects in the following manner. For each individual, first we simulate a stratifying variable value, say S using the p_S value. Then we generate an environmental covariate value, E, using the p_E|S value. Then we generate a phased haplotype pair, say Z, using the frequencies given in Table 1 and assuming HWE (d=0). Next, the individual is assigned to be a case or control using a logistic regression model: log(p/(1−p))=X β, where p is the probability that the individual is case, and X=(1, X_S, X_E, X_Z, X_ZX_S, X_ZX_E). The intercept is calculated using a baseline prevalence of 0.1, that is, β₀=log(0.1/0.9). For the other β coefficients, we use the corresponding ORs as listed in Table 1. We set the most frequent haplotype as the baseline in the regression model. After the case/control status is assigned, the phase information is removed and only genotypes are retained. Once a population of 10 000 subjects is generated in this manner, we obtain a sample from it as described next.

Suppose the numbers of cases and controls in the population of Stratum h (h=0, 1; h=0 corresponds to S=0 and h=1 corresponds to S=1) are and . Correspondingly, let the number of cases and controls in the sample of the Stratum h be and . First, we select all the cases in the population to be included in the sample for each of the strata, that is, , h=0, 1.¹⁹ For selecting controls, to mimic the KCS data, we use differential sampling rates in the two strata. In Stratum 0, the number of controls is set to be the same as the number of cases, that is, . While in Stratum 1, we select a simple random sample of size controls, that is, . In most situations, out of a population of size 10 000, we get a sample of size of 2200–3800 with the number of cases varying between 1000–1500 (700–900 in Stratum 0 and 150–800 in Stratum 1) depending on the scenario.

Next we calculate the population weights for sampled cases and control in each stratum and rescale them. The rescaling is such that the sum of weights for cases is the same as the sum of weights for controls, as in the analysis of the KCS data reported by Hofmann et al.¹⁴. Denote the rescaled weights of sampled cases and controls in stratum h by and . As all cases are sampled, the weight for a case is 1, that is, =1, h=0, 1. Thus the sum of weights for cases in the sample is the sample size of the cases (). The population weights of controls in stratum h is . For rescaling, we divide these population weights by their sum, that is, and then multiply by case sample size, that is, . Thus and . Therefore, we can see that if, we oversample the controls for one stratum, their weights will be reduced. This can be clearly seen from the above expressions of weights if the control-to-case ratio in the population is constant across different strata. Note that all persons in a stratum have the same weight, original as well as rescaled, and these are computed only once for a given sample.

We analyze each sample using LBLc-GXE and LBLc-GXE-GXS. For comparison, we also apply LBL-GXE from Zhang et al.,⁴ which models G–E dependence but ignores the stratifying variables (Note that, in Zhang et al.,⁴ this method was referred as LBL-GXE-D; however, for the sake of simplicity here we refer to it as LBL-GXE). Additionally, we also analyze the data using a variation of LBL-GXE, referred to as LBL-GXE-GXS, which includes the stratifying variables as covariates but does not use sampling weights, that is, ignores the complex sampling scheme. For each of these four methods, we use a total number of 120 000 iterations with a burn-in period of 20 000 iterations to ensure satisfactory convergence.²¹ The total number of replications in each simulation is 500. For each β coefficient, we calculate the percentage of times (out of 500) that its 95% CSs does not cover 0 to study the power or type I error rate.

Figures 1,2,3 and Supplementary Figures S1–S3 show the powers and type I error rates for LBLc-GXE, LBLc-GXE-GXS, LBL-GXE-GXS and LBL-GXE for association scenarios 1–6 (null model), respectively. In scenario 1 (Figure 1), the performance of LBLc-GXE and LBLc-GXE-GXS are comparable, detecting the main haplotype and interaction effects with E with similar powers and keeping the type I error rates under control, while LBL-GXE-GXS and LBL-GXE have inflated type I error rates. In scenario 2 (Figure 2) where an interaction effect with S is present in the data, LBLc-GXE-GXS continues to performs well, while the other three methods, including LBLc-GXE, have inflated type I error rates. In scenario 3 (Figure 3) where the main effect of S is included in the data, LBLc-GXE, LBLc-GXE-GXS and LBL-GXE-GXS control the type I error rates successfully while LBL-GXE leads to inflated type I error rates. However, we should note that the main effect of S detected by LBL-GXE-GXS here is not really an indication of its power because this method detects the main effect of S to be significant always irrespective of whether S has a true main effect or not, as seen in Figures 1 and 2 and Supplementary Figures S1–S3. In summary, LBLc-GXE controls type I error rates in situations where there is no interaction between haplotype and stratifying variable, while LBLc-GXE-GXS performs well in all scenarios. The Supplementary Figure S3 for the null model (scenario 6) shows that LBL-GXE-GXS and LBL-GXE lead to seriously inflated type I error rates while LBLc-GXE and LBLc-GXE-GXS control these rates well.

We also explore scenarios 2 and 6 with p_S=0.15 and p_E|S=0=p_E|S=1=0.19 to mimic S and E to be race and smoking. We use the fact that the prevalence of blacks in the United States is about 15% and the prevalence of smoking among whites or blacks in the United States is about 19%. Supplementary Figures S4 and S5 show the corresponding results. The methods perform similarly as before except that, with lower prevalences of S and E, LBLc-GXE and LBLc-GXE-GXS have reduced power, as expected.

For scenario 2 and setting 1, we also analyzed the data using a standard haplotype association method haplo.glm.²⁸ Haplo.glm is based on the generalized linear model and uses maximum likelihood methods for inference. The results are reported in Figure 4 and Supplementary Table S1, which show that haplo.glm has inflated type I error rates.

Additionally, we investigate a different rescaling of the weights such that the sum of the case (control) weights is equal to case (control) sample size. We compare the two types of rescaling by applying LBLc-GXE and LBLc-GXE-GXS to the data generated under setting 1 of scenario 2. The results of these two types of rescaling are comparable as shown in Supplementary Table S2. We also examine the methods for data generated under HWD by setting d=0.1 in the data simulation procedure for setting 1 of scenario 2. The relative performances of the methods are similar to what we found earlier under HWE. The detailed results are shown in Supplementary Figure S6.

Two stratifying variables

We next conduct simulation studies using two stratifying variables S₁ (0/1) and S₂ (0/1) to mimic race and sex. We set the prevalence p_S₁ =P(S₁=1)=0.15 and p_S₂ =P(S₂=1)=0.5. These two stratifying variables form four strata: Stratum 1 (S₁=0, S₂=0), Stratum 2 (S₁=0, S₂=1), Stratum 3 (S₁=1, S₂=0), and Stratum 4 (S₁=1, S₂=1). The binary environmental covariate E has prevalence p_E_|S₂ =0=P(E=1|S₂=0)=0.15 and p_E|S₂=1=P(E=1|S₂=1)=0.2, which mimics that prevalence of smoking among females and males are about 15% and 20%, respectively (http://kff.org/other/state-indicator/smoking-adults-by-gender/). We consider six haplotypes and two types of G–S dependence—dependence on S₁ only (G–S₁ dependence) or on both S₁ and S₂ (G–S₁–S₂ dependence), as listed in Table 2.

Table 2 Simulation setup for two stratifying variables: OR and haplotype frequencies under two types of G–S dependence

Full size table

The sample generation and weight calculation procedure is similar to that in the ‘One stratifying variable’ subsection. Specifically, we generate a population of size 10 000 and select all cases in the population. In Strata 1 and 2, we select a simple random sample of controls of the same size as the number of cases in the corresponding stratum. In Strata 3 and 4, we select a simple random sample of controls with size double of that of the cases in the corresponding stratum. The total sample sizes range from 2000 to 2500 with roughly 1000 cases (about 400 each in Strata 1 and 2 and 100 each in Strata 3 and 4).

Figure 5 shows the results for both G–S₁ dependence and G–S₁–S₂ dependence. The relative performances of the methods are comparable to what we observe in the case of one stratifying variable. That is, LBLc-GXE-GXS has type I error rates well controlled while the other three methods, including LBLc-GXE, have inflated type I error rates as the simulation model includes non-null effects of both GXE and GXS. The powers are lower under G–S₁–S₂ dependence compared with G–S₁ dependence as the former involves additional modeling.

Application to the KCS data

Following our motivation described in the Introduction section, we study the NAT2 gene and its interaction with smoking. Deitz et al.²⁹ report that seven SNPs (rs1801279, rs1041983, rs1801280, rs1799929, rs1799930, rs1208 and rs1799931) explain 100% of the alleles detected in NAT2. Out of these seven SNPs, six are available in the KCS data. From them, a haplotype block consisting of the following five SNPs is detected by Haploview:³⁰ rs1041983, rs1801280, rs1799929, rs1799930, and rs1208. We focus on analyzing this five-SNP haplotype block.

The KCS data include rescaled population weights; the rescaling is such that the sum of the weights for the cases is the same as the sum of the weights for the controls. We used these weights in our analyses to account for complex sampling design. We consider smoking status as a covariate with three levels: never smoking, former smoking, and current smoking (consisting of occasional and regular current smokers). Further, we adjust for all four stratifying variables: site (Detroit, Chicago), age (<45, 45–54, 55–64, 65–74, ⩾75 years), race (white, black), and sex following Li and Graubard¹⁹ and Hofmann et al.¹⁴ Note that, at each site (city), both cases and controls were recruited, and so using site as a stratifying variable along with race can address population stratification due to geographical location to some extent.

After removing subjects with missing genotype or smoking status, there are 909 cases and 936 controls in the KCS data. Table 3 shows some characteristics of these data. There is a higher proportion of current smokers among cases than in controls for both whites and blacks. More details about these data can be found in Hofmann et al.¹⁴ Haplotype frequencies as estimated using the hapassoc software³¹ based on maximum likelihood estimation are shown in Table 4. They vary substantially between the two races as well as cases and controls. These estimates are used as starting values of the frequency (γ) parameters in the MCMC procedures.

Table 3 Characteristics distributions of the KCS data according to several variables

Full size table

Table 4 Haplotype frequencies in the KCS data as reported by hapassoc

Full size table

In our analysis, we set haplotype TTCAA as the baseline as it has similar frequencies in the cases and controls among whites as well as blacks. In addition, we assume that G–E dependence can be captured through the dependence of haplotypes on race, that is, C_dep={Race}. As there are several haplotypes that are extremely rare, we run LBL for a large number of iterations to ensure convergence and accurate results. In particular, to monitor convergence, we run three chains from three different starting points and make diagnostic plots and calculate the R² statistics.²¹ We run each chain for 300 000 iterations, discard initial 100 000 as burn-in and combine the three chains to obtain the posterior distributions.

The results are reported in Table 5. Both LBLc-GXE and LBLc-GXE-GXS find an interaction effect of a rare haplotype CTCGG and current smoking to be highly significant with BF>100. LBLc-GXE also detects the main effects of CTCGG and current smoking to be significant while LBLc-GXE-GXS finds only the latter to be significant. Specifically, LBLc-GXE-GXS estimates the OR of the interaction to be 0.37 and the main effect of CTCGG to be null. Therefore, among current smokers, the carriers of CTCGG have reduced odds of kidney cancer compared with the carriers of the baseline haplotype TTCAA. The two methods also detect a few other effects with their 95% CS excluding 1; however, their corresponding BF values are small.

Table 5 Results of analysis of the KCS data^a

Full size table

On the other hand, if the complex sampling design is ignored in the analysis (that is, LBL-GXE-GXS or LBL-GXE are used for analysis), we fail to detect the main effect of former or current smoking. Besides, LBL-GXE-GXS, which models main and interaction effects of stratification variables, even detects a protective effect of the black race, which contradicts the fact that blacks are at an increased risk of kidney cancer than whites.¹⁷ These contradictory results illustrate the importance of accounting for complex sampling design in the analysis.

Discussion

Complex sampling schemes such as stratified sampling with frequency matching are now increasingly used in practice. At the same time, in the quest to dissect the etiology of common diseases, tremendous efforts are being directed toward detecting rare variants and their interactions with environmental covariates. Yet most of the current genetic association methods do not take the design of data collection into account, which can lead to biased results. Thus there is a pressing need for methods, especially for rare variants, that can properly account for complex sampling design. Here we adapted the LBL framework to analyze data originating from complex sampling schemes. As LBL is based on retrospective likelihood, it automatically conditions on the matched frequencies of cases and controls in each stratum once we condition on the stratifying variables. The differential sampling rates across strata are accounted for using the (rescaled) population weights.

When there is no interaction between stratifying variable and haplotype, we found that LBLc-GXE provides considerable powers and controlled type I error rates. However, it has increased type I error rates when such type of interaction is present. In such situations, the method that additionally models the interaction term, LBLc-GXE-GXS, performs well. On the other hand, the originally proposed LBL method has high type I error rates even when stratifying variables are included as covariates in the model. In addition to inference on association, which is our main focus, we also report in Supplementary Table S3 bias, standard errors and mean squared errors of the point estimates of the regression coefficients whose true OR>1. For the null effects (OR=1), these values are smaller than the ones reported in the table and thus omitted for brevity. As we can see from the table, these are all small for LBLc-GXE-GXS. The same is true for LBLc-GXE except for the bias and the mean squared errors of the R2XE effect when there are two stratifying variables and there is also R2XS effect. In this case, LBLc-GXE is not the correct model and thus gives inflated type I errors, as already noted above.

To examine the methods under realistic linkage disequilibrium patterns and potential cryptic relatedness among subjects, we also carried out simulations based on the haplotypes and results from the KCS data analysis. We use the haplotype frequencies from Table 4 (separately for whites and blacks) and use race as the stratifying variable (S) and smoking as a binary environmental covariate (E). To mimic the prevalences of blacks in the United States and smoking among the two races, we set p_S=0.15 and p_E|S=0=p_E|S=1=0.19, as used earlier in some simulations. The data are generated in the same manner as described in the ‘Simulation study’ section. We consider two scenarios—(1) Null with all ORs set to 1 and (2) Non-null with OR=1.4 for E and OR=0.3 for interaction of haplotype CTCGG with E, which are similar to those estimated in the KCS data analysis. The results, presented in Supplementary Figure S7, are consistent with our earlier simulation study results.

When applied to the KCS data, our method found current smokers to be at an increased risk for kidney cancer, consistent with the literature. Further, our finding of interaction between smoking and NAT2 gene has been also reported in the literature. However, this is the first time, to the best of our knowledge, that an interaction with a specific rHTV has been implicated. Moreover, we found that the current smokers carrying the rHTV CTCGG have reduced odds of the disease compared with those with baseline haplotype. Semenza et al.¹⁵ and Chow et al.¹⁷ state that kidney cancer risk is higher for NAT2 slow acetylators than rapid acetylators among smokers. The haplotype CTCGG appears to be of a rapid acetylator type as per http://www.snpedia.com/index.php/NAT2, which might explain its protective effect for current smokers. However, the finding of this significant interaction effect appears to be novel and should be investigated in future studies. Moreover, the population stratification issue, in general, might need to be handled more carefully because genetic background can sometimes vary even within the same race and site.

As an alternative to LBLc-GXE, which models stratifying variables as covariates, we also explored including stratifying variables in the model by assigning to each stratum its own intercept³² denoted by LBLc-GXE(I). We compared LBLc-GXE and LBLc-GXE(I) for a few simulation settings when there is one binary stratifying variable and they perform similarly. This is expected as the two models are actually equivalent in this case. When there are two or more stratifying variables and their effects are not additive, LBLc-GXE(I) may perform better than LBLc-GXE; however, its power will suffer if the model is additive given that it has a large number of intercept parameters.

The LBL methods are computationally intensive and hence are more suited for zooming into genes/regions of interest implicated previously by fast, typically single-SNP-based and genome-wide, algorithms. LBLc-GXE-GXS is computationally slower than LBLc-GXE as it has more parameters. For example, when there is one stratifying variable, LBLc-GXE takes 915, 1379 and 1993 s to finish 120 000 iterations under settings 1–3 of scenario 2, respectively, while the corresponding times for LBLc-GXE-GXS are 1095, 1694 and 2435 s. These computing times are for a 3.60 GHz Xeon processor under Linux operating system with 15.55 GB RAM.

To summarize, we have extended the original LBL method to incorporate complex sampling schemes, in particular, stratified random sampling. Its main advantage stems from the fact that none of the current haplotype association methods can handle both rare variants and complex sampling design in the model. Another complex sampling scheme that is gaining popularity is matching controls to cases individually rather than with frequency matching (typically referred as matched case–control). Although we focus on stratified sampling design for a more concise discussion, the model for an individually matched case–control design would be similar because the retrospective likelihood will take care of conditioning on individual-level matching, similar to frequency matching. LBL has been also extended to handle longitudinal data³³ and case–parent triad data.³⁴ Thus LBL is now a comprehensive suite of rHTV methods, which can be used for various types of data. We plan to extend the methods to quantitative traits and extended family data as well as other sampling designs such as nested case–control and case–cohort to further increase LBL’s capability.

Software

The methods have been implemented in an R package LBL available at:

http://www.utdallas.edu/~swati.biswas and

http://www.stat.osu.edu/~statgen/SOFTWARE/LBL

References

Biswas, S. & Lin, S. Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics 68, 587–597 (2012).
Article CAS Google Scholar
Biswas, S., Xia, S. & Lin, S. Detecting rare haplotype-environment interaction with logistic Bayesian LASSO. Genet. Epidemiol. 38, 31–41 (2014).
Article Google Scholar
Zhang, Y. & Biswas, S. An improved version of logistic Bayesian LASSO for detecting rare haplotype-environment interactions with application to lung cancer. Cancer Inform. 14, 11–16 (2015).
PubMed PubMed Central Google Scholar
Zhang, Y., Lin, S. & Biswas, S. Detecting rare haplotype-environment interaction under uncertainty of gene-environment independence assumption. Biometrics 73, 344–355 (2017).
Article CAS Google Scholar
Biswas, S. & Papachristou, C. Evaluation of logistic Bayesian LASSO for identifying association with rare haplotypes. BMC Proc. 8, S54 (2014).
Article Google Scholar
Datta, A. S., Zhang, Y., Zhang, L. & Biswas, S. Association of rare haplotypes on ULK4 and MAP4 genes with hypertension. BMC Proc. 10, 363–369 (2016).
Article Google Scholar
Wang, M. & Lin, S. Detecting associations of rare variants with common diseases: collapsing or haplotyping? Brief Bioinform. 16, 759–768 (2015).
Article Google Scholar
Datta, A. S. & Biswas, S. Comparison of haplotype-based statistical tests for disease association with rare and common variants. Brief Bioinform. 17, 657–671 (2016).
Article Google Scholar
Korn, E. L. & Graubard, B. Analysis of Health Surveys, (Wiley, New York, NY, USA, 1999).
Book Google Scholar
Scott, A. J. & Wild, C. J. Case-control studies with complex sampling. Appl. Stat. 50, 389–401 (2001).
Google Scholar
Digaetano, R., Graubard, B., Rao, S., Severynse, J. & Wacholder, S . Sampling racially matched population controls for case-control studies: using DMV lists and oversampling minorities. (2003) https://fcsm.sites.usa.gov/files/2014/05/2003FCSM_DiGaetano.pdf (accessed 8 August 2016).
Colt, J. S., Schwartz, K., Graubard, B. I., Davis, F., Ruterbusch, J., DiGaetano, R. et al. Hypertension and risk of renal cell carcinoma among white and black Americans. Epidemiology 22, 797–804 (2011).
PubMed PubMed Central Google Scholar
Purdue, M. P., Moore, L. E., Merino, M. J., Boffetta, P., Colt, J. S., Schwartz, K. L. et al. An investigation of risk factors for renal cell carcinoma by histologic subtype in two case-control studies. Int. J. Cancer 132, 2640–2647 (2013).
Article CAS Google Scholar
Hofmann, J. N., Schwartz, K., Chow, W. H., Ruterbusch, J. J., Shuch, B. M., Karami, S. et al. The association between chronic renal failure and renal cell carcinoma may differ between black and white Americans. Cancer Causes Control 24, 167–174 (2013).
Article Google Scholar
Semenza, J. C., Ziogas, A., Largent, J., Peel, D. & Anton-Culver, H. Gene-environment interactions in renal cell carcinoma. Am. J. Epidemiol. 153, 851–859 (2001).
Article CAS Google Scholar
Moore, L. E., Brennan, P., Karami, S., Menashe, I., Berndt, S. I., Dong, L. et al. Apolipoprotein E/C1 locus variants modify renal cell carcinoma risk. Cancer Res. 69, 8001–8008 (2009).
Article CAS Google Scholar
Chow, W., Dong, L. M. & Devesa, S. S. Epidemiology and risk factors for kidney cancer. Nat. Rev. Urol. 7, 245–257 (2010).
Article Google Scholar
Purdue, M. P., Johansson, M., Zelenika, D., Toro, J. R., Scelo, G., Moore, L. E. et al. Genome-wide association study of renal cell carcinoma identifies two susceptibility loci on 2p21 and 11q13.3. Nat. Genet. 43, 60–65 (2011).
Article CAS Google Scholar
Li, Y. & Graubard, B. I. Pseudo semiparametric maximum likelihood estimation exploiting gene environment independence for population-based case-control studies with complex samples. Biostatistics 13, 711–723 (2012).
Article Google Scholar
Longuemaux, S., Delomenie, C., Gallou, C., Mejean, A., Vincent-Viry, M., Bouvier, R. et al. Candidate genetic modifiers of individual susceptibility to renal cell carcinoma: a study of polymorphic human xenobiotic-metabolizing enzymes. Cancer Res. 59, 2903–2908 (1999).
CAS PubMed Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis, (Chapman and Hall/CRC, Boca Raton, FL, USA, 2003).
Google Scholar
Chatterjee, N. & Carroll, R. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92, 399–418 (2005).
Article Google Scholar
Mukherjee, B., Zhang, L., Ghosh, M. & Sinha, S. Semiparametric Bayesian analysis of case-control data under conditional gene-environment independence. Biometrics 63, 834–844 (2007).
Article Google Scholar
Weir, B. S. Genetic Data Analysis II, (Sinauer Associates Inc, Sunderland, MA, USA, 1996).
Google Scholar
Mukherjee, B. & Chatterjee, N. Exploiting gene-environment independence for analysis of case-control studies: An empirical-Bayes type shrinkage estimator to trade off between bias and efficiency. Biometrics 64, 685–694 (2008).
Article Google Scholar
Prentice, R. L. & Pyke, R. Logistic disease incidence models and case-control studies. Biometrika 66, 403–411 (1979).
Article Google Scholar
Kwee, L. C., Epstein, M. P., Manatunga, A. K., Duncan, R., Allen, A. S. & Satten, G. A. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidemiol. 31, 75–90 (2007).
Article CAS Google Scholar
Lake, S. L., Lyon, H., Tantisira, K., Silverman, E. K., Weiss, S. T., Laird, N. M. et al. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum. Hered. 55, 56–65 (2003).
Article CAS Google Scholar
Deitz, A. C., Rothman, N., Rebbeck, T. R., Hayes, R. B., Chow, W. H., Zheng, W. et al. Impact of misclassification in genotype-exposure interaction studies: example of N-Acetyltransferase 2 (NAT2), smoking, and bladder cancer. Cancer Epidemiol. Biomarkers Prev. 13, 1543–1546 (2004).
CAS PubMed Google Scholar
Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).
Article CAS Google Scholar
Burkett, K., Graham, J. & McNeney, B. hapassoc: software for likelihood inference of trait associations with SNP haplotypes and other attributes. J. Stat. Softw. 16, 1–19 (2006).
Article Google Scholar
Scott, A. J. & Wild, C. J . in Analysis of Survey Data (eds Chambers, R. L., Skinner, C. J.) 109–120 (Wiley, Chichester, England, 2003).
Xia, S. & Lin, S. Detecting longitudinal effects of haplotypes and smoking on hypertension using B-Splines and Bayesian LASSO. BMC Proc. 8, S85 (2014).
Article Google Scholar
Wang, M. & Lin, S. FamLBL: detecting rare haplotype disease association based on common SNPs using case-parent triads. Bioinformatics 30, 2611–2618 (2014).
Article CAS Google Scholar

Download references

Acknowledgements

This work was partially supported by the grant R03CA171011 from the National Cancer Institute, NIH and by allocations of computing times from the Texas Advanced Computing Center at the University of Texas at Austin. The US Kidney Cancer Study was supported by the Intramural Research Program of the NIH, National Cancer Institute. We are thankful to the two anonymous referees for their constructive comments and suggestions.

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
Yuan Zhang & Swati Biswas
Division of Cancer Epidemiology and Genetics, Occupational and Environmental Epidemiology Branch, National Cancer Institute, Bethesda, MD, USA
Jonathan N Hofmann & Mark P Purdue
Department of Statistics, The Ohio State University, Columbus, OH, USA
Shili Lin

Authors

Yuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan N Hofmann
View author publications
You can also search for this author in PubMed Google Scholar
Mark P Purdue
View author publications
You can also search for this author in PubMed Google Scholar
Shili Lin
View author publications
You can also search for this author in PubMed Google Scholar
Swati Biswas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shili Lin or Swati Biswas.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on Journal of Human Genetics website

Supplementary information

Supplementary Information (PDF 138 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Hofmann, J., Purdue, M. et al. Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs. J Hum Genet 62, 819–829 (2017). https://doi.org/10.1038/jhg.2017.43

Download citation

Received: 11 August 2016
Revised: 21 March 2017
Accepted: 22 March 2017
Published: 20 April 2017
Issue Date: September 2017
DOI: https://doi.org/10.1038/jhg.2017.43

This article is cited by

Development and validation of a novel nomogram model for predicting delayed graft function in deceased donor kidney transplantation based on pre-transplant biopsies
- Meihe Li
- Xiaojun Hu
- Chenguang Ding
BMC Nephrology (2024)
Bayesian variable selection for high-dimensional data with an ordinal response: identifying genes associated with prognostic risk group in acute myeloid leukemia
- Yiran Zhang
- Kellie J. Archer
BMC Bioinformatics (2021)