Introduction

Gallbladder disease (GBD) is a major cause of morbidity, hospital admission, surgical intervention and economic burden and is caused by gallstones.1 Gallstones occur in 10–40% of adults in developed countries,2 predominantly in women.3 The incidence increases with age and other factors such as obesity and diabetes, although there are differences between men and women in the determinants of GBD.4 These differences make advisable the analysis of the genetic basis of GBD separate by sex. Causes of morbidity include biliary colic, cholecystitis, choledocholithiasis and pancreatitis.5 In the USA, >700 000 cholecystectomies are undertaken per year.6 Large genetic studies have shown that the additive genetic heritability of symptomatic gallstones ranges from 25 to 29%.7, 8

In developed countries, most gallstones are formed of cholesterol, but pigment stones consisting substantially of calcium bilirubinate predominate in regions where bacterial and parasitic infections of the biliary tree, and hemolysis, are common.9 Exogenous estrogens are a risk factor, as are fibrates.10

Some symptomatic patients require surgical intervention, whereas prevention through measures targeting important causal pathways should avoid all morbidity and sequelae. Long-term statin use was associated with reduced GBD in a recent observational case–control design, but bias and confounding cannot be ruled out,11 and results of long-term randomized controlled trials are not yet available.

Although family-based genetic linkage studies of GBD have met with limited success, inbred mouse studies12 and human genome-wide association studies (GWASs)13 converged on one notable locus, the head-to-head gene pair ABCG5/ABCG8, which together encode a heterodimeric transporter responsible for apical cholesterol secretion from both hepatocytes and enterocytes. However, a systems biology view must be taken since liver cell homeostasis compensates lesser absorption (more secretion) with more endogenous cholesterol synthesis.

We therefore conducted a large-scale genetic association study of GBD. This was conducted in women sampled from the general population. This design avoided potential selection bias of genotypes; with high single-nucleotide polymorphism (SNP) density coverage including many loci already well established to be involved in lipid metabolism or obesity. We analyzed our data in relation to this knowledge and interpret the findings in relation to metabolic pathways relevant to GBD.

Materials and methods

Study design

Genetic association analyses were conducted in order to identify candidate SNPs associated with GBD. Fine mapping of causal loci and effects of lipids were also considered. Only women were tested as two of the cohorts analyzed included only women. Findings derived from this work will therefore be applicable only to women.

Study descriptions and phenotype definitions are shown in Supplementary Information.

Data Access for the NHLBI Candidate gene Association Resource (CARe). The NHLBI initiated the CARe to create a shared genotype/phenotype resource for analyses of the association of genotypes with phenotypes relevant to the mission of the NHLBI. The resource comprises nine cohort studies funded by the NHLBI including: Atherosclerosis Risk in Communities (ARIC), Cardiovascular Health Study (CHS), Cleveland Family Study (CFS), Coronary Artery Risk Development in Young Adults (CARDIA), Framingham Heart Study (FHS), Jackson Heart Study (JHS), Multi-Ethnic Study of Atherosclerosis (MESA) and the Sleep Heart Health Study (SHHS). A database of genotype and phenotype data was created that includes records for ~41 000 study participants with ~50 000 SNPs from >2000 selected candidate genes. Data from individual cohorts such as ARIC are available to the approved investigators upon submission of data requests through the dbGaP portal.

Genotyping and quality control

Genotype quality control characteristics from the three data sets are described in Supplementary Table 2. Genomic control inflation factors (λ) were 1.00 for all three cohorts. Genotyping was performed using the Human cardiovascular disease (HumanCVD) BeadChip (Illumina, San Diego, CA, USA), also known as the ITMAT-Broad-CARe (IBC) array as previously described.14 The ARIC and BWHHS used the IBCv2 array, which contained up to 49 240 SNPs, whereas the WHI used the IBCv3 array, which contained up to 53 400 SNPs (containing the entire v2 content plus an additional ~4200 SNPs from updated metabolic GWAS findings in the literature from 2007 to 2008). All SNPs were clustered into genotypes using the Illumina Beadstudio software and subjected to quality control filters at the sample and SNP level, separately within each cohort. Samples were excluded for individual call rates <90%, gender mismatch and duplicate discordance. SNPs were removed for call rates <95% or Hardy–Weinberg disequilibrium P<10–7. Because of the low frequency SNPs included in the design, and the aim to capture low frequency variants of large effect across the large dataset, we filtered only on minor allele frequency (MAF) <0.01. Following identification of the TTC39B SNP rs581080 in WHI that was absent in IBCv2, all BWHHS samples were genotyped for rs581080 using the KASPar system (KBiosciences, Essex, UK).15

Statistical analyses

Only women of European ancestry were included. For each of the three studies, we verified self-reported ethnicity by multidimensional scaling analysis of identity-by-state distances as implemented in PLINK,16 including HapMap panels as reference standards. After pruning of SNPs in linkage disequilibrium (LD; r2>0.3), Eigenstrat was used to compute principal components (PCs) on the subset of non-excluded individuals for use as covariates in the regression analyses to control for the influence of population admixture.17, 18

GBD association analysis was performed in each study using an additive genetic model. We performed association analyses using GBD case/control adjusted for age and additionally adjusted for PCs if population structure was evident. Analyses adjusting for BMI were also performed. The λ was calculated in each case–control study and used for within-study correction before meta-analysis. Meta-analysis was performed using a fixed-effect standard error-based approach using METAL (1). The Candidate gene Association Resource (CARe) IBC array studies,19 determined that after accounting for LD, the effective number of independent tests was ~20 500 for Europeans producing an experimental or ‘array-wide’ statistical threshold of P=2.4 × 10−6, respectively, to maintain a false-positive rate of 5%,20 which has been employed in numerous studies to date.21, 22, 23, 24 We also highlight loci significantly associated at a more conventional genome-wide significant threshold of P<5.0 × 10−8.

Loci harboring marginally significant evidence for association at P<10−5 were examined for independent signals via regression analyses in PLINK.16 A term was added to the regression model including the lead SNP as a covariate, and SNPs within the same candidate gene, or ±200 kb if the candidate gene region was <200 kb, were evaluated for maintaining array-wide significance. HDL-adjusted meta-analysis and conditional analysis were performed in the WHI and ARIC cohorts in which individual-level genotype data were available. A locus-specific Bonferroni correction was applied to determine the significance of independent signals.

Regional association plots for the genomic regions associated with GBD were constructed using the online tool LocusZoom (http://csg.sph.umich.edu/locuszoom/). When drawing the plots, the genome build HG18 was selected to provide the genome coordinates of SNPs. 1000 Genome CEU was used as the LD population.

The variance explained (adjusted R2) by associated SNPs (defined to be with P-value<1 × 10−4 in the primary genetic association tests) was calculated within cohorts with individual-level genotype and phenotype data available (ARIC and WHI) using a linear regression model incorporating the adjustment by covariates of age and three PC's. The average variance explained, weighted by the sample size of each contributing study, is reported.

Fine mapping of candidate loci

Fine mapping of candidate loci has been performed using an approach we previously developed (Sequential Sentinel SNP Regional Association Plot, SSS-RAP)25 (software available at http://apps.biocompute.org.uk/sssrap/sssrap.cgi). SSS-RAP, together with conditional analysis, was used to identify independent signals. In short, this method is biologically driven and considers the relationship between LD and linear/logistic regression for pairwise SNPs. We adopted the additive model, typical for genes with small effect, and transformed the effect of a sentinel SNP to the predicted effect for a possibly dependent SNP.

There are 77 SNPs in the ABCG5/8 locus designed on the IBC array, thus we used locus-wide significance level of 0.05/77=6.49 × 10−4 as the cutoff for conditional analysis. In the first round conditional analysis, we took genotypes of the lead SNP (rs4953023:G>A) in the un-conditional meta-analysis, redid association tests by adjusting its genotypes as a covariate in each individual-level cohort, and meta-analyzed the results. Then in the second round, we extracted the lead SNP of the first round conditional analysis (rs4299376:G>T), took the combination of genotypes for SNPs rs4953023:G>A and rs4299376:G>T as a covariate, and repeated the above procedure. We performed third and fourth rounds similarly. Within the ABCG5-8 region, 360 SNPs were analysed from the 1000 genomes database based on 26 different populations from many different worldwide locations (http://www.1000genomes.org/). A total of 118 SNPs remained after excluding SNPs in perfect LD (r2=1). A random subset of SNPs not in the meta-analysis was excluded, with the remaining SNPs analysed using SSS-RAP. The rationale of random exclusion was to keep a representative sample of the study. Using this approach, 10/118 SNPs remained significant as signals in the meta-analysis.

Associations of lipid SNPs with GBD

Median −log10 P-values were analyzed for 65 loci previously reported to associate with plasma lipid levels26 in comparison with the distribution of median −log10P-values generated by 10 000 simulations using random sampling of 65 loci for each set of simulations. These Monte Carlo simulations were performed in R.

Representations of the histogram of median estimates were performed using the ‘hist’ function in the basic package of R (http://svn.r-project.org/R/trunk/src/library/graphics/R/hist.R), and the quantile–quantile (QQ) plots were also performed in R, but using the ‘ggplot2’ package (http://cran.r-project.org/web/packages/ggplot2/index.html).

Results

Meta-analysis

Seventeen SNPs with call rates >95% and Hardy–Weinberg disequilibrium (P>10–7) with MAF >0.01 showed significant association with GBD at a level of P≤2.4x10−6 in our meta-analysis (Table 1). Fourteen of these variants localized to the ABCG5/ABCG8 locus (Table 1), including the strongest signal found in our study (with rs4953023:G>A having the strongest value, P=7.41x10−47). The three remaining SNPs corresponded to the genes GCKR and TTC39B (Table 1). A less stringent threshold of P≤10−5 revealed 18 more SNPs that showed marginally significant association with GBD at a level of P≤10−5 in our meta-analysis (Supplementary Table 3). One of these variants localized to the ABCG5/ABCG8 locus. The 17 remaining SNPs corresponded to the genes TTC39B, GCKR, C2orf16, CCL20, RUNX1, EIF2B4, MMP24, CYP4F2, RIPK1, ZNF512 and GAA (Supplementary Table 3). The effect sizes and standard errors are also shown in Table 1 and Supplementary Table 3.

Table 1 Association of SNPs with GBD in the ARIC, BWHHS and WHI IBC array studies

As there are two versions of the IBC array used in this study, this resulted in missingness for some SNPs across the cohorts. Table 1 shows the concordance of the direction of the effect found in all three cohorts (ARIC, BWHHS and WHI). There is concordance of direction for nine ABCG5/ABCG8 SNPs, discordance for none, and incomplete information for six (five of the six SNPS were present on the version 3 array only). With regard to the other genes, we found the same direction of effect in all three cohorts for SNPs in GCKR, C2orf16, CCL20, RUNX1, MMP24, CYP4F2, RIPK1 and ZNF512. TTC39B showed incomplete information (Supplementary Table 3).

Results for HDL-adjusted meta-analysis are shown in Supplementary Table 4. In summary, after adjusting for HDL, all of the ABCG5/8 SNPs with the exception of rs4148189:C>T and rs6544713:T>C are significant considering the array-wide significance level, whereas TTC39B SNPs are not significant considering the array-wide significance level. Most strong hit for this gene is rs686030:C>A, with P=0.0038. GCKR SNP rs1260326 it is still marginally significant (P=1.01 × 10−5 after the HDL adjustment).

SNPs with P-value <1 × 10−4 (37 SNPs in total) explain 0.0808 of the variance in ARIC (note 11 SNPs are not genotyped in ARIC), while they explain 0.0917 of the variance in WHI (one ABCG5 SNP rs4148191:G>T was not genotyped in WHI).

Conditional analyses and regional association plots

Our conventional conditional analysis detected four independent SNP effects in ABCG8/5 rs4953023:G>A, rs4299376:G>T, rs6544718:T>C and rs6720173:G>C (Supplementary Table 6) A different method to perform conditional analyses (SSS-RAP), after setting rs4953023:G>A as a top hit (lowest P-value), identified three SNPs (rs10208987:T>G, rs4299376:G>T and rs3806470: G>A) as possible independent signals associated with GBD. The possible independence among these SNPs was supported by the low pairwise r2 values observed (r2<0.22).

High degrees of heterogeneity when combining the three cohorts and unavailability of individual-level data in BWHHS are possible causes to explain differences between conditional analysis and SSS-RAP results.

Regional association plots are shown in Supplementary Figures 4–15.

LD involving loci significantly associated with GBD

LD results for loci significantly associated with GBD are shown in Supplementary Information.

Analyses of loci in lipid pathway

From the 95 independent loci previously reported to be associated with blood lipid levels,26 it was possible to subject 65 to a test using IBC array data (Supplementary Table 5). Figure 1a shows the QQ plots relating the distribution of significant associations for GBD observed for SNPs representing 65 independent loci previously reported to associate with plasma lipid levels, compared with a random set. Out of 65 SNPs, 59 showed an observed P-value above the expected P-value in a QQ plot. The observed distribution (for lipid-associated loci) was significantly different than the expected distribution (for a random set). This is in accordance with a combined effect of these SNPs both on GBD and on plasma lipid levels variation.

Figure 1
figure 1

Pathway analysis relating GBD with SNPs associated with plasma lipid levels. (a) GBD QQ plot for SNPs representing 65 independent loci reported in a published study to influence plasma lipid levels. (b) Monte Carlo analysis of the median log10 P-value when sampling 65 SNPs at random from our IBC array meta-analysis (excluding SNPs with MAF <0.05), 10 000 random samplings. The observed median was 1.49. The median observed for the lipid set (b) was 1.06 (mean=1.07, SD=0.19).

Figure 1b shows results for Monte Carlo simulations analyzing the median −log10P-value of 10 000 samples of groups of 65 random SNPs in IBC. The observed median value for the 65 lipid-associated loci was 1.06 (mean=1.07, SD=0.19). A median −log10P-value greater than that observed for the lipid loci set was only observed in 1.7% of random samplings.

In our analysis, a handful of loci influencing either LDLc or triglycerides showed the most prominent associations with GBD. Specifically, these were rs4299376:G>T in ABCG5/ABCG8 (see above); rs2081687:T>C in CYP7A1 (our proxy rs8192870:T>G has an r2=0.806 with rs2081687:T>C, P=0.021 for GBD); rs5756931:A>G in PLA2G6 (our proxy rs4820314:C>T r2=0.661, P=0.0022 for GBD); rs28385705:T>C in PCSK9 (P=0.011) and rs483140:G>C in LIPC (P=0.0037). Details about all the associations are presented in Supplementary Table 3.

Pathway analysis

Tissue expression and pathway rationale were evaluated to support the associated gene loci reported in Table 1. We consulted a recent survey of tissue expression by RNA sequencing27 for transcripts showing particularly relevant expression to GBD. We noted that CCL20 (highest expression in gallbladder), ABCG5, ABCG8, CYP4F2, GCKR, RUNX1 and TTC39B all showed predominant expression in hepatobiliary and gastrointestinal tissues, providing biological support for a role in GBD. Extending our analysis further, we used GeneGO Metacore (Thomson Reuters, New York, NY, USA) to evaluate the 13 genes for direct interaction with 142 genes previously linked to GBD in the literature by Medical Subject Headings terms. Five genes (ABCG5, ABCG8, CCL20, RUNX1 and CYP4F2) showed direct interaction with genes linked to GBD by MESH in a network of directly interacting genes. Notably, the RUNX1 gene interacts directly with 13 other GBD-linked genes, whereas CCL20 interacts with three GBD-linked genes, suggesting multiple levels of biological support for both genes. Supplementary Figures 2 and 3 show a sub-network from this analysis based only on direct interactors of the genes reported in this study.

Analysis of potential pharmacologic targets

Five of the novel genes reported here have available small molecule modulators (mainly inhibitory or binding) based on a query of the ChEMBL database (www.ebi.ac.uk/chembldb/). The genes are listed with the number of tool compounds in brackets: CYP4F2 (7), GAA (drugged, 22256), GCKR (68), RUNX1 (7437), TPH (4). As the differing number of tool compounds indicates, some of these targets are likely to be the focus of pre-existing drug development and projects may still be ongoing or terminated. Once molecular properties of these compounds are considered to have a favorable profile, particularly concerning the required direction of therapeutic effect, they could be investigated in animal models of GBD. Some specific candidate molecules are discussed further below.

Other relevant phenotypes

Supplementary Table 4 shows the values observed for cases and controls in BWHHS, ARIC and WHI in relation to relevant covariates for the interpretation of our meta-analysis. Significant differences were observed between cases and controls for both age and BMI in all three cohorts, with the exception of WHI in relation to age. This is in accordance with the known relation between these variables and GBD. However, no significant differences were observed between cases and controls in relation to height, hormone replacement therapy and statin use. Significant differences were observed for diabetes status in ARIC and WHI but not in BWHHS.

Discussion

GBD is a common disease, with >700 000 cholecystectomies performed in the United States alone per year at a cost of approximately $6.5 billion.6 This study, the largest genetic epidemiological analysis of GBD conducted to date, shows that a substantial number of genotypes, which associate with plasma lipid levels, also associate with risk of GBD in women of European ancestry. Taken together with some of their known functions, these data provide evidence of causal contributions of metabolic pathways involving both cholesterol, intermediary metabolism of fats, carbohydrates, and phospholipids in relation to GBD. Overall, this systems genetics analysis, by establishing causality of associations, shows that appropriate approaches to reduction of total cholesterol including reduction of carbohydrate and triglyceride substrate availability to the hepatocyte, should result in a decrease in risk of GBD.

Discussion about the relevance of phenotypic definitions of GBD based on questionnaires is shown in Supplementary Information.

From our examination of up to ~53 000 SNPs included in the IBC array analysis, the most prominent SNPs, which reached array-wide significance, were at the ABCG5/ABCG8 locus (Table 1). Signals in this locus, at such small P-values, are reassuring as it is a known positive control for GBD. The high density of SNP coverage of this locus in IBC compared with conventional GWAS arrays, combined with the large number of cases, allowed us to show conclusively (based on the absence of LD and retention in a combined model for variable selection) the presence of several mutually independent signals in known functional sites. Our leading SNP, rs4953023:G>A (P=7.41 × 10−47, effect size=0.734) is in perfect LD with the previously reported rs11887534:G>C (NM_022437.2(ABCG8):c.55G>C (p.Asp19His). This amino-acid variant is believed to be a functional SNP, which increases cholesterol secretion and hence the risk of super-saturation of bile. The rs4299376:G>T variant, also in ABCG8, tags an independent effect, possibly reflecting an earlier report of NM_022437.2(ABCG8):c.161A>G (p.Tyr54Cys) (r2~0.2 between rs4299376:G>T and rs4148211:A>G in Europeans) on GBD in Taiwanese. Two further independent SNPs were identified in ABCG5, rs10439467:C>T in intron 10 and rs6720173:G>C NM_022436.2(ABCG5)c.1810C>G (p.Gln604Glu). The ABCG5/ABCG8 transporter is important in regulating biliary cholesterol. Our identification of further new alleles at this locus influencing gallstone risk emphasises that an approach, which might target cholesterol secretion directly, could be useful in GBD prevention, though adverse effects on TC and LDLc levels would need to be avoided or countered. However, it does not prove that reduction of total cholesterol will be successful, as sequestration anywhere away from the biliary tree might have no consequence for GBD risk. Furthermore, the impact of ABCG5/ABCG8 is much greater in GBD than it is on LDLc levels. ABCG8 does emerge in lipid GWAS but with a small effect, apparently because the hepatocyte compensates increased secretion by increasing cholesterol synthesis.

A pathway analysis of the gene loci reported here highlighted extensive connectivity between ABCG5, ABCG8, RUNX1, CCL20 and nuclear hormone receptors (PPARG, PPARD, LXRA and RAR) with important roles in lipid homeostasis. The influence of thiazolidinedione drugs on gallstones has not been extensively studied, although two preclinical studies of pioglitazone conflict, a study by Al-Azzawi et al.28 showed a drug induced increase in gallbladder volume leading to increased gallstone formation in mice, whereas another study showed that the drug reduced gallstone formation in rats, by the action on retinol-binding protein-4.29 Our findings highlight the potential therapeutic importance of the network of interactions between members of the PPAR family, ABCG5/8, CCL20 and RUNX1. We evaluated all the associated loci for evidence of small molecule druggability (Supplementary Methods). The protein products of several loci reported here are potentially druggable, but RUNX1 (Runt-related transcription factor 1) may be of immediate interest in a novel target for GBD. RUNX1 has an osteogenic role in the pathology of osteoarthritis, which was downregulated by a thienoindazole molecule (TD-198946) recently described by Yano et al.30 RUNX1 may also have a role in atherosclerotic plaque formation.31 It is intriguing to speculate that RUNX1 may have a similar role in the gallstone matrix in GBD.

Our pathway analysis relies on prior knowledge of gene function; however, we also describe potentially important associations in several highly novel genes of unknown function. A strong genetic signal is reported in TTC39B (rs686030:C>A), at array-wide significance, the first report implicating this locus in GBD. A variant in tetratricopeptide repeat domain 39B (TTC39B) at 9p22, rs581080:C>G, has been associated with levels of HDL cholesterol and with genotype-expression association in the Global Lipids Consortium.26 Lower TTC39B transcript levels and higher HDL cholesterol are evident in humans32 and knockdown of TTC39B in murine models also showed a clear increase in HDL.26 The function of TTC39B is unknown; however, the gene shows predominant gastrointestinal expression and the gallbladder was the second most highly expressed tissue after the skin, providing further support for a role of this novel gene in GBD. Our results showing that the TTC39B association is attenuated after HDL adjustment are consistent with HDL either mediating the association with GBD or acting as a confounder. Further studies are required to elucidate this.

Therapeutic implications of our findings are shown in Supplementary Information.

Discussion about our combined analysis of 65 independent loci for blood lipid levels and GBD is presented in Supplementary Information.

An overall system, from the perspective of the gallbladder, can be described as three loops (Figure 2): first and most proximal, the enterohepatic cholesterol circulation; second, the loops of hepatocyte cellular homeostasis of cholesterol (and intermediary metabolism); and more distally, the total body status and turnover of lipids. Each known genetic perturbation represents an experiment of nature showing the likely directional outcome of an equivalent environmental or pharmacological intervention, similar to the causal role reported for IL6R in relation to coronary heart disease.33 It is evident from the above that enhancing cholesterol breakdown, reducing synthesis and reducing reuptake from plasma should reduce GBD risk, although the third strategy would not be attractive from a vascular perspective. It is notable that fibrate usage to reduce plasma triglycerides, increases GBD risk apparently through increasing cholesterol synthesis and hence flux through the biliary tree. The LIPC and GCKR data suggest that reducing substrate availability for endogenous cholesterol synthesis should also be effective in modifying GBD risk.

Figure 2
figure 2

Gallbladder-centric view of cholesterol system. The cholesterol system can be described as three loops from a gallbladder-centric perspective. These are the enterohepatic cholesterol circulation, the loops of hepatocyte cellular homeostasis of cholesterol (and intermediary metabolism) and the total body status and turnover of lipids. Enhancing cholesterol breakdown, reducing synthesis and reducing reuptake from plasma should reduce gallstone risk.

Overall, these systems genetic analyses show that regulation of hepatocyte cholesterol and intermediary metabolism exert causal effects in GDB risk in women. For pragmatic reasons, we only included women in our study and results may therefore only generalise to women. Further studies in cohorts of men with similar sample sizes would be required to extend our findings in men. Preventive measures, dietary and pharmacological, should meet with success considering the genetic evidence presented here that the observed associations are causal.