Introduction

Serum concentrations of lipids, such as cholesterol and triglycerides, are heritable risk factors for cardiovascular disease and targets for therapeutic intervention.1, 2 To date, 19 genome-wide association studies (GWASs) and over 200 single-nucleotide polymorphisms (SNPs) have been reported for lipid traits and are listed online in the catalog of published GWASs.3 Recently, Kim et al.4 reported a GWAS of metabolic traits, including lipid levels, in East Asians4 that implicated 18 lipid level influencing loci, including three novel loci—MYL2, OAS3 and C12orf51.

In conventional approaches, multiple testing of several hundred thousand SNPs makes the detection of an association difficult, and the variants detected by Kim et al.4 cannot be categorized into functional variants such as nonsynonymous SNPs (nsSNPs). 4 By selecting SNPs according to their functional significance, it is possible to overcome the problem of an overwhelming number of SNPs and to identify the candidate SNPs for experimental validation.5 The nsSNPs are located in coding regions and result in amino-acid changes in the protein products of genes; they are thought to be the class of SNPs that have the greatest impact on phenotype. Several studies have suggested the impact of amino-acid allelic variants on protein structure and function and on human disease.6, 7, 8

Here, we used an alternative approach based on nsSNP analysis of GWAS data. A total of 11 558 nsSNPs, distributed genome wide, were analyzed for an association with lipid levels in a new Korean cohort (CAVAS; Cardio Vascular disease Association Study), which is a subset of the Korean Genome and Epidemiology Study (KoGES), and tested for data replication in another two independent populations.

Materials and methods

Study subjects

Study subjects were selected from a cohort of the ongoing KoGES. Participants were recruited among the residents of three cities located in rural areas of Korea: Yang-pyeong, Nam-won and Go-ryeong. In this study, the study population termed CAVAS included at the time a total of 4052 participants (aged 38–89 years) who were recruited from 2004 to 2008. Written informed consent was obtained from all participants, and this research project was approved by the institutional review board of the Korea National Institute of Health.

The CAVAS subjects did not have self-reported metabolic diseases (type 2 diabetes, hypertension or hyperlipidemia), cardiovascular diseases (myocardial infarction or stroke) or cancers. In addition, they had a normal range of blood pressure (systolic blood pressure <130 mm Hg and diastolic blood pressure <90 mm Hg) and fasting glucose levels (<126 mg dl−1), as measured during health examinations of the cohort members.

For the replication analyses, we used two independent Korean populations, from the Korean Association Resource (KARE) and Health Examinee (HEXA) study cohorts. The KARE is a consortium for analyzing the Ansung–Ansan population, which is a part of the KoGES project, and the HEXA is a cohort from two metropolitan areas (Seoul and Pusan), which is part of the KoGES project. The initial populations of the KARE and HEXA cohorts included 8842 and 3703 patients, respectively, and we applied the same inclusion criteria that were used for the CAVAS cohort. Ultimately, 3679 and 2123 individuals, respectively, were selected for the replication study. Detailed descriptions of the KARE and HEXA cohorts have been previously reported elsewhere.4, 9

Parameters were measured for fasting serum lipid levels (HDL cholesterol, triglycerides and total cholesterol). Biochemical measurements were obtained in the morning before the first meal of the day. The concentration of LDL cholesterol was calculated using Friedewald’s formula (LDL cholesterol=total cholesterol−HDL cholesterol−(triglycerides/5)). Missing values were assigned for the individuals with TG >400 mg dl−1.

Genotyping and quality control

Genomic DNA samples, isolated from peripheral blood drawn from CAVAS cohort participants, were genotyped using the Illumina Human 1M-duo Beadchip (Illumina Inc., San Diego, CA, USA). A total of 4052 samples were genotyped using 500 ng genomic DNA with an Illumina chip. The Bayesian Robust Linear Modeling using Mahalanobis Distance (BRLMM) genotyping algorithm was used for genotype calling of 1 010 624 SNPs.10 Subjects with genotype accuracies below 98% and high missing genotype call rates (4%), high heterozygosity (>30%) or inconsistency in sex were excluded from subsequent analyses. Individuals were related individuals whose estimated identity-by-state values were high (>0.80). The methods used to estimate heterozygosity and identity-by-state have been described elsewhere.10 After these quality control steps, 3667 samples were selected. Markers with a high missing gene call rate (>5%), low minor allele frequency (<0.01) or significant deviation from Hardy–Weinberg equilibrium (P<1 × 10−6) were excluded, leaving a total of 723 056 markers to be examined in 3667 individuals.

Functional SNP selection

We previously reported nsSNP GWAS results for blood pressure regulation and described the nsSNP selection methods.11 Briefly, using BIOMART version 0.7 (http://www.ensembl.org/biomart/), nsSNPs were selected from the 723 056 SNPs based on the data in dbSNP v.135. BIOMART is a web-based tool that allowed us to use information from the latest release of a database to select SNPs limited to a specific nonsynonymous category.

Estimation of genetic variance

The genetic variances were computed via GCTA v1.24,12 which is a tool for estimating the proportion of phenotypic variance that is explained by genome-wide SNPs for complex traits.13 First, we estimated the pairwise genetic relationship using the make-grm option for the nsSNPs and all SNPs on the array. We then estimated the proportion of phenotypic variance contributed by the nsSNPs and all SNPs, respectively, based on the restricted maximum likelihood.14

Replication study

Because the KARE and HEXA cohorts were genotyped using the Affymetrix 5.0 and 6.0 SNP arrays (Affymetrix, Santa Clara, CA, USA), respectively, only one-third of the SNPs on the Illumina Beadchip were available. For this reason, we used imputed SNPs reported in a previous study.4, 9 Briefly, we used the IMPUTE program (http://mathgen.stats.ox.ac.uk/impute/) for SNP imputation based on the International HapMap data (http://hapmap.ncbi.nlm.nih.gov/;phase 2, release 22 k NCBI build 36 and dbSNP build 126), including 2.2 million SNPs from 90 individuals from the JPT and CHB populations for use as a reference panel.4, 14, 15 After excluding imputed SNPs with low genotype information content (<0.05), a posterior probability score <0.90, a call rate <0.90, a minor allele frequency <0.01 or Hardy–Weinberg equilibrium P<1 × 10−6, a total of 1 573 409 SNPs for the KARE cohort and 1 984 393 SNPs for the HEXA cohort were selected.

Statistical analysis

In our quantitative analysis of HDL cholesterol, LDL cholesterol, triglyceride and total cholesterol levels, linear regression was used by controlling for the age, sex and body mass index of the cohort subjects. To understand the environmental effect, we added smoking and alcohol drinking habits as the covariates.

Statistical analyses were performed using PLINK (version 1.07).16 We conducted GWASs by 723 056 SNPs (all SNPs) and 11 558 nsSNPs. All tests were based on an additive model, and P-values were not adjusted for multiple tests. Significantly associated SNPs were examined in the replication study. The inverse-variance meta-analysis method, assuming fixed effects, was used to generate meta-analysis statistics. Cochran’s Q-test was used to assess between-study heterogeneity.17 All meta-analysis calculations were performed using PLINK (version 1.07).16 Possible damaging effects of nonsynonymous SNPs on protein structure were predicted using PolyPhen 2,18 SIFT,19 Mutation Taster20 and LRT predictor.21

Population stratification was tested by a principal component analysis using the PLINK.16 To prevent overrepresentation of regions with more redundant SNPs, we used the indep-pairwise command in PLINK16 to reduce linkage disequilibrium between the remaining variants by eliminating any SNP that had a pairwise r2>0.3 with any other SNP in a 1500 bp window (step size, 150 bp). This reduced the CAVAS data set to 148 902 SNPs; subsequently, we selected 19 159 SNPs commonly presented in KARE and HEXA, and 545 SNPs commonly presented in HapMap 3 (n=957) in which the plink format was obtained from GitHub (https://github.com/gabraham/flashpca). Using the 545 common SNPs, we computed the principal components by Plink —pca command.

Results

Discovery GWAS by all SNPs

The clinical characteristics of the CAVAS, KARE and HEXA cohorts are described in Table 1. In the CAVAS cohort, we conducted the conventional GWASs for total cholesterol, HDL cholesterol, LDL cholesterol and triglyceride levels by 723 056 SNPs. Manhattan plots and quantile–quantile plot (QQ plot) were illustrated in Supplementary Figures S1–S4. Unfortunately, there was no SNP that passed the genome-wide significant level (P-values <5 × 10−8). The genomic inflation factors from 1.000 (triglycerides) to 1.045 (LDL cholesterol) were not significantly deviated from the expected line.

Table 1 Basic patient characteristics and lipid measurements

We then decided to test the previously reported SNPs to be associated with lipid traits from GWAS catalogs (Supplementary Table S1).22 Among 242 SNPs reported for the four lipid traits (total cholesterol, HDL cholesterol, LDL cholesterol and triglyceride), we included 122 SNPs available for our genotype data, of which small fraction (23 SNPs) showed P-value<0.05. (Supplementary Table S1).

Alternative approach GWAS by nsSNPs

In the CAVAS cohort, we conducted the alternative approach GWAS for total cholesterol, HDL cholesterol, LDL cholesterol and triglyceride levels using 11 558 nsSNPs. For multiple comparisons, we detected significant associations based on the Bonferroni correction criteria (P-value<4.3 × 10−6 (0.05/11 558)). We plotted −log10(P) values against chromosomal position on Manhattan plots (Figure 1), and all significant associations are described in Table 3.

Figure 1
figure 1

A Manhattan plot of nonsynonymous single-nucleotide polymorphisms (SNPs) from the genome-wide association studies (GWASs) that are associated with lipid traits. SNPs are plotted based on their physical chromosomal positions (horizontal axis) together with their −log10(P-values) in the GWAS (vertical axis). The blue line indicates the suggestive threshold and the red line indicates the genome-wide significance threshold of P=5 × 10−8. HDL, high-density lipoprotein; LDL, low-density lipoprotein. A full color version of this figure is available at the Journal of Human Genetics journal online.

Through the discovery GWAS by using the nsSNPs, we identified one SNP (rs3733197) for LDL cholesterol levels that passed the Bonferroni correction criteria. Individuals with the minor allele of rs3733191 showed significantly increased LDL cholesterol levels compared with those with the major allele (beta±s.e.=4.67±0.94, P-value=1.0 × 10−6). The rs3733197 nsSNP is located in the coding region of BANK1 (B-cell scaffold protein with ankyrin repeats 1) gene-coding region, and the minor allele replaces the alanine at position 383 with threonine. Interestingly, SNP rs3733197 was previously reported to be a candidate variant for systemic lupus erythematosus,23, 24 psoriasis25 and systemic sclerosis,26, 27 which are all diseases that are affected by serum cholesterol levels. 28

To validate this association, we genotyped rs3733197 in other independent populations (the KARE and HEXA cohorts). The SNP for LDL cholesterol was significantly replicated in the HEXA cohort (beta±s.e.=2.88±1.12, P-value=0.016), but was not significant in the KARE cohort (beta±s.e.=1.26±0.97, P-value=0.196). Furthermore, the meta-analysis showed a significant association; P-value=6.19 × 10−7, but the Cochran’s Q (0.013) and Heterogeneity I (69%) were also significant.

Estimation of the phenotypic variance explained by genetic variances (Vg/Vp)

The Vg/Vp were estimated by 723 056 all SNPs and 11 558 nsSNPs by restricted maximum likelihood method14 and the results were described in Table 2. The all SNPs examined were ~70 times more frequently found than the nsSNPs; however, compared wlith those explained by all SNPs, the Vg/Vp by nsSNPs was similar for LDL cholesterols (4.2% in nsSNPs vs 5.4% in all SNPs), a third proportions for total cholesterols (1.9% in nsSNPs vs 4.8% in all nsSNPs) and for HDL cholesterols (0.6% in nsSNPs vs 1.8% in all SNPs). Therefore, we assumed that the nsSNPs can explain the large proportion of phenotypic variances of lipid traits compared with the all SNPs.

Table 2 The proportion of phenotypic variance explained by genotypic variance (Vg/Vp)

Discussion

The nsSNP rs3733197 discovered in our study has never been reported in previous GWAS for the lipid traits. Our study subjects were selected from the general population cohorts (CAVAS, KARE and HEXA), who showed normal characteristics in any of the intermediate phenotypes for metabolic diseases (that is, diabetes, hypertension and hyperlipidemia) and cancers. When we ran all SNP-GWAS using the subjects without exclusion of diabetes or hypertension, we found no significantly associated SNP (data not shown). We, therefore, suggest that nsSNP association might be attributable to lipid level in the healthy populations, and the effects could be diluted by the other risk genetic variant effects in the mixed populations.

The minor allele frequencies of SNP rs3733197 in three cohorts were similar (see Table 3), but the Cochran’s Q (0.013) and Heterogeneity I (69%) were also significant. Therefore, we tried to find whether there was any population stratification that could lead us to spurious association. First of all, we conducted the Principal Component analysis by using the genome-wide SNPs of three cohorts (CAVAS, KARE and HEXA). As shown in Supplementary Figure S5, the genotypes of those cohorts were similarly aggregated, indicating no differences of the genetic constructions. Next, we looked into the environmental factors such as the smoking and alcohol drinking habits, which might differentially effect on the association results. The KARE cohort had more proportion of current smokers. We, therefore, tested the association between rs3733197 and LDL cholesterols using the smoking and alcohol drinking as the covariates. The results showed slightly increased significance for all three cohorts, but the P-value of KARE was still non-significant (see Table 3).

Table 3 Significant results from the nsSNP GWAS for LDL cholesterol levels and validation in two independent populations (the KARE and HEXA cohorts)

BANK1 has four functional domains (DBB, ANK1, ANK2 and interaction with ITPR2); rs3733197 is located in the ANK2 domain, which is known to function as protein–protein interaction motif. On the basis of the Ala to Thr substitution caused by rs3733197, this SNP was predicted to be ‘probably damaging’ to the BANK1 protein structure as the Poly-Phen risk score was 0.983. The Poly-Phen risk score indicates the Bayes posterior probability that a given mutation is damaging. Although PolyPhen risk score suggested a high probability to reduce protein stability or function,16 other prediction softwares showed the mild effect of the variants for which the score of SIFT was 0.27 (tolerate),19 and the score of Mutation Taster predicted the harmless and polymorphism.20 The score of LRT predictor showed neutral.21

BANK1 is predominantly expressed in B cells, but low expression can be detected in other cell populations.29 BANK1 functions in B-cell receptor-induced calcium mobilization from intracellular stores.30 In addition, the protein can promote tyrosine phosphorylation of inositol 1,4,5-triphosphate receptors.31 Mice homozygous for a Bank1 knock-out allele exhibit enhanced B-cell responses.32

Increased cholesterol levels are associated with a greater abundance of B cells.32 In our GWAS, individuals with the minor allele of rs3733197 showed significantly increased cholesterol levels. Therefore, it is possible that the genotype effect of rs3733197 may contribute to B-cell activation by interactions with cholesterol levels.

Interestingly, a recent clinical study indicated that statins (a class of cholesterol-lowering drugs), which were originally developed to treat lipid disorders,had immune-modulating effects in systemic lupus erythematosus patients.33 Systemic lupus erythematosus [MIM:152700] is a chronic autoimmune disorder of the connective tissues, such as the skin, joint, and kidney. Therefore, our results suggest that the effects of rs3733197 and statins on systemic lupus erythematosus may be related, and an improved understanding of the role of LDL cholesterol in these contexts should be a goal of future studies.

In conclusion, we identified a novel functional variant of BANK1 that affects the serum lipid levels in three Korean cohorts. Our findings suggest several hypotheses for the effects of BANK1 on several immune disorders and the interactions of BANK1 alleles with cholesterol levels. Furthermore, we expect our strategy that focuses on nsSNPs to be an efficient way to discover functional associations from whole-genome data sets.