Fine mapping the CETP region reveals a common intronic insertion associated to HDL-C

Background: Individuals with exceptional longevity and their offspring have significantly larger high-density lipoprotein concentrations (HDL-C) particle sizes due to the increased homozygosity for the I405V variant in the cholesteryl ester transfer protein (CETP) gene. In this study, we investigate the association of CETP and HDL-C further to identify novel, independent CETP variants associated with HDL-C in humans. Methods: We performed a meta-analysis of HDL-C within the CETP region using 59,432 individuals imputed with 1000 Genomes data. We performed replication in an independent sample of 47,866 individuals and validation was done by Sanger sequencing. Results: The meta-analysis of HDL-C within the CETP region identified five independent variants, including an exonic variant and a common intronic insertion. We replicated these 5 variants significantly in an independent sample of 47,866 individuals. Sanger sequencing of the insertion within a single family confirmed segregation of this variant. The strongest reported association between HDL-C and CETP variants, was rs3764261; however, after conditioning on the five novel variants we identified the support for rs3764261 was highly reduced (βunadjusted=3.179 mg/dl (P value=5.25×10−509), βadjusted=0.859 mg/dl (P value=9.51×10−25)), and this finding suggests that these five novel variants may partly explain the association of CETP with HDL-C. Indeed, three of the five novel variants (rs34065661, rs5817082, rs7499892) are independent of rs3764261. Conclusions: The causal variants in CETP that account for the association with HDL-C remain unknown. We used studies imputed to the 1000 Genomes reference panel for fine mapping of the CETP region. We identified and validated five variants within this region that may partly account for the association of the known variant (rs3764261), as well as other sources of genetic contribution to HDL-C.


INTRODUCTION
Aging is characterized by a deterioration in the maintenance of homeostatic processes over time, leading to functional decline and increased risk for disease and death. 1 One of the genes linked to healthy aging and longevity is the cholesteryl ester transfer protein (CETP) gene. 1,2 Homozygosity in the 405VV variants of CETP is associated with lower concentrations of CETP, higher concentrations of high-density lipoprotein concentrations (HDL-C), and greater HDL-C particle size, all associated with both protection against cardiovascular disease 3 and exceptional longevity. 4 Functional analyses in mice, 5 hamsters, 6 and rabbits 7 have revealed that the protein encoded by the CETP gene mediates the transfer of cholesteryl esters from HDL-C to other lipoproteins such as atherogenic (V)LDL particle and is a key participant in the reverse transport of cholesterol from the periphery to the liver. 8 Due to the function of CETP and the association of the gene with HDL-C in humans, 9,10 the CETP gene is one of the targets for drug development for dyslipidemia. 6,11,12 CETP-inhibition leads to an increase of HDL-C from 30 up to 140% depending on the compound used. The first drug of its class, Torcetrapib was unfortunately associated with an increased mortality and morbidity in patients receiving the CETP inhibitor in addition to atorvastatin. 13,14 The estimated heritability of HDL-C levels is high in humans: 47-76%. [15][16][17][18][19][20][21][22][23] Previously published whole-genome sequence data 23 reported that common variants (minor allele frequency (MAF)41%) explain up to 61.8% of the variance in HDL-C levels and that rare variants (MAF o1%) explain an additional 7.8% of the variance. Genome-wide association studies revealed that numerous variants are associated with HDL-C, among which are various common 9,10 and rare 24,25 variants within the CETP gene in multiple ancestries. 4,8,[26][27][28] In this paper, we investigate the association between CETP and HDL-C in humans in further detail to identify variants that are likely to be causal.
To this end, we used a meta-analysis of association studies with imputed genotypes within the CETP region. Our study consisted of data from 59,432 samples, of which the genotypes were imputed to the 1000 Genomes project reference panel (version Phase 1 integrated release v3, April 2012, all populations). By using 1000 Genomes imputed data, we expected to find more rare or low-frequent variants, as well as novel insertions and deletions.

Study descriptions
The descriptions of the participating cohorts can be found in the Supplementary Information. All studies were performed with the approval of the local medical ethics committees, and written informed consent was obtained from all participants.

Study samples and phenotypes
The total number of individuals in the discovery phase was 59,432 and in the replication phase 47,866. Of the discovery samples, 44,108 individuals (74.21%) were of European ancestry. Of the replication samples, 47,081 individuals (98.36%) were of European ancestry. A summary of the details of both the discovery and replication cohorts participating in this study can be found in Supplementary Table 1.

Genotyping and imputations
All cohorts were genotyped using commercially available Affymetrix or Illumina genotyping arrays, or custom Perlegen arrays. Quality control was performed independently for each study. To facilitate meta-analysis and replication, each discovery and replication cohort performed genotype imputation using IMPUTE2 29

Association analysis in discovery cohorts
The lipid measurements were adjusted for sex, age, and age 2 in all cohorts, and if necessary also for cohort-specific covariates (Supplementary Table 1). Some cohorts included samples using lipid-lowering medication; we did not adjust for lipid-lowering medication in our analysis because HDL-C levels are only minimally influenced by lipid-lowering medication. Each discovery cohort ran association analysis for all variants within the CETP region (chromosome 16, 56.99-57.02 Mbp) with HDL-C.

Meta-analysis of discovery cohorts
The association results of all discovery cohorts for all variants within the CETP region (chromosome 16, 56.99-57.02 Mbp) were combined using inverse-variance weighting as applied by METAL. 31 This tool also applies genomic control by automatically correcting the test statistics to account for small amounts of population stratification or unaccounted relatedness and the tool also allows for heterogeneity. We used the following filters for the variants: 0.3oR 2 (measurement for the imputation quality)o 1.0 and expected minor allele count (expMAC = 2 × MAF × R 2 × sample size)410 prior to meta-analysis. After meta-analysis of all available variants, we excluded the variants that were not present in at least three cohorts, to prevent false positive findings.
Fine mapping the association between HDL-C and CETP EM van Leeuwen et al

Selection of independent variants
To select only variants that were independently associated with HDL-C, we used the Genome-wide Complex Trait Analysis (GCTA) tool, version 1.13. 32 Although this tool currently supports multiple functionalities, we only used the functions for conditional and joint genome-wide association analysis. This function performs a stepwise selection procedure to select independent single nucleotide polymorphisms (SNP) associations by a conditional and joint analysis approach. It utilizes summary-level statistics from the meta-analysis and linkage disequilibrium (LD) corrections between SNPs are estimated from the 1000 Genomes (1000G Phase I Integrated Release Version 22 Haplotypes (2010-11 data freeze, 14 February 2012 haplotypes)). GCTA estimates the effective sample size and determines the effect size, the s.e., and the P value from a joint analysis of all the selected SNPs. In this way, we select the best associated variants in CETP. We subsequently checked whether these variants were in LD within the 1000 Genomes reference panel using PLINK 33 software (Supplementary Table 3).

Replication of independent CETP variants
Five variants were selected for replication in a sample of 12 independent cohorts: Athero-Express, CHS, FINCAVAS, LBC1936, Lifelines, LLS, NTR-NESDA, PREVEND, PROSPER, QIMR, TRAILS, and YFS. The lipid measurements were adjusted for sex, age, and age 2 in all cohorts, and if necessary also for cohort-specific covariates (Supplementary  31 We used the genotypes of all 1,092 individuals of the 1000 Genomes project to calculate the correlation between the 38 variants. This correlation matrix was used by matSpDlite 34 which examines the ratio of observed eigenvalue variance to its theoretical maximum to determine the number of independent variables. For these 38 genome-wide significant variants within the CETP region, the effective number of independent variables is 18 and therefore the experiment-wide significance threshold required to keep type I error rate at 5% is 2.85×10 − 3 .

Conditional analysis of independent CETP variants
The replicated independent variants were selected for conditional analysis in both the discovery and the replication cohorts. In this analysis we adjusted for the lead SNP for this region as reported by Teslovich et al. 9 (rs3764261, chromosome 16, position 56,993,324 bp). The association results of all discovery and replication cohorts were combined and the s.e. based weights were calculated by METAL. 31 The Bonferroni-corrected P value for multiple testing was 0.01, since none of the five variants is in LD (Supplementary Table 3).

Validation of the new CETP insertion within a family
Within the ERF study, 3,658 individuals have been genotyped on various Illumina (Illumina, San Diego, CA, USA) and Affymetrix chips (Affymetrix, Santa Clara, CA, USA), followed by imputations with MaCH (1.0.18c) and Minimac (minimac-β-14 March 2012) to the 1000 Genomes reference panel. Based on the best guess imputed genotypes, we selected one family in which we expected the insertion to segregate.
Validation of the insertion was performed by Sanger sequencing. Genomic DNA was isolated from peripheral blood using standard protocols (salting-out). The intron 2-3 of the CETP gene (Supplementary Table 4) was amplified using PCR and the following primer sequences were used to amplify: forward; 5ʹ-tgggggactcaggtctctcc-3ʹ; reverse; 5ʹ-aaagcacctggccca caacc-3ʹ; size 409 bp.

Meta-analysis in all discovery cohorts to select independent variants
The association of all variants within the CETP region (chromosome 16, 56.99-57.02 Mbp) to HDL-C was tested in all discovery cohorts. These results were combined using the inverse-variance weights as applied by METAL. 31 After exclusion of the variants that were not present in at least 3 cohorts, 254 variants remained ( Figure 1). A conditional and joint analysis of the 254 variants using GCTA identified 5 independent variants ( Figure 2). Three variants were intronic (rs5817082, rs4587963, and rs7499892), one variant was intergenic (rs12920974) and one variant was exonic (rs34065661) ( Table 1). Using PLINK software, 33 we calculated the LD between the five variants based on the 1000 Genomes reference panel, and found that none are in high LD with each other (Supplementary Table 3).
Test to explain the previously published results In each discovery and replication cohort, we tested if the five independent variants explain the associations within the CETP region (chromosome 16, 56.99-57.02 Mbp) as reported in the study by Teslovich et al. 9 We tested a total of 38 genome-wide significant (P value o5 × 10 − 8 ) SNPs within this region identified by Teslovich et al. 9 and conditioned for the five independent variants in all discovery and replication cohorts.   Abbreviations: EA, effect allele-the allele for which the effect on HDL-C is estimated; Freq, the frequency of reference allele in the discovery cohorts; Freq geno , the frequency of the variant within the reference panel. a β is the effect of the effect allele. β j is the effect of the effect allele after joint analysis of all selected variants by GCTA.

Fine mapping the association between HDL-C and CETP EM van Leeuwen et al
Conditional analysis of the independent CETP variants Next, we performed conditional analysis of the independent variants in both the discovery and replication cohorts. We conditioned on the lead SNP for the CETP region as reported by the study by Teslovich et al. 9 (rs3764261, chromosome 16, position 56,993,324 bp), see Table 4 and Figure 4. This analysis showed that three out of the five variants (rs34065661, rs5817082, rs7499892) are independent of rs3764261. For all variants the P values and β's decreased, but all P values remained significant.    The question marks mean that the variant was removed prior to meta-analysis due to a low imputation quality and/or expMAC o10.
of HDL-C within this family is 27.47%. DNA was available for 16 individuals. Figure 5 shows the results of the Sanger sequencing for rs5817082 for these 16 individuals within the family. The sequencing of the insertion confirmed the best guess results for 10 individuals (62.5%), of which 7 were heterozygous for the insertion, 1 was homozygous for the insertion, and 2 did not carry the insertion. Three individuals that are homozygous for the insertion, were predicted to be heterozygous by the best guess imputations. Three individuals that are heterozygous for the insertion were not predicted to carry the insertion by the best

DISCUSSION
We conducted an analysis to fine map the association between CETP genetic variants and HDL-C. To this end, a total of 59,432 samples were imputed to the latest version of the 1000 Genomes (version Phase 1 integrated release v3, April 2012, all populations). We identified and replicated five independent variants within the CETP region (chromosome 16, 56.99-57.02 Mbp), of which four are SNPs and one is an insertion. We validated the insertion by Sanger sequencing within a large family, as the largest effect on HDL-C comes from this insertion. The relationship between the CETP gene and HDL-C has been known for a long time 9 and genome-wide association studies have revealed many common and rare variants in this region. Although the associated genetic variants are strongly correlated with HDL-C, the causal variants have not been determined. Our study showed that when using the latest 1000 Genomes reference panel, we have more power to fine map this association. By conditional analysis of the five variants, we were able to reduce the P values of the genome-wide significant associations published before by Teslovich et al. 9 Furthermore, conditional analysis showed that three out of the five variants are independent of the lead SNP for the CETP region as reported by the study by Teslovich et al. 9 (rs3764261).
Several fine-mapping effort have been previously published 36,37 and in all those efforts sequencing was used for the fine mapping. In our project we did not use sequencing, but imputations using the 1000 Genomes as a reference panel. This method has been widely used in the past and is much lower in cost. With new reference panels available, we were able to have a revised study of this region. The 1000 Genomes reference panel consists of 30 million variants including a million insertions and deletions. By using this reference panel for imputation, we were able to impute these insertions and deletions in 59,432 samples from various cohorts. This led to the significant association of an insertion within a known region with HDL-C. So far, no association between a structural variation and HDL-C has been found in such a large sample size. Validation of the insertion by Sanger sequencing confirms the correct imputations of this insertion in 62.5% of the individuals, of which seven heterozygous carriers, one homozygous carrier and two did not carry the insertion.  The results of this study showed that by using the 1000 Genomes reference panel, the proportion of the variance explained can be increased and that multiple common variants in the same region may be implicated in a single family of the ERF study. The insertion we identified in this study explains 35.50% of variation in the HDL-C level in a single family of the ERF study; this is in concordance with the results of the whole-genome sequence data. 23 This is much higher than the proportion of the variance explained (14.11%) in the same family by rs3764261, which was reported before as the lead variant of this region. Fine mapping of various associations may help us to unravel the genetic background of various phenotypes.
Although rs3764261 was identified by Teslovich et al. 9 to be the lead SNP of this region, other variants are used in clinical settings. Three of the classical variants are located in the promoter region of the CETP gene: − 1337C/T (rs708272 or Taq1B), − 971G/A, and − 629C/A (rs1800775) polymorphisms. 38 Carriers of the B2 allele of the common Taq1B polymorphism exhibit lower plasma CETP levels and higher HDL-C. Furthermore, a recent meta-analysis showed that the B2 allele is associated with a reduced risk for coronary heart disease. 39 One more classical variant is rs5882A (405I/V), which is located outside the promoter region. 40 The − 1337C/T and − 629C/A are in strong LD, however, they are in very low LD (r 2 of 0.442 for rs708272 and 0.461 for rs1800775) with rs3764261, despite the fact that all three variant are within 3,000 bp of each other.
Large HDL-C particle sizes have been associated with exceptional longevity before and with an increased homozygosity for the I405V variant within the CETP gene. [1][2][3][4] Many of the studies confirm this relationship, however, all are based on genotyping of the I405V variant. Our study, however, shows that more variants within the CETP gene are associated with HDL-C levels in the blood circulation. Therefore we would suggest investigating more variants within the CETP gene for its association with longevity and healthy aging.
Some genetic variants identified in our study were published before, 41,42 but so far no conditional analyses have been performed with these variants. Our study suggests that various CETP variants may be relevant for HDL-levels in the blood circulation and that these may have a substantial role in the heritability of HDL-C in specific families.  Figure 5. Validation of the insertion (rs5817082) with a large family. The numbers present the dosage for rs5817082 after imputations, second row the best guess result (I is insertion, R is reference) and the third row the genotypes of the insertion from Sanger sequencing.