Introduction

Serum concentrations of low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), triglycerides (TGs) and total cholesterol (TC) are highly heritable phenotypes associated with the risk of cardiovascular morbidity and mortality.1, 2, 3, 4 A number of genome-wide association studies (GWASs) successfully identified multiple genes influencing circulating lipid levels.5, 6, 7, 8, 9, 10, 11, 12 There are currently over 100 established loci that include both common variants with relatively small effects as well as a considerable number of rare variants with large effects.13 Despite these successes, a substantial proportion of the heritability of each trait remains unexplained, suggesting that many determinants have yet to be identified.14

Several plausible explanations may underlie the unexplained heritability of lipid traits, including the presence of both unknown common variants with small effects and novel rare variants with larger effects. The ENGAGE GWAS5 was one of the first large population-based studies designed to find variants associated with circulating lipid levels. The study, based on 16 European cohorts including up to 22 562 individuals, identified 6 novel loci, in addition to replicating 16 previously known loci. However, as demonstrated by the recent GWAS from the Global Lipid Genetics Consortium (GLGC), numerous additional variants passed the genome-wide significance threshold as a result of increased sample size.15 The GLGC GWAS, which included over 1 00 000 individuals of European ancestry, reported 95 loci, with 59 reaching genome-wide significance for the first time. These results raise an interesting question: if common variants remain to be discovered, how many should we expect? Are there still a limited number of loci or can we expect a polygenic mechanism that involves a very large number of variants with very small effects? In the latter case, these variants would contribute to a continuous spectrum of alleles spanning the genome and single genes involved in this complex polygenic model might not be detectable by GWAS, regardless of sample size.16 Evidence for this type of genetic architecture can be shown using a genome-wide scoring approach, as was recently demonstrated for a number of psychiatric outcomes.17, 18, 19 Additionally, these polygenic scores may provide extra information useful in determining P-value thresholds for pathway analysis.

The current study aimed to explore the extent to which common variation accounts for the unexplained heritability of circulating lipid levels using the genome-wide scoring method. We also evaluated the evidence for a common polygenic effect underlying different lipid traits, using the same risk scoring approach. Finally, we examined the utility of genome-wide polygenic scores for identifying pathways beyond those identified using a classical GWAS approach.

Materials and methods

The polygenic risk score approach involves using results from a discovery set to explore the genetic architecture of an independent target sample. Our discovery set consisted of the meta-analysis of 16 European populations from the ENGAGE Lipid Consortium (N=17 798–22 562) (Table 1). A detailed description of this study, including populations, genotyping information and statistical analysis, was previously published.5

Table 1 Descriptive data of discovery and replication samples

The target sample consisted of RS-II, an extension of the Rotterdam Study (RS), a prospective cohort study started in 1990 in the Ommoord district of the city of Rotterdam. RS-II, which was not a part of the ENGAGE discovery set, consists of 3011 participants (out of 4472 invitees) who were 55 years or older during the recruitment period (2000–2001).20 Of the 3011, 2540 persons were successfully genotyped with an Illumina 610K array. Fasting HDL-C, TG and TC were measured with enzymatic colorimetric tests on a Roche/Hitachi 911 analyzer (Roche Diagnostics, Meylan, France). LDL-C was estimated using the Friedewald formula.21

SNPs included in the construction of the polygenic scores were based on the results from the ENGAGE study. We selected different clusters of SNPs for the calculation of the scores using several P-value thresholds (Pdiscovery) ranging from 5 × 10−8 to 0.5. We calculated genetic scores for those various clusters of SNPs in the target sample by multiplying the number of risk alleles for each SNP (0, 1 or 2) by the effect sizes from the discovery set, and summing them up across all the SNPs in that cluster. For this analysis, we used the PLINK ‘profile scoring’ option. SNPs that had a call rate <90% or HWE P-value <1 × 10−8 were excluded from these computations. A/T and G/C polymorphisms were also excluded to avoid potential strand inconsistencies. SNPs in linkage disequilibrium (LD) were pruned over 200 SNP sliding windows using a pair wise r2 threshold of 0.25 in PLINK.22 LD pruning was performed per SNP cluster. (See Supplementary Table 2 for the number of SNPs remaining in each cluster and used for analysis.)

The associations between these scores and serum lipid levels were tested in SPSS for Windows version 15 (SPSS, Chicago, IL, USA) using linear regression models, with sex, age and age2 as covariates (the same covariates as included in the discovery GWAS). The proportion of total variance explained by the genetic score, here referred to as the percentage of explained variance (PEV), was determined by comparing models with/without the risk score.

To evaluate whether the PEV results were driven by the GWAS hits, we also constructed a variable comprising only the significant GWAS variants and included it as a covariate in our original models. When calculating the polygenic scores for these analyses, we also removed SNPs within 2 Mb windows surrounding the GWAS hits. We employed exactly the same pruning approach for this analysis.

To search for evidence for a shared genetic background between various lipid traits, we tested additional models in which we used the polygenic score for a particular lipid and tried to predict the others, for instance, utilizing the HDL-C polygenic score to predict TG and vice versa.

The score which yielded the highest PEV for a given lipid trait ostensibly includes the most valuable genetic information; therefore, we selected these thresholds to utilize in pathway analysis (in contrast to using only genome-wide significant loci). For these analyses, we used the PANTHER tools (http://www.pantherdb.org).23 We first tested the genome-wide significant SNPs (Pdiscovery<5 × 10−8) from the ENGAGE GWAS in the pathway analysis. These results were then compared with those obtained using alternative Pdiscovery thresholds selected on the predictive ability of the polygenic scores. After SNP selection, SNPs within gene regions were converted to gene symbols using the ‘SCAN SNP and CNV annotation database’ (http://www.scandb.org). Gene lists were tested for enrichment in three PANTHER categories: (1) pathways, (2) biological processes and (3) molecular functions. Testing for enrichment basically involves comparing one gene list to the reference list to statistically determine over- or under- representation of PANTHER classification categories. Based on the reference list, an expected value is computed (the number of genes one would expect in the list for a particular PANTHER category) and it is assumed that, under the null hypothesis, genes in the tested list are sampled from the same distribution as genes from the reference set. The Homo sapiens gene list from National Centre for Biotechnology Information was used as the reference gene list. To avoid bias caused by multiple testing, PANTHER's Bonferroni correction option was implemented. (See Supplementary Figure 1 for the overall flowchart of the study.)

Results

Table 1 shows summary statistics for the discovery and target samples. The female/male ratio in the discovery set was significantly higher compared with the target set (1.6 vs 1.2, P<0.001). Genome-wide significant SNPs from the ENGAGE GWAS were checked for their associations in the target sample using linear regression. Generally, evidence of association between those SNPs and lipid levels were marginally significant or non-significant (Supplementary Table 1). The GWAS of circulating lipids in RS-II did not show any genome-wide significant findings except the CETP gene region SNPs, which were associated with HDL-C (rs7499892, P=3.4 × 10−13). Manhattan plots for the GWAS of the HLD-C, LDL-C, TG and TC can be found in Supplementary Figure 2.

Prediction

Figure 1 shows the PEV obtained for each lipid trait using the polygenic scores generated for a number of P-value thresholds in the target sample (RS-II). For HDL-C, the polygenic score computed using 19 genome-wide significant SNPs from 8 gene regions (Pdiscovery<5 × 10−8) resulted in the maximum PEV compared with the null model (4.75%, P=3.6 × 10−30; Figure 1a). For LDL-C (Figure 1b), the maximum PEV was observed with the polygenic score that included 21 SNPs with a Pdiscovery<1 × 10−6 (2.6%, P=5.1 × 10−16). Figure 1c shows PEVs for TG levels; the score that included 12 SNPs from 8 regions with Pdiscovery<1 × 10−7 (3.8%, P=2.8 × 10−21) was the best predictor. For these traits, the variance explained decreased with the inclusion of additional SNPs in the polygenic score selected using more liberal Pdiscovery thresholds (Figures 1a–c). Finally, for TC, the highest PEV was obtained using 46 SNPs from 24 regions with Pdiscovery<10−5 (2.7%, P=1.4 × 10−16). This was higher than the PEV obtained using only the genome-wide significant SNPs (PEV=2.1%, P=8.2 × 10−13, n=20 SNPs from 11 regions; Figure 1d). As with HDL-C, LDL-C and TG, the explained variance for TC dropped when more liberal Pdiscovery thresholds were used to construct the polygenic score. For LDL-C, HDL-C and TC, all scores were significant (up to a threshold of Pdiscovery<0.5). We observed similar patterns when we used unpruned data (Supplementary Figure 3).

Figure 1
figure 1

Graphs ad show the PEV of circulating lipids with risk scores by different Pdiscovery thresholds. Adjusted for age, sex and age2. +P<5 × 10−8; *5 × 10−8<P<0.05.

Figure 2 shows the results from the second approach, in which models were adjusted for genome-wide significant variants. For HDL-C (Figure 2a), the PEV increased as SNPs were added, up to 0.5% with Pdiscovery<0.1 (P=1.0 × 10−4) and remained significant until Pdiscovery<0.5 (P=2.3 × 10−4). A similar pattern was observed with LDL-C (Figure 2b, explained variance was up to 0.4% (P=0.002)) with Pdiscovery threshold of 0.2. In contrast, the polygenic score for TG, when the effects of known variants were excluded, was not associated with TG levels in the target population (Figure 2c). For TC (Figure 2d), the maximum PEV was observed with Pdiscovery<1 × 10−5, (0.6%, P=1.8 × 10−4).

Figure 2
figure 2

Graphs ad show the PEV of circulating lipids when the top regions are excluded. Adjusted for age, sex age2 and risk score computed from genome-wide significant findings. The lack of association in the first cluster of SNPs are due to the exclusion of SNPs within 2 Mb window region surrounding the top findings, as there were only a few SNPs to be included in the analysis after excluding the top regions. *P<0.05.

Cross-prediction

Table 2 shows the phenotypic correlations for the four outcomes studied, and additionally shows the correlations between the polygenic scores for different Pdiscovery thresholds. Correlations between the traits were modest, with the exceptions of TC and LDL-C, (r=0.9) and TG and HDL-C (r=−0.5). The correlations between the polygenic scores were weaker than the phenotypic correlations (0.8 for TC/LDL-C and −0.2 for TG/HDL for Pdiscovery<5 × 10−8)

Table 2 Correlation matrix of circulating lipids and genetic risk scores in RS-II

To evaluate the evidence for common polygenetic effects underlying lipid levels, we performed cross-prediction analyses (Figure 3). The highest PEV was based on the TC score at Pdiscovery (TC)<1 × 10−5, which explained up to 2.7% of the variance in circulating LDL-C (P=2.0 × 10−5; Figure 3k). Similarly, LDL-C risk profiles explained up to 1.8% of the variance in TC when we selected all SNPs with a Pdiscovery (LDL−C)<10−6 (P=1.4 × 10−11; Figure 3f). These findings are in line with the high phenotypic correlations between those variables. Figures 3g–i shows the predictions based on a TG score which explained up to 0.8% of the variance in other lipids. HDL-C scores explained up to 0.3% of the variance in other lipids (Figures 3a–c).

Figure 3
figure 3

Cross-prediction across different lipids. Evaluation of the evidence for a joint polygenic effect underlying various lipids: (ac) Prediction based on HDL risk scores imposed on LDL, TGs and TC. (df) Prediction based on LDL risk scores imposed on HDL, TGs and TC. (gi) Prediction based on TG risk scores. (jl) Prediction based on TC risk scores. *P<0.05.

Pathway analysis

Pathways analyses using only genome-wide significant SNPs was compared with the analogous analyses using SNPs from the polygenic scores, which yielded the highest PEV for each trait (Figure 1). These scores used thresholds of P<1 × 10−6 for LDL-C, P<1 × 10−5 for TC, P<5 × 10−8 for HDL-C and P<1 × 10−7 for TG. Table 3 shows the findings from the pathway analysis, based on alternatives to a P-value threshold of 5.0 × 10−8. None of the pathways among categories defined by the PANTHER tool were significant after strict adjustment for multiple testing (Bonferroni correction). With respect to biological processes the lipid and fatty acid transport and lipid, fatty acid and steroid metabolism pathways were two biological processes enriched in the HDL-C and LDL-C GWAS findings. At the level of molecular function, genes with an apolipoprotein and transfer/carrier function were enriched in LDL-C, while genes with a lipase function were observed to be significantly enriched among the top GWAS results for HDL-C. For HDL-C and TG, we were not able to select alternative P-value thresholds as the highest PEVs were observed with P<5 × 10−8. With respect to LDL-C, the pathway analysis utilizing two different P-value thresholds (P<1 × 10−6 and P<5 × 10−8) resulted in the same findings. No additional pathways were identified by using extra information from the risk profiles for LDL-C, TG and HDL-C. For TC, on the other hand, the lipid, fatty acid and steroid metabolism, lipid and fatty acid transport and transport terms additionally emerged among biological processes tested using the alternative threshold (Table 3).

Table 3 Pathway analysis

Discussion

Using prediction modelling, we could explain up to 4.8% of the variance in HDL-C, 2.6% in LDL-C, 3.8% in TG and 2.7% in TC. These PEVs are very similar to those from similar studies5, 9 and much higher than the single SNP analysis of genome-wide significant SNPs from the ENGAGE GWAS (Supplementary Table 1).

However, these proportions are much lower than those identified by GLGC, which were estimated to explain 12.4% (TC), 12.2% (LDL-C), 12.1% (HDL-C) and 9.6% (TG) of the variance in the Framingham Heart Study sample, as mentioned by Teslovich et al.24 This is expected as increases in sample size lead to better estimation of the effect sizes of the SNPs and GLGC had a sample size 5 times larger than the ENGAGE sample, which we used as a discovery set in our study.

For all of the traits, the PEV reached a maximum and then decreased with the use of more liberal Pdiscovery thresholds to calculate the polygenic scores (Figure 1). This is most likely explained by the inclusion of more and more biologically non-relevant SNPs, so that the effects of true positive findings are diluted and this is reflected by the decreases in PEV. For all of the studied traits, we found the highest PEV when the polygenic score was based on SNPs with a low Pdiscovery value (5 × 10−8 for HDL-C, 1 × 10−7 for TG, 1 × 10−6 for LDL-C and 1 × 10−5 for TC). Including the top regions from the ENGAGE GWAS data set as a separate predictor in the models (Figure 2) uncovered a residual polygenic component which does not explain >1% of HDL-C, LDL-C and TC levels. These findings suggest that there are unknown genes with much smaller effects involved in determining these outcomes. However, the PEVs for these additional variants were small when compared with those for the top findings. For TG, on the contrary, excluding the top regions from the polygenic score resulted in non-significant findings. For TC, which is highly heterogeneous compared with the other traits, it seems that some variants remain to be discovered (Pdiscovery<1 × 10−5). It is of note that among newly discovered loci for HDL-C by GLGC, leading SNPs from 10 loci had P-values >0.05 in the ENGAGE HDL-C analysis. Similar findings were observed for 10 loci for LDL-C, 3 loci for TG and for 9 loci in TC.24 It is already known that monogenic disorders25 and rare variants also account for variation in circulating lipid levels.26, 27, 28, 29, 30, 31, 32 This may help to explain why the explained variance is small compared with the high heritability of the traits, especially as many rarer variants are population specific, and might not have been well represented in our European data set, or not well tagged by the common SNPs under study. For instance, APOE gene variations are tagged by the CEACAM16-TOMM40 region among the ENGAGE GWAS top findings, and SNPs from this region were not associated to LDL-C levels in RS-II, however, APOE ɛ2 carrier status explains 2.6% of the phenotypic variation in LDL-C levels in RS-II. Additionally, the gender ratio difference between the discovery and target samples may have been a limitation to the current study, as some loci show different effect sizes for males and females.5 Our findings have implications for gene discovery and suggest that GWAS of much larger samples may be needed to discover additional variants with small effects for HDL-C and LDL-C. However, at the same time, this study suggests that many of the unknown SNPs have relatively large effects and that is confirmed by the GLGC data. Our findings suggest that GWAS on serum lipids in the future will still be successful as sample sizes increase.14 Our cross-prediction results are interesting from a biological perspective. These findings showed very little overlap between the polygenic scores for different circulating lipids. A strong inverse relationship exists between low HDL-C and elevated plasma TG (r=−0.5 in RS-II). Low HDL-C levels are strongly associated with hypertriglyceridemia as high levels of plasma TGs drive an exchange reaction for HDL-C cholesteryl esters mediated by CETP.33 In addition, the TG and phospholipids in HDL-C are hydrolysed by LIPC.13, 33 However, using our genetic evaluation it was not possible to predict a large proportion of the variance in TG levels using HDL-C risk profiles despite the correlation between the two lipids. The polygenic score for TG was slightly better in predicting HDL-C than when we used the top SNPs, however, the PEV did not exceed 0.6% and was lower than the variance explained by HDL-C SNPs and also lower than the variance explained in circulating TG by TG SNPs. Thus, our data implies that common genetic variants involved in determining both TG and HDL-C levels do not explain the phenotypic correlation between these traits, suggesting that the correlation may be influenced strongly by environmental factors, and/or restricted to a few genes. An alternative explanation may be that we tested the polygenic effects of common variants weighted by their effect size from the initial GWAS. When there are strong causal variants among the top hits that are specific to HDL-C but not to TG, this may dilute the effect of genes with small effect sizes on both outcomes. Also, the current analyses do not account other forms of genetic variation, such as rare variants or copy number variations (CNVs). As expected, we also found evidence for a number of genes that regulate both HDL-C and LDL-C (Figure 3a) and a similar overlap between TG and LDL-C (Figure 3h). TC SNPs were able to explain up to 2.7% of the variation in LDL-C, suggesting that the genes determining LDL-C and TC are for a large part overlapping. This result is in line with the high phenotypic correlation between the two measures. Genome-wide significant findings from the ENGAGE GWAS harboured two loci (apolipoprotein B and LPL) influencing both HDL-C and TG, 2 loci influencing both TG and TC (DOCK7 and CEACAM16-TOMM40 regions) and 7 loci influencing both LDL-C and TC (CELSR2, APOB, ABCG5, HMGCR, FADS2/3, LDLR and CEACAM16-TOMM40). A limitation here is that LDL-C was not directly measured but calculated with the Friedewald formula in the RS-II sample and so, by definition, depends directly on TC, HDL-C and TG. This may cause a potential bias in findings for LDL-C and may inflate the association between lipids in cross-prediction findings with this phenotype. We investigated whether the polygenic score approach can be used as a tool for selecting SNPs of interest in order to further evaluate them in a pathway analysis. First, we evaluated the genome-wide significant SNPs from an existing GWAS and compared the results with those obtained using the SNPs from the polygenic model with the maximum PEV. Neither of the approaches yielded any novel pathways/biological processes (only those already known to be involved in lipid metabolism, such as cholesterol biosynthesis; lipid and fatty acid transport; and lipid, fatty acid and steroid metabolism). Also, we see that, although the use of the polygenic score approach did not provide extra information concerning LDL-C, HDL-C or TG, for TC, pathway analysis based on the best predicting polygenic score (with Pdiscovery<1 × 10−5) was more informative than analysis based solely on the genome-wide significant findings. Including TC SNPs up to a more liberal threshold of 1 × 10−5 suggested three processes, which are already biologically known but were not detectable with the 5 × 10−8 discovery threshold. This finding shows that for complex traits like TC, the risk scoring approach might be used to select the SNP cluster which harbours a large number of true positives that are not significant at the genome-wide level. Taken together with the polygenic component analysis results, it is likely that ENGAGE TC-GWAS results harbour undiscovered associated variants distributed between 1 × 10−6<Pdiscovery<1 × 10−5. Using a gene scoring approach, we tested the evidence of a polygenic component for the heritable circulating lipids. We concluded that a polygenic form of inheritance exists for HDL-C, LDL-C, TG and TC. These findings may be useful for future gene discovery efforts for lipids. We also tested for possible genetic overlap between biologically related lipid traits and compared two different approaches for pathway analysis. This study gives an example of utilizing the risk scoring approach to search for the common genetic background of different quantitative traits; thus, it may also be an example for more sophisticated future studies.