Introduction

A previous weighted burden analysis of rare coding variants observed in 200,000 exome-sequenced UK Biobank participants implicated four protein-coding genes as being involved in risk of hyperlipidaemia at exome-wide significance: LDLR, PCSK9, ANGPTL3 and IFITM5 [1]. The report of that study also reviewed the broader contribution of genetic variation to hyperlipidaemia risk and that discussion will not be repeated here. For all of these four genes except LDLR, rare variants predicted to impair gene functioning were associated with lower risk of hyperlipidaemia. It was noted that overall 55 genes, of which 47 were protein-coding, had uncorrected p values significant at p < 0.001 whereas only 23 would be expected by chance and a number of these genes seemed to be plausible biological candidates. Exome sequence data for a full set of 470,000 participants has now been released and a follow-up study was carried out in order to determine if any of these 47 genes of interest demonstrated evidence for association in the 270,000 newly available samples. For such genes, further analyses were performed in the whole sample to investigate the overall evidence for association and the contributions from different types of variant.

Most of this exome sequence data has also been used in two previous studies which tested for gene-wise associations with very large numbers of phenotypes, including some hyperlipidaemia-related phenotypes [2, 3]. It was recognised that results from the current investigation would need to be considered in this context of these studies.

Methods

In order to maintain compatibility with the previous study, the same methods were used for the genetic analyses and for phenotype definition. The description is repeated here for convenience.

UK Biobank dataset

UK Biobank participants are volunteers intended to be broadly representative of the UK population and are not selected on the basis of having any health condition. UK Biobank had obtained ethics approval from the North West Multi-centre Research Ethics Committee which covers the UK (approval number: 11/NW/0382) and had obtained informed consent from all participants. The UK Biobank approved an application for use of the data (ID 51119) and ethics approval for the analyses was obtained from the UCL Research Ethics Committee (11527/001). The UK Biobank Research Analysis Platform was used to access the Final Release Population level exome OQFE variants in PLINK format for 469,818 exomes which had been produced at the Regeneron Genetics Center using the protocols described here: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets [3]. All variants were then annotated using the standard software packages VEP, PolyPhen and SIFT [4,5,6]. To obtain population principal components reflecting ancestry, version 2.0 of plink (https://www.cog-genomics.org/plink/2.0/) was run with the options --maf 0.1 --pca 20 approx [7, 8].

Hyperlipidaemia phenotype

The hyperlipidaemia phenotype was determined in the same way as previously from four sources in the dataset: self-reported high cholesterol; reporting taking cholesterol lowering medication; reporting taking a named statin; having an ICD10 diagnosis for hyperlipidaemia in hospital records or as a cause of death [9]. Participants in any of these categories were deemed to be cases with hyperlipidaemia while all other subjects were taken to be controls. As previously described, the UK Biobank sample does contain some subjects who are related to each other [10]. These subjects were not excluded as including them is not theoretically expected to cause major difficulties for the methods of analysis used and using them in previous similar analyses of this dataset had proved unproblematic. When carrying out the original study a deliberate decision had been made to attempt to identify clinically defined cases of hyperlipidaemia without including information on measured lipid levels in the blood, since these levels would likely be influenced by medication. In the primary analyses to implicate specific genes, attention was restricted to participants not included in the earlier study, consisting of 62,066 cases and 207,216 controls. For the subsequent analyses using the whole sample there were 106,091 cases and 363,674 controls.

Variant weighting

The SCOREASSOC program was used to carry out a weighted burden analysis to test whether, in each gene, sequence variants which were rarer and/or predicted to have more severe functional effects occurred more commonly in cases than controls [11,12,13]. Attention was restricted to rare variants with minor allele frequency (MAF) ≤0.01 in cases or controls or both. As previously described in detail, variants were weighted by overall MAF so that variants with MAF = 0.01 were given a weight of 1 while very rare variants with MAF close to zero were given a weight of 10. This is done by taking the weight to be a parabolic function of MAF passing through the points (0, 10) and (0.01, 1). Variants were also weighted according to their functional annotation using the GENEVARASSOC program, which was used to generate the input files for weighted burden analysis by SCOREASSOC. Variants predicted to cause complete loss of function (LOF) of the gene were assigned a weight of 100. Nonsynonymous variants were assigned a weight of 5 but if PolyPhen annotated them as possibly or probably damaging then 5 or 10 was added to this and if SIFT annotated them as deleterious then 20 was added. The full set of weights and categories is displayed in Table 1 of the previous study [1]. This means that each variant is assigned a weight according to its MAF and a weight according to its functional annotation. As described previously, the weight due to MAF and the weight due to functional annotation were multiplied together to provide an overall weight for each variant. Variants were excluded if there were more than 10% of genotypes missing in the controls and cases or if the heterozygote count was smaller than both homozygote counts in controls and cases. If a subject was not genotyped for a variant then they were assigned the subject-wise average score for that variant. For each subject a gene-wise weighted burden score was derived as the sum of the variant-wise weights, each multiplied by the number of alleles of the variant which the given subject possessed.

Table 1 Genes with absolute value of SLP exceeding 3 (equivalent to p < 0.001) for association with hypertension in previous study showing the SLP obtained in the new sample

Logistic regression analysis

Analyses were restricted to the 47 protein-coding genes significant at p < 0.001 in the previous study. For each gene, logistic regression analysis was carried out with hyperlipidaemia as the dependent variable including the first 20 population principal components and sex as covariates and a likelihood ratio test was performed comparing the likelihoods of the models with and without the gene-wise burden score. This is a test for association between the gene-wise burden score and caseness and the statistical significance was summarised as a signed log p value (SLP), which is the log base 10 of the p value given a positive sign if the score is higher in cases and negative if it is higher in controls. Since only 47 genes were analysed, after correcting for multiple testing a gene could be declared statistically significant if it achieved an SLP with absolute value greater than -log10(0.05/47) = 2.97 using the new samples.

Follow-up analyses

Follow-up analyses were performed on all genes individually achieving this significance level and also ANGPTL4. For this subset of genes the weighted burden analysis described above was carried out using the whole sample of 106,091 cases and 363,674 controls. Additionally, for each subject a count was obtained of the number of variants they carried falling into particular broad annotation categories, such as LOF, protein altering, etc. The full list of these categories is shown in Supplementary Table 1. These counts were entered into a multiple logistic regression analysis with hyperlipidaemia as the dependent variable and again including sex and 20 principal components as covariates in order to elucidate the contribution of different types of variant to the overall evidence for association. The odds ratios (ORs) associated with each category were estimated along with their standard errors and the Wald statistic was used to obtain a p value. This p value was converted to an SLP, again with the sign being positive if the OR was greater than 1, indicating that variants in that category tended to increase risk.

Software

Data manipulation and statistical analyses were performed using GENEVARASSOC, SCOREASSOC and R [12,13,14].

Results

Table 1 shows the results of the primary analysis. Three of the four protein-coding genes which reach exome-wide significance in the earlier study show convincing evidence of association with hyperlipidaemia, LDLR, PCSK9 and ANGPTL3, with SLPs of 87.28, −29.08 and −7.48 respectively. However the other gene, IFITM5, shows no evidence of association, with SLP of only 0.44, so it seems reasonable to conclude that the original results for this gene represented a type 1 error. Of the remaining genes which were originally significant at p < 0.001, most show no evidence of association in this new sample and can be dismissed as chance findings. However ABCG5, APOC3 and NPC1L1 all produce SLPs which are statistically significant after correcting for testing 47 genes, with values of 5.08, −11.01 and −3.42. These six genes were carried forward for secondary analyses along with ANGPTL4 which was considered to be of interest because of its similarity to ANGPTL3, even though it achieved SLPs of only −3.66 in the first sample and −1.95 in the second.

The original study considered 22,642 genes, meaning that for a gene-wise result to be considered exome-wide significant the magnitude of the SLP obtained should exceed -log10(0.05/22642) = 5.66. For the seven genes carried forward, the results of weighted burden analysis in the entire sample of 106,091 cases and 363,674 controls are also shown in Table 1 and it can be seen that all six of the genes which produced results which were statistically significant after multiple testing in the second sample also produce results which would be regarded as exome-wide significant in the full sample. However the SLP for ANGPTL4 in the combined sample is only −4.79.

In order to gain insights into the effects of different categories of variant within these seven genes of interest, counts for variants of each category in each subject were entered into multiple logistic regression analysis along with sex and 20 principal components as covariates. These results are shown in Tables 24 and are summarised briefly as follows.

Table 2 Results from logistic regression analysis showing the contribution different categories of variant within each gene make to risk of hyperlipidaemia
Table 3 Results of variant category analysis for PCSK9, NPC1L1 and APOC3
Table 4 Results of variant category analysis for ANGPTL3 and ANGPTL4

Variants in LDLR (SLP = 156.81) and ABCG5 (SLP = 6.95) increase risk of hyperlipidaemia and results for each variant category are shown in Table 2. Table 2A shows the results for LDLR and it can be seen that LOF variants are associated with hyperlipidaemia risk with OR > 20. 113 participants carry a LOF variant and all but 16 of these are cases. Of note, there are also 19 subjects who carry an inframe indel and all but 2 of these are also cases, again yielding an OR over 20 though with a wide confidence interval. Detailed inspection of these results reveals that they are driven by two inframe deletions, 19:11105556ATGG > A (rs121908027) which is carried by 8 participants who are all cases and 19:11116925ACGG > A (rs1221971156) which is carried by 7 participants, 6 of whom are cases. The first of these, rs121908027, is reported to the be most common familial hyperlipidaemia (FH) mutation in Ashkenazi Jews and was found in 35% of FH families in Israel [15]. As well as the large effect of these LOF and indel variants there is statistically significant evidence for an overall small effect on risk of the much commoner variants in the “Protein altering” category (consisting mostly of nonsynonymous variants) with OR of 1.12 and a further modest increase in risk if these are annotated as deleterious by SIFT and/or possibly or probably damaging by PolyPhen, with ORs of 1.44, 1.27 and 1.61. While all these categories of variant are associated with increased risk of hyperlipidaemia, the category “Splice region” is actually associated with reduced risk, with OR of 0.82 and SLP of −19.81. This result is driven by 19:11120527 G > A (rs72658867), which has MAF 0.0126 in controls and 0.0096 in cases and which has previously been reported to lower HDL cholesterol and to be protective against coronary artery disease [16].

Table 2B shows the results for ABCG5 and it can be seen that although a few hundred participants carry LOF variants these do not appear to have any strong effect on hyperlipidaemia risk with an OR of 1.2 which is not statistically significant. Instead, the signal for this gene seems to be driven largely by the “Splice region category”, with OR of 1.44 and SLP of 6.20. Although there are 74 variants in this category, the result seems to be mainly driven by three variants which are somewhat commoner in cases, 2:43813316 A > C (rs114780578), 2:43822939 G > C (rs370895243) and 2:43825025 A > T (rs201469377). The category “Protein altering” yields an OR of 1.05 and an SLP of 2.07 but there is no suggestion that nonsynonymous variants recognised as more severe by SIFT or PolyPhen are associated with increased risk. This result may be largely driven by 2:43813208 T > C (rs140374206) which had frequency 0.006049 in controls and 0.006246 in cases and which has previously been reported to be associated with raised non-HDL cholesterol and increased risk of gallstones [17].

Variants in PCSK9 (SLP = −48.57), NPC1L1 (SLP = −7.60) and APOC3 (SLP = −13.19) are protective against hyperlipidaemia and their results detailed are shown in Table 3. As can be seen in Table 3A, LOF variants in PCSK9 reduce hyperlipidaemia risk, with OR of 0.39. On average, protein altering variants in general have a mild effect on lowering risk, with OR of 0.92, but those which are additionally annotated as deleterious by SIFT have a larger effect, with OR 0.69, whereas there is no additional effect associated with being characterised as possibly or probably damaging by PolyPhen.

The results in Table 3B show that the overall signal for NPC1L1 is mainly due to LOF variants, with OR 0.64 and SLP of −5.20, with possibly some additional contribution from variants annotated as deleterious by SIFT, which have OR 0.89 and SLP −1.76. A similar scenario is seen for APOC3 in Table 3C, with LOF variants have OR of 0.68 and SLP of −11.22 but with other categories not showing clear evidence of association.

Variants in ANGPTL3 (SLP = −12.68) and ANGPTL4 (SLP = −4.79) also appear to be protective against hyperlipidaemia, although the result for ANGPTL4 is not exome-wide significant. Nevertheless, as the products of both genes modulate the activity of lipoprotein lipase and as inactivating variants in both genes have previously been shown to be associated with hypolipidaemia, it seems appropriate to present the detailed results for both, as shown in Table 4 [18, 19]. For ANGPTL3 the signal is again mainly due to LOF variants, with OR of 0.59 and SLP of −8.36, but it can be seen that splice region variants also have OR of 0.69 and SLP of −4.70. This latter result is driven by 1:62598067 T > C (rs372257803) which has MAF 0.00100 in controls and 0.00069 in cases and which has been previously reported to be associated with lower non-HLD cholesterol and triglycerides [20]. The results for ANGPTL4 are not statistically significant after correction for multiple testing but it can be seen that they are consistent with the possibility that LOF variants lower hyperlipidaemia risk modestly, with OR 0.78 and SLP −1.84.

Discussion

As mentioned above, this dataset has been used for previous two studies testing for association between exome sequence variance with a very large number of phenotypes, which for convenience we can refer to as the Regeneron and AstraZeneca studies [2, 3]. The Regeneron study carried out a variety of single variant and gene-wise burden tests on 3994 health-related traits to produce a total of about 2.3 billion tests, yielding a critical p value of 2.18 × 10−11 (corresponding to SLP = 10.66)) and reported 8865 significant associations which are presented in their Supplementary Data 2 [3]. This reports significant gene-wise associations of measured cholesterol and/or LDL levels for all seven of the genes implicated in the present study. However the only genes reported to be associated with clinical hyperlipidaemia, as indicated by taking lipid lowering medication or having a hyperlipidaemia diagnosis recorded, were LDLR, PCSK9 and APOC3. For the AstraZeneca study, all gene-wise and variant-wise associations with 17,361 binary and 1419 quantitative phenotypes are reported on the AstraZeneca PheWAS Portal at https://azphewas.com/ [2]. This was accessed to find the most significant p value for any analysis of each of these genes with the phenotype which on the website is labelled “Union#E78#E78 Disorders of lipoprotein metabolism and other lipidaemias” and Table 5 shows the results obtained compared with those for the current study. This shows that, for every gene, the AstraZeneca results provide less support for association than the present study and in particular both ABCG5 and NPC1L1 would be regarded as exome-wide significant in the present study. While, these two genes were both previously shown to influence lipid levels, the current study implicates variants in these genes as impacting the risk of developing clinically relevant hyperlipidaemia, suggesting that such variants could be included when drawing up genetically informed individual risk profiles.

Table 5 Comparison of results from current study to those reported for the AstraZeneca study

Of the four genes in which variants impairing function are associated with reduced risk of hyperlipidaemia, two are already established drug targets. The product of NPC1L1, which is essential for intestinal sterol absorption, is the molecular target of ezetimibe, a cholesterol absorption inhibitor that lowers blood cholesterol [21]. The product of ANGPTL3 is the target of evinacumab, a human monoclonal antibody designed to treat hypercholesterolaemia [18, 22]. The therapeutic role of ANGPTL3 inhibition has recently been reviewed [23]. The well-established evidence for PCSK9 and APOC3 variants as being protective is fuelling research into developing strategies to find novel ways to antagonise PCSK9 and lower apoC-III [24, 25].

The detailed analyses of variant categories provide insights into the magnitude of effects of different kinds of variant in different genes, along with information about their cumulative frequencies. One feature of note is the heterogeneity of effects between genes. For example, for LDLR nonsynonymous variants classified as probably damaging by PolyPhen are more strongly implicated and have a larger effect size than variants classified as deleterious by SIFT, but for PCSK9 this is not the case and only SIFT deleterious variants have an effect. For ANGPTL3 neither SIFT nor PolyPhen is helpful for identifying risk-associated variants. For most genes LOF variants have the largest effect size but this is not the case for ABCG5 in which LOF variants do not clearly have any effect and the signal is instead driven by specific splice region variants. This inconsistency of effects of different methods for weighting across different genes has been noted in a previous study [26]. A total of 43 different methods of predicting the pathogenic effect of nonsynonymous variants were compared across ten different genes associated with common phenotypes, including LDLR, PCSK9 and ANGPTL3. These results showed that while SIFT and PolyPhen performed reasonably well for some genes, other methods were better for other genes. However no single prediction method performed consistently well across all genes. Using additional prediction methods would introduce complications around correcting for multiple testing and interpretation of results but it seems that there is no single weighting system which would be optimal for all genes.

Using biobank data to identify genes influencing complex traits contrasts with the strategy of concentrating on individuals and families with extreme phenotypes who may harbour variants with quasi-Mendelian effects [27]. Ascertaining cases from biobank data can yield large sample sizes but with less severe and less well-defined phenotypes. The present study quantifies the effect on clinically relevant hyperlipidaemia risk of naturally occurring variation within the identified genes, along with their cumulative frequencies in a sample broadly representative of the population. The fact that people with severe, early onset cardiovascular disease might have been less likely to survive to be recruited into UK Biobank may mean that the magnitude of effect of some variants on risk is somewhat underestimated, but this does not seem likely to be a major consideration. Variants with large effect sizes are very rare. Around 10,000 participants carry a variant in a category with a moderate effect on risk, i.e. with OR below 0.7 or above 1.4, but it is debatable how relevant such effects would be for quantifying individual risk in the context of personalised medicine. Additionally, we may note that, although formal analyses were not carried out, the individual allele frequencies of these variants would be too low to have an appreciable effect of any common variants which might be in linkage disequilibrium with them, such as could be detected in genome-wide association studies. It seems that the main value of identifying coding variants associated with risk is to clarify the pathophysiological mechanisms influencing risk of hyperlipidaemia in order to support the development of novel therapeutic approaches.