Introduction

As large exome-sequenced datasets become available it has become possible to detect gene-level associations between the burden of extremely rare coding variants and a variety of phenotypes [1, 2]. Typically, tests for association involve considering together variants falling into a particular category based on their predicted effect. Variants expected to completely disrupt function of a gene, consisting of stop gained, frameshift and splice site variants are jointly termed loss of function (LOF) or protein truncating variants (PTV) and when considered jointly it is usually the case that this category of variant is associated with the largest effect size. While nonsynonymous variants may also have effects, the nature and magnitude of these effects is likely to be heterogeneous and if all nonsynonymous variants are considered to form a single category then the average estimated effect will naturally be smaller than that of the variants having the largest effect sizes. Rare variant association studies which simply consider all nonsynonymous variants jointly have yielded informative results [3]. However a more widely used approach is to use some form of secondary annotation method which attempts to distinguish those nonsynonymous variants which are more likely to have a biological effect and applying such approaches may allow one to demonstrate that those nonsynonymous variants predicted to be most impactful are indeed the ones which show association with a phenotype [4].

A large number of methods are available to carry out such secondary annotations and we have recently assessed their relative performance [5]. Since then a new method, AlphaMissense, has been released with the aim of recognising whether a nonsynonymous variant observed in a patient is or is not likely to be pathogenic [6]. The report of this study also discusses at length the various issues involved in attempting to interpret the likely effects of nonsynonymous variants. The AlphaMissense prediction is based on machine learning approaches to assimilate information about the protein structural context and about evolutionary conservation to generate a score reflecting likely pathogenicity. It was demonstrated to perform well on benchmarks derived from clinically identified variants as well as from multiplexed assays of variant effect (MAVEs).

Since the publication of AlphaMissense, a small number of published reports have carried out further evaluation of its ability to identify pathogenic variants in a clinical context. In a study of 37 nonsynonymous variants in IRF6, experimental work determined that of 18 variants predicted by AlphaMissense to be pathogenic 15 were in fact benign [7]. In a study of CFTR variants, AlphaMissense predicted a high proportion to be pathogenic, resulting in a high false positive rate and fair classification rate with only modest correlation with clinical measures of pathogenicity [8]. A study of 2073 variants in haematological malignancies found good agreement between AlphaMissense and previous classifications which had been made for clinical reporting purposes although they noted that some of the variants for which there was disagreement seemed to represent misclassifications by AlphaMissense [9].

Although one might hope that the same classification methods used to identify single variants causing severe disease might also be helpful in attempting to discriminate those variants increasing risk of common phenotypes, this is not necessarily the case. For example, when using PolyPhen-2 it is recommended that the version trained on HumVar be used to assist the diagnosis of Mendelian disorders while the version trained on HumDiv should be used to evaluate rare alleles for complex genotypes (Adzhubei et al. [10]). This consideration means that it would be helpful to assess the extent to which AlphaMissense could assist as an annotation tool in the context of large case control studies of exome-sequenced datasets aiming to identify genes influencing the risk of common phenotypes.

The aim of the present study is to compare the performance of AlphaMissense with other annotation methods in terms of their ability to produce evidence for association between a gene and a common, clinically relevant phenotype. Such associations were previously established using weighted burden analyses, in which different variants within a gene were weighted differentially according to their annotation and rarity. In the present analyses, variants are weighted for rarity and then the contributions to evidence for association are examined separately for AlphaMissense and a number of other annotation methods.

The genes and phenotypes used in this study are listed in Table 1 and they were selected because they had previously produced exome-wide significant results in weighted burden analyses using phenotypes of hypertension, hyperlipidaemia and type 2 diabetes in the UK Biobank dataset [11,12,13]. In Table 1, the SLP entry indicates the signed log10 p value for the evidence in favour of association, as explained in greater detail in the Methods section below. For convenience, if variants in a gene are protective against, for example, hyperlipidaemia then the phenotype of interest is characterised as “Not hyperlipidaemia”. For a number of these genes, their association with the relevant phenotype was previously well established and for others there is at least some independent evidence for their involvement, increasing the confidence that they do not represent false positives. These findings are summarised as follows. Rare variants with a large dominant effect in LDLR, APOB and PCSK9 are recognised to cause 40% of cases of familial hyperlipidaemia [14]. Recessively acting variants in ABCG5, involved in the intestinal and biliary excretion of cholesterol, are recognised to cause sitosterolemia and recent reports support an effect of heterozygous variants as causes of hypercholesterolaemia [15, 16] The product of NPC1L1, is involved in intestinal sterol absorption and is the target of ezetimibe, a cholesterol absorption inhibitor which lowers blood cholesterol [17] The product of ANGPTL3 is the target of evinacumab, a human monoclonal antibody for treating hypercholesterolaemia [18, 19]. The therapeutic role of ANGPTL3 inhibition for hyperlipidaemia has recently been reviewed [20] The well-established evidence for variants impairing the function of PCSK9 and APOC3 as being protective against hyperlipidaemia is fuelling research into developing strategies to find novel ways to antagonise PCSK9 and lower apoC-III [21, 22]. Knockdown of Dnmt3a in mice results in reduced methylation of the gene for angiotensin receptor type 1a, Agtr1a, leading to increased Agtr1a expression and salt-induced hypertension [23]. In studies using common variants, an SNP close to FES, rs2521501, shows robust evidence for association with blood pressure [24]. NPR1 and GUCY1A1 both code for guanylate cyclases producing cyclic GMP, which acts as an intracellular messenger to mediate responses such as vasodilation in order to reduce blood pressure [25, 26]. DBH codes for dopamine beta hydroxylase, which catalyzes the oxidative hydroxylation of dopamine to norepinephrine and recessively acting variants are known cause norepinephrine deficiency syndrome, characterised by reduced blood pressure [27]. Rare variants in GCK, HNF1A and HNF4A are recognised as causes of maturity onset diabetes of the young [28]. The association of rare predicted loss of function variants in GIGYF1 with T2D was replicated in an independent cohort and these variants were also shown to be associated with associated with decreased cholesterol levels and increased risk of hypothyroidism [29]. Overall, there is good to excellent evidence that rare variants in these genes are associated with the relevant phenotype and hence it becomes reasonable to investigate the relative performance of different annotation methods in terms of detecting these associations.

Table 1 List of genes used for these analyses along with the SLP obtained in the original analyses with the corresponding phenotype [19,20,21].

Materials and methods

The methods used closely followed those described in the previous study exploring different annotation and weighting schemes, and the description is partly repeated here for the convenience of the reader [5].

The UK Biobank Research Analysis Platform was used to access the Final Release Population level exome OQFE variants in PLINK format for 469,818 exomes which had been produced at the Regeneron Genetics Center using the protocols described here: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets [2]. UK Biobank had obtained ethics approval from the North West Multi-centre Research Ethics Committee which covers the UK (approval number: 11/NW/0382) and had obtained written informed consent from all participants. The UK Biobank approved an application for use of the data (ID 51119) and ethics approval for the analyses was obtained from the UCL Research Ethics Committee (11527/001). To obtain 20 population principal components reflecting ancestry, version 2.0 of plink (https://www.cog-genomics.org/plink/2.0/) was run with the options --maf 0.1 --pca 20 approx [30, 31].

To assess overall evidence for gene-wise associations with different phenotypes, weighted burden analyses had previously been carried out using the SCOREASSOC and GENEVARASSOC programs [32]. Attention was restricted to rare variants with minor allele frequency (MAF) ≤ 0.01 in both cases and controls. As previously described, variants were weighted by overall MAF so that variants with MAF = 0.01 were given a weight of 1 while very rare variants with MAF close to zero were given a weight of 10, with a parabolic function used to assign weights with intermediate MAFs [33]. Additionally each variant was annotated with the Variant Effect Predictor (VEP), SIFT and PolyPhen SIFT [10, 34, 35]. A weight was assigned according to this annotation and the overall weight for each variant consisted of the frequency weight multiplied by the annotation weight. For each subject and each gene, the weights for the variants carried by the subject were summed to provide an overall weighted burden score. Regression modelling was done to calculate the likelihood for the phenotype data given covariates consisting of sex and the first 20 principal components and then the likelihood was recalculated for the model additionally incorporating the weighted burden score. Twice the natural log of the ratio of these likelihoods is a likelihood ratio statistic taken to be distributed as a chi-squared statistic with 1 degree of freedom. The evidence for association is summarised as the signed log p value (SLP) taken as the log base 10 of the p value and given a positive sign if there is a positive correlation between the weighted burden score and the phenotype.

For the present study variant annotation was performed in two stages. First, a primary categorisation was made using VEP, which uses information based on the reference sequence and coordinates of known transcripts to report findings such as whether variants occur within exons, if so whether they change amino acid sequence, etc. [34]. For purposes of the present analyses, variants predicted to have a similar kind of effect were grouped together so that, for example, stop gained, frameshift and essential splice site variants were all treated as LOF. The full list of annotations as reported by VEP and the category they were assigned to is shown in Table 2, along with the weights which were used for the previous weighted burden analyses, which had been arbitrarily assigned based on expectations of the likely biological importance of each annotation. Each of the annotation categories was then used to generate a separate burden score, so that for example the burden score relating to the category LOF for a subject would consist of the number of LOF variants carried by that subject, each multiplied by the weight according to allele frequency as described above.

Table 2 Table showing annotations produced by VEP, the weights assigned to them for the previous weighted burden analyses and the categories they were assigned to for the current analyses.

In order to obtain secondary annotations using AlphaMissense for all nonsynonymous variants, VEP was run with the options b --canonical –regulatory --plugin AlphaMissense [6].This produces two AlphaMissense annotations, a raw score and a categorisation of likely pathogenic, likely benign or ambiguous. These three categories were converted to numerical scores of 2, 0 or 1 respectively. To obtain secondary annotations for other predictors, dbNSFP v4 was used [36]. For the nonsynonymous and splice site variants listed in dbNSFP v4, secondary annotation scores were obtained consisting of the rank scores for a variety of different prediction and conservation methods. For each secondary annotation for a variant, the annotation score was then multiplied by the weight based on allele frequency. Thus, a subject’s overall score for the SIFT annotation would consist of the sum of all the SIFT rank scores of the variants carried by that subject, with the score for each variant also each being weighted according to allele frequency. For ease of processing, special characters in dbNSFP annotation names were replaced, for example GERP++ was changed to GERPPP. A total of 43 such scores were used, as presented below and as detailed at http://database.liulab.science/dbNSFP.

The genes selected for this study are listed in Table 1 and consisted of those which had previously produced exome-wide significant results in weighted burden analyses using phenotypes of hypertension, hyperlipidaemia and type 2 diabetes [11,12,13]. For each phenotype, a mixture of self-report, recorded diagnoses and medication reports was used to designate a set of participants as cases, with all other participants taken to be controls. There were a total of 469,818 exome-sequenced UK Biobank participants, of whom 167,127 were designated cases for hypertension, 106,091 for hyperlipidaemia and 33,629 for type 2 diabetes. As noted in the legend for Table 1, for some of these genes the original SLPs obtained were negative, indicating that variants impairing the function of these genes were protective and were associated with lower risk of developing the clinical phenotype. For the purpose of the current study, in order to make it easier to interpret the results for these genes alongside the others, the phenotype of interest for these genes is taken to be “being a control”, meaning that all variants associated with the phenotype would tend to generate positive SLPs. The magnitude of the SLP represents the strength of evidence in favour of association and hence, when applied to truly associated gene-phenotype pairs, the relative values of SLPs obtained using different methods provide an indication of their relative utility in terms of their ability to correctly assess the effect of variants on a phenotype.

To gain an understanding of the relationships between the different annotation methods, a correlation matrix was produced of all the secondary annotation scores across all the nonsynonymous variants in all these genes and this matrix was visualised using the correl package in R [37, 38].

In order to assess the contributions of each different category of variant to the evidence for association, a logistic regression analysis was performed separately on the weighted burden score for each of the primary categories, with population principal components and sex being included as covariates. The Wald statistic was then used to obtain an SLP for each variant category for each gene and these were tabulated and compared.

Similar analyses were performed for secondary annotations obtained from AlphaMissense and dbNSFP, except that for these analyses the weighted burden score produced by the ProteinAltering category was included as an additional covariate. This is because the overall burden for each of these secondary annotations would depend on the total number of nonsynonymous variants each subject carries and the purpose of these analyses is to assess the relevant performance of the different secondary annotation methods to distinguish the effect of different nonsynonymous variants. Again, the Wald statistic was used to obtain SLPs for each secondary annotation and these were tabulated and compared.

Data manipulation and statistical analyses were performed using GENEVARASSOC, SCOREASSOC and R [32, 33, 38].

Results

Figure 1 shows a heatmap which illustrates the relative magnitude of the SLP produced by each variant category for each gene and the actual SLPs are shown in Table 3. From this it can be seen that for most genes the only variant categories to generate SLPs of a large magnitude were LOF and ProteinAltering. However for ABCG5 and ANGPTL3 the SpliceRegion category had large SLPs whereas for LDLR and HNF4A the InDelEtc category had large SLPs. For some genes both the LOF and ProteinAltering categories had large SLPs but for others only one category did. For example, for HNF1A the LOF category produced a much larger SLP than ProteinAltering did, whereas for HNF41A this situation was reversed and the ProteinAltering category produced a fairly large SLP whereas the LOF SLP was minimal. Thus it can be seen that there is no consistency across genes regarding which variant categories make the most substantial contribution to evidence for association, implying that no single scheme could be optimal for all genes.

Fig. 1: Heatmap of SLPs produced by each variant category for each gene.
figure 1

The sizes of the dots for each gene are proportional to the SLP for each variant category relative to the maximum category SLP obtained for that gene. White circles indicate negative SLPs.

Table 3 SLPs produced individually by each variant category for each gene, including sex and principal components as covariates.

In order to gain insights into the relationships between the secondary annotations, pairwise correlation coefficients were obtained between all pairs across variants in all genes, comprising 10,567 nonsynonymous variants, and a heatmap illustrating these correlations is shown in Fig. 2. The raw correlation coefficients themselves are tabulated in Supplementary Table 1. It can be seen that the AlphaMissense annotations are positively correlated with each other and with 15 other annotations, forming a block. There is then a second block comprising 8 annotations which are again positively correlated with each other but which show little correlation with any of the annotations forming the first block. Interestingly, the Mutation Predictor (MutPred) score is positively correlated with the annotations of the first block and somewhat negatively correlated with those in the second block [39]. Following these two blocks are a number of other annotations showing little in the way of correlation with any of the others. Notably, this list includes the CADD annotations, which are quite widely used but which somewhat surprisingly seem to pick up different variant characteristics than the other methods [40].

Fig. 2: Plot of pairwise correlations between secondary annotations for the variants used in this study.
figure 2

Black circles indicate positive correlations and white circles negative correlations.

The relative performance of the different secondary annotations in terms of producing evidence for association is displayed as a heatmap in Fig. 3 and the underlying SLPs are listed in Table 4. One thing of note is that there is considerable variability between genes as to the extent to which any of the secondary annotation methods produces evidence for association, as measured by the magnitude of the SLP. For some genes the methods are clearly quite effective. For example, LDLR, PCSK9 and GCK all yield large SLPs for a variety of different annotations. Interestingly, although APOC3 produced a negligible SLP of 1.34 for the ProteinAltering category taken as a whole, when these variants are annotated with MutationTaster they yield an SLP of 11.33 [41]. Conversely, ABCG5 produced SLP = 7.80 for the ProteinAltering category but none of the secondary annotation methods seems able to distinguish which variants within this category are more associated with risk and the maximum any of them produces is SLP = 2.25 for the SiPhy score, which is a conservation score based on comparison of human and mammalian genomes [42].

Fig. 3: Heatmap of SLPs produced by each secondary annotation for each gene.
figure 3

The sizes of the dots for each gene are proportional to the SLP for each annotation relative to the maximum SLP produced by any annotation for that gene. White circles indicate negative SLPs.

Table 4 SLPs produced individually by each secondary annotation from AlphaMissense and dbNSFP for each gene, including sex and principal components as covariates.

When a secondary annotation method is able to produce a high SLP, there is inconsistency between genes with regard to the relative performance of the different methods. While the AlphaMissense annotations have the best performance on average across all genes (for AlphaMissense category SLP = 7.12 and for AlphaMissense score SLP = 7.50), they actually produce the maximum SLP for only 4 genes: LDLR, ANGPTL3, NPR1 and HNFA1. There are some genes where AlphaMissense is able to produce reasonable evidence for association but other methods do considerably better. For example, PCSK9 yields SLP = 11.70 with the AlphaMissense score but SLP = 21.07 with MutationTaster, while GCK yields SLP = 6.49 with the AlphaMissense category but SLP = 18.46 with MutationTaster and SLP = 18.35 with the Variant Effect Scoring Tool (VEST4) [43]. More strikingly, there were genes for which AlphaMissense was not able to find any evidence for association whereas another method performed well. For example, AlphaMissense produces negligible SLPs for both APOC3 and ASXL1 whereas MutationTaster produces SLP = 11.07 for APOC3 and CADD produces SLP = 4.50 for ASXL1.

Discussion

Examining this relatively small number of gene-phenotype pairs in detail is sufficient to establish that there is dramatic variability in the performance of secondary annotation methods in terms of their ability to produce evidence for association. We should note that for AlphaMissense the categorical predictions and raw scores have been used whereas for the other methods rank scores were extracted from dbNSFP and this might possibly have had some impact on the results obtained. However, we note that for AlphaMissense the category is in fact simply a trichotomized version of the score, consisting of highest, lowest and intermediate scores, and that both the category and the score produce very similar results (as seen in Table 4). This supports the notion that using rank scores rather than raw scores for the other methods may be unlikely to have a major effect. Additionally, we have chosen to incorporate a scheme to weight variants by rarity and this may also have led to results which differ somewhat from those which would have been obtained if a simple frequency cut-off had been applied, as demonstrated in previous investigations [5]. Nevertheless, it seems unlikely that such considerations would dramatically alter the main conclusion, which is that some secondary annotation methods work better than others but that their relative performance varies between genes. This would seem to have a number of implications.

The first implication is that it is not at all clear what is the optimal approach to use when testing for association between coding variants and a complex phenotype. For some genes nonsynonymous variants produce little evidence for association with any of the annotation methods, meaning that weighted burden analysis would have no advantage over simply testing for an excess of LOF variants among cases. If weighted burden analysis is to be used then there are choices to be made between carrying out multiple different analyses using different categorisations, annotation methods and weighting schemes or attempting to combine information from multiple sources into a smaller number of analyses. The results shown here seem to demonstrate that relying on a single annotation method would risk failing to detect some real associations, although if one were forced to rely on a single method then it does seem that AlphaMissense has the best performance on average.

The second implication seems to be that, because different methods work better for different gene-phenotype pairs, one would want to take account of this if the aim was to use sequence data for individual level risk prediction. For example, if one wished to obtain a comprehensive assessment of an individual’s probability of developing type 2 diabetes by combining information from rare variants into some form of risk score, then based on these results one might use the MutationTaster prediction for GCK variants, AlphaMissense for HNF4A and CADD for HNF1A and for GIGYF1. It would be suboptimal to apply a single annotation method to characterise variants across multiple genes.

Finally, it seems that it would be very desirable to be in a position where one could in advance identify for a given gene or gene-phenotype pair which annotation method would best distinguish the relevant variants. As knowledge accrues it would be helpful to investigate what are the characteristics of a gene which mean that one method will perform well and another poorly. Ultimately one would then seek to develop an automated method in which the first step was gene classification and then this would be followed by application of a gene-relevant annotation method.